文章目錄

1.採集任務分析

1.1 信息源選取

採集信息目標：大學生實習信息

採集目標網站：實習網 https://www.shixi.com/

採集結果: json格式

robots.txt檢查

https://www.shixi.com/robots.txt

User-agent: *
Disallow: http://us.shixi.com
Disallow: http://eboler.com
Disallow: http://www.eboler.com
Disallow: http://shixigroup.com
Disallow: http://www.shixi.com/%7B%7B_HTTP_HOST%7D%7D
Disallow: http://www.shixi.com/index/index
Disallow: http://www.shixi.com/index
Disallow: https://api.app.shixi.com
Disallow: https://api.wechat.shixi.com

大概瀏覽了一下，發現裏面直接把後臺登錄的網站…app的請求接口直接列了出來？

最終，發現該協議並未禁止我們採集search頁面的數據。

1.2 採集策略

選擇爬取入口，找到了search的主頁面，發現該處的實習信息分佈十分有規律，因此選擇從此入手。

https://www.shixi.com/search/index

第一頁：

https://www.shixi.com/search/index?key=&districts=0&education=70&full_opportunity=0&stage=0&/%20practice_days=0&nature=0&trades=317&lang=zh_cn&page=1

第二頁：

https://www.shixi.com/search/index?key=&districts=0&education=70&full_opportunity=0&stage=0&/%20practice_days=0&nature=0&trades=317&lang=zh_cn&page=2

第1000頁

https://www.shixi.com/search/index?key=&districts=0&education=70&full_opportunity=0&stage=0&/%20practice_days=0&nature=0&trades=317&lang=zh_cn&page=1000

很明顯，url中的param的page決定了當前頁數，其他的param則是用於篩選等。

經過觀察，且該網站沒有使用ajax等異步加載信息的技術，使用request，加上合適的請求頭就能成功獲取到包含有目標信息的response。

因此決定使用scrapy框架來進行爬取，採集思路如下：

①按照page參數生成待爬取主頁index_url的列表，例如生成1-100頁的index_url；

②對列表中的每一個index_url，進行GET請求，得到對應的index_response（狀態碼爲2xx或3xx）;

③對每一個index_response，解析出詳情工作鏈接detail_url，按照實習網的佈局看，每頁有10條崗位信息，即一個index_response可以解析出10條detail_url；

④對每個detail_url進行GET請求，然後對detail_response進行解析，獲取每個崗位的各種信息；

⑤對每個崗位，將其對應信息寫入json文件，一個崗位爲一個json對象{}

2.網頁結構與內容解析

2.1 網頁結構

首先查看請求的目標網址

確定目標url如下

https://www.shixi.com/search/index?key=&districts=0&education=70&full_opportunity=0&stage=0&/%20practice_days=0&nature=0&trades=317&lang=zh_cn&page=1

使用requests庫請求

def get_html():
    '''
    可以嘗試去掉headers中的某些參數來查看哪些參數是多餘的
    '''
    headers = {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        "Accept-Encoding": "gzip, deflate, br",
        "Accept-Language": "zh-CN,zh;q=0.9",
        "Cache-Control": "max-age=0",
        "Connection": "keep-alive",
        "Cookie": "UM_distinctid=171eeb58d40151-0714854878229a-335e4e71-100200-171eeb58d427ce; PHPSESSID=3us0g85ngmh6fv12qech489ce3; \
    Hm_lvt_536f42de0bcce9241264ac5d50172db7=1588847808,1588999500; \
    CNZZDATA1261027457=1718251162-1588845770-https%253A%252F%252Fwww.baidu.com%252F%7C1589009349; \
    Hm_lpvt_536f42de0bcce9241264ac5d50172db7=1589013570",
        "Host": "www.shixi.com",
        "Upgrade-Insecure-Requests": "1",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3756.400 QQBrowser/10.5.4039.400",
    }
    url = "https://www.shixi.com/search/index?key=&districts=0&education=70&full_opportunity=0&stage=0&practice_days=0&nature=0&trades=317&lang=zh_cn&page=1"
    response = requests.get(url=url, headers=headers)
    print(response.text)

在渲染好的html代碼中，一個崗位對應一個<div class="job-pannel-list">

2.2 內容解析


def my_parser():
    etree = lxml_html.etree

    with open('./shixiw01.html', 'r', encoding='utf-8') as fp:
        html = fp.read()
 

    html_parser_01 = etree.HTML(text=html)
    html_parser_02 = lxml_html.fromstring(html)  # 將字符串轉成Element類

    page_num = int(html_parser_01.xpath('//li[@jp-role="last"]/@jp-data')[0])
    print(page_num)

    jobs = html_parser_02.cssselect(".left_list.clearfix .job-pannel-list")
    print(jobs)

# 輸出結果  得到總共有2520頁   一個有10個job
2520
[<Element div at 0x2500bb6da48>, <Element div at 0x2500c394f98>, <Element div at 0x2500c394c28>, <Element div at 0x2500c394e08>, <Element div at 0x2500c394e58>, <Element div at 0x2500c394ea8>, <Element div at 0x2500c394ef8>, <Element div at 0x2500c394f48>, <Element div at 0x2500c39c048>, <Element div at 0x2500c39c098>]

對job信息的提取，這裏使用的是css選擇器

for job in jobs:
    item = dict()

    # 工作名稱
    item['work_name'] = job.css("div.job-pannel-one > dl > dt > a::text").get().strip()

    # 所在城市
    item['city'] = job.css(".job-pannel-two > div.company-info > span:nth-child(1) > a::text").get().strip()

    # 公司名稱
    item['company'] = job.css(".job-pannel-one > dl > dd:nth-child(2) > div > a::text").get().strip()

    # 工資
    item['salary'] = job.css(".job-pannel-two > div.company-info > div::text").get().strip().replace(' ','')

    # 學歷要求
    item['degree'] = job.css(".job-pannel-one > dl > dd.job-des > span::text").get().strip()

    # 發佈時間
    item['publish_time'] = job.css(".job-pannel-two > div.company-info > span.job-time::text").get().strip()

    # 詳情鏈接
    next = job.css(".job-pannel-one > dl > dt > a::attr('href')").get()
    url = response.urljoin(next)
    item['detail_url'] = url
    
    
'''
崗位描述則要進入到detail_url中解析
# 崗位描述
            description = response.css("div.work.padding_left_30 > div.work_b::text").get().split()
            description = ''.join(description)
'''

3.採集過程與實現

本次採集過程中，我使用的是scrapy爬蟲框架，版本爲Version: 1.6.0。

3.1 編寫Item

首先要明確，該爬取那些信息，經過之前的觀察，明確了有如下的信息可以被爬取到

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class ShiXiWangItem(scrapy.Item):

    # 詳情鏈接
    detail_url = scrapy.Field()

    # 工作名稱
    work_name = scrapy.Field()

    # 所在城市
    city = scrapy.Field()

    # 公司名
    company = scrapy.Field()

    # 工資
    salary = scrapy.Field()

    # 學歷要求
    degree = scrapy.Field()

    # 發佈時間
    publish_time = scrapy.Field()

    # 職位描述
    description = scrapy.Field()

3.2 編寫spider

# -*- coding: utf-8 -*-
import scrapy
from ..items import ShiXiWangItem


# from scrapy.linkextractors import LinkExtractor
# from scrapy.spiders import CrawlSpider, Rule


class ShixiwangSpider(scrapy.Spider):
    name = 'shixiwang'

    # 設置允許爬取的域
    # allowed_domains = ['https://www.shixi.com/']

    # spider啓動時第一個爬取的url
    # start_urls = ['https://www.shixi.com/search/index']

    def __init__(self):
        super().__init__()

        # 起始網頁
        self.base_url = 'https://www.shixi.com/search/index?key=&districts=0&education=70&full_opportunity=0&stage=0&\
         practice_days=0&nature=0&trades=317&lang=zh_cn&page={page}'

        # 計數器
        self.item_count = 0

    def closed(self, reason):
        print(f'爬取結束，總共有{self.item_count}條實習崗位數據')

    def start_requests(self):
        # base_url = "https://www.shixi.com/search/index?key=大數據&page={}"
        # dont_filter=True  :  避免第一頁的請求因爲重複而被過濾掉
        yield scrapy.Request(url=self.base_url.format(page=1), callback=self.set_page, dont_filter=True)

    def set_page(self, response):
        page_num = int(response.xpath('//ul[@id="shixi-pagination"]/@data-pages').get())
        print(f'共有{page_num}頁')
        targe_page = int(input('輸入要爬取的頁數:  ').strip())
        print(f'目標：{targe_page}頁，開始爬取...')
        for page in range(1, targe_page + 1):
            yield scrapy.Request(url=self.base_url.format(page=page), callback=self.parse_index)

    def parse_index(self, response):
        try:
            # 該頁面上的所有工作列表
            jobs = response.css(".left_list.clearfix .job-pannel-list")

            for job in jobs:
                item = dict()

                # 工作名稱
                item['work_name'] = job.css("div.job-pannel-one > dl > dt > a::text").get().strip()

                # 所在城市
                item['city'] = job.css(".job-pannel-two > div.company-info > span:nth-child(1) > a::text").get().strip()

                # 公司名稱
                item['company'] = job.css(".job-pannel-one > dl > dd:nth-child(2) > div > a::text").get().strip()

                # 工資
                item['salary'] = job.css(".job-pannel-two > div.company-info > div::text").get().strip().replace(' ',
                                                                                                                 '')

                # 學歷要求
                item['degree'] = job.css(".job-pannel-one > dl > dd.job-des > span::text").get().strip()

                # 發佈時間
                item['publish_time'] = job.css(".job-pannel-two > div.company-info > span.job-time::text").get().strip()

                # 詳情鏈接
                next = job.css(".job-pannel-one > dl > dt > a::attr('href')").get()
                url = response.urljoin(next)
                item['detail_url'] = url

                yield scrapy.Request(
                    url=url,
                    callback=self.parse_detail,
                    meta={'item': item},
                )
        except:
            print('解析失敗')

    def parse_detail(self, response):
        # print(response.request.headers['User-Agent'], '\n')
        self.item_count += 1

        item = response.meta['item']

        try:
            # 崗位描述
            description = response.css("div.work.padding_left_30 > div.work_b::text").get().split()
            description = ''.join(description)
        except:
            description = ''

        item['description'] = description

        # 完成一個item
        yield ShiXiWangItem(**item)

3.3 編寫pipeline

pipeline是對item加工處理的地方，通常用於數據清洗和數據保存等。

例如，使用JsonItemExporter導出json格式的數據文件。

from scrapy.exporters import CsvItemExporter, JsonItemExporter, JsonLinesItemExporter
from scrapy import signals
import os

class JSONPipeline(object):
    def __init__(self):
        self.fp = open("data/data.json", "wb")
        self.exporter = JsonItemExporter(self.fp, ensure_ascii=False, encoding='utf-8')
        self.exporter.start_exporting()

    def open_spider(self, spider):
        pass

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.fp.close()

3.4 設置settings

# 一些主要配置

# 默認使用的請求頭
DEFAULT_REQUEST_HEADERS = {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "zh-CN,zh;q=0.9",
    "Cache-Control": "max-age=0",
    "Connection": "keep-alive",
    "Cookie": "UM_distinctid=171eeb58d40151-0714854878229a-335e4e71-100200-171eeb58d427ce; PHPSESSID=3us0g85ngmh6fv12qech489ce3; \
    Hm_lvt_536f42de0bcce9241264ac5d50172db7=1588847808,1588999500; \
    CNZZDATA1261027457=1718251162-1588845770-https%253A%252F%252Fwww.baidu.com%252F%7C1589009349; \
    Hm_lpvt_536f42de0bcce9241264ac5d50172db7=1589013570",
    "Host": "www.shixi.com",
    "Upgrade-Insecure-Requests": "1",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3756.400 QQBrowser/10.5.4039.400",
}

# 讓蜘蛛在訪問網址中間休息xxx秒, 設置請求延遲（間隔）
DOWNLOAD_DELAY = 1

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
# 最大併發請求數
CONCURRENT_REQUESTS = 10



# 下載器中間件 更換代理 ip cookies等等,在request發送之前進行處理
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
#   'MySpider01.middlewares.Myspider01DownloaderMiddleware': 543,
    'MySpider01.middlewares.RandomUserAgentMiddlware': 543,
}

# item處理管道
ITEM_PIPELINES = {
    #   'MySpider01.pipelines.Myspider01Pipeline': 300,
    #    'MySpider01.pipelines.JsonLinesPipeline': 301,
    'MySpider01.pipelines.JsonLinesPipeline': 302,
}

3.5 啓動爬蟲

# main.py
from scrapy.cmdline import execute
import sys
import os

if __name__ == '__main__':
    sys.path.append(os.path.dirname(os.path.abspath(__file__)))
    execute(['scrpy', 'crawl', 'shixiwang'])

4.採集結果數據分析

4.1 採集結果

得到的部分json如下：

  {
    "work_name": ".Net軟件開發工程師",
    "city": "江蘇省/蘇州市",
    "company": "蘇州麥粒信息科技有限公司",
    "salary": "￥5000/月",
    "degree": "本科",
    "publish_time": "2020-04-08",
    "detail_url": "https://www.shixi.com/personals/jobshow/73974",
    "description": "根據產品設計要求，按期完成量化開發任務；"
  },
  {
    "work_name": "業務員",
    "city": "廣東省/江門市",
    "company": "江門市滿紅網絡科技有限公司",
    "salary": "￥8000/月",
    "degree": "本科",
    "publish_time": "2020-03-31",
    "detail_url": "https://www.shixi.com/personals/jobshow/74682",
    "description": "崗位職責：1、負責公司產品的推廣；2、收集客戶意見及信息；3、爲客戶提供準確專業的銷售及諮詢服務；4、根據市場計劃完成銷售指標；5、維護客戶關係以及客戶間長期戰略合作計劃；6、跟進未成交的客戶，促進客戶轉介紹；7、負責管轄區市場信息的收集8、爲客戶提供優質的服務職位要求：1、語言表達能力強，語言表達清晰、流暢；2、思維清晰，反應敏捷，具有較強的溝通能力及交際技巧，有親和力；3、具備一定的市場分析及判斷能力，有良好的客戶意識；4、有責任心，有團隊精神，善於挑戰；5、有理想有目標，敢於挑戰高薪。待遇：1、每天工作8小時，月休4天，2、有法定假期，3、加班補貼，4、五險一金，5、年終分紅"
  },
  {
    "work_name": "產品運營實習生",
    "city": "北京市/海淀區",
    "company": "北京七視野文化創意發展有限公司",
    "salary": "面議",
    "degree": "本科",
    "publish_time": "2020-04-01",
    "detail_url": "https://www.shixi.com/personals/jobshow/22904",
    "description": "【崗位職責】"
  },
  {
    "work_name": "數據內容編輯實習生",
    "city": "北京市/海淀區",
    "company": "北京歲月桔子科技有限公司",
    "salary": "￥120/月",
    "degree": "本科",
    "publish_time": "2020-04-10",
    "detail_url": "https://www.shixi.com/personals/jobshow/64212",
    "description": "職位描述："
  },
  {
    "work_name": "人力實習生",
    "city": "北京市/海淀區",
    "company": "北京職業夢科技有限公司",
    "salary": "￥2000/月",
    "degree": "本科",
    "publish_time": "2020-04-17",
    "detail_url": "https://www.shixi.com/personals/jobshow/22980",
    "description": "職位描述："
  },
  {
    "work_name": "內容電商運營",
    "city": "北京市/海淀區",
    "company": "愛天教育科技（北京）有限公司",
    "salary": "￥120/天",
    "degree": "本科",
    "publish_time": "2020-04-19",
    "detail_url": "https://www.shixi.com/personals/jobshow/23020",
    "description": ""
  }

4.2 簡要分析

可以看到，互聯網相關的工作中，大部分都集中在北上廣。

5.總結與收穫

進一步熟悉了scrapy框架，弄清楚了下載中間件和爬蟲中間件的作用，加深了理解，並提高了對應的實踐能力；
明白了dont_filter參數的使用，可以避免scrapy自動去除掉重複的request;

scrapy.Request(url=self.base_url.format(page=1), callback=self.set_page, dont_filter=True)

深入理解了css選擇器的使用方法，這對於js+css的網頁編寫能力提高有很大的幫助;
瞭解了大量的實習信息，對今後的工作有了更多認知和理解。

scrapy之實習網信息採集

文章目錄

1.採集任務分析

1.1 信息源選取

1.2 採集策略

2.網頁結構與內容解析

2.1 網頁結構

2.2 內容解析

3.採集過程與實現

3.1 編寫Item

3.2 編寫spider

3.3 編寫pipeline

3.4 設置settings

3.5 啓動爬蟲

4.採集結果數據分析

4.1 採集結果

4.2 簡要分析

5.總結與收穫

vue項目獲取富文本編輯器wangEditor內容導出爲word（html轉word格式並下載）

dotnet C# 創建 X11 應用時設置窗口背景顏色

Navicat安裝與激活教程

TDengine docker安裝方法

vue3組件通信與props

sapui5

Alpine Linux apk add DNS lookup error

部分JDK版本的發佈時間

工作中用到的腳本合集

合併代碼時Beyond Compare設置

我是這樣子5分鐘上手Screen命令的

管理統計知識要點目錄

KNN算法實戰之手寫數字識別

DRF嚮導之基類視圖

雲服務器異常登錄（kdevtmpfsi挖礦病毒的處理）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結