scrapy之實習網信息採集

1.採集任務分析

1.1 信息源選取

採集信息目標:大學生實習信息

採集目標網站:實習網 https://www.shixi.com/

採集結果: json格式

robots.txt檢查

https://www.shixi.com/robots.txt

User-agent: *
Disallow: http://us.shixi.com
Disallow: http://eboler.com
Disallow: http://www.eboler.com
Disallow: http://shixigroup.com
Disallow: http://www.shixi.com/%7B%7B_HTTP_HOST%7D%7D
Disallow: http://www.shixi.com/index/index
Disallow: http://www.shixi.com/index
Disallow: https://api.app.shixi.com
Disallow: https://api.wechat.shixi.com

大概瀏覽了一下,發現裏面直接把後臺登錄的網站…app的請求接口直接列了出來?

最終,發現該協議並未禁止我們採集search頁面的數據。

1.2 採集策略

選擇爬取入口,找到了search的主頁面,發現該處的實習信息分佈十分有規律,因此選擇從此入手。

https://www.shixi.com/search/index

img

第一頁:

https://www.shixi.com/search/index?key=&districts=0&education=70&full_opportunity=0&stage=0&/%20practice_days=0&nature=0&trades=317&lang=zh_cn&page=1

第二頁:

https://www.shixi.com/search/index?key=&districts=0&education=70&full_opportunity=0&stage=0&/%20practice_days=0&nature=0&trades=317&lang=zh_cn&page=2

第1000頁

https://www.shixi.com/search/index?key=&districts=0&education=70&full_opportunity=0&stage=0&/%20practice_days=0&nature=0&trades=317&lang=zh_cn&page=1000

很明顯,url中的param的page決定了當前頁數,其他的param則是用於篩選等。

經過觀察,且該網站沒有使用ajax等異步加載信息的技術,使用request,加上合適的請求頭就能成功獲取到包含有目標信息的response。

因此決定使用scrapy框架來進行爬取,採集思路如下:

①按照page參數生成待爬取主頁index_url的列表,例如生成1-100頁的index_url;

②對列表中的每一個index_url,進行GET請求,得到對應的index_response(狀態碼爲2xx或3xx);

③對每一個index_response,解析出詳情工作鏈接detail_url,按照實習網的佈局看,每頁有10條崗位信息,即一個index_response可以解析出10條detail_url;

④對每個detail_url進行GET請求,然後對detail_response進行解析,獲取每個崗位的各種信息;

⑤對每個崗位,將其對應信息寫入json文件,一個崗位爲一個json對象{}

2.網頁結構與內容解析

2.1 網頁結構

首先查看請求的目標網址

img

確定目標url如下

https://www.shixi.com/search/index?key=&districts=0&education=70&full_opportunity=0&stage=0&/%20practice_days=0&nature=0&trades=317&lang=zh_cn&page=1

使用requests庫請求

def get_html():
    '''
    可以嘗試去掉headers中的某些參數來查看哪些參數是多餘的
    '''
    headers = {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        "Accept-Encoding": "gzip, deflate, br",
        "Accept-Language": "zh-CN,zh;q=0.9",
        "Cache-Control": "max-age=0",
        "Connection": "keep-alive",
        "Cookie": "UM_distinctid=171eeb58d40151-0714854878229a-335e4e71-100200-171eeb58d427ce; PHPSESSID=3us0g85ngmh6fv12qech489ce3; \
    Hm_lvt_536f42de0bcce9241264ac5d50172db7=1588847808,1588999500; \
    CNZZDATA1261027457=1718251162-1588845770-https%253A%252F%252Fwww.baidu.com%252F%7C1589009349; \
    Hm_lpvt_536f42de0bcce9241264ac5d50172db7=1589013570",
        "Host": "www.shixi.com",
        "Upgrade-Insecure-Requests": "1",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3756.400 QQBrowser/10.5.4039.400",
    }
    url = "https://www.shixi.com/search/index?key=&districts=0&education=70&full_opportunity=0&stage=0&practice_days=0&nature=0&trades=317&lang=zh_cn&page=1"
    response = requests.get(url=url, headers=headers)
    print(response.text)

在渲染好的html代碼中,一個崗位對應一個<div class="job-pannel-list">

img

2.2 內容解析


def my_parser():
    etree = lxml_html.etree

    with open('./shixiw01.html', 'r', encoding='utf-8') as fp:
        html = fp.read()
 

    html_parser_01 = etree.HTML(text=html)
    html_parser_02 = lxml_html.fromstring(html)  # 將字符串轉成Element類

    page_num = int(html_parser_01.xpath('//li[@jp-role="last"]/@jp-data')[0])
    print(page_num)

    jobs = html_parser_02.cssselect(".left_list.clearfix .job-pannel-list")
    print(jobs)
# 輸出結果  得到總共有2520頁   一個有10個job
2520
[<Element div at 0x2500bb6da48>, <Element div at 0x2500c394f98>, <Element div at 0x2500c394c28>, <Element div at 0x2500c394e08>, <Element div at 0x2500c394e58>, <Element div at 0x2500c394ea8>, <Element div at 0x2500c394ef8>, <Element div at 0x2500c394f48>, <Element div at 0x2500c39c048>, <Element div at 0x2500c39c098>]

對job信息的提取,這裏使用的是css選擇器

for job in jobs:
    item = dict()

    # 工作名稱
    item['work_name'] = job.css("div.job-pannel-one > dl > dt > a::text").get().strip()

    # 所在城市
    item['city'] = job.css(".job-pannel-two > div.company-info > span:nth-child(1) > a::text").get().strip()

    # 公司名稱
    item['company'] = job.css(".job-pannel-one > dl > dd:nth-child(2) > div > a::text").get().strip()

    # 工資
    item['salary'] = job.css(".job-pannel-two > div.company-info > div::text").get().strip().replace(' ','')

    # 學歷要求
    item['degree'] = job.css(".job-pannel-one > dl > dd.job-des > span::text").get().strip()

    # 發佈時間
    item['publish_time'] = job.css(".job-pannel-two > div.company-info > span.job-time::text").get().strip()

    # 詳情鏈接
    next = job.css(".job-pannel-one > dl > dt > a::attr('href')").get()
    url = response.urljoin(next)
    item['detail_url'] = url
    
    
'''
崗位描述則要進入到detail_url中解析
# 崗位描述
            description = response.css("div.work.padding_left_30 > div.work_b::text").get().split()
            description = ''.join(description)
'''

3.採集過程與實現

本次採集過程中,我使用的是scrapy爬蟲框架,版本爲Version: 1.6.0。

3.1 編寫Item

首先要明確,該爬取那些信息,經過之前的觀察,明確了有如下的信息可以被爬取到

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class ShiXiWangItem(scrapy.Item):

    # 詳情鏈接
    detail_url = scrapy.Field()

    # 工作名稱
    work_name = scrapy.Field()

    # 所在城市
    city = scrapy.Field()

    # 公司名
    company = scrapy.Field()

    # 工資
    salary = scrapy.Field()

    # 學歷要求
    degree = scrapy.Field()

    # 發佈時間
    publish_time = scrapy.Field()

    # 職位描述
    description = scrapy.Field()

3.2 編寫spider

# -*- coding: utf-8 -*-
import scrapy
from ..items import ShiXiWangItem


# from scrapy.linkextractors import LinkExtractor
# from scrapy.spiders import CrawlSpider, Rule


class ShixiwangSpider(scrapy.Spider):
    name = 'shixiwang'

    # 設置允許爬取的域
    # allowed_domains = ['https://www.shixi.com/']

    # spider啓動時第一個爬取的url
    # start_urls = ['https://www.shixi.com/search/index']

    def __init__(self):
        super().__init__()

        # 起始網頁
        self.base_url = 'https://www.shixi.com/search/index?key=&districts=0&education=70&full_opportunity=0&stage=0&\
         practice_days=0&nature=0&trades=317&lang=zh_cn&page={page}'

        # 計數器
        self.item_count = 0

    def closed(self, reason):
        print(f'爬取結束,總共有{self.item_count}條實習崗位數據')

    def start_requests(self):
        # base_url = "https://www.shixi.com/search/index?key=大數據&page={}"
        # dont_filter=True  :  避免第一頁的請求因爲重複而被過濾掉
        yield scrapy.Request(url=self.base_url.format(page=1), callback=self.set_page, dont_filter=True)

    def set_page(self, response):
        page_num = int(response.xpath('//ul[@id="shixi-pagination"]/@data-pages').get())
        print(f'共有{page_num}頁')
        targe_page = int(input('輸入要爬取的頁數:  ').strip())
        print(f'目標:{targe_page}頁,開始爬取...')
        for page in range(1, targe_page + 1):
            yield scrapy.Request(url=self.base_url.format(page=page), callback=self.parse_index)

    def parse_index(self, response):
        try:
            # 該頁面上的所有工作列表
            jobs = response.css(".left_list.clearfix .job-pannel-list")

            for job in jobs:
                item = dict()

                # 工作名稱
                item['work_name'] = job.css("div.job-pannel-one > dl > dt > a::text").get().strip()

                # 所在城市
                item['city'] = job.css(".job-pannel-two > div.company-info > span:nth-child(1) > a::text").get().strip()

                # 公司名稱
                item['company'] = job.css(".job-pannel-one > dl > dd:nth-child(2) > div > a::text").get().strip()

                # 工資
                item['salary'] = job.css(".job-pannel-two > div.company-info > div::text").get().strip().replace(' ',
                                                                                                                 '')

                # 學歷要求
                item['degree'] = job.css(".job-pannel-one > dl > dd.job-des > span::text").get().strip()

                # 發佈時間
                item['publish_time'] = job.css(".job-pannel-two > div.company-info > span.job-time::text").get().strip()

                # 詳情鏈接
                next = job.css(".job-pannel-one > dl > dt > a::attr('href')").get()
                url = response.urljoin(next)
                item['detail_url'] = url

                yield scrapy.Request(
                    url=url,
                    callback=self.parse_detail,
                    meta={'item': item},
                )
        except:
            print('解析失敗')

    def parse_detail(self, response):
        # print(response.request.headers['User-Agent'], '\n')
        self.item_count += 1

        item = response.meta['item']

        try:
            # 崗位描述
            description = response.css("div.work.padding_left_30 > div.work_b::text").get().split()
            description = ''.join(description)
        except:
            description = ''

        item['description'] = description

        # 完成一個item
        yield ShiXiWangItem(**item)

3.3 編寫pipeline

pipeline是對item加工處理的地方,通常用於數據清洗和數據保存等。

例如,使用JsonItemExporter導出json格式的數據文件。

from scrapy.exporters import CsvItemExporter, JsonItemExporter, JsonLinesItemExporter
from scrapy import signals
import os

class JSONPipeline(object):
    def __init__(self):
        self.fp = open("data/data.json", "wb")
        self.exporter = JsonItemExporter(self.fp, ensure_ascii=False, encoding='utf-8')
        self.exporter.start_exporting()

    def open_spider(self, spider):
        pass

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.fp.close()

3.4 設置settings

# 一些主要配置

# 默認使用的請求頭
DEFAULT_REQUEST_HEADERS = {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "zh-CN,zh;q=0.9",
    "Cache-Control": "max-age=0",
    "Connection": "keep-alive",
    "Cookie": "UM_distinctid=171eeb58d40151-0714854878229a-335e4e71-100200-171eeb58d427ce; PHPSESSID=3us0g85ngmh6fv12qech489ce3; \
    Hm_lvt_536f42de0bcce9241264ac5d50172db7=1588847808,1588999500; \
    CNZZDATA1261027457=1718251162-1588845770-https%253A%252F%252Fwww.baidu.com%252F%7C1589009349; \
    Hm_lpvt_536f42de0bcce9241264ac5d50172db7=1589013570",
    "Host": "www.shixi.com",
    "Upgrade-Insecure-Requests": "1",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3756.400 QQBrowser/10.5.4039.400",
}

# 讓蜘蛛在訪問網址中間休息xxx秒, 設置請求延遲(間隔)
DOWNLOAD_DELAY = 1

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
# 最大併發請求數
CONCURRENT_REQUESTS = 10



# 下載器中間件 更換代理 ip cookies等等,在request發送之前進行處理
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
#   'MySpider01.middlewares.Myspider01DownloaderMiddleware': 543,
    'MySpider01.middlewares.RandomUserAgentMiddlware': 543,
}

# item處理管道
ITEM_PIPELINES = {
    #   'MySpider01.pipelines.Myspider01Pipeline': 300,
    #    'MySpider01.pipelines.JsonLinesPipeline': 301,
    'MySpider01.pipelines.JsonLinesPipeline': 302,
}

3.5 啓動爬蟲

# main.py
from scrapy.cmdline import execute
import sys
import os

if __name__ == '__main__':
    sys.path.append(os.path.dirname(os.path.abspath(__file__)))
    execute(['scrpy', 'crawl', 'shixiwang'])

img

img

img

4.採集結果數據分析

4.1 採集結果

得到的部分json如下:

  {
    "work_name": ".Net軟件開發工程師",
    "city": "江蘇省/蘇州市",
    "company": "蘇州麥粒信息科技有限公司",
    "salary": "¥5000/月",
    "degree": "本科",
    "publish_time": "2020-04-08",
    "detail_url": "https://www.shixi.com/personals/jobshow/73974",
    "description": "根據產品設計要求,按期完成量化開發任務;"
  },
  {
    "work_name": "業務員",
    "city": "廣東省/江門市",
    "company": "江門市滿紅網絡科技有限公司",
    "salary": "¥8000/月",
    "degree": "本科",
    "publish_time": "2020-03-31",
    "detail_url": "https://www.shixi.com/personals/jobshow/74682",
    "description": "崗位職責:1、負責公司產品的推廣;2、收集客戶意見及信息;3、爲客戶提供準確專業的銷售及諮詢服務;4、根據市場計劃完成銷售指標;5、維護客戶關係以及客戶間長期戰略合作計劃;6、跟進未成交的客戶,促進客戶轉介紹;7、負責管轄區市場信息的收集8、爲客戶提供優質的服務職位要求:1、語言表達能力強,語言表達清晰、流暢;2、思維清晰,反應敏捷,具有較強的溝通能力及交際技巧,有親和力;3、具備一定的市場分析及判斷能力,有良好的客戶意識;4、有責任心,有團隊精神,善於挑戰;5、有理想有目標,敢於挑戰高薪。待遇:1、每天工作8小時,月休4天,2、有法定假期,3、加班補貼,4、五險一金,5、年終分紅"
  },
  {
    "work_name": "產品運營實習生",
    "city": "北京市/海淀區",
    "company": "北京七視野文化創意發展有限公司",
    "salary": "面議",
    "degree": "本科",
    "publish_time": "2020-04-01",
    "detail_url": "https://www.shixi.com/personals/jobshow/22904",
    "description": "【崗位職責】"
  },
  {
    "work_name": "數據內容編輯實習生",
    "city": "北京市/海淀區",
    "company": "北京歲月桔子科技有限公司",
    "salary": "¥120/月",
    "degree": "本科",
    "publish_time": "2020-04-10",
    "detail_url": "https://www.shixi.com/personals/jobshow/64212",
    "description": "職位描述:"
  },
  {
    "work_name": "人力實習生",
    "city": "北京市/海淀區",
    "company": "北京職業夢科技有限公司",
    "salary": "¥2000/月",
    "degree": "本科",
    "publish_time": "2020-04-17",
    "detail_url": "https://www.shixi.com/personals/jobshow/22980",
    "description": "職位描述:"
  },
  {
    "work_name": "內容電商運營",
    "city": "北京市/海淀區",
    "company": "愛天教育科技(北京)有限公司",
    "salary": "¥120/天",
    "degree": "本科",
    "publish_time": "2020-04-19",
    "detail_url": "https://www.shixi.com/personals/jobshow/23020",
    "description": ""
  }

4.2 簡要分析

img

img

img

img

可以看到,互聯網相關的工作中,大部分都集中在北上廣。

5.總結與收穫

  • 進一步熟悉了scrapy框架,弄清楚了下載中間件和爬蟲中間件的作用,加深了理解,並提高了對應的實踐能力;

  • 明白了dont_filter參數的使用,可以避免scrapy自動去除掉重複的request;

scrapy.Request(url=self.base_url.format(page=1), callback=self.set_page, dont_filter=True)

  • 深入理解了css選擇器的使用方法,這對於js+css的網頁編寫能力提高有很大的幫助;
  • 瞭解了大量的實習信息,對今後的工作有了更多認知和理解。
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章