文章目錄
1.採集任務分析
1.1 信息源選取
採集信息目標:大學生實習信息
採集目標網站:實習網 https://www.shixi.com/
採集結果: json格式
robots.txt檢查
https://www.shixi.com/robots.txt
User-agent: *
Disallow: http://us.shixi.com
Disallow: http://eboler.com
Disallow: http://www.eboler.com
Disallow: http://shixigroup.com
Disallow: http://www.shixi.com/%7B%7B_HTTP_HOST%7D%7D
Disallow: http://www.shixi.com/index/index
Disallow: http://www.shixi.com/index
Disallow: https://api.app.shixi.com
Disallow: https://api.wechat.shixi.com
大概瀏覽了一下,發現裏面直接把後臺登錄的網站…app的請求接口直接列了出來?
最終,發現該協議並未禁止我們採集search頁面的數據。
1.2 採集策略
選擇爬取入口,找到了search的主頁面,發現該處的實習信息分佈十分有規律,因此選擇從此入手。
https://www.shixi.com/search/index
第一頁:
https://www.shixi.com/search/index?key=&districts=0&education=70&full_opportunity=0&stage=0&/%20practice_days=0&nature=0&trades=317&lang=zh_cn&page=1
第二頁:
https://www.shixi.com/search/index?key=&districts=0&education=70&full_opportunity=0&stage=0&/%20practice_days=0&nature=0&trades=317&lang=zh_cn&page=2
第1000頁
https://www.shixi.com/search/index?key=&districts=0&education=70&full_opportunity=0&stage=0&/%20practice_days=0&nature=0&trades=317&lang=zh_cn&page=1000
很明顯,url中的param的page決定了當前頁數,其他的param則是用於篩選等。
經過觀察,且該網站沒有使用ajax等異步加載信息的技術,使用request,加上合適的請求頭就能成功獲取到包含有目標信息的response。
因此決定使用scrapy框架來進行爬取,採集思路如下:
①按照page參數生成待爬取主頁index_url的列表,例如生成1-100頁的index_url;
②對列表中的每一個index_url,進行GET請求,得到對應的index_response(狀態碼爲2xx或3xx);
③對每一個index_response,解析出詳情工作鏈接detail_url,按照實習網的佈局看,每頁有10條崗位信息,即一個index_response可以解析出10條detail_url;
④對每個detail_url進行GET請求,然後對detail_response進行解析,獲取每個崗位的各種信息;
⑤對每個崗位,將其對應信息寫入json文件,一個崗位爲一個json對象{}
2.網頁結構與內容解析
2.1 網頁結構
首先查看請求的目標網址
確定目標url如下
https://www.shixi.com/search/index?key=&districts=0&education=70&full_opportunity=0&stage=0&/%20practice_days=0&nature=0&trades=317&lang=zh_cn&page=1
使用requests庫請求
def get_html():
'''
可以嘗試去掉headers中的某些參數來查看哪些參數是多餘的
'''
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "zh-CN,zh;q=0.9",
"Cache-Control": "max-age=0",
"Connection": "keep-alive",
"Cookie": "UM_distinctid=171eeb58d40151-0714854878229a-335e4e71-100200-171eeb58d427ce; PHPSESSID=3us0g85ngmh6fv12qech489ce3; \
Hm_lvt_536f42de0bcce9241264ac5d50172db7=1588847808,1588999500; \
CNZZDATA1261027457=1718251162-1588845770-https%253A%252F%252Fwww.baidu.com%252F%7C1589009349; \
Hm_lpvt_536f42de0bcce9241264ac5d50172db7=1589013570",
"Host": "www.shixi.com",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3756.400 QQBrowser/10.5.4039.400",
}
url = "https://www.shixi.com/search/index?key=&districts=0&education=70&full_opportunity=0&stage=0&practice_days=0&nature=0&trades=317&lang=zh_cn&page=1"
response = requests.get(url=url, headers=headers)
print(response.text)
在渲染好的html代碼中,一個崗位對應一個<div class="job-pannel-list">
2.2 內容解析
def my_parser():
etree = lxml_html.etree
with open('./shixiw01.html', 'r', encoding='utf-8') as fp:
html = fp.read()
html_parser_01 = etree.HTML(text=html)
html_parser_02 = lxml_html.fromstring(html) # 將字符串轉成Element類
page_num = int(html_parser_01.xpath('//li[@jp-role="last"]/@jp-data')[0])
print(page_num)
jobs = html_parser_02.cssselect(".left_list.clearfix .job-pannel-list")
print(jobs)
# 輸出結果 得到總共有2520頁 一個有10個job
2520
[<Element div at 0x2500bb6da48>, <Element div at 0x2500c394f98>, <Element div at 0x2500c394c28>, <Element div at 0x2500c394e08>, <Element div at 0x2500c394e58>, <Element div at 0x2500c394ea8>, <Element div at 0x2500c394ef8>, <Element div at 0x2500c394f48>, <Element div at 0x2500c39c048>, <Element div at 0x2500c39c098>]
對job信息的提取,這裏使用的是css選擇器
for job in jobs:
item = dict()
# 工作名稱
item['work_name'] = job.css("div.job-pannel-one > dl > dt > a::text").get().strip()
# 所在城市
item['city'] = job.css(".job-pannel-two > div.company-info > span:nth-child(1) > a::text").get().strip()
# 公司名稱
item['company'] = job.css(".job-pannel-one > dl > dd:nth-child(2) > div > a::text").get().strip()
# 工資
item['salary'] = job.css(".job-pannel-two > div.company-info > div::text").get().strip().replace(' ','')
# 學歷要求
item['degree'] = job.css(".job-pannel-one > dl > dd.job-des > span::text").get().strip()
# 發佈時間
item['publish_time'] = job.css(".job-pannel-two > div.company-info > span.job-time::text").get().strip()
# 詳情鏈接
next = job.css(".job-pannel-one > dl > dt > a::attr('href')").get()
url = response.urljoin(next)
item['detail_url'] = url
'''
崗位描述則要進入到detail_url中解析
# 崗位描述
description = response.css("div.work.padding_left_30 > div.work_b::text").get().split()
description = ''.join(description)
'''
3.採集過程與實現
本次採集過程中,我使用的是scrapy爬蟲框架,版本爲Version: 1.6.0。
3.1 編寫Item
首先要明確,該爬取那些信息,經過之前的觀察,明確了有如下的信息可以被爬取到
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class ShiXiWangItem(scrapy.Item):
# 詳情鏈接
detail_url = scrapy.Field()
# 工作名稱
work_name = scrapy.Field()
# 所在城市
city = scrapy.Field()
# 公司名
company = scrapy.Field()
# 工資
salary = scrapy.Field()
# 學歷要求
degree = scrapy.Field()
# 發佈時間
publish_time = scrapy.Field()
# 職位描述
description = scrapy.Field()
3.2 編寫spider
# -*- coding: utf-8 -*-
import scrapy
from ..items import ShiXiWangItem
# from scrapy.linkextractors import LinkExtractor
# from scrapy.spiders import CrawlSpider, Rule
class ShixiwangSpider(scrapy.Spider):
name = 'shixiwang'
# 設置允許爬取的域
# allowed_domains = ['https://www.shixi.com/']
# spider啓動時第一個爬取的url
# start_urls = ['https://www.shixi.com/search/index']
def __init__(self):
super().__init__()
# 起始網頁
self.base_url = 'https://www.shixi.com/search/index?key=&districts=0&education=70&full_opportunity=0&stage=0&\
practice_days=0&nature=0&trades=317&lang=zh_cn&page={page}'
# 計數器
self.item_count = 0
def closed(self, reason):
print(f'爬取結束,總共有{self.item_count}條實習崗位數據')
def start_requests(self):
# base_url = "https://www.shixi.com/search/index?key=大數據&page={}"
# dont_filter=True : 避免第一頁的請求因爲重複而被過濾掉
yield scrapy.Request(url=self.base_url.format(page=1), callback=self.set_page, dont_filter=True)
def set_page(self, response):
page_num = int(response.xpath('//ul[@id="shixi-pagination"]/@data-pages').get())
print(f'共有{page_num}頁')
targe_page = int(input('輸入要爬取的頁數: ').strip())
print(f'目標:{targe_page}頁,開始爬取...')
for page in range(1, targe_page + 1):
yield scrapy.Request(url=self.base_url.format(page=page), callback=self.parse_index)
def parse_index(self, response):
try:
# 該頁面上的所有工作列表
jobs = response.css(".left_list.clearfix .job-pannel-list")
for job in jobs:
item = dict()
# 工作名稱
item['work_name'] = job.css("div.job-pannel-one > dl > dt > a::text").get().strip()
# 所在城市
item['city'] = job.css(".job-pannel-two > div.company-info > span:nth-child(1) > a::text").get().strip()
# 公司名稱
item['company'] = job.css(".job-pannel-one > dl > dd:nth-child(2) > div > a::text").get().strip()
# 工資
item['salary'] = job.css(".job-pannel-two > div.company-info > div::text").get().strip().replace(' ',
'')
# 學歷要求
item['degree'] = job.css(".job-pannel-one > dl > dd.job-des > span::text").get().strip()
# 發佈時間
item['publish_time'] = job.css(".job-pannel-two > div.company-info > span.job-time::text").get().strip()
# 詳情鏈接
next = job.css(".job-pannel-one > dl > dt > a::attr('href')").get()
url = response.urljoin(next)
item['detail_url'] = url
yield scrapy.Request(
url=url,
callback=self.parse_detail,
meta={'item': item},
)
except:
print('解析失敗')
def parse_detail(self, response):
# print(response.request.headers['User-Agent'], '\n')
self.item_count += 1
item = response.meta['item']
try:
# 崗位描述
description = response.css("div.work.padding_left_30 > div.work_b::text").get().split()
description = ''.join(description)
except:
description = ''
item['description'] = description
# 完成一個item
yield ShiXiWangItem(**item)
3.3 編寫pipeline
pipeline是對item加工處理的地方,通常用於數據清洗和數據保存等。
例如,使用JsonItemExporter導出json格式的數據文件。
from scrapy.exporters import CsvItemExporter, JsonItemExporter, JsonLinesItemExporter
from scrapy import signals
import os
class JSONPipeline(object):
def __init__(self):
self.fp = open("data/data.json", "wb")
self.exporter = JsonItemExporter(self.fp, ensure_ascii=False, encoding='utf-8')
self.exporter.start_exporting()
def open_spider(self, spider):
pass
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
def close_spider(self, spider):
self.exporter.finish_exporting()
self.fp.close()
3.4 設置settings
# 一些主要配置
# 默認使用的請求頭
DEFAULT_REQUEST_HEADERS = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "zh-CN,zh;q=0.9",
"Cache-Control": "max-age=0",
"Connection": "keep-alive",
"Cookie": "UM_distinctid=171eeb58d40151-0714854878229a-335e4e71-100200-171eeb58d427ce; PHPSESSID=3us0g85ngmh6fv12qech489ce3; \
Hm_lvt_536f42de0bcce9241264ac5d50172db7=1588847808,1588999500; \
CNZZDATA1261027457=1718251162-1588845770-https%253A%252F%252Fwww.baidu.com%252F%7C1589009349; \
Hm_lpvt_536f42de0bcce9241264ac5d50172db7=1589013570",
"Host": "www.shixi.com",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3756.400 QQBrowser/10.5.4039.400",
}
# 讓蜘蛛在訪問網址中間休息xxx秒, 設置請求延遲(間隔)
DOWNLOAD_DELAY = 1
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
# 最大併發請求數
CONCURRENT_REQUESTS = 10
# 下載器中間件 更換代理 ip cookies等等,在request發送之前進行處理
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
# 'MySpider01.middlewares.Myspider01DownloaderMiddleware': 543,
'MySpider01.middlewares.RandomUserAgentMiddlware': 543,
}
# item處理管道
ITEM_PIPELINES = {
# 'MySpider01.pipelines.Myspider01Pipeline': 300,
# 'MySpider01.pipelines.JsonLinesPipeline': 301,
'MySpider01.pipelines.JsonLinesPipeline': 302,
}
3.5 啓動爬蟲
# main.py
from scrapy.cmdline import execute
import sys
import os
if __name__ == '__main__':
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
execute(['scrpy', 'crawl', 'shixiwang'])
4.採集結果數據分析
4.1 採集結果
得到的部分json如下:
{
"work_name": ".Net軟件開發工程師",
"city": "江蘇省/蘇州市",
"company": "蘇州麥粒信息科技有限公司",
"salary": "¥5000/月",
"degree": "本科",
"publish_time": "2020-04-08",
"detail_url": "https://www.shixi.com/personals/jobshow/73974",
"description": "根據產品設計要求,按期完成量化開發任務;"
},
{
"work_name": "業務員",
"city": "廣東省/江門市",
"company": "江門市滿紅網絡科技有限公司",
"salary": "¥8000/月",
"degree": "本科",
"publish_time": "2020-03-31",
"detail_url": "https://www.shixi.com/personals/jobshow/74682",
"description": "崗位職責:1、負責公司產品的推廣;2、收集客戶意見及信息;3、爲客戶提供準確專業的銷售及諮詢服務;4、根據市場計劃完成銷售指標;5、維護客戶關係以及客戶間長期戰略合作計劃;6、跟進未成交的客戶,促進客戶轉介紹;7、負責管轄區市場信息的收集8、爲客戶提供優質的服務職位要求:1、語言表達能力強,語言表達清晰、流暢;2、思維清晰,反應敏捷,具有較強的溝通能力及交際技巧,有親和力;3、具備一定的市場分析及判斷能力,有良好的客戶意識;4、有責任心,有團隊精神,善於挑戰;5、有理想有目標,敢於挑戰高薪。待遇:1、每天工作8小時,月休4天,2、有法定假期,3、加班補貼,4、五險一金,5、年終分紅"
},
{
"work_name": "產品運營實習生",
"city": "北京市/海淀區",
"company": "北京七視野文化創意發展有限公司",
"salary": "面議",
"degree": "本科",
"publish_time": "2020-04-01",
"detail_url": "https://www.shixi.com/personals/jobshow/22904",
"description": "【崗位職責】"
},
{
"work_name": "數據內容編輯實習生",
"city": "北京市/海淀區",
"company": "北京歲月桔子科技有限公司",
"salary": "¥120/月",
"degree": "本科",
"publish_time": "2020-04-10",
"detail_url": "https://www.shixi.com/personals/jobshow/64212",
"description": "職位描述:"
},
{
"work_name": "人力實習生",
"city": "北京市/海淀區",
"company": "北京職業夢科技有限公司",
"salary": "¥2000/月",
"degree": "本科",
"publish_time": "2020-04-17",
"detail_url": "https://www.shixi.com/personals/jobshow/22980",
"description": "職位描述:"
},
{
"work_name": "內容電商運營",
"city": "北京市/海淀區",
"company": "愛天教育科技(北京)有限公司",
"salary": "¥120/天",
"degree": "本科",
"publish_time": "2020-04-19",
"detail_url": "https://www.shixi.com/personals/jobshow/23020",
"description": ""
}
4.2 簡要分析
可以看到,互聯網相關的工作中,大部分都集中在北上廣。
5.總結與收穫
-
進一步熟悉了scrapy框架,弄清楚了下載中間件和爬蟲中間件的作用,加深了理解,並提高了對應的實踐能力;
-
明白了dont_filter參數的使用,可以避免scrapy自動去除掉重複的request;
scrapy.Request(url=self.base_url.format(page=1), callback=self.set_page, dont_filter=True)
- 深入理解了css選擇器的使用方法,這對於js+css的網頁編寫能力提高有很大的幫助;
- 瞭解了大量的實習信息,對今後的工作有了更多認知和理解。