CrawlSpider的基本使用,Request,Response,DownloadMiddlewares,Settings
CrawlSpider
- 所有自定義的spider爬蟲類都會繼承scrapy.Spider類,Spider是最基本的類
- CrawlSpider是Spider類的拓展類
- CrawlSpider類跟Spider類相比,多了一個Rule類,這個類用來匹配獲取頁面中的鏈接的
Rule類
class scrapy.spiders.Rule(
link_extractor,
callback = None,
cb_kwargs = None,
follow = None,
process_links = None,
process_request = None
)
- 其中最常用的參數:
- link_extractor表示匹配的鏈接對象,包含rep匹配規則
- callback表示回調函數,該回調函數接受一個response作爲第一個參數。*注意當編寫爬蟲規則時,要求避免使用parse作爲回調函數,是由於CrawlSpider使用parse()方法來實現其邏輯,如果覆蓋了parse()方法,crawl_spider將會運行失效
- follow表示是否跟進,None表示默認,默認爲Truw,表示跟進,否則爲False,表示不跟進
LinkExtractors類
- linkExtractors類是就是生Rule類對象時的第一個link_extractor對象的類,它的作用就一個: 提取網頁中的鏈接
class scrapy.linkextractors.LinkExtractor(
allow = (),
deny = (),
allow_domains = (),
deny_domains = (),
deny_extensions = None,
restrict_xpaths = (),
tags = ('a','area'),
attrs = ('href'),
canonicalize = True,
unique = True,
process_value = None
)
常用的參數:allow,表示提取網站允許的匹配條件,()裏寫re規則
其他不常用:
deny:滿足括號中“正則表達式”的URL一定不提取 *(優先級高於allow)。
allow_domains:會被提取的鏈接的domains。
deny_domains:一定不會被提取鏈接的domains。
restrict_xpaths:使用xpath表達式,和allow共同作用過濾鏈接。
LinkExtractor類的使用步驟:
- 定義一個link對象,初始化參數爲匹配規則:
link = LinkExtractor('正則表達式,如: start=\d+')
- 使用link對象調用extract_link(response)方法,返回查詢的結果的列表集:
result = link.extract_link(response)
- 可以發現使用方法和正則的使用方法很相似:
re1 = re.compile('正則表達式')
result = re1.find('需要匹配的字符串')
示例:騰訊社招網站的爬取
- 目標:爬取列表頁信息及詳情頁信息
tentent.py
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from TestCrawlSpider.items import TestcrawlspiderListItem, TestcrawlspiderDetailItem
class TencentSpider(CrawlSpider):
name = 'tencent3'
allowed_domians = ['tencent.com']
start_urls = ['https://hr.tencent.com/position.php?keywords=python&tid=87&lid=2156&start=0']
# rules 提取規則
rules = (
# 先提取列表頁
Rule(LinkExtractor(allow=r'start=\d+'), callback='parse_list', follow=True),
# 提取詳情頁
Rule(LinkExtractor(allow=r'position_detail.php'), callback='parse_detail', follow=False),
)
def parse_list(self, response):
# 解析出tr數據
tr_list = response.xpath('//tr[@class="even"]|//tr[@class="odd"]')
for tr in tr_list:
item = TestcrawlspiderListItem()
item['work_name'] = tr.xpath('./td[1]/a/text()').extract_first()
item['work_type'] = tr.xpath('./td[2]/text()').extract_first()
item['work_count'] = tr.xpath('./td[3]/text()').extract_first()
item['work_place'] = tr.xpath('./td[4]/text()').extract_first()
item['work_time'] = tr.xpath('./td[5]/text()').extract_first()
# 將解析完畢的數據 --》引擎--》管道
yield item
def parse_detail(self, response):
# 解析出詳情頁的數據
ul_list = response.xpath('//ul[@class="squareli"]')
item = TestcrawlspiderDetailItem()
item['work_duty'] = ''.join(ul_list[0].xpath('./li/text()').extract())
item['work_requir'] = ''.join(ul_list[1].xpath('./li/text()').extract())
# 將解析完的數據--》引擎--》管道
yield item
items.py
import scrapy
class TestcrawlspiderListItem(scrapy.Item):
# 設置列表頁目標數據
work_name = scrapy.Field()
work_type = scrapy.Field()
work_count = scrapy.Field()
work_place = scrapy.Field()
work_time = scrapy.Field()
class TestcrawlspiderDetailItem(scrapy.Item):
# 設置詳情頁目標數據
work_duty = scrapy.Field()
work_requir = scrapy.Field()
piplines.py
import json
from items import TestcrawlspiderDetailItem, TestcrawlspiderListItem
class TestcrawlspiderListPipeline(object):
def open_spider(self, spider):
self.file = open('list.json', 'w')
def process_item(self, item, spider):
# 判斷item是否是列表頁的數據,是,才存儲
if isinstance(item, TestcrawlspiderListItem):
dict_item = dict(item)
str_item = json.dumps(dict_item) + '\n'
self.file.write(str_item)
return item
def close_spider(self, spider):
self.file.close()
class TestcrawlspiderDetailPipeline(object):
def open_spider(self, spider):
self.file = open('detail.json', 'w')
def process_item(self, item, spider):
# 判斷item是否是詳情頁的數據,是,才存儲
if isinstance(item, TestcrawlspiderDetailItem):
dict_item = dict(item)
str_item = json.dumps(dict_item) + '\n'
self.file.write(str_item)
return item
def close_spider(self, spider):
self.file.close()
middlewares.py
import scrapy
from TestCrawlSpider.settings import USER_AGENT_LIST
import random
# 設置隨機的User_Agent的中間件
class UserAgentMiddleWares(object):
def process_request(self, request, spider):
# 隨機挑選一個User-Agent
user_agent = random.choice(USER_AGENT_LIST)
# 修改request對象中的user-agent
request.headers['User-Agent'] = user_agent
print '***' * 30
print user_agent
# 設置隨機代理proxy的中間件
class ProxyMiddleWares(object):
def process_request(self, request, spider):
# 隨機一個代理(要錢的/免費的)
proxy = 'http://162.138.3.1:8888'
# 修改request的meta參數
request.meta['proxy']=proxy
print '----'*30
print proxy
settings.py
BOT_NAME = 'TestCrawlSpider'
SPIDER_MODULES = ['TestCrawlSpider.spiders']
NEWSPIDER_MODULE = 'TestCrawlSpider.spiders'
DOWNLOADER_MIDDLEWARES = {
'TestCrawlSpider.middlewares.UserAgentMiddleWares': 543,
'TestCrawlSpider.middlewares.ProxyMiddleWares':888,
}
ITEM_PIPELINES = {
'TestCrawlSpider.pipelines.TestcrawlspiderListPipeline': 300,
USER_AGENT_LIST = [
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
]
Request&Response
- request對象是Request類的實例化對象,就是集成了請求相關的信息,以備發起請求的對象。
- 常用參數有:
url: 就是需要請求,並進行下一步處理的url
callback: 指定該請求返回的Response,由那個函數來處理。
method: 請求一般不需要指定,默認GET方法,可設置爲"GET", "POST", "PUT"等,且保證字符串大寫
headers: 請求時,包含的頭文件。一般不需要。內容一般如下:
# 自己寫過爬蟲的肯定知道
Host: media.readthedocs.org
User-Agent: Mozilla/5.0 (Windows NT 6.2; WOW64; rv:33.0) Gecko/20100101 Firefox/33.0
Accept: text/css,*/*;q=0.1
Accept-Language: zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
Referer: http://scrapy-chs.readthedocs.org/zh_CN/0.24/
Cookie: _ga=GA1.2.1612165614.1415584110;
Connection: keep-alive
If-Modified-Since: Mon, 25 Aug 2014 21:59:35 GMT
Cache-Control: max-age=0
meta: 比較常用,在不同的請求之間傳遞數據使用的。字典dict型
request_with_cookies = Request(
url="http://www.example.com",
cookies={'currency': 'USD', 'country': 'UY'},
meta={'dont_merge_cookies': True}
)
encoding: 使用默認的 'utf-8' 就行。
errback: 指定錯誤處理函數
response對象是Response類的實例化,包含服務端響應給瀏覽器的相關信息。
常用參數/信息有:
status: 響應碼
_set_body(body): 響應體
_set_url(url):響應url
self.request = request
DownloadMiddlewares
- 溫故下scrapy框架圖
- 有爬蟲,就有反爬蟲,有反爬蟲,就有反反爬蟲。。。。。。爬蟲和反爬蟲的鬥爭是無窮無盡的。
- scrapy基本的功能可以滿足基本的爬蟲任務,使用CrawlSpider甚至只需要簡單的設置一下過濾條件follow=True,就可以自動爬取大量的列表頁及詳情頁。但是,大部分網站不會這麼容易爬取。這時候就需要進行額外的操作,就需要在基本設置的基礎上進行中間件的設置。
- scrapy的中間件有spider中間件和download中間件兩種,常用的是download中間件,可以在request對象正式發送前進行響應的設置。
常見的反反爬蟲設置有:
- 模擬用戶登錄(自動獲取cookie)/禁用cookies
- 設置隨機的User-Agent
- 設置隨機的代理IP
- 控制訪問頻次
- 使用selenium自動化控制瀏覽器,從瀏覽器渲染後的頁面中換取數據
- 。。。
示例:模擬用戶登錄到github主頁
# -*- coding: utf-8 -*-
import scrapy
class GithubSpider(scrapy.Spider):
name = 'github2'
allowed_domains = ['github.com']
start_urls = ['https://github.com/login']
# 模擬登錄 : scrapy框架 會自動保存當前的cookie
def parse(self, response):
login_url = 'https://github.com/session'
formdata = {
"login": '賬戶名',
'password': '密碼',
'commit': 'Sign in',
'utf8': '✓',
'authenticity_token': response.xpath('//*[@id="login"]/form/input[2]/@value').extract_first()
}
# 發送 登錄請求 POST
yield scrapy.FormRequest(url=login_url, formdata=formdata, callback=self.parse_logined)
def parse_logined(self, response):
with open('222githublogin.html', 'w') as f:
f.write(response.body)
Settings
- setting設置是對整個爬蟲架構進行相關的設置,設置項有:
# -*- coding: utf-8 -*-
# Scrapy settings for GitHub project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'GitHub'
SPIDER_MODULES = ['GitHub.spiders']
NEWSPIDER_MODULE = 'GitHub.spiders'
# 設置 log文件
# LOG_FILE = 'github.log'
# 日誌的等級 5個等級 debug info warning error 嚴重
LOG_LEVEL = 'ERROR'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko"
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# 設置併發數 3000個 3000 1次 硬件
# Configure maximum concurrent requests performed by Scrapy (default: 16)
# CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# 設置延遲 批次
# DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# 禁用cookie
# Disable cookies (enabled by default)
# COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
# DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
# }
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'GitHub.middlewares.GithubSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'GitHub.middlewares.GithubDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'GitHub.pipelines.GithubPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'