中間件簡介
我們先來看一下scrapy爬取數據的大致流程,中間件出現在下載器和爬蟲中。設置代理和更換請求頭需要在調度器到下載器之間的環節進行修改。
下載中間件
class ScrapyTextDownloaderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,scrapy acts as if the downloader middleware does not modify the passed objects.
# 中間件並非必須定義,根據所需定義即可。
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
# Scrapy使用此方法創建爬蟲(spider)。
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# 當每個請求(request)通過下載中間件時,該方法被調用
# 返回值必須是下列方式之一:
# - 返回值爲None(return None): 繼續處理此請求
# - 返回一個響應(Response)對象
# - 返回一個請求(Request)對象
# - 拋出異常(raise IgnoreRequest): 將調用已安裝的下載中間件(DownloadMiddleware)的process_exception()方法
return None
def process_response(self, request, response, spider):
# 當下載器完成http請求,傳遞響應(response)給引擎的時候調用
# 返回值必須是下列方式之一:
# - 返回一個響應(Response)對象
# - 返回一個請求(Request)對象
# - 拋出異常(raise IgnoreRequest)
return response
def process_exception(self, request, exception, spider):
# 當下載器或process_request()(來自其他下載中間件)引發異常時調用.
# 返回值必須是下列方式之一:
# - 返回值爲None(return None): 繼續處理此異常
# - 返回一個響應(Response)對象: 停止異常處理鏈(process_exception() chain)
# - 返回一個請求(Request)對象: 停止異常處理鏈(process_exception() chain)
pass
def spider_opened(self, spider):
spider.logger.info('啓動爬蟲(spider): %s' % spider.name)
下載中間件中process_request()
方法在發送請求前執行,所以,更換代理和請求頭就在process_request()
方法中設置即可。
更換請求頭
相關工具性網站
- 查看當前請求頭(http://httpbin.org/user-agent)
- 請求頭網站,內涵大量請求頭
我們先做一個簡單的請求,查看在默認情況下,我們發送的請求頭爲"user-agent": "Scrapy/2.0.1 (+https://scrapy.org)"
在中間件中,我們可以自行添加一個隨機請求頭類
class 隨機請求頭(object):
def process_request(self, request, spider):
user_agent = random.choice(spider.settings.get("PC_請求頭"))
request.headers['User-Agent'] = user_agent
在設置中,我們需要在DOWNLOADER_MIDDLEWARES
中啓動我們自定義的中間件並且添加請求頭。
請求頭
下方爲我自行總結的一些比較新的PC端請求頭。
# 常見PC端請求頭
PC_請求頭 = [
# 谷歌
# windows
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36",
# Linux
"Mozilla/5.0 (X11; Ubuntu; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2919.83 Safari/537.36",
"Mozilla/5.0 (X11; Ubuntu; Linux i686 on x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2820.59 Safari/537.36",
# mac
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2866.71 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2762.73 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2656.18 Safari/537.36",
# 火狐
# windows
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101 Firefox/76.0",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:77.0) Gecko/20190101 Firefox/77.0",
"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:77.0) Gecko/20100101 Firefox/77.0",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:39.0) Gecko/20100101 Firefox/75.0",
"Mozilla/5.0 (Windows NT 6.3; WOW64; rv:71.0) Gecko/20100101 Firefox/71.0",
"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:69.2.1) Gecko/20100101 Firefox/69.2",
"Mozilla/5.0 (Windows NT 6.1; rv:68.7) Gecko/20100101 Firefox/68.7",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:64.0) Gecko/20100101 Firefox/64.0",
# linux
"Mozilla/5.0 (X11; Linux ppc64le; rv:75.0) Gecko/20100101 Firefox/75.0",
"Mozilla/5.0 (X11; Linux; rv:74.0) Gecko/20100101 Firefox/74.0",
"Mozilla/5.0 (X11; Linux i686; rv:64.0) Gecko/20100101 Firefox/64.0",
# mac
"Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.10; rv:75.0) Gecko/20100101 Firefox/75.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:61.0) Gecko/20100101 Firefox/73.0",
"Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.10; rv:62.0) Gecko/20100101 Firefox/62.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:10.0) Gecko/20100101 Firefox/62.0",
"Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.13; ko; rv:1.9.1b2) Gecko/20081201 Firefox/60.0"
]
設置代理
查看當前ip(http://httpbin.org/ip)
同樣先寫一個測試用的爬蟲,和上述爬蟲基本一致
# -*- coding: utf-8 -*-
import scrapy
class TestIpSpider(scrapy.Spider):
name = 'test_ip'
allowed_domains = ['httpbin.org/ip']
start_urls = ['http://httpbin.org/ip']
def parse(self, response):
print(response.text)
想用更換代理,代理存儲在request的meta中request.meta['proxy']
,根據此我們可以添加代理,如下所示。
class 代理池(object):
def process_request(self, request, spider):
request.meta['proxy'] = 'http://123.207.57.145:1080'
print(request.meta['proxy'])
簡單代理池製作
免費代理由於穩定性和實效性都較短,如果你對代理有需求,我建議還是購買一個收費代理使用。免費代理中,小幻代理我個人用上相對較好,而且有相應的接口。下方代理池,我就將基於小幻代理製作,如果購買了正版代理,能拿到json格式的數據處理起來會更加方便!
# -*- coding: utf-8 -*-
import scrapy
class A代理提取Spider(scrapy.Spider):
name = '代理提取'
allowed_domains = ['ip.ihuan.me']
start_urls = ['https://ip.ihuan.me/address/5Lit5Zu9.html']
def parse(self, response):
ip_tr = response.xpath('//*[@class="table-responsive"]/table/tbody/tr')
for td in ip_tr:
item = {}
item['ip'] = td.xpath('./td[1]/a/text()').extract_first()
item['端口'] = td.xpath('./td[2]/text()').extract_first()
item['位置'] = td.xpath('./td[3]/a[1]/text()').extract_first()
print(item)
yield item
ip_page = response.xpath('//*[@class="pagination"]/li/a[@aria-label="Next"]/@href').extract_first()
ip_page = self.start_urls[0] + ip_page
print(ip_page)
yield response.follow(
url=ip_page
)
我們拿到ip後可以傳入需要使用代理的地方進行使用,也可以存放起來,但免費ip一般失效會很快,存起來意義不大。