DOWNLOADER_MIDDLEWARES 下載中間件
代碼
處理每個請求和響應,以及處理請求過程中拋出異常處理
class TestDownloaderMiddleware:
@classmethod
def from_crawler(cls, crawler):
s = cls()
# crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
spider.logger.info('process_request: %s' % request.url)
return None
def process_response(self, request, response, spider):
spider.logger.info('process_response: %s' % request.url)
return response
def process_exception(self, request, exception, spider):
spider.logger.info('process_exception: %s' % request.url)
return None
說明
from_crawler: 初始化中間件,可增加額外監聽方法具體參見signals
process_request: 引擎發起請求前執行,
process_response: 引擎獲取響應後,爬蟲解析響應前執行
process_exception: 爬取出現異常時執行
返回None繼續爬取,返回Request重新發起新的爬取,返回Response進入process_response繼續執行,返回異常進入process_exception,返回Request時可增加dont_filter=True,不會經過過濾器
配置
-
全局配置 在settings.py文件中
DOWNLOADER_MIDDLEWARES = { '項目名稱.middlewares.TestDownloaderMiddleware': 100, }
-
爬蟲定製配置 在爬蟲類中增加
custom_settings = { "DOWNLOADER_MIDDLEWARES":{ '項目名稱.middlewares.TestDownloaderMiddleware': 100, } }
SPIDER_MIDDLEWARES 爬蟲中間件
代碼
class TestSpiderMiddleware:
@classmethod
def from_crawler(cls, crawler):
s = cls()
# crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(self, response, spider):
spider.logger.info('process_spider_input: %s' % response)
return None
def process_spider_output(self, response, result, spider):
spider.logger.info('process_spider_output: %s' % result)
for i in result:
yield i
def process_spider_exception(self, response, exception, spider):
spider.logger.info('process_spider_exception: %s' % exception)
pass
def process_start_requests(self, start_requests, spider):
spider.logger.info('process_start_requests: %s' % start_requests)
for r in start_requests:
yield r
說明
from_crawler: 初始化中間件,可增加額外監聽方法具體參見signals
process_spider_input: Download中間件執行之後執行,返回None繼續執行,返回異常由process_spider_exception()處理
process_spider_output: input執行之後執行,必須返回包含 Request 或 Item 對象的可迭代對象。
process_spider_exception: 爬取出現異常時執行
process_start_requests: 爬蟲整體開始之前,返回只必須是url
配置
-
全局配置 在settings.py文件中
SPIDER_MIDDLEWARES = { '項目名稱.middlewares.TestSpiderMiddleware': 100, }
-
爬蟲定製配置 在爬蟲類中增加
custom_settings = { "SPIDER_MIDDLEWARES":{ '項目名稱.middlewares.TestSpiderMiddleware': 100, } }