DOWNLOADER_MIDDLEWARES 下载中间件
代码
处理每个请求和响应,以及处理请求过程中抛出异常处理
class TestDownloaderMiddleware:
@classmethod
def from_crawler(cls, crawler):
s = cls()
# crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
spider.logger.info('process_request: %s' % request.url)
return None
def process_response(self, request, response, spider):
spider.logger.info('process_response: %s' % request.url)
return response
def process_exception(self, request, exception, spider):
spider.logger.info('process_exception: %s' % request.url)
return None
说明
from_crawler: 初始化中间件,可增加额外监听方法具体参见signals
process_request: 引擎发起请求前执行,
process_response: 引擎获取响应后,爬虫解析响应前执行
process_exception: 爬取出现异常时执行
返回None继续爬取,返回Request重新发起新的爬取,返回Response进入process_response继续执行,返回异常进入process_exception,返回Request时可增加dont_filter=True,不会经过过滤器
配置
-
全局配置 在settings.py文件中
DOWNLOADER_MIDDLEWARES = { '项目名称.middlewares.TestDownloaderMiddleware': 100, }
-
爬虫定制配置 在爬虫类中增加
custom_settings = { "DOWNLOADER_MIDDLEWARES":{ '项目名称.middlewares.TestDownloaderMiddleware': 100, } }
SPIDER_MIDDLEWARES 爬虫中间件
代码
class TestSpiderMiddleware:
@classmethod
def from_crawler(cls, crawler):
s = cls()
# crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(self, response, spider):
spider.logger.info('process_spider_input: %s' % response)
return None
def process_spider_output(self, response, result, spider):
spider.logger.info('process_spider_output: %s' % result)
for i in result:
yield i
def process_spider_exception(self, response, exception, spider):
spider.logger.info('process_spider_exception: %s' % exception)
pass
def process_start_requests(self, start_requests, spider):
spider.logger.info('process_start_requests: %s' % start_requests)
for r in start_requests:
yield r
说明
from_crawler: 初始化中间件,可增加额外监听方法具体参见signals
process_spider_input: Download中间件执行之后执行,返回None继续执行,返回异常由process_spider_exception()处理
process_spider_output: input执行之后执行,必须返回包含 Request 或 Item 对象的可迭代对象。
process_spider_exception: 爬取出现异常时执行
process_start_requests: 爬虫整体开始之前,返回只必须是url
配置
-
全局配置 在settings.py文件中
SPIDER_MIDDLEWARES = { '项目名称.middlewares.TestSpiderMiddleware': 100, }
-
爬虫定制配置 在爬虫类中增加
custom_settings = { "SPIDER_MIDDLEWARES":{ '项目名称.middlewares.TestSpiderMiddleware': 100, } }