【爬蟲】Scrapy 自定義下載器中間件

原創

栗子ma

2018-08-26 15:59

【原文鏈接】https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

Writing your own downloader middleware

Each middleware component is a Python class that defines one or more of the following methods:

class scrapy.downloadermiddlewares.DownloaderMiddleware

Note

Any of the downloader middleware methods may also return a deferred (延遲的).

process_request(request, spider)

This method is called for each request that goes through the download middleware.

process_request() should either: return None, return a Response object, return a Request object, or raise IgnoreRequest.

If it returns None, Scrapy 會繼續處理這個請求, 一直執行其他所有的中間件，直到最終合適的下載器 handler 被調用並且請求被處理 (且響應被下載).

If it returns a Response object, Scrapy 不會調用任何其它 process_request() 或 process_exception() 方法, 或 the appropriate 下載函數; 而是會返回 that response. 安裝好的中間件的 process_response() 方法對每個響應都會被調用.

If it returns a Request object, Scrapy 會停止調用 process_request() 方法，並且會 reschedule 被返回的請求. 一旦新被返回的請求被執行，合適的中間件 chain will be called on the downloaded response.

If it raises an IgnoreRequest exception, 被安裝的下載器中間件的 process_exception() 方法會被調用. 如果他們中沒有能夠處理異常的, 請求的 errback 方法會被調用 (Request.errback). If no code handles the raised exception, it is ignored and not logged (unlike other exceptions).

Parameters:	request (`Request` object) – the request being processed spider (`Spider` object) – the spider for which this request is intended

process_response(略)

process_exception(request, exception, spider)

當一個下載 handler or a process_request() (from a downloader middleware) 拋出異常 (including an IgnoreRequest exception)時，Scrapy 會調用 process_exception() 方法.

process_exception() 應該返回: either None, a Response object, or a Request object.

If it returns None, Scrapy 會繼續處理這個異常, 執行安裝的中間件的其他 process_exception() 方法, 直到沒有中間件被剩下，然後默認的異常處理 kicks in.

If it returns a Response object, 已安裝中間件的 process_response() method chain is started, and Scrapy won’t bother calling any other process_exception() methods of middleware.

If it returns a Request object, 被返回的請求 is rescheduled to 下載 in the future. 這會停止中間件的 process_exception() 方法的執行 the same as returning a response would.

Parameters:	request (is a `Request` object) – the request that generated the exception exception (an `Exception` object) – the raised exception spider (`Spider` object) – the spider for which this request is intended

from_crawler(cls, crawler)，

如果有此方法，該類方法會被調用創建一箇中間件實例 from a Crawler. 它必須返回中間件的一個新實例. Crawler 對象對所有 Scrapy 核心組件，比如 settings 和 signals 提供 access; 它是中間件 access them and hook its functionality into Scrapy 的一種方式.

Parameters:	crawler (`Crawler` object) – crawler that uses this middleware

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

【爬蟲】Scrapy 自定義下載器中間件

Writing your own downloader middleware

linux安裝cuda和cudnn

測試人員都是畫畫大神，讓我看看誰還不會用代碼圖？

Object.values()對象遍歷

我拍了拍Redis，被移出了羣聊···

網絡現代化通向雲原生應用的高速公路

面試官：說說你對序列化的理解

我宣佈，這是我找到的史上AI最全論文體系！

【Sqoop】Export data into RDBMS using Sqoop 及其調優

【NLP】Python中文文本聚類

【NLP】Python英文文本聚類

【NLP】Jieba中文分詞

【Python】解決matplotlib圖例中文亂碼問題——win10版本

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結