【爬蟲】Scrapy 自定義下載器中間件

【原文鏈接】https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

 

Writing your own downloader middleware

Each middleware component is a Python class that defines one or more of the following methods:

class scrapy.downloadermiddlewares.DownloaderMiddleware

Note

Any of the downloader middleware methods may also return a deferred (延遲的).

process_request(requestspider)

This method is called for each request that goes through the download middleware.

process_request() should either: return None, return a Response object, return a Request object, or raise IgnoreRequest.

If it returns None, Scrapy 會繼續處理這個請求, 一直執行其他所有的中間件,直到最終合適的下載器 handler 被調用並且請求被處理 (且響應被下載).

If it returns a Response object, Scrapy 不會調用任何其它 process_request()process_exception() 方法, 或 the appropriate 下載函數; 而是會返回 that response. 安裝好的中間件的 process_response() 方法對每個響應都會被調用.

If it returns a Request object, Scrapy 會停止調用 process_request() 方法,並且會 reschedule 被返回的請求. 一旦新被返回的請求被執行,合適的中間件 chain will be called on the downloaded response.

If it raises an IgnoreRequest exception, 被安裝的下載器中間件的 process_exception() 方法會被調用. 如果他們中沒有能夠處理異常的, 請求的 errback 方法會被調用 (Request.errback). If no code handles the raised exception, it is ignored and not logged (unlike other exceptions).

Parameters:
  • request (Request object) – the request being processed
  • spider (Spider object) – the spider for which this request is intended

process_response(略)

process_exception(requestexceptionspider)

當一個下載 handler or a process_request() (from a downloader middleware) 拋出異常 (including an IgnoreRequest exception)時,Scrapy 會調用 process_exception() 方法.

process_exception() 應該返回: either None, a Response object, or a Request object.

If it returns None, Scrapy 會繼續處理這個異常, 執行安裝的中間件的其他 process_exception() 方法, 直到沒有中間件被剩下,然後默認的異常處理 kicks in.

If it returns a Response object, 已安裝中間件的 process_response() method chain is started, and Scrapy won’t bother calling any other process_exception() methods of middleware.

If it returns a Request object, 被返回的請求 is rescheduled to 下載 in the future. 這會停止中間件的 process_exception() 方法的執行 the same as returning a response would.

Parameters:
  • request (is a Request object) – the request that generated the exception
  • exception (an Exception object) – the raised exception
  • spider (Spider object) – the spider for which this request is intended

from_crawler(clscrawler),

如果有此方法,該類方法會被調用創建一箇中間件實例 from a Crawler. 它必須返回中間件的一個新實例. Crawler 對象對所有 Scrapy 核心組件,比如 settings 和 signals 提供 access; 它是中間件 access them and hook its functionality into Scrapy 的一種方式.

Parameters:

crawler (Crawler object) – crawler that uses this middleware

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章