【爬蟲】Python Scrapy 基礎概念 —— 請求和響應

【原文鏈接】https://doc.scrapy.org/en/latest/topics/request-response.html

 

Scrapy uses Request and Response 對象來爬網頁.

Typically, spiders 中會產生 Request 對象,然後傳遞 across the system,  直到他們到達 Downloader, which 執行請求並返回一個 Response 對象 which travels back to the spider that issued the request.

Both Request and Response 類都有子類 which 增加了基類中 not required 的功能. These are described below in Request subclasses and Response subclasses.

Request 對象

class scrapy.http.Request(url[, callback, method='GET', headers, body, cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback, flags])

A Request 對象代表了一次 HTTP 請求, which is usually generated in the Spider and executed by the Downloader, and thus 生成一個 Response.

Parameters:
  • url (string) – the URL of this request
  • callback (callable) – the function that will be called 且這個請求對應的響應 (once its downloaded) 會成爲該方法的第一個參數. For more information see Passing additional data to callback functions below. If a Request 沒有指定回調函數, the spider’s parse() 方法會被使用. Note that 如果過程中產生了異常, errback is called instead.
  • method (string) – the HTTP 方法 of this request. Defaults to 'GET'.
  • meta (dict) – Request.meta 屬性的初始值. If given, 此參數傳遞進來的字典會被淺拷貝.
  • body (str or unicode) – 請求體. 如果傳遞的是一個 unicode, 那麼會使用傳遞進來的 encoding (默認爲 utf-8) 編碼成 str. 如果沒有指定 body, 會存儲一個空字符串. 不論該 argument 是什麼類型, 最終存儲的值會是一個 str (never unicode or None).
  • headers (dict) – 請求頭. 字典的值可以爲字符串 (對於單一值請求頭而言) 或列表 (對於多值請求頭而言). 如果傳遞的值是 None, the HTTP 頭不會被髮送.
  • cookies (dict or list) – 請求的 cookies. 可以通過兩種形式進行發送.
    1. 使用字典:
      request_with_cookies = Request(url="http://www.example.com",
                                     cookies={'currency': 'USD', 'country': 'UY'})
      
    2. 使用字典列表:
      request_with_cookies = Request(url="http://www.example.com",
                                     cookies=[{'name': 'currency',
                                              'value': 'USD',
                                              'domain': 'example.com',
                                              'path': '/currency'}])
      

    後一種形式 allows for customizing cookie 的 domain and path 屬性. 只有在 cookies 被保存 for later requests 時纔有用.

    當有些網站 (in a response) 返回了 cookies 時,會被存在這個域名的 cookies 中,然後 in future requests 會被再次發送. 這是一般網頁瀏覽器的典型行爲. 然而,出於某些原因,你可能想避免 merging with existing cookies, 你可以設置 the dont_merge_cookies key 爲 True in the Request.meta.

    Example of request without merging cookies:

    request_with_cookies = Request(url="http://www.example.com",
                                   cookies={'currency': 'USD', 'country': 'UY'},
                                   meta={'dont_merge_cookies': True})
    

    For more info see CookiesMiddleware.

  • encoding (string) – the encoding of this request (defaults to 'utf-8'). This encoding will be used to percent-encode the URL and to convert the body to str (if given as unicode).
  • priority (int) – 這個請求的優先級 (defaults to 0). Scheduler 會使用這個優先級來定義處理請求的順序. 有更高優先級的值的請求會更早執行. 負數值 are allowed, 可以用來表示相對較低的優先級.
  • dont_filter (boolean) – 表示該請求 should not be filtered by the scheduler. 當你想多次執行同一個相同的請求時,可以使用該參數 to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to False.
  • errback (callable) – a function that will be called if any exception was raised while processing the request. 包括失敗頁面 with 404 HTTP errors and such. 該方法會接收一個 Twisted Failure 實例作爲第一個參數. For more information, see Using errbacks to catch exceptions in request processing below.
  • flags (list) – Flags sent to the request, 可以用於打印日誌 or similar purposes.

url

A string containing the URL of this request. Keep in mind that this attribute contains 轉義的 URL, so it can differ from the URL passed in the constructor.

This attribute is read-only. To change the URL of a Request use replace().

method

A string representing the HTTP method in the request. This is guaranteed to be 大寫. Example: "GET", "POST", "PUT", etc

headers

A 類似於字典的對象 which contains the request headers.

body

一個包含了請求體的 str.

This attribute is read-only. To change the body of a Request use replace().

meta

A dict that contains arbitrary 元數據 for this request. 對於 new Requests 來說這個字典是空的, 且該字典通常會填充不同的 Scrapy 組件 (插件, 中間件, etc). 所以這個字典中包含的數據會依賴於 the extensions you have enabled.

See Request.meta special keys for a list of special meta keys recognized by Scrapy.

當使用 copy() or replace() 方法克隆請求的時候,這個字典會被 shallow copied, 並且可以通過你的 spider 中的 response.meta 屬性 be accessed.

copy()

返回一個新的 Request which is a copy of this Request. See also: Passing additional data to callback functions.

replace([url, method, headers, body, cookies, meta, encoding, dont_filter, callback, errback])

返回一個 Request 對象 with the same members, 除了 those members given new values by whichever keyword arguments are specified. The attribute Request.meta 默認會被複制 (除非 meta argument中被給予了一個新的值). See also Passing additional data to callback functions.

Passing additional data to callback functions

一個請求的回調函數是指一個方程,該方程在此請求所對應的響應被下載的時候被調用. 這個回調函數在被調用的時候,被下載的 Response 對象會成爲其第一個參數.

Example:

def parse_page1(self, response):
    return scrapy.Request("http://www.example.com/some_page.html",
                          callback=self.parse_page2)

def parse_page2(self, response):
    # this would log http://www.example.com/some_page.html
    self.logger.info("Visited %s", response.url)

有的時候你可能想傳遞 arguments 給這些回調函數,然後之後在第二個回調函數中你就可以接收這些 arguments. You can use the Request.meta attribute for that.

Here’s an example of how to pass an item using this mechanism, to populate different fields from different pages:

def parse_page1(self, response):
    item = MyItem()
    item['main_url'] = response.url
    request = scrapy.Request("http://www.example.com/some_page.html",
                             callback=self.parse_page2)
    request.meta['item'] = item
    yield request

def parse_page2(self, response):
    item = response.meta['item']
    item['other_url'] = response.url
    yield item

Using errbacks to 在處理請求時捕獲異常

一個請求的 errback 是一個方程 that will be called when an exception is raise while processing it.

It receives a Twisted Failure instance as first parameter and can be used to track 連接建立超時, DNS 錯誤 etc.

Here’s an example spider logging all errors and 捕獲一些特殊的 errors if needed:

import scrapy

from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError, TCPTimedOutError

class ErrbackSpider(scrapy.Spider):
    name = "errback_example"
    start_urls = [
        "http://www.httpbin.org/",              # HTTP 200 expected
        "http://www.httpbin.org/status/404",    # Not found error
        "http://www.httpbin.org/status/500",    # server issue
        "http://www.httpbin.org:12345/",        # non-responding host, timeout expected
        "http://www.httphttpbinbin.org/",       # DNS error expected
    ]

    def start_requests(self):
        for u in self.start_urls:
            yield scrapy.Request(u, callback=self.parse_httpbin,
                                    errback=self.errback_httpbin,
                                    dont_filter=True)

    def parse_httpbin(self, response):
        self.logger.info('Got successful response from {}'.format(response.url))
        # do something useful here...

    def errback_httpbin(self, failure):
        # log all failures
        self.logger.error(repr(failure))

        # in case you want to do something special for some errors,
        # you may need the failure's type:

        if failure.check(HttpError):
            # these exceptions come from HttpError spider middleware
            # you can get the non-200 response
            response = failure.value.response
            self.logger.error('HttpError on %s', response.url)

        elif failure.check(DNSLookupError):
            # this is the original request
            request = failure.request
            self.logger.error('DNSLookupError on %s', request.url)

        elif failure.check(TimeoutError, TCPTimedOutError):
            request = failure.request
            self.logger.error('TimeoutError on %s', request.url)

Request.meta special keys

The Request.meta 屬性可以包含任何 arbitrary data, but there are some special keys recognized by Scrapy and its built-in extensions.

Those are:

bindaddress

The IP of the outgoing IP address to use for the performing the request.

download_timeout

The amount of time (in secs) that the downloader will wait before timing out. See also: DOWNLOAD_TIMEOUT.

download_latency

指從請求開始起,the amount of time spent to fetch the response, i.e. HTTP 消息 sent over the network. 當響應被下載下來的時候,這個 meta key 纔會變得可用. While most other meta keys are used to control Scrapy behavior, this one is supposed to be read-only.

download_fail_on_dataloss

Whether or not to fail on 斷開的響應. See: DOWNLOAD_FAIL_ON_DATALOSS.

max_retry_times

The meta key is used set retry times per request. When initialized, the max_retry_times meta key 比 the RETRY_TIMES setting 優先級更高.

Request subclasses

Here is the list of built-in Request subclasses. You can also subclass it to implement your own custom functionality.

FormRequest objects (略)

Request usage examples

Using FormRequest to send data via HTTP POST

If you want to 模擬 a HTML Form POST in your spider 並且發送一系列 key-value fields, you can return a FormRequest object (from your spider) like this:

return [FormRequest(url="http://www.example.com/post/action",
                    formdata={'name': 'John Doe', 'age': '27'},
                    callback=self.after_post)]

Using FormRequest.from_response() to 模擬用戶登陸

網站經常會通過 <input type="hidden"> 元素提供 pre-populated 表單 fields, 比如 session related data or authentication tokens (在登陸頁面上). 當爬取頁面時,你會希望這些 fields 都是自動 pre-populated and only override a couple of them, such as the user name and password. You can use the FormRequest.from_response() method for this job. Here’s an example spider which uses it:

import scrapy

class LoginSpider(scrapy.Spider):
    name = 'example.com'
    start_urls = ['http://www.example.com/users/login.php']

    def parse(self, response):
        return scrapy.FormRequest.from_response(
            response,
            formdata={'username': 'john', 'password': 'secret'},
            callback=self.after_login
        )

    def after_login(self, response):
        # check login succeed before going on
        if "authentication failed" in response.body:
            self.logger.error("Login failed")
            return

        # continue scraping with authenticated session...

Response objects

class scrapy.http.Response(url[, status=200, headers=None, body=b'', flags=None, request=None])

A Response 對象代表了一個 HTTP 響應, which 通常會被下載下來 (by the Downloader) 然後給到 Spiders 進行處理.

Parameters:
  • url (string) – the URL of this response
  • status (integer) – the HTTP status of the response. Defaults to 200.
  • headers (dict) – 響應頭. 字典的值可以是字符串 (for single valued headers) 或列表 (for multi-valued headers).
  • body (bytes) – the response body. To access the decoded text as str (unicode in Python 2) you can use response.text from an encoding-aware Response subclass, such as TextResponse.
  • flags (list) – 是一個包含了 Response.flags 屬性初始值的列表. If given, the list will be shallow copied.
  • request (Request object) – Response.request 屬性的初始值. This represents the Request that generated this response.

url

A string containing the URL of the response.

This attribute is read-only. To change the URL of a Response use replace().

status

一個代表了響應的 HTTP 狀態的 Integer. Example: 200, 404.

headers

A dictionary-like object which 包含了響應頭. 可以通過 get() 獲取值然後返回  the first header value with the specified name 或者通過 getlist() to return all header values with the specified name. For example, this call will give you all cookies in the headers:

response.headers.getlist('Set-Cookie')

body

The body of this Response. Keep in mind that Response.body 永遠是一個字節對象. If you want the unicode version use TextResponse.text (only available in TextResponse and subclasses).

This attribute is read-only. To change the body of a Response use replace().

request

The Request object that generated this response. This attribute is assigned in the Scrapy engine, after the response and the request have passed through all Downloader Middlewares. In particular, this means that:

  • HTTP 重定向會使得原始請求 (to the URL before redirection) 被賦給被重定向的響應 (with the final URL after redirection).
  • Response.request.url doesn’t always equal Response.url
  • This attribute is only available in the spider code, and in the Spider Middlewares, but not in Downloader Middlewares (although you have the Request available there by other means) and handlers of the response_downloaded signal.

meta

A shortcut to the Request.meta attribute of the Response.request object (ie. self.request.meta).

Unlike the Response.request attribute, the Response.meta attribute is propagated along redirects and retries, so you will get the original Request.meta sent from your spider.

See also

Request.meta attribute

flags

一個包含了該請求 flags 的列表. Flags 是用於標記 Responses 的標籤. For example: ‘cached’, ‘redirected’, etc. 並且他們被顯示在 Response 的字符串表達式中 (__str__ method) which is used by the engine for logging.

copy()

Returns a new Response which is a copy of this Response.

replace([url, status, headers, body, request, flags, cls])

返回一個 Response 對象 with the same members, except for those members given new values by whichever keyword arguments are specified. The attribute Response.meta is copied by default.

urljoin(url)

Constructs an absolute url by combining the Response’s url with a possible relative url.

This is a wrapper over urlparse.urljoin, it’s merely an alias for making this call:

urlparse.urljoin(response.url, url)

follow(url, callback=None, method='GET', headers=None, body=None, cookies=None, meta=None, encoding='utf-8', priority=0, dont_filter=False, errback=None)

Return a Request instance 來 follow 一個鏈接 url. 與 Request.__init__ 方法接受的 arguments 相同, but url can be 一個相對的 URL or a scrapy.link.Link object, not only an absolute URL.

TextResponse provides a follow() method which supports selectors in addition to absolute/relative URLs and Link objects.

Response subclasses

Here is the list of available built-in Response subclasses. You can also subclass the Response class to implement your own functionality.

TextResponse objects(略)

HtmlResponse objects(略)

XmlResponse objects(略)

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章