- 增加併發
- 默認scrapy開啓的併發線程的個數是32個,可以適當的進行增加。在settings中進行設置CONCURRENT_REQUESTS=100
- 降低日誌級別
- 在運行的時候,會有大量的日誌信息的輸出,爲了減少CPU的使用率。可以設置log輸出的信息是INFO或者是ERROR就可以了LOG_LEVEL=‘INFO’
- 禁止cookie
- 如果不是真的需要cookie的話,可以在scrapy爬取的時候禁止cookie從而來減少CPU的使用率,提升爬取的效率,COOKIES_ENABLED=False
- 禁止重試
- 對失敗的HTTP進行重新的請求(重試)會減慢爬取的速度,因此可以禁止重試RETRY_ENABLED=False
- 減少下載超時
- 如果對一個非常慢的鏈接進行爬取,減少下載超時可以能讓卡住的鏈接快速的被放棄,從而來提升效率DOWNLOAD_TIMEOUT=10
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36' # Obey robots.txt rules ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) CONCURRENT_REQUESTS = 100 COOKIES_ENABLED = False LOG_LEVEL = 'ERROR' RETRY_ENABLED = False DOWNLOAD_TIMEOUT = 3 # Configure a delay for requests for the same website (default: 0) # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 DOWNLOAD_DELAY = 3