爬蟲
-
下載
-
Linux:pip3 install scrapy
-
Windows
- a. pip3 install wheel
- b. 下載twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
- c. 進入下載目錄,執行 pip3 install Twisted‑17.1.0‑cp35‑cp35m‑win_amd64.whl
- d. pip3 install scrapy
- e. 下載並安裝pywin32:https://sourceforge.net/projects/pywin32/files/
-
-
shell界面調試(pip安裝好ipython):
scrapy shell “http://www.baidu.com”- 查看請求返回的包頭:response.headers
- 查看請求返回的內容:response.body
-
創建項目:
scrapy startproject first_obj
目錄結構:- first_obj目錄
- middlewares:中間件
- items:格式化
- pipelines:持久化
- settings:配置文件
- scrapy.cfg:配置信息
- first_obj目錄
-
創建爬蟲:
cd first_obj
scrapy genspider baidu baidu.com -
執行爬蟲:
scrapy crawl baidu [–nolog] [-o baidu.json] -
其他命令:
- 列出當前項目中所有可用的spider:
scrapy list
- 下載給定的URL,並將獲取到的內容送到標準輸出:
scrapy fetch <url>
- 在瀏覽器中打開給定的URL,並以Scrapy spider獲取到的形式展現:
scrapy view <url>
- 獲取Scrapy的配置:
scrapy settings --get [options]
- 獲取scrapy版本:
scrapy version
- 性能測試:
scrapy bench
- 列出當前項目中所有可用的spider:
-
配置文件:
settings:
ROBOTSTXT_OBEY:是否遵循網站的robots.txt規則
所有配置項必須大寫
。 -
基本操作
- selector
from scrapy.selector import Selector
hxs = Selector(response=response)
img_list = hxs.xpath("//div[@class='item']")
for item in img_list:
title = item.xpath("./div[@class='news-content']/div[@class='part2']/@share-title").extract()[0]
url = item.xpath("./div[@class='news-pic']/img/@original").extract_first().strip('//')
- yield
page_list = hxs.xpath('//a[re:test(@href, "/all/hot/recent/\d+")]/@href').extract()
for page in page_list:
yield Request(url=page, callback=self.parse)
- pipline
- chouti.py
import scrapy
from scrapy.selector import Selector
from scrapy.http import Request
from ..items import ChouTiItem
class ChoutiSpider(scrapy.Spider):
name = 'chouti'
allowed_domains = ['dig.chouti.com']
start_urls = ['https://dig.chouti.com/']
def parse(self, response):
hxs = Selector(response=response)
img_list = hxs.xpath("//div[@class='item']")
for item in img_list:
title = item.xpath("./div[@class='news-content']/div[@class='part2']/@share-title").extract()[0]
url = item.xpath("./div[@class='news-pic']/img/@original").extract_first().strip('//')
obj = ChouTiItem(title=title, url=url)
yield obj
- items.py
import scrapy
class ChouTiItem(scrapy.Item):
title = scrapy.Field()
url = scrapy.Field()
- piplines.py
- Pipeline執行順序:
1. 檢測Pipeline類中是否有from_crawler方法
如果有:obj = Pipeline.from_crawler()
如果沒有:obj = Pipeline()
2. 開啓爬蟲:obj.open_spider()
3. while True:
爬蟲運行,並執行parse… yield item
obj.process_item()
4. 關閉爬蟲:obj.close_spider() - 一般重構
process_item
即可
- Pipeline執行順序:
from scrapy.exceptions import DropItem
class SavePipeline(object):
def __init__(self, v):
self.file = open('chouti.txt', 'a+')
def process_item(self, item, spider):
# 操作並進行持久化
# return表示會被後續的pipeline繼續處理
self.file.write(item)
return item
# 表示將item丟棄,不會被後續pipeline處理
# raise DropItem()
@classmethod
def from_crawler(cls, crawler):
"""
初始化時候,用於創建pipeline對象
:param crawler:
:return:
"""
val = crawler.settings.get('SIX')
return cls(val)
def open_spider(self, spider):
"""
爬蟲開始執行時,調用
:param spider:
:return:
"""
print('開啓爬蟲')
def close_spider(self, spider):
"""
爬蟲關閉時,被調用
:param spider:
:return:
"""
print('關閉爬蟲')
- settings.py
# 每行後面的整型值,確定了他們運行的順序,item按數字從低到高的順序通過pipeline,通常將這些數字定義在0-1000範圍內。
# 當遇到raise DropItem()將不再往下執行
ITEM_PIPELINES = {
'fone.pipelines.SavePipeline': 300,
}
- 注意:settings.py中的ITEM_PIPELINES是全局生效的(所有爬蟲都會執行)。如果要對個別爬蟲做特殊操作可以在pipelines.py中Pipeline方法中做
spider.name
判斷:
def process_item(self, item, spider):
if spider.name == 'chouti':
pass
- 去重
- 默認去重規則:
DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter'
DUPEFILTER_DEBUG = False
# 保存範文記錄的日誌路徑,如:/root/ 最終路徑爲 /root/requests.seen
JOBDIR = ""
- 自定義去重規則:
- 新建文件rfd.py,主要是重構request_seen方法
class RepeatUrl:
def __init__(self):
self.visited_url = set()
@classmethod
def from_settings(cls, settings):
"""
初始化時,調用
:param settings:
:return:
"""
return cls()
def request_seen(self, request):
"""
檢測當前請求是否已經被訪問過
:param request:
:return: True表示已經訪問過;False表示未訪問過
"""
if request.url in self.visited_url:
return True
self.visited_url.add(request.url)
return False
def open(self):
"""
開始爬去請求時,調用
:return:
"""
print('open replication')
def close(self, reason):
"""
結束爬蟲爬取時,調用
:param reason:
:return:
"""
print('close replication')
def log(self, request, spider):
"""
記錄日誌
:param request:
:param spider:
:return:
"""
print('repeat', request.url)
- settings.py
DUPEFILTER_CLASS = 'fone.rfp.RFPDupeFilter'
- 自定義拓展
- 新建extensions.py
from scrapy import signals
class MyExtension(object):
def __init__(self, value):
self.value = value
@classmethod
def from_crawler(cls, crawler):
val = crawler.settings.getint('SIX')
ext = cls(val)
crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
return ext
def spider_opened(self, spider):
print('open')
def spider_closed(self, spider):
print('close')
- settings.py
EXTENSIONS = {
'fone.extensions.MyExtension': 100,
}
- 更多拓展詳見from scrapy import signals
- 中間件
創建項目時會生成middlewares.py
-
爬蟲中間件SpiderMiddleware示例代碼類的方法說明(執行順序):
- process_spider_input:下載完成,執行,然後交給parse處理 (2)
- process_spider_output:spider處理完成,返回時調用,必須返回包含 Request 或 Item 對象的可迭代對象(iterable) (3)
- process_spider_exception:異常調用,返回None時繼續交給後續中間件處理異常;含 Response 或 Item 的可迭代對象(iterable),交給調度器或pipeline (4)
- process_start_requests:爬蟲啓動時調用,包含 Request 對象的可迭代對象 (1)
-
下載中間件
DownloaderMiddleware
示例代碼及說明:- 下載中間件的應用場景比較廣,尤其是process_request方法中返回
None
和Response對象
方法
- 下載中間件的應用場景比較廣,尤其是process_request方法中返回
class DownMiddleware1(object):
def process_request(self, request, spider):
"""
請求需要被下載時,經過所有下載器中間件的process_request調用
:param request:
:param spider:
:return:
None,繼續後續中間件去下載;
Response對象,停止process_request的執行,開始執行process_response
Request對象,停止中間件的執行,將Request重新調度器
raise IgnoreRequest異常,停止process_request的執行,開始執行process_exception
"""
pass
def process_response(self, request, response, spider):
"""
spider處理完成,返回時調用
:param response:
:param result:
:param spider:
:return:
Response 對象:轉交給其他中間件process_response
Request 對象:停止中間件,request會被重新調度下載
raise IgnoreRequest 異常:調用Request.errback
"""
print('response1')
return response
def process_exception(self, request, exception, spider):
"""
當下載處理器(download handler)或 process_request() (下載中間件)拋出異常
:param response:
:param exception:
:param spider:
:return:
None:繼續交給後續中間件處理異常;
Response對象:停止後續process_exception方法
Request對象:停止中間件,request將會被重新調用下載
"""
return None
- 自定義命令
- 在spiders同級創建任意目錄,如:commands
- 在其中創建 crawlall.py 文件 (此處文件名就是自定義的命令)
from scrapy.commands import ScrapyCommand
class Command(ScrapyCommand):
requires_project = True
def syntax(self):
return '[options]'
def short_desc(self):
return 'Runs all of the spiders'
def run(self, args, opts):
spider_list = self.crawler_process.spiders.list()
for name in spider_list:
self.crawler_process.crawl(name, **opts.__dict__)
self.crawler_process.start()
- 在settings.py 中添加配置 COMMANDS_MODULE = ‘項目名稱.目錄名稱’
- 在項目目錄執行 scrapy crawlall