scrapy初體驗

爬蟲

博客參考

  • 下載

    • Linux:pip3 install scrapy

    • Windows

      • a. pip3 install wheel
      • b. 下載twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
      • c. 進入下載目錄,執行 pip3 install Twisted‑17.1.0‑cp35‑cp35m‑win_amd64.whl
      • d. pip3 install scrapy
      • e. 下載並安裝pywin32:https://sourceforge.net/projects/pywin32/files/
  • shell界面調試(pip安裝好ipython):
    scrapy shell “http://www.baidu.com”

    • 查看請求返回的包頭:response.headers
    • 查看請求返回的內容:response.body
  • 創建項目:
    scrapy startproject first_obj
    目錄結構:

    • first_obj目錄
      • middlewares:中間件
      • items:格式化
      • pipelines:持久化
      • settings:配置文件
    • scrapy.cfg:配置信息
  • 創建爬蟲:
    cd first_obj
    scrapy genspider baidu baidu.com

  • 執行爬蟲:
    scrapy crawl baidu [–nolog] [-o baidu.json]

  • 其他命令:

    1. 列出當前項目中所有可用的spider:scrapy list
    2. 下載給定的URL,並將獲取到的內容送到標準輸出:scrapy fetch <url>
    3. 在瀏覽器中打開給定的URL,並以Scrapy spider獲取到的形式展現:scrapy view <url>
    4. 獲取Scrapy的配置:scrapy settings --get [options]
    5. 獲取scrapy版本:scrapy version
    6. 性能測試:scrapy bench
  • 配置文件:
    settings:
    ROBOTSTXT_OBEY:是否遵循網站的robots.txt規則
    所有配置項必須大寫

  • 基本操作

  1. selector
from scrapy.selector import Selector

hxs = Selector(response=response)
img_list = hxs.xpath("//div[@class='item']")
for item in img_list:
    title = item.xpath("./div[@class='news-content']/div[@class='part2']/@share-title").extract()[0]
    url = item.xpath("./div[@class='news-pic']/img/@original").extract_first().strip('//')
  1. yield
page_list = hxs.xpath('//a[re:test(@href, "/all/hot/recent/\d+")]/@href').extract()

for page in page_list:
    yield Request(url=page, callback=self.parse)
  1. pipline
  • chouti.py
import scrapy
from scrapy.selector import Selector
from scrapy.http import Request

from ..items import ChouTiItem


class ChoutiSpider(scrapy.Spider):
    name = 'chouti'
    allowed_domains = ['dig.chouti.com']
    start_urls = ['https://dig.chouti.com/']

    def parse(self, response):
        hxs = Selector(response=response)
        img_list = hxs.xpath("//div[@class='item']")
        for item in img_list:
            title = item.xpath("./div[@class='news-content']/div[@class='part2']/@share-title").extract()[0]
            url = item.xpath("./div[@class='news-pic']/img/@original").extract_first().strip('//')
            obj = ChouTiItem(title=title, url=url)
            yield obj
  • items.py
import scrapy


class ChouTiItem(scrapy.Item):
    title = scrapy.Field()
    url = scrapy.Field()
  • piplines.py
    • Pipeline執行順序:
      1. 檢測Pipeline類中是否有from_crawler方法
      如果有:obj = Pipeline.from_crawler()
      如果沒有:obj = Pipeline()
      2. 開啓爬蟲:obj.open_spider()
      3. while True:
      爬蟲運行,並執行parse… yield item
      obj.process_item()
      4. 關閉爬蟲:obj.close_spider()
    • 一般重構process_item即可
from scrapy.exceptions import DropItem


class SavePipeline(object):
    def __init__(self, v):
        self.file = open('chouti.txt', 'a+')

    def process_item(self, item, spider):
        # 操作並進行持久化
        # return表示會被後續的pipeline繼續處理
        self.file.write(item)
        return item

        # 表示將item丟棄,不會被後續pipeline處理
        # raise DropItem()

    @classmethod
    def from_crawler(cls, crawler):
        """
        初始化時候,用於創建pipeline對象
        :param crawler:
        :return:
        """
        val = crawler.settings.get('SIX')
        return cls(val)

    def open_spider(self, spider):
        """
        爬蟲開始執行時,調用
        :param spider:
        :return:
        """
        print('開啓爬蟲')

    def close_spider(self, spider):
        """
        爬蟲關閉時,被調用
        :param spider:
        :return:
        """
        print('關閉爬蟲')
  • settings.py
# 每行後面的整型值,確定了他們運行的順序,item按數字從低到高的順序通過pipeline,通常將這些數字定義在0-1000範圍內。
# 當遇到raise DropItem()將不再往下執行
ITEM_PIPELINES = {
   'fone.pipelines.SavePipeline': 300,
}
  • 注意:settings.py中的ITEM_PIPELINES是全局生效的(所有爬蟲都會執行)。如果要對個別爬蟲做特殊操作可以在pipelines.py中Pipeline方法中做spider.name判斷:
    def process_item(self, item, spider):
        if spider.name == 'chouti':
            pass
  1. 去重
  • 默認去重規則:
	DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter'
	DUPEFILTER_DEBUG = False
	# 保存範文記錄的日誌路徑,如:/root/ 最終路徑爲 /root/requests.seen
	JOBDIR = ""
  • 自定義去重規則:
    1. 新建文件rfd.py,主要是重構request_seen方法
class RepeatUrl:
    def __init__(self):
        self.visited_url = set()

    @classmethod
    def from_settings(cls, settings):
        """
        初始化時,調用
        :param settings: 
        :return: 
        """
        return cls()

    def request_seen(self, request):
        """
        檢測當前請求是否已經被訪問過
        :param request: 
        :return: True表示已經訪問過;False表示未訪問過
        """
        if request.url in self.visited_url:
            return True
        self.visited_url.add(request.url)
        return False

    def open(self):
        """
        開始爬去請求時,調用
        :return: 
        """
        print('open replication')

    def close(self, reason):
        """
        結束爬蟲爬取時,調用
        :param reason: 
        :return: 
        """
        print('close replication')

    def log(self, request, spider):
        """
        記錄日誌
        :param request: 
        :param spider: 
        :return: 
        """
        print('repeat', request.url)
  1. settings.py
DUPEFILTER_CLASS = 'fone.rfp.RFPDupeFilter'
  1. 自定義拓展
  • 新建extensions.py
from scrapy import signals


class MyExtension(object):
    def __init__(self, value):
        self.value = value

    @classmethod
    def from_crawler(cls, crawler):
        val = crawler.settings.getint('SIX')
        ext = cls(val)
        crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
        return ext

    def spider_opened(self, spider):
        print('open')

    def spider_closed(self, spider):
        print('close')
  • settings.py
EXTENSIONS = {
   'fone.extensions.MyExtension': 100,
}
  • 更多拓展詳見from scrapy import signals
  1. 中間件
    創建項目時會生成middlewares.py
  • 爬蟲中間件SpiderMiddleware示例代碼類的方法說明(執行順序):

    • process_spider_input:下載完成,執行,然後交給parse處理 (2)
    • process_spider_output:spider處理完成,返回時調用,必須返回包含 Request 或 Item 對象的可迭代對象(iterable) (3)
    • process_spider_exception:異常調用,返回None時繼續交給後續中間件處理異常;含 Response 或 Item 的可迭代對象(iterable),交給調度器或pipeline (4)
    • process_start_requests:爬蟲啓動時調用,包含 Request 對象的可迭代對象 (1)
  • 下載中間件DownloaderMiddleware示例代碼及說明:

    • 下載中間件的應用場景比較廣,尤其是process_request方法中返回NoneResponse對象方法
class DownMiddleware1(object):
    def process_request(self, request, spider):
        """
        請求需要被下載時,經過所有下載器中間件的process_request調用
        :param request: 
        :param spider: 
        :return:  
            None,繼續後續中間件去下載;
            Response對象,停止process_request的執行,開始執行process_response
            Request對象,停止中間件的執行,將Request重新調度器
            raise IgnoreRequest異常,停止process_request的執行,開始執行process_exception
        """
        pass



    def process_response(self, request, response, spider):
        """
        spider處理完成,返回時調用
        :param response:
        :param result:
        :param spider:
        :return: 
            Response 對象:轉交給其他中間件process_response
            Request 對象:停止中間件,request會被重新調度下載
            raise IgnoreRequest 異常:調用Request.errback
        """
        print('response1')
        return response

    def process_exception(self, request, exception, spider):
        """
        當下載處理器(download handler)或 process_request() (下載中間件)拋出異常
        :param response:
        :param exception:
        :param spider:
        :return: 
            None:繼續交給後續中間件處理異常;
            Response對象:停止後續process_exception方法
            Request對象:停止中間件,request將會被重新調用下載
        """
        return None
  1. 自定義命令
  • 在spiders同級創建任意目錄,如:commands
  • 在其中創建 crawlall.py 文件 (此處文件名就是自定義的命令)
from scrapy.commands import ScrapyCommand


class Command(ScrapyCommand):
    requires_project = True

    def syntax(self):
        return '[options]'

    def short_desc(self):
        return 'Runs all of the spiders'

    def run(self, args, opts):
        spider_list = self.crawler_process.spiders.list()
        for name in spider_list:
            self.crawler_process.crawl(name, **opts.__dict__)
        self.crawler_process.start()
  • 在settings.py 中添加配置 COMMANDS_MODULE = ‘項目名稱.目錄名稱’
  • 在項目目錄執行 scrapy crawlall
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章