Scrapy學習過程之二:架構及簡單示例

1、Scrapy架構

參考:https://docs.scrapy.org/en/latest/topics/architecture.html#data-flow

以下是架構圖:

 

Scrapy architecture

從上圖可以看出,Scrapy是組件化的,每個組件實現特定的功能,組件之間是獨立的,鬆耦合的,在大規模應用中應該可以分佈式部署。 

紅色箭頭表示數據流,其它表示組件,首先對Scrapy包含那些組件,以及數據是如何流動的有個大概的印象,在接下來進一步的學習中再加深理解。

2、Scrapy簡單示例

參考:https://docs.scrapy.org/en/latest/intro/overview.html#walk-through-of-an-example-spider

代碼如下:

import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.xpath('span/small/text()').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

解析一下這個代碼。QuotesSpider的父類是scrapy.Spider,每個scrapy.Spider的子類都被認爲是一個SPIDER,這段代碼與上邊架構圖中的"SPIDERS"相對應。

"SPIDERS"是複數,可以有多個,那麼類QuotesSpider中的成員name就是當前這個SPIDER的唯一標識符。 

start_urls就是起始的url,scrap.Spider類中有一個默認實現的方法,它會根據start_urls中的內容構建request,同時默認指定這個request產生的response將由類中的parse方法處理,也就是parse是回調方法。

然後就按上邊架構圖中的步驟開始運行。

當ENGINE發現SCHEDULER隊列中已經沒有待處理的REQUEST,並且所有RESPONSE已經被SPIDER的parser處理完成,不可能再有新的REQUEST進入隊列,這個時候ENGINE是通知SPIDER任務已經完成,整個運行過程結構。

測試一下以上代碼

啓動構建好的Scrapy docker image,命令如下:

docker run -it --name scrapy-test scrapy-clear /bin/sh

其中scrapy-clear是我構建的scrapy鏡像的名稱。

啓動以後創建一個臨時用的測試目錄如scrapy-test,進入新創建的目錄並創建新文件quotes_spider.py,然後將以上的代碼複製進quotes_spider.py文件中。

最後運行如下命令:

scrapy runspider quotes_spider.py -o quotes.json

輸出如下日誌: 

/scrapy-test # scrapy runspider quotes_spider.py -o quotes.json
2019-07-24 07:21:40 [scrapy.utils.log] INFO: Scrapy 1.7.1 started (bot: scrapybot)
2019-07-24 07:21:40 [scrapy.utils.log] INFO: Versions: lxml 4.3.4.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.2.1, Python 3.7.3 (default, May  3 2019, 11:24:39) - [GCC 8.3.0], pyOpenSSL 19.0.0 (OpenSSL 1.1.1c  28 May 2019), cryptography 2.7, Platform Linux-4.4.0-116-generic-x86_64-with
2019-07-24 07:21:40 [scrapy.crawler] INFO: Overridden settings: {'FEED_FORMAT': 'json', 'FEED_URI': 'quotes.json', 'SPIDER_LOADER_WARN_ONLY': True}
2019-07-24 07:21:40 [scrapy.extensions.telnet] INFO: Telnet Password: ee6cc9f3a24c449d
2019-07-24 07:21:40 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2019-07-24 07:21:40 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-07-24 07:21:40 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-07-24 07:21:40 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-07-24 07:21:40 [scrapy.core.engine] INFO: Spider opened
2019-07-24 07:21:40 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-07-24 07:21:40 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-07-24 07:21:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/tag/humor/> (referer: None)
2019-07-24 07:21:41 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/tag/humor/>
{'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'author': 'Jane Austen'}
2019-07-24 07:21:41 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/tag/humor/>
{'text': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin'}
2019-07-24 07:21:41 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/tag/humor/>
{'text': '“Anyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.”', 'author': 'Garrison Keillor'}
2019-07-24 07:21:41 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/tag/humor/>
{'text': '“Beauty is in the eye of the beholder and it may be necessary from time to time to give a stupid or misinformed beholder a black eye.”', 'author': 'Jim Henson'}
2019-07-24 07:21:41 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/tag/humor/>
{'text': "“All you need is love. But a little chocolate now and then doesn't hurt.”", 'author': 'Charles M. Schulz'}
2019-07-24 07:21:41 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/tag/humor/>
{'text': "“Remember, we're madly in love, so it's all right to kiss me anytime you feel like it.”", 'author': 'Suzanne Collins'}
2019-07-24 07:21:41 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/tag/humor/>
{'text': '“Some people never go crazy. What truly horrible lives they must lead.”', 'author': 'Charles Bukowski'}
2019-07-24 07:21:41 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/tag/humor/>
{'text': '“The trouble with having an open mind, of course, is that people will insist on coming along and trying to put things in it.”', 'author': 'Terry Pratchett'}
2019-07-24 07:21:41 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/tag/humor/>
{'text': '“Think left and think right and think low and think high. Oh, the thinks you can think up if only you try!”', 'author': 'Dr. Seuss'}
2019-07-24 07:21:41 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/tag/humor/>
{'text': '“The reason I talk to myself is because I’m the only one whose answers I accept.”', 'author': 'George Carlin'}
2019-07-24 07:21:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/tag/humor/page/2/> (referer: http://quotes.toscrape.com/tag/humor/)
2019-07-24 07:21:41 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/tag/humor/page/2/>
{'text': '“I am free of all prejudice. I hate everyone equally. ”', 'author': 'W.C. Fields'}
2019-07-24 07:21:41 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/tag/humor/page/2/>
{'text': "“A lady's imagination is very rapid; it jumps from admiration to love, from love to matrimony in a moment.”", 'author': 'Jane Austen'}
2019-07-24 07:21:41 [scrapy.core.engine] INFO: Closing spider (finished)
2019-07-24 07:21:42 [scrapy.extensions.feedexport] INFO: Stored json feed (12 items) in: quotes.json
2019-07-24 07:21:42 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 511,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 3725,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 1.322739,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 7, 24, 7, 21, 42, 3560),
 'item_scraped_count': 12,
 'log_count/DEBUG': 14,
 'log_count/INFO': 11,
 'memusage/max': 46931968,
 'memusage/startup': 46931968,
 'request_depth_max': 1,
 'response_received_count': 2,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2019, 7, 24, 7, 21, 40, 680821)}
2019-07-24 07:21:42 [scrapy.core.engine] INFO: Spider closed (finished)
/scrapy-test #

這個日誌在學習Scrapy的學習過程中應該是很重要的,每一條記錄中都包含諸如[scrapy.core.engine]等內容,這個應該與上邊的架構圖中的組件是相對應的,通過日誌內容大概就能夠看出各個組件之間如何交互,數據如何流動,每個組件都完成了那些動作。 

quotes.json文件內容如下:

/scrapy-test # cat quotes.json
[
{"text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d", "author": "Jane Austen"},
{"text": "\u201cA day without sunshine is like, you know, night.\u201d", "author": "Steve Martin"},
{"text": "\u201cAnyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.\u201d", "author": "Garrison Keillor"},
{"text": "\u201cBeauty is in the eye of the beholder and it may be necessary from time to time to give a stupid or misinformed beholder a black eye.\u201d", "author": "Jim Henson"},
{"text": "\u201cAll you need is love. But a little chocolate now and then doesn't hurt.\u201d", "author": "Charles M. Schulz"},
{"text": "\u201cRemember, we're madly in love, so it's all right to kiss me anytime you feel like it.\u201d", "author": "Suzanne Collins"},
{"text": "\u201cSome people never go crazy. What truly horrible lives they must lead.\u201d", "author": "Charles Bukowski"},
{"text": "\u201cThe trouble with having an open mind, of course, is that people will insist on coming along and trying to put things in it.\u201d", "author": "Terry Pratchett"},
{"text": "\u201cThink left and think right and think low and think high. Oh, the thinks you can think up if only you try!\u201d", "author": "Dr. Seuss"},
{"text": "\u201cThe reason I talk to myself is because I\u2019m the only one whose answers I accept.\u201d", "author": "George Carlin"},
{"text": "\u201cI am free of all prejudice. I hate everyone equally. \u201d", "author": "W.C. Fields"},
{"text": "\u201cA lady's imagination is very rapid; it jumps from admiration to love, from love to matrimony in a moment.\u201d", "author": "Jane Austen"}
]/scrapy-test #

scrapy runspider這個命令,只是從指定的文件中找scrapy.Spider的子類,找到以後把它運行起來。

上邊架構圖中有很多組件,是一個複雜的系統。在本例中這些組件如何配置沒有涉及,這裏應該全部是默認配置,並且所有組件都運行在一臺主機上。

在複雜的項目中,必然會涉及到很多的配置用來定義涉及的全部組件如何工作。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章