scrapy架構初探

引言

Python即時網絡爬蟲啓動的目標是一起把互聯網變成大數據庫。單純的開放源代碼並不是開源的全部，開源的核心是“開放的思想”，聚合最好的想法、技術、人員，所以將會參照衆多領先產品，比如，Scrapy，ScrapingHub，import io等。

本文簡單講解一下Scrapy的架構。沒錯，通用提取器gsExtractor就是要集成到Scrapy架構中。

請注意，本文不想複述原文內容，而是爲了開源Python爬蟲的發展方向找參照，而且以9年來開發網絡爬蟲經驗作爲對標，從而本文含有不少筆者主觀評述，如果想讀Scrapy官方原文，請點擊Scrapy官網的Architecture。

scrapy數據流

Scrapy中的數據流由執行引擎控制，下面的原文摘自Scrapy官網，我根據猜測做了點評，爲進一步開發GooSeeker開源爬蟲指示方向：

The Engine gets the first URLs to crawl from the Spider and schedules them in the Scheduler, as Requests.

URL誰來準備呢？看樣子是Spider自己來準備，那麼可以猜測Scrapy架構部分（不包括Spider）主要做事件調度，不管網址的存儲。看起來類似GooSeeker會員中心的爬蟲羅盤，爲目標網站準備一批網址，放在羅盤中準備執行爬蟲調度操作。所以，這個開源項目的下一個目標是把URL的管理放在一個集中的調度庫裏面。

The Engine asks the Scheduler for the next URLs to crawl.

看到這裏其實挺難理解的，要看一些其他文檔才能理解透。接第1點，引擎從Spider中把網址拿到以後，封裝成一個Request，交給了事件循環，會被Scheduler收來做調度管理的，暫且理解成對Request做排隊。引擎現在就找Scheduler要接下來要下載的網頁地址。

The Scheduler returns the next URLs to crawl to the Engine and the Engine sends them to the Downloader, passing through the Downloader Middleware (request direction).

從調度器申請任務，把申請到的任務交給下載器，在下載器和引擎之間有個下載器中間件，這是作爲一個開發框架的必備亮點，開發者可以在這裏進行一些定製化擴展。

Once the page finishes downloading the Downloader generates a Response (with that page) and sends it to the Engine, passing through the Downloader Middleware (response direction).

下載完成了，產生一個Response，通過下載器中間件交給引擎。注意，Response和前面的Request的首字母都是大寫，雖然我還沒有看其它Scrapy文檔，但是我猜測這是Scrapy框架內部的事件對象，也可以推測出是一個異步的事件驅動的引擎，就像DS打數機的三級事件循環一樣，對於高性能、低開銷引擎來說，這是必須的。

The Engine receives the Response from the Downloader and sends it to the Spider for processing, passing through the Spider Middleware (input direction).

再次出現一箇中間件，給開發者足夠的發揮空間。

The Spider processes the Response and returns scraped items and new Requests (to follow) to the Engine.

每個Spider順序抓取一個個網頁，完成一個就構造另一個Request事件，開始另一個網頁的抓取。

The Engine passes scraped items and new Requests returned by a spider through Spider Middleware (output direction), and then sends processed items to Item Pipelines and processed Requests to the Scheduler.

引擎作事件分發

The process repeats (from step 1) until there are no more requests from the Scheduler.

持續不斷地運行。

版權信息所有者：chenjiabing
如若轉載請標明出處：chenjiabing666.github.io6

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

scrapy架構初探

scrapy架構初探

引言

scrapy數據流

他來了，程序員的指路明燈來了！！！

SpringBoot整合分頁插件

SpringBoot整合JTA

scrapy架構初探

Java IO 學習筆記五

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結