成千上萬個站點,日數據過億的大規模爬蟲是怎麼實現的?

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"簡介:分佈式爬蟲、智能解析、消息隊列、去重和調度等技術點","attrs":{}}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們身邊接觸最頻繁、同時也是最大的爬蟲莫過於幾大搜索引擎。但是搜索引擎的爬取方式和我們爬蟲工程師接觸的方式差異比較大,沒有太大的參考價值,我們今天要講的是輿情方向的爬蟲(架構以及關鍵技術原理),主要涉及:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"網頁文本智能提取;","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"分佈式爬蟲;","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"爬蟲 DATA/URL 去重;","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"爬蟲部署;","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null},"content":[{"type":"text","text":"分佈式爬蟲調度;","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":6,"align":null,"origin":null},"content":[{"type":"text","text":"自動化渲染技術;","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":7,"align":null,"origin":null},"content":[{"type":"text","text":"消息隊列在爬蟲領域的應用;","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":8,"align":null,"origin":null},"content":[{"type":"text","text":"各種各樣形式的反爬蟲;","attrs":{}}]}],"attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"請大家買好瓜子,搬好凳子坐下學習,並準備好爭奪文末贈送的獎品!","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一、網頁文本智能提取","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"輿情其實就是輿論情況,要掌握輿情,那麼就必須掌握足夠多的內容資訊。除了一些開放了商業接口的大型內容/社交類平臺(例如微博)之外,其他都需要依靠爬蟲去採集。因此,輿情方向的爬蟲工程師需要面對的是千千萬萬個內容和結構都不同的站點。我們用一個圖來表示他們面對的問題:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/4f/4faa9166238fbe0f7db6ecf8b11258fc.png","alt":"CwHXGu","title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"沒錯,他們的採集器必須要能夠適配千千萬萬個站點的結構,從風格迥異的 HTML 文本中提取出主體內容——標題、正文、發佈時間和作者。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果是你,你會用什麼樣的設計來滿足業務需求呢?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"曾經我也設想過這樣的問題,在技術羣裏也看到有羣友提出類似的問題,但是很難得到滿意的答案。有的人說:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"用歸類法,相似的內容歸類到一起,然後給一類內容配置提取規則;","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"用正則,提取指定標籤中的內容;","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"用深度學習,NLP 語義分析出哪裏是有意義的內容,提取出來;","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"用計算機視覺,讓人去點擊,然後按照頁面相似度分類提取(其實就是歸類法的自動化版本);","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null},"content":[{"type":"text","text":"用算法,計算出文本的密度,然後提取;","attrs":{}}]}],"attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"總之各種各樣的想法層出不窮,但是最後都沒有聽到實際應用的消息。目前來說,大部分公司使用的都是人工配配置 XPATH 規則的方式,採集的時候通過網址來匹配對應的提取規則,然後調用規則來實現多站點的爬取。這種方法很有效,而且在企業中應用已久,比較穩定,但缺點也顯而易見——費時間、費人工、費錢!","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"偶有一天,我在微信技術羣裏看到有人(優秀的 Python 工程師青南)發表了一個用於自動化提取文本的算法庫,","attrs":{}},{"type":"link","attrs":{"href":"https://github.com/kingname/GeneralNewsExtractor","title":null},"content":[{"type":"text","text":"GeneralNewsExtractor","attrs":{}}]},{"type":"text","text":" (以下簡稱 GNE)。這個庫參考了武漢郵電科學研究院洪鴻輝、丁世濤、黃傲、郭致遠等人編寫的論文——《基於文本及符號密度的網頁正文提取方法》,並在論文的基礎上用 Python 代碼進行了具體實現,也就是 GNE。它的原理是通過提取網頁 DOM 中的文本以及其中的標點符號,以文本中標點符號的密度作爲基礎,使用算法從一句話延伸到一段文字和一篇文章。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/9b/9be7487679bf66ea5ace5be621d83486.png","alt":"xS6M3T","title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GNE 能夠有效排除正文以外的的廣告、推薦欄、介紹欄等“噪聲”內容,準確識別出網頁正文,且識別率高達 99%(測試選用的內容爲國內主流門戶/媒體平臺中的文章)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GNE 的具體算法細節以及源碼解析請翻閱《Python3 網絡爬蟲寶典》第 5 章。","attrs":{}}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有了它,基本上就可以解決 90% 以上的輿情方向爬蟲解析的需求,剩下的 10% 可以基於提取規則進行鍼對性調整或者完全定製化開發,解放了一大波 XPATH 工程師。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"二、爬蟲 DATA/URL 去重","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"輿情業務必須緊盯網站是否有新的內容發佈,要求是越快越好,但由於各項軟硬件限制,通常會要求在 30 分鐘內或者 15 分鐘內監聽到新內容。要實現對目標網站內容變化的監聽,那麼我們可以選擇的比較好的方式就是輪詢。不停地訪問網頁,並且判斷是否有“新內容”出現,如果有的話就執行爬取,沒有“新內容”就不爬取。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那麼問題來了,應用程序如何知道哪些內容是“新的”、哪些內容又是“舊的”的呢?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"問題拆解一下,“新內容”就是沒有爬過的內容。這時候我們需要用一個東西來記錄這篇文章是否被爬取過,每次有要爬取的文章時就比對一下,這就是解決這個問題的辦法。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那又依靠什麼來進行比對呢?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們都知道文章的 URL 幾乎都是不變且不會重複的,因此可以選擇文章的 URL 作爲判定的依據,也就是把爬取過的 URL 放到一個類似列表一樣的容器裏存儲起來,每次有待爬取的 URL 就判斷它是否已經被存在容器裏,如果在就說明已經爬過了,直接丟棄,進入下一個 URL 的判斷流程。整體邏輯就像下面這張圖一樣:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/9e/9ea9ff24c7abccbd92b87685f90faae4.png","alt":"ac9WnD","title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這就是爬蟲領域的“去重”。實際上去重可以粗略的分爲內容(DATA)去重和鏈接(URL)去重,這裏我們講的只是輿情方向的去重需求,如果是電商方向的去重,那麼就不能用 URL 作爲判斷依據,因爲電商爬蟲(例如比價軟件)的目的主要是判斷價格的變化,這時候判斷變化的依據應該是商品的關鍵信息(例如價格、折扣),也就是 DATA 去重。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"去重的原理明白了,那用什麼東西作爲存放去重依據的容器呢?MySQL?Redis?MongoDB?內存?實際上大部分工程師都選擇 Redis 作爲存放去重依據的容器,但實際上 MySQL、MongoDB 和內存都是可以充當容器的,至於爲什麼會選擇 Redis,它又比其他數據存儲好在哪裏?你可以翻閱《Python3 網絡爬蟲寶典》的第 3 章。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"三、分佈式爬蟲","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"無論是輿情方向的爬蟲還是電商方向的爬蟲,要承擔的爬取量都是非常大的。少則每日百萬數據,多則每日數十億數據。以往大家瞭解的單機爬蟲,在性能和資源方面都無法滿足需求。既然 1 個滿足不了,那就 10 個、100 個!這就是分佈式爬蟲出現的背景。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"衆所周知,分佈式和單機要面對的問題是有差異的,除了業務目標是相同的之外,分佈式還要考慮多個個體之間的協作,尤其是資源的共享和競爭。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/79/79a7bc7418b3c766dd95fb4f2e1bed04.png","alt":"Btwyu7","title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在只有 1 個爬蟲應用的時候,也只有它 1 個讀取待爬隊列、只有 1 個存儲數據、只有它 1 個判斷 URL 是否重複。但有幾十幾百個爬蟲應用的時候,就需要區分先後順序,避免出現多個爬蟲應用訪問同一個 URL 的情況(因爲這不僅浪費時間,還浪費資源)。而且,只有 1 個爬蟲應用的時候只需要把它放在 1 臺計算機(服務器)上運行就可以了,但是爬蟲應用突然變得這麼多,又應該如何部署到不同的計算機上呢?手動一個個上傳,然後一個個啓動嗎?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"資源問題","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"先說資源共享和競爭的情況,爲了解決 URL 待爬隊列和已經爬隊列的共享,那麼必須將隊列(也就是上面提到的存放 URL 的容器)放到一個可以公開(多個爬蟲應用)訪問的地方,例如部署在服務器上的 Redis。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這時候又出現一個新狀況,隨着數據量越來越大,要存儲的 URL 越來越多,後面很有可能出現因爲存儲空間需求過大而導致成本遞增的問題。因爲 Redis 是利用內存來存儲數據的,所以存放的 URL 越多就需要越多的內存,而內存又是硬件設備裏價格相對較高的硬件,所以不得不考慮這個問題。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"好在一個叫做布隆的人發明了一種算法——Bloom Filter(布隆過濾器),這種算法通過哈希映射的方式來標記一個對象(這裏是 URL)是否存在,這樣可以將內存的佔用率大大降低,按 1 億條長度爲 32 字符的 URL MD5 值來計算,使用 Bloom Filter 前後的差距大約在 30倍。關於 Bloom Filter 的算法原理和代碼實現解讀可翻閱《Python3 網絡爬蟲寶典》第 3 章 。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"部署問題","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一個個文件上傳,一次次手動運行爬蟲實在是太累了。你可以向運維同事尋求技術支持,但也可以自己探尋這些能夠減輕你工作量的自動化部署方式。目前業內知名的持續集成和部署莫過於 GitLab 的 GitLab Runner 和 GitHub Action,又或者是藉助 K8S 的容器化來實現。但它們只能幫助你實現部署和啓動,而爬蟲應用的一些管理功能就指望不上了。遂,今天要給大家介紹的是另一種實現方式——使用 Crawlab。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Crawlab 是一款由知名外企工程師開發的分佈式爬蟲管理平臺,它不僅支持 Python 語言編寫的爬蟲,幾乎可以兼容大部分編程語言和應用程序。藉助 Crawlab,我們可以將爬蟲應用分散到不同的計算機(服務器)上,而且能夠在可視化界面設定定時任務、查看平臺上爬蟲應用的狀態以及環境依賴等信息。具體如下圖所示:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ee/ee8bc8772fb6f1010882f7347e16a302.png","alt":"qEQqcK","title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.infoq.cn/static/write/img/img-copy-disabled.4f2g7h.png","alt":"image-20201129201320525","title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"面對一款如此實用的平臺工具,作爲工程師的我們不禁想問:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"它是如何把文件分散到不同計算機的?","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"它如何實現不同計算機(多節點)之間通信的?","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"它是如何實現多語言兼容的?","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"……","attrs":{}}]}],"attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中我們比較關注的多節點通信是藉助 Redis 實現的,文件分散同步是藉助 MongoDB 實現的。更多細節可翻閱《Python3 網絡爬蟲寶典》 第 6 章。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了這樣的平臺之外,Python 爬蟲工程師常常接觸的莫過於 Scrapy 框架以及相關衍生的庫。Scrapy 團隊官方開發了一個名爲 Scrapyd 的庫,它專門用來部署 Scrapy 框架開發的爬蟲應用。在部署 Scrapy 應用時,我們通常只需要執行 1 行命令就可以把爬蟲程序部署到服務器上。你想不想知道背後的邏輯:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"程序以什麼樣的形式上傳到服務器的?","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"程序在服務器上如何運行的?","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"爲什麼可以查看到每個任務運行的開始時間和結束時間?","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"中途取消任務執行的功能是怎麼實現的?","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null},"content":[{"type":"text","text":"它的版本控制是怎麼實現的?","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":6,"align":null,"origin":null},"content":[{"type":"text","text":"如果不是 Scrapy 框架編寫的 Python 應用,能實現像上面幾點那樣的監控和操作嗎?","attrs":{}}]}],"attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實際上 Scrapy 應用會被打包成爲一個後綴爲“.egg” 的壓縮包,以 HTTP 的形式上傳到服務器上。當服務端程序需要執行這個程序時,先將它複製到操作系統的臨時文件夾,執行時將其導入到當前 Python 環境,執行完畢後刪除該文件。至於它的執行時間和中斷操作,實際上藉助了 Python 進程接口,具體細節翻閱《Python3 網絡爬蟲寶典》 第 6 章。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"四、自動化渲染技術","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了實現炫酷的效果,或者說爲了節省靜態資源對帶寬的佔用,很多網站都是藉助 JavaScript 來實現對頁面內容的優化。Python 程序本身是無法解釋 JavaScript 和 HTML 代碼的,因此無法獲得我們在瀏覽器中“看到”,但實際上並不是“真實存在”的內容,因爲這些內容都是由瀏覽器渲染出來的,只存在於瀏覽器中,HTML 文檔裏面還是那些文本、JavaScript 文件中還是那些代碼,圖片、視頻和那些特效並不會出現在代碼中,我們看到的一切都是瀏覽器的功勞。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於 Python 也無法獲取瀏覽器渲染後的內容,所以當我們像往常一樣寫代碼爬取上面的數據時,就會發現拿到的數據和看到的並不一樣,任務它就失敗了。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/bd/bd5e225b6fdc8018c96dd3dccf68d1d7.png","alt":"image-20201129203201261","title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這時候我們就需要用到自動化渲染技術了,實際上像 Chrome 和 FireFox 這樣的瀏覽器都開放了接口,允許其他編程語言按照協議規範操控瀏覽器。基於這樣的技術背景,有團隊開發出了像 Selenium 和 Puppeteer 這樣的工具,然後我們就可以用 Python (其他語言也可以)代碼來操作瀏覽器了。讓瀏覽器幫助我們做一些用戶名密碼輸入、登錄按鈕點擊、文本和圖片渲染、驗證碼滑動等操作,從而打破 Python 與瀏覽器本身的差異壁壘,藉助瀏覽器渲染內容後再返回給 Python 程序,然後拿到和我們在網頁上看到的一樣的內容。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了瀏覽器,APP 也有類似的情況。具體操作實踐和案例細節可翻閱《Python3 網絡爬蟲寶典》 第 2 章。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"五、消息隊列在爬蟲領域的應用","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"之前的描述中,我們並沒有提到爬取時候的細節。假設這樣一個正常的爬蟲場景:爬蟲先訪問網站的文章列表頁,然後根據列表頁的 URL 進入詳情頁進行爬取。這裏要注意,文章詳情頁的數量一定是比列表頁的數量多 N 倍的,如果列表展示的是 20 條內容,那麼就是多 20 倍。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果我們需要爬取的網站很多,那麼就會用到分佈式爬蟲。如果分佈式爬蟲只是把 1 個爬蟲程序複製出 N 份來運行,那麼就會出現資源分配不均衡的情況,因爲在上面提到的這種情況下,每 1 個爬蟲都需要這麼幹活。實際上我們可以有更好的搭配方式,讓它們的資源得到最大利用。例從列表頁到詳情頁可以抽象爲生產者和消費者模型:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/9e/9e5237a9c44fe492b5c59cfb28044bad.png","alt":"image-20201129204509533","title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4 號和 5 號爬蟲應用只負責將列表頁中抽取詳情頁的 URL,然後推送到一個隊列中,另外幾個爬蟲程序從隊列中取出詳情頁的 URL 進行爬取。當列表頁和詳情頁數量差距比較大的時候,我們可以增加右側的爬蟲程序數量,差距較小的時候就減少右側的爬蟲程序(或者增加左側的爬蟲程序,具體視情況定)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"左側的爬蟲程序相對於隊列這條“數據採集生產線”來說,它就是生產者,右側爬蟲程序的就是消費者。有了這樣的結構,我們就可以根據實際情況對生產者或者消費者的熟練進行調整,實現資源的最大化利用。另外一個好處是當生產者拿到的 URL 越來越多,但消費者一時消費不過來時,URL 會一直存放在隊列中,等消費能力上升時就能夠再次實現均衡。有了這樣的生產線,我們就不用擔心一下突然湧來很多的 URL 或者一下突然把隊列的 URL 消費一空,隊列這種削峯填谷的能力除了在後端應用中大放異彩之外,在爬蟲應用中也發揮了很大的作用。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"關於爬蟲(以及分佈式爬蟲)程序接入消息隊列的具體實現和細節可翻閱《Python3 網絡爬蟲寶典》 第 4 章。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"六、各種各樣形式的反爬蟲","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你想要我偏不給!","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"網站可不會輕易讓你爬取站點上面的內容,它們往往會從網絡協議、瀏覽器特徵、編程語言差異、人機差異等方面給爬蟲工程師設置障礙,常見的有滑塊驗證碼、拼圖驗證碼、封 IP、檢查 COOKIE、要求登錄、設定複雜的加密邏輯、混淆前端代碼等。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"水來土掩、兵來將擋!爬蟲工程師與目標網站的工程師你來我往的過招就像兵家爾虞我詐一般精彩。《Python3 反爬蟲原理與繞過實戰》一書囊括了市面上 80% 以上的反爬蟲手段和爬蟲技巧,詳細解讀雙方所用招術,使各位看客從中學到不少使用招式。具體細節可翻閱該書,領略技術領域的江湖!","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"小結","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"今天我們一起學習了日數據量過億的大規模爬蟲實踐路上的關鍵技術點,包括文本智能提取、分佈式爬蟲、爬蟲部署和調度、去重、自動化渲染等。學會了這些技術並融會貫通,那麼實現日數據過億的爬蟲就不是問題了。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e6/e605831fe49b3d8886e94f1319a3db8d.png","alt":"double-book","title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這些經驗都來自一線爬蟲工程師,同時這些技術和設計都經過了長期的工作驗證,能夠直接應用在工作當中。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章