web scraper 入門到精通之路

【摘要】來一個插件幫忙翻看一下網頁上的數據——webscraper,目的當然是爲了學習新知識,希望在此與大家一起進步,一起成長。謝謝大家的過目!爲了更加透徹清晰,將採用圖文並茂的方式。(如有侵權,請及時聯繫我) 本文來自於x-team成員:清泓 。「最後更新時間2020年2月23日【持續更新】」

(本人鄭重聲明:抓取的所有資料著作權歸被抓取方所屬公司或集團,抓取數據只供學習使用,強烈譴責把數據商業化!!!請勿以身試法!)

本文主要參考文獻:[1]

一.安裝

安裝採用的網站[2]下載,這個網站是一個插件庫,實測可行。

 

下載下來之後,是一個crx文件,然後打開Chrome,重點是:只支持Chrome瀏覽器!

1.打開Chrome瀏覽器設置,找到拓展程序。

 

2.打開瀏覽器開發者模式。

 

3.將crx的後綴名改爲zip格式並解壓。

4.點擊拓展程序裏面的按鈕「加載已解壓的拓展程序」。

5.成功部署webscraper。

基本安裝步驟就說到這裏了,下面讓我們來小試一下牛刀。

二.初步使用,抓取csdn官方博客的所有條目數據。

1.抓取博客第一頁的所有標題。

(1)打開網頁,打開調試板,找到webscraper,點擊進去。

值得注意的是這個調試板必須要弄成下列模式佈局,在瀏覽器下方的佈局。

(2)添加請求頭,這個就是我們的網頁地址https://blog.csdn.net/blogdevteam/

(3)理解工具含義。

創建選擇器時需使用 Element preview 和 Data preview 功能以確保你選中了正確的網頁元素及數據。

1)selector - CSS 選擇器選取所需元素;[3]

2)multiple - 如果要選擇多個記錄需勾選此項。從兩個或多個選中 multiple 的選擇器中提取的數據不會合併到一個單獨記錄中;

3)delay - 選擇器生效前的延遲時長;

4)parent selectors - 爲此選擇器選擇母選擇器以產生選擇器樹形結構;

5)文本選擇器(Text selector);

6)鏈接選擇器(Link selector);

7)元素選擇器(Element selector)。

 

(4). Date extraction 選擇器。

Date extraction 選擇器僅從選中的元素中返回數據。譬如 Text (文本)選擇器從選中的元素中提取文本。以下選擇器可用作 Date extraction 選擇器:

 

1)Text(文本)選擇器;

2)Link(鏈接)選擇器;

3)Link popup(彈出鏈接)選擇器;

4)Image(圖像)選擇器;

5)Table(表格)選擇器;

6)Element attribute(元素屬性)選擇器;

7)HTML 選擇器;

8)Grouped(組塊)選擇器。

 

(5). 設定規則

(6).抓取運行和抓取結果。

(7). 結果,這就是設定的單頁抓取標題的數據。

 

三.抓取整個博客的標題,描述和日期,閱讀數,評論數。

(1)關於多頁抓取。

多頁抓取分很多情況,需要看一下網站的規則,csdn的博客的分頁規則如下:

當點擊第二頁博客的數據的時候,網址鏈接變成了https://blog.csdn.net/blogdevteam//article/list/2?

再看再這個博客內容有多少頁。

可以看到總共37頁。

設置完之後保存一下設置,跑一下,測試一下結果是否正確。

可以看到最小頁碼是1,最大頁碼是37,抓取數據成功。現在來創建同級數據的多個數據集,道理同上,只是多了一個內容類型而已。現在的結構如下:

接下來多建幾個同級的內容。

讓我們來試一下效果,action。

這個是有殘缺的,每行至多一個數據內容,其餘的全沒了,是隨機的丟失。爲什麼會出現這種情況呢??????太奇怪了。檢查一下:

1.首先結構是沒有問題的。

2.單條數據沒問題。

3.逐條檢查規則沒任何問題。

原因定位在multiple!

這個只能配置一個作爲起始點。感覺和只能有一個主鍵key差不多了。

疑似原因的解除如下,設定之後,成功加載出數據,然後導出爲Excel文檔。

導出Excel文檔如下:

 

注意:一個常見錯誤是同時創建兩個選擇器設定選項均選中 multiple,期望結果自然合併。例如,如果同時選擇分頁鏈接和導航鏈接,這些鏈接無法自然合併。正確的方法是使用元素選擇器選用 Element 元素,並將 Data 選擇器作爲子選擇器添加到 Element 選擇器中,而不是選中 multiple 選項。

這個要特別注意,當時爬取網站的時候,是把multiple當成了一個類型選擇器在使用,正確用法應該在默認_root的目錄下新建一個類型選擇器進行合併操作,相當於把一撮毛用橡皮筋捲起來,這個element就相當於那個橡皮筋。

2020年2月23日補充說明:(在此感謝熱心知友提出的問題,以下提供的圖片,也是熱心網友提供的)

如下圖所示,multiple1圖是利用multiple對所選數據進行抓取的,但是,這個會出現一個問題,就是多個元素的批量抓取的時候,容易出現multiple2圖出現的情況,單條數據的元素不能完全被抓下來,造成了單條數據的元素缺失,比如說,我抓取的電影,有三個元素,一個是電影標題,第二個元素是電影簡介,第三個元素是電影評分,而結果是我們只抓取到了電影簡介,或者只抓取到了電影標題。

multiple1(只設置了multiple)

multiple2

 

解決方案,在_root目錄下加一個element類型的元素束,把這些元素捆起來。如下圖element3所示:

element3

2020年3月21日補充說明:

添加element的方法:

選擇element類型的選擇器

element內容區域其實就是一個母容器

element創建成功

 

四.關於多級數據的抓取。

二級頁面抓取,可以設定一個子數據源的selector。

現在開始,我們來爬 一個處女座程序猿 的博客,做個簡單一點的,多級頁面的的每頁的單個數據源和多級頁面的全部正文,這裏主要偏向的有兩個方面,其一是多級頁面的數據抓取,其一是子數據源的橋接點的建立。

1.我們首先來新建一個請求頭,然後暫且不抓太多數據,就抓取處女座程序猿的1-5頁的博客數據。請求頭如下,點擊保存。

2.創建父類選擇器。

 

父類選擇器創建成功,我們可以在這個父類選擇器裏面創建新的子類選擇器了。

點擊父類選擇器我們可以再新建一個子類選擇器,我們這裏先把流程簡單化,每個分級都只帶一個屬性,這個本身是一個橋接點,類型爲link,是一個鏈接,意思就是以標題爲鏈接源,(相當於我們手動點擊知乎某個推送標題可以進入具體文章瀏覽內容)這個當然是分在我們剛纔創建的root目錄下面的。

3.開始抓取二級頁面內容。

 

打開子頁面之後,我們直接在剛纔建立的子選擇器裏面添加文本類型的選擇器就行了,內容選中爲整個文章的內容。

 

4.整個頁面結構圖。

五. 多級頁面的熱身運動到此爲止,接下來,是多級頁面的多數據抓取,首先思路是:

1.創建一個公共的父類選擇器。

2.創建多個分支選擇器。

3.在分支選擇器的下面創建多個子類分支內容,可以子生子,孫子生孫子。

 

這個是抓取的數據字段:

{"_id":"zhihu","startUrl":["https://www.zhihu.com/question/352108632"],
"selectors":[{"id":"anwer","type":"SelectorElementScroll","parentSelectors":
["_root"],"selector":"div.List-item:nth-of-type(1)","multiple":true,"delay":0},
{"id":"name","type":"SelectorText","parentSelectors":["anwer"],
"selector":"#Popover13-toggle a","multiple":false,"regex":"","delay":0},
{"id":"Agree with the number","type":"SelectorText","parentSelectors":["anwer"],
"selector":".Voters button","multiple":false,"regex":"","delay":0},
{"id":"content","type":"SelectorText","parentSelectors":["anwer"],
"selector":"span[itemprop='text']","multiple":false,"regex":"","delay":0},
{"id":"Editing time","type":"SelectorText","parentSelectors":
["anwer"],"selector":"a span","multiple":false,"regex":"","delay":0},
{"id":"comment","type":"SelectorText","parentSelectors":["anwer"],
"selector":"button.ContentItem-action:nth-of-type(1)",
"multiple":false,"regex":"","delay":0}]}

 

六. 關於滾動網頁的多數據抓取

以知乎回答爲例子:

 

 

 

注意點:選擇內容屬性的時候,特別是標籤屬性的時候,一定要選對,選不對會出現數據抓取失敗的情況。

 

2019年11月11日~12日 關於抓取二級頁面的固定點擊事件中的內容

最近爬取網站的時候,發現二級頁面的數據中有一個展開數據的情況,如果不點擊的時候會導致收起的頁面抓取不到。

抓取海單詞[4]數據的實際問題:裏面的近反義詞有時候是展開的有時候是收起的,webscraper無法自動識別導致所有數據都是空值。在webscraper中寫上適合自己的網站實際情況的爬蟲。

查看官方文檔,我看到了一個好東西,仔細看了一下Element click的功能,嗯,可以,好像這個以前我認爲只能獲取分頁數據的點擊跳轉頁面的東西,好像還有一個作用,可以在當前頁面點擊按鈕然後爬取點擊事件結束之後的內容。仔細研究了一下。

4.11 Element click(元素點擊)選擇器

Element click 選擇器使用方式類似 Element 選擇器。主要目的也是元素選擇,作爲子選擇器的母選擇器。唯一差別在於, Element click 選擇器可通過點擊按鈕同網站交互,以加載新元素。比如採用 JavaScript 以及 AJAX 技術進行導航或頁面加載的網頁。

4.11.1 配置選項

1)selector - CSS 選擇器,用於選擇元素,作爲子選擇器的母選擇器。

2)click selector - CSS 選擇器,用於點擊按鈕加載更多元素。

3)click type - 選擇器類型,用於指示選擇如何得知無新元素並停止點擊。

4)click element uniqueness(點擊元素獨特性)- 選擇器如何的是按鈕已點擊過。

5)multiple - 選中多項記錄(默認應選中)。子選擇器的 multiple 通常不選。

6)delay- 配置在點擊及元素搜索之間的間隔。此項需指定,因爲按鈕點擊後數據未必能立刻加載。因爲服務器響應沒那麼及時,要想不丟失數據,最好設爲 2000ms 以上。

7)Discard initial elements(忽略初始元素)- 選擇器不會選中在第一次點擊按鈕前就已經存在的元素。這在去重時很有用。

4.11.2 Click type 點擊類型

重點:

1)Click Once 點擊一次

Click Once 只會點擊按鈕一次。如果符合條件的新按鈕出現亦會點擊。比如導航鏈接可能只會顯示1~5,6~10隨後纔會顯示。此選擇器也會對它們(6~10)進行點擊。

2)Click More 點擊更多

Click More 會點擊已有按鈕直至無新元素出現。新元素按照有獨有文本內容進行認定。

4.11.3 Click element uniqueness 點擊元素獨特性

當使用 Click Once 同一按鈕只會被點擊一次。當使用 Click More 會一直點擊直到不產生新元素。

1)Unique Text - 有同樣文本內容的按鈕被視爲同一按鈕

2)Unique HTML+Text - 有同樣 HTML 和文本內容的按鈕被視爲同一按鈕

3)Unique HTML - 有同樣 HTML 的按鈕被視爲同一按鈕

4)Unique CSS Selector - 有同樣 CSS 選擇器的按鈕被視爲同一按鈕

案例:

具體來說:

1.Click type

點擊類型,click more 表示點擊多次,因爲我們要抓取批量數據,這裏就選擇 click more,還有一個 click once 選項,點擊一次

2.Click element uniqueness

這個選項是控制 Web Scraper 什麼時候停止抓取數據的。比如說 Unique Text,表示文字改變時停止抓取數據。

我們都知道,一個網站的數據不可能是無窮無盡的,總有加載完的時候,這時候「加載更多」按鈕文字可能就變成「沒有更多」、「沒有更多數據」、「加載完了」等文字,當文字變動時,Web scraper 就會知道沒有更多數據了,會自動停止抓取數據。

3.Multiple

這個我們的老朋友了,表示是否多選,這裏我們要抓取多條數據,當然要打勾。

4.Discard initial elements

是否丟棄初始元素,這個主要是去除一些網站的重複數據用的,不是很重要,我們這裏也用不到,直接選擇 Never discard,從不丟棄數據。

5.Delay

延遲時間,因爲點擊加載更多後,數據加載需要一段時間,delay 就是等待數據加載的時間。一般我們設置要大於等於 2000,因爲延遲 2s 是一個比較合理的數據,如果網絡不好,我們可以設置更大的數字。

這次海詞詞典上的應用可以說正好可以應用到這個東西。

這個問題是解決了,說一下實際操作:

首先是我們有自己的服務器的情況下,

我們自己建立一個h5頁面,寫上鍊接,本次利用了webscraper抓取二級頁面的特性,在第一個頁面,人工寫入網頁鏈接。然後海詞詞典的數據。(本人鄭重聲明:海詞詞典的所有資料著作權歸屬海詞詞典所屬公司,抓取數據只供學習使用,強烈譴責把數據商業化!!!請勿以身試法!)

h5編寫頁面如下圖:

 

在瀏覽器中打開編寫的實際網頁顯示如下圖:

 

在這裏我們利用webscraper抓取二級頁面的屬性,抓取海詞資源。我們可以看到以下這種情況:

 

這圖中還有點擊事件的,點擊進去還有查看更多。。。

於是應證了我之前所出現的那一點問題。

解決辦法把圖中第一節的json文件變動一下:

{"_id":"test_python_bigboom","startUrl":
["http://shupai.downline.cn/local_test_db_009/001_center_data_shupai/000_test_python_webscraper_data_explesion.html"],
"selectors":[{"id":"root","type":"SelectorElement","parentSelectors":["_root"],
"selector":"a","multiple":true,"delay":0},{"id":"titlelink","type":"SelectorLink",
"parentSelectors":["root"],"selector":"_parent_","multiple":true,"delay":0},
{"id":"word_name","type":"SelectorText","parentSelectors":["titlelink"]
,"selector":"h1.keyword","multiple":false,"regex":"","delay":0},
{"id":"haici_n","type":"SelectorText","parentSelectors":["titlelink"],
"selector":".basic li:nth-of-type(1)","multiple":false,"regex":"","delay":0},
{"id":"haici_adj","type":"SelectorText","parentSelectors":["titlelink"],
"selector":".basic li:nth-of-type(2)","multiple":false,"regex":"","delay":0},
{"id":"haici_pron","type":"SelectorText","parentSelectors":["titlelink"],
"selector":".basic li:nth-of-type(3)","multiple":false,"regex":"","delay":0},
{"id":"Detailed interpretation","type":"SelectorText","parentSelectors":
["titlelink"],"selector":"div.detail","multiple":false,"regex":"","delay":0},
{"id":"Near antonym","type":"SelectorText","parentSelectors":["titlelink"],
"selector":"div.nfo","multiple":false,"regex":"","delay":0},
{"id":"Proximity word","type":"SelectorElementClick","parentSelectors":
["titlelink"],"selector":".rel h3.cur","multiple":false,"delay":0,
"clickElementSelector":".rel h3.cur","clickType":"clickOnce",
"discardInitialElements":"do-not-discard","clickElementUniquenessType":
"uniqueText"}]}

變動爲:

{"_id":"test_python_bigboom","startUrl":
["http://shupai.downline.cn/local_test_db_009/001_center_data_shupai/000_test_python_webscraper_data_explesion.html"],
"selectors":[{"id":"root","type":"SelectorElement","parentSelectors":
["_root"],"selector":"a","multiple":true,"delay":0},{"id":"titlelink","type":
"SelectorLink","parentSelectors":["root"],"selector":"_parent_",
"multiple":true,"delay":0},{"id":"word_name","type":"SelectorText",
"parentSelectors":["titlelink"],"selector":"h1.keyword",
"multiple":false,"regex":"","delay":0},
{"id":"haici_n","type":"SelectorText","parentSelectors":["titlelink"],
"selector":".basic li:nth-of-type(1)","multiple":false,"regex":"","delay":0},
{"id":"haici_adj","type":"SelectorText","parentSelectors":["titlelink"],
"selector":".basic li:nth-of-type(2)","multiple":false,"regex":"","delay":0},
{"id":"haici_pron","type":"SelectorText","parentSelectors":["titlelink"],
"selector":".basic li:nth-of-type(3)","multiple":false,"regex":"","delay":0},
{"id":"Detailed interpretation","type":"SelectorText","parentSelectors":
["titlelink"],"selector":"div.detail","multiple":false,"regex":"","delay":0},
{"id":"Near antonym","type":"SelectorText","parentSelectors":["titlelink"],
"selector":"div.nfo","multiple":false,"regex":"","delay":0},
{"id":"Proximity word","type":"SelectorElementClick","parentSelectors":
["titlelink"],"selector":"div.nwd","multiple":true,"delay":"2000",
"clickElementSelector":".rel h3.cur","clickType":"clickMore",
"discardInitialElements":"do-not-discard","clickElementUniquenessType":
"uniqueText"},{"id":"liju","type":"SelectorText","parentSelectors":
["titlelink"],"selector":"div.sort","multiple":false,"regex":"","delay":0},
{"id":"linjinyici","type":"SelectorText","parentSelectors":["Proximity word"],
"selector":"_parent_","multiple":false,"regex":"","delay":0}]}

 

{"_id":"test_python_bigboom","startUrl":
["http://shupai.downline.cn/local_test_db_009/001_center_data_shupai/000_test_python_webscraper_data_explesion.html"],
"selectors":[{"id":"root","type":"SelectorElement","parentSelectors":
["_root"],"selector":"a","multiple":true,"delay":0},
{"id":"titlelink","type":"SelectorLink","parentSelectors":
["root"],"selector":"_parent_","multiple":true,"delay":0},
{"id":"word_name","type":"SelectorText","parentSelectors":["titlelink"],
"selector":"h1.keyword","multiple":false,"regex":"","delay":0},{"id":"haici_n",
"type":"SelectorText","parentSelectors":["titlelink"],
"selector":".basic li:nth-of-type(1)","multiple":false,"regex":"","delay":0},
{"id":"haici_adj","type":"SelectorText","parentSelectors":["titlelink"],
"selector":".basic li:nth-of-type(2)","multiple":false,"regex":"","delay":0},
{"id":"haici_pron","type":"SelectorText","parentSelectors":
["titlelink"],"selector":".basic li:nth-of-type(3)","multiple":false,"regex":"",
"delay":0},{"id":"Detailed interpretation","type":"SelectorText","parentSelectors":
["titlelink"],"selector":"div.detail","multiple":false,"regex":"","delay":0},
{"id":"Near antonym","type":"SelectorText","parentSelectors":["titlelink"],
"selector":"div.nfo","multiple":false,"regex":"","delay":0},
{"id":"Proximity word","type":"SelectorElementClick","parentSelectors":
["titlelink"],"selector":"div.nwd","multiple":true,"delay":"2000",
"clickElementSelector":".rel h3.cur","clickType":"clickMore",
"discardInitialElements":"do-not-discard","clickElementUniquenessType":"uniqueText"},
{"id":"liju","type":"SelectorText","parentSelectors":["titlelink"],
"selector":"div.sort","multiple":false,"regex":"","delay":0},
{"id":"linjinyici","type":"SelectorText","parentSelectors":
["Proximity word"],"selector":"_parent_","multiple":false,"regex":"","delay":0}]}

以下是結構圖:

 

 

實際效果:到最後proximityword 爲 elemtmore類型的click,所以不會在成果表單中顯示,

proximity word 之後的 linjinyici爲text類型,是真正展現在結果表單中的展示數據。

可以看下結果,以前是抓不到的。

上述僅僅爲方法,真正應用實戰中又出現了一部分問題,

所以進行了第二次修訂:

樹形圖如下:

由於webscraper的樹形圖片區只有這麼大(反正左右拉,上下拉都沒有放大,將就一下,看不清直接導入json文件即可。)

以下爲json文件:

{"_id":"python_haici","startUrl":
["http://shupai.downline.cn/001_center_data_shupai/000_test_python_webscraper_data_explesion.html"],
"selectors":[{"id":"base","type":"SelectorElement","parentSelectors":["_root"],
"selector":"a","multiple":true,"delay":0},{"id":"links","type":"SelectorLink",
"parentSelectors":["base"],"selector":"_parent_","multiple":true,"delay":0},
{"id":"word","type":"SelectorText","parentSelectors":["links"],
"selector":"h1.keyword","multiple":false,"regex":"","delay":0},
{"id":"Basic interpretation","type":"SelectorText","parentSelectors":["links"],
"selector":"div.word","multiple":false,"regex":"","delay":0},
{"id":"type_one","type":"SelectorText","parentSelectors":["links"],
"selector":".detail span:nth-of-type(1)","multiple":false,"regex":"","delay":0},
{"id":"Explain one","type":"SelectorText","parentSelectors":["links"],
"selector":".detail ol:nth-of-type(1)","multiple":false,"regex":"","delay":0},
{"id":"type_two","type":"SelectorText","parentSelectors":["links"],
"selector":".detail span:nth-of-type(2)","multiple":false,"regex":"","delay":0},
{"id":"Explain two","type":"SelectorText","parentSelectors":["links"],
"selector":".detail ol:nth-of-type(2)","multiple":false,"regex":"","delay":0},
{"id":"type_three","type":"SelectorText","parentSelectors":["links"],
"selector":".layout span:nth-of-type(3)","multiple":false,"regex":"","delay":0},
{"id":"Explain_three","type":"SelectorText","parentSelectors":
["links"],"selector":".detail ol:nth-of-type(3)","multiple":false,"regex":"",
"delay":0},{"id":"type_four","type":"SelectorText","parentSelectors":["links"],
"selector":"span:nth-of-type(4)","multiple":false,"regex":"","delay":0},
{"id":"Explain four","type":"SelectorText","parentSelectors":
["links"],"selector":"ol:nth-of-type(4)","multiple":false,"regex":"","delay":0},
{"id":"type_five","type":"SelectorText","parentSelectors":["links"],
"selector":"span:nth-of-type(5)","multiple":false,"regex":"","delay":0},
{"id":"Explain_five","type":"SelectorText","parentSelectors":
["links"],"selector":"ol:nth-of-type(5)","multiple":false,"regex":"","delay":0},
{"id":"type_six","type":"SelectorText","parentSelectors":
["links"],"selector":"span:nth-of-type(6)","multiple":false,"regex":"","delay":0},
{"id":"Explain_six","type":"SelectorText","parentSelectors":
["links"],"selector":"ol:nth-of-type(6)","multiple":false,"regex":"","delay":0},
{"id":"English plus English interpretation click",
"type":"SelectorElementClick","parentSelectors":["links"],
"selector":"div.en","multiple":false,"delay":"400",
"clickElementSelector":".def h3.cur","clickType":"clickMore",
"discardInitialElements":"do-not-discard","clickElementUniquenessType":
"uniqueText"},{"id":"English plus English interpretation++",
"type":"SelectorText","parentSelectors":
["English plus English interpretation click"],"selector":
"_parent_","multiple":false,"regex":"","delay":0},
{"id":"Double interpretation click","type":
"SelectorElementClick","parentSelectors":["links"],"selector":
"div.dual","multiple":false,"delay":"400","clickElementSelector":
".def h3.cur","clickType":"clickMore","discardInitialElements":"do-not-discard",
"clickElementUniquenessType":"uniqueText"},
{"id":"Double interpretation++","type":"SelectorText","parentSelectors":
["Double interpretation click"],"selector":"_parent_","multiple":false,"regex":"",
"delay":0},{"id":"Example","type":"SelectorText","parentSelectors":["links"],
"selector":"div.sort","multiple":false,"regex":"","delay":0},
{"id":"Common sentence pattern click",
"type":"SelectorElementClick","parentSelectors":["links"],"selector":
"div.patt","multiple":false,"delay":"400","clickElementSelector":
".sent h3.cur","clickType":"clickOnce","discardInitialElements":
"do-not-discard","clickElementUniquenessType":"uniqueText"},
{"id":"Common sentence pattern++","type":"SelectorText","parentSelectors":
["Common sentence pattern click"],"selector":
"_parent_","multiple":false,"regex":"","delay":0},
{"id":"Common Phrases click","type":
"SelectorElementClick","parentSelectors":["links"],
"selector":"div.phrase","multiple":false,"delay":"400","clickElementSelector":
".sent h3.cur","clickType":"clickOnce","discardInitialElements":"do-not-discard",
"clickElementUniquenessType":"uniqueText"},
{"id":"Common Phrases++","type":"SelectorText","parentSelectors":
["Common Phrases click"],"selector":"_parent_","multiple":false,"regex":"",
"delay":0},{"id":"Vocabulary matching click","type":
"SelectorElementClick","parentSelectors":["links"],"selector":"div.coll",
"multiple":false,"delay":0,"clickElementSelector":".sent h3.cur","clickType":
"clickOnce","discardInitialElements":"do-not-discard","clickElementUniquenessType":
"uniqueText"},{"id":"Vocabulary matching++",
"type":"SelectorText","parentSelectors":["Vocabulary matching click"],
"selector":"_parent_","multiple":false,"regex":"","delay":0},
{"id":"Classic citation click","type":
"SelectorElementClick","parentSelectors":["links"],
"selector":"div.auth","multiple":false,"delay":"400","clickElementSelector":
".sent h3.cur","clickType":"clickOnce","discardInitialElements":"do-not-discard",
"clickElementUniquenessType":"uniqueText"},
{"id":"Classic citation++","type":"SelectorText","parentSelectors":
["Classic citation click"],"selector":"_parent_","multiple":false,"regex":
"","delay":0},{"id":"Word usage","type":"SelectorText","parentSelectors":
["links"],"selector":"div.ess","multiple":false,"regex":"","delay":0},
{"id":"Discrimination of word meaning click","type":
"SelectorElementClick","parentSelectors":
["links"],"selector":"div.discrim","multiple":false,"delay":"400",
"clickElementSelector":".learn h3.cur","clickType":
"clickOnce","discardInitialElements":"do-not-discard","clickElementUniquenessType":
"uniqueText"},{"id":"Discrimination of word meaning++",
"type":"SelectorText","parentSelectors":["Discrimination of word meaning click"],
"selector":"_parent_","multiple":false,"regex":"","delay":0},
{"id":"Common mistakes click","type":"SelectorElementClick","parentSelectors":
["links"],"selector":"div.comn","multiple":false,"delay":"400",
"clickElementSelector":".learn h3.cur","clickType":"clickOnce",
"discardInitialElements":"do-not-discard","clickElementUniquenessType":
"uniqueText"},{"id":"Common mistakes++","type":"SelectorText","parentSelectors":
["Common mistakes click"],"selector":
"_parent_","multiple":false,"regex":"","delay":0},
{"id":"Etymological explanation click","type":
"SelectorElementClick","parentSelectors":["links"],"selector":"div.etm",
"multiple":false,"delay":"400","clickElementSelector":
".learn h3.cur","clickType":"clickOnce","discardInitialElements":
"do-not-discard","clickElementUniquenessType":"uniqueText"},
{"id":"Etymological explanation++","type":"SelectorText","parentSelectors":
["Etymological explanation click"],
"selector":"_parent_","multiple":false,"regex":"","delay":0},
{"id":"Near antonym","type":"SelectorText","parentSelectors":
["links"],"selector":"div.nfo","multiple":false,"regex":"","delay":0},
{"id":"Proximity word click","type":"SelectorElementClick","parentSelectors":
["links"],"selector":"div.nwd","multiple":false,
"delay":"400","clickElementSelector":".rel h3.cur","clickType":"clickOnce",
"discardInitialElements":"do-not-discard","clickElementUniquenessType":
"uniqueText"},{"id":"Proximity word++","type":"SelectorText","parentSelectors":
["Proximity word click"],"selector":"_parent_","multiple":false,"regex":
"","delay":0}]}

這邊也沒什麼特點好說的,主要是一點,類型太多,動名詞等類型有的單詞有很多,而有的單詞沒有,有多種類型的單詞解釋,多類型的單詞解釋那一部分數據必不可缺,類型少的單詞有的數據類型列是爲空值的。

雖然說後期的列表排序和管理比較困難,但是必須這麼去做,不然有一部分數據是抓取不到的。

實例:(no單詞和one單詞的用例標籤欄有區別,或多或少)

 

 

這邊用六個類型囊括詳盡釋義,對於詳盡釋義這一欄我重點抓取,

造成的是部分type類型爲空,因爲有的單詞沒有這麼多類型,而有的單詞達到六個類型之多。

但這一欄的分開寫,分類型寫,我認爲是有必要的,因爲便於以後學習。分清楚這些基本類型,動,名,代,數。。。詞。

 

 

 

OK裏面就有幾個是空值,但是這個不可避免。

曬一下抓取之後的結果圖:

 

好了,解釋就到這裏了,這個可能只有實際操作網頁才能弄透徹,希望大家動手實操,一起學習,一起進步!

這個插件還有一些內容可以深挖,而且有很多隱藏性的問題,可能存在屬性衝突之類的疑難問題,可能使用還不夠熟悉,需要多加熟悉。

本文將持續更新,完善,對此文檔有疑問或者對這方面有興趣的同志可以留言聯繫我,與我一起學習,一起進步,come on!

 

 


2019年12月6日更新

大家好久不見哈。今天給大家實戰一個項目。本次是抓取易讀網[5]的小說。本人不具備版權,大家記得數據僅提供學習使用,私自挪用產生的一切後果,本人不承擔任何連帶責任。好了,不說了,展開正題。

乍一看,結構分明,非常適合操作。 來來來,操作一把。

沒有任何“雜質”(華麗佈局,繁雜廣告等)而且外表看似一個非常好抓的網站,誰料到它是分離型的結構。沒有外框,這就意味着不能設置元素選擇器。元素選擇器需要設置一大片區域,Element...想了一下,直接把外部標題弄成束狀集結點。具體結構如下圖所示 。

json串如下:

{"_id":"yidu","startUrl":["https://yiduks.com/artlist_[1-5].html"],
"selectors":[{"id":"外部標題","type":"SelectorText","parentSelectors":["_root"],
"selector":".b_title a","multiple":true,"regex":"","delay":0},
{"id":"版權作者","type":"SelectorText","parentSelectors":["_root"],"selector":
".b_auth a","multiple":false,"regex":"","delay":0},
{"id":"小說類型","type":"SelectorText","parentSelectors":["_root"],
"selector":"div.b_artc","multiple":false,"regex":"","delay":0},
{"id":"是否連載","type":"SelectorText","parentSelectors":["_root"],
"selector":"div.b_staus","multiple":false,"regex":"","delay":0},
{"id":"點擊閱讀","type":"SelectorLink","parentSelectors":["_root"],
"selector":".b_read a","multiple":true,"delay":0},
{"id":"章節鏈接","type":"SelectorLink","parentSelectors":["點擊閱讀"],
"selector":"td[width] a","multiple":true,"delay":0},
{"id":"小說章節","type":"SelectorText","parentSelectors":
["章節鏈接"],"selector":"b","multiple":false,"regex":"","delay":0},
{"id":"小說作者","type":"SelectorText","parentSelectors":["章節鏈接"],"selector":
".MC a[title]","multiple":false,"regex":"","delay":0},
{"id":"小說正文","type":"SelectorText","parentSelectors":["章節鏈接"],
"selector":"div.ART","multiple":false,"regex":"","delay":0}]}

後來發現了一個問題,就是,這個網站做了如下限制:(有的章節不能看,這個不是爬取數據被識別出來了,而是網站本身的問題)

解決方案:

ok,填入邀請碼,不影響我們的接下來的操作,繼續爬。

這個爲結果,構成結構還需要調整一下。

2019年11月19日更新:

經過後續測試,發現是我之前錯了,這個Element可以多個條目的情況下使用,當然,其他問題,我之前所擔心的只能抓取單條數據的可能存在的問題,都不是問題。

所以,這次經歷告訴我,實踐是檢驗真理的唯一標準,不要以自己的猜想和臆測,或者根據之前的經驗,而妄下定論。

這個修改起來是不能直接在那個地方修改的,只能在json字符串裏面進行改動,只需要在前面的頭部結構加一個束狀元素把它們捆綁起來就OK。

修改json數據如下:

測試窗口:

 

這裏要詳細解釋一下之前出現的問題。

問題體現在設立元素選擇器的時候,沒有大框,不能一下選定所有需要爬取的數據,如果沒有整合,如何能夠抓取裏面的內容。

因爲束狀選擇器(這個算是無中生有的名字,這是我自己命名的,其實就是我之前文檔中所說過的,一個橡皮筋的作用)裏面包含了

所有的內容,下一級的內容都要從這個大盤子中獲取。

這裏是一條一條疊起來的element區域。詳情見下圖:

連續點擊兩條條目數據之後疊加,和子類數據條目獲取一個概念。

我這裏就不對抓取詳細說明了,第一爲了減少篇幅,我削減了很多之前已經在此欄目說明的抓取方法,如果對基本使用存在疑問可以往上面看一下之前的項目怎麼抓取的。再次感謝大家追更的心。篇幅確實有點長。

以下我對本次抓取的結構放在下面。如果對本次抓取的結構還有疑問,請複製此json串慢慢研究。

 

測試json如下:

{"_id":"yidutwo","startUrl":
["https://yiduks.com/artlist_[1-2].html"],"selectors":
[{"id":"test","type":"SelectorElement","parentSelectors":["_root"],
"selector":"div.b_row","multiple":true,"delay":0},
{"id":"title","type":"SelectorText","parentSelectors":["test"],
"selector":".b_title a","multiple":false,"regex":"","delay":0},
{"id":"auther","type":"SelectorText","parentSelectors":["test"],
"selector":".b_auth a","multiple":false,"regex":"","delay":0},
{"id":"type","type":"SelectorText","parentSelectors":
["test"],"selector":"div.b_artc","multiple":false,"regex":"","delay":0},
{"id":"yesnonext","type":"SelectorText",
"parentSelectors":["test"],"selector":"div.b_staus",
"multiple":false,"regex":"","delay":0},
{"id":"readclick","type":"SelectorLink",
"parentSelectors":["test"],"selector":".b_read a",
"multiple":false,"delay":0}]}

 

根據這個json,改動正式結構的json結構如下:

{"_id":"yidu","startUrl":
["https://yiduks.com/artlist_[1-5].html"],
"selectors":[{"id":"yiduelement","type":"SelectorElement","parentSelectors":
["_root"],"selector":"div.b_row","multiple":true,"delay":0},
{"id":"外部標題","type":"SelectorText","parentSelectors":["test"],"selector":
".b_title a","multiple":false,"regex":"","delay":0},
{"id":"版權作者","type":"SelectorText","parentSelectors":["_root"],
"selector":".b_auth a","multiple":false,"regex":"","delay":0},
{"id":"小說類型","type":"SelectorText","parentSelectors":
["_root"],"selector":"div.b_artc","multiple":false,"regex":"","delay":0},
{"id":"是否連載","type":"SelectorText","parentSelectors":["_root"],
"selector":"div.b_staus","multiple":false,"regex":"","delay":0},
{"id":"點擊閱讀","type":"SelectorLink","parentSelectors":["_root"],
"selector":".b_read a","multiple":false,"delay":0},
{"id":"章節鏈接","type":"SelectorLink","parentSelectors":["點擊閱讀"],
"selector":"td[width] a","multiple":true,"delay":0},
{"id":"小說章節","type":"SelectorText","parentSelectors":
["章節鏈接"],"selector":"b","multiple":false,"regex":"","delay":0},
{"id":"小說作者","type":"SelectorText","parentSelectors":["章節鏈接"],
"selector":".MC a[title]","multiple":false,"regex":"","delay":0},
{"id":"小說正文","type":"SelectorText","parentSelectors":["章節鏈接"],
"selector":"div.ART","multiple":false,"regex":"","delay":0}]}

 

{"_id":"yidu","startUrl":
["https://yiduks.com/artlist_[1-5].html"],
"selectors":[{"id":"外部標題","type":"SelectorText","parentSelectors":["_root"],
"selector":".b_title a","multiple":false,"regex":"","delay":0},
{"id":"版權作者","type":"SelectorText","parentSelectors":["_root"],
"selector":".b_auth a","multiple":false,"regex":"","delay":0},
{"id":"小說類型","type":"SelectorText","parentSelectors":
["_root"],"selector":"div.b_artc","multiple":false,"regex":"","delay":0},
{"id":"是否連載","type":"SelectorText","parentSelectors":
["_root"],"selector":"div.b_staus","multiple":false,"regex":"","delay":0},
{"id":"點擊閱讀","type":"SelectorLink","parentSelectors":
["_root"],"selector":".b_read a","multiple":false,"delay":0},
{"id":"章節鏈接","type":"SelectorLink","parentSelectors":["點擊閱讀"],
"selector":"td[width] a","multiple":true,"delay":0},
{"id":"小說章節","type":"SelectorText","parentSelectors":["章節鏈接"],
"selector":"b","multiple":false,"regex":"","delay":0},
{"id":"小說作者","type":"SelectorText","parentSelectors":["章節鏈接"],
"selector":".MC a[title]","multiple":false,"regex":"","delay":0},
{"id":"小說正文","type":"SelectorText","parentSelectors":["章節鏈接"],
"selector":"div.ART","multiple":false,"regex":"","delay":0}]}

 

{"_id":"yidu","startUrl":
["https://yiduks.com/artlist_[1-5].html"],"selectors":
[{"id":"yidu","type":"SelectorElement","parentSelectors":["_root"],
"selector":"div.b_row","multiple":true,"delay":0}
{"id":"bubiaoti","type":"SelectorText","parentSelectors":["test"],
"selector":".b_title a","multiple":false,"regex":"","delay":0},
{"id":"banquanzuozhe","type":"SelectorText","parentSelectors":["_root"],
"selector":".b_auth a","multiple":false,"regex":"","delay":0},
{"id":"xiaoshuoleixing","type":"SelectorText","parentSelectors":
["_root"],"selector":"div.b_artc","multiple":false,"regex":"","delay":0},
{"id":"shifoulianzai","type":"SelectorText","parentSelectors":["_root"],
"selector":"div.b_staus","multiple":false,"regex":"","delay":0},
{"id":"yuedu","type":"SelectorLink","parentSelectors":["_root"],
"selector":".b_read a","multiple":false,"delay":0},
{"id":"ielianjie","type":"SelectorLink","parentSelectors":["dianjiyuedu"],
"selector":"td[width] a","multiple":true,"delay":0},
{"id":"zhangjie","type":"SelectorText","parentSelectors":["lianjie"],
"selector":"b","multiple":false,"regex":"","delay":0},
{"id":"zuozhe","type":"SelectorText","parentSelectors":["elianjie"],
"selector":".MC a[title]","multiple":false,"regex":"","delay":0},
{"id":"hengwen","type":"SelectorText","parentSelectors":["lianjie"],
"selector":"div.ART","multiple":false,"regex":"","delay":0}]}

 


歡迎關注 技術團隊的知乎賬號 我們憑團隊實例運作以下專欄, 必須乾貨!

互聯網創業專欄 (我們小夥伴的創業歷程)

與您一起聊技術 (APP、微信公衆號、小程序、H5 技術總結)

互聯網產品研發管理 (我們公司對產品結構的管理思路)

 

我們是不一樣的技術團隊:

(我們認爲:所有的企業行爲,都解讀爲交易行爲,無論是摩拜單車、外賣平臺、自動售貨機、招聘社區、家政服務,都用交易的語言來表達,我們專欄裏面有很多實際案例和開發過程和交付流程)

(類似於元素週期表,我們把交易拆解成元素級別,根據業務定製組裝,完全復原個性化需求,我們專欄裏面有很學術也很實際的介紹)

(每個項目設置: 導師成長基金、參與人員的獎勵,全員股權池,創業氛圍濃郁,我們專欄公開分享了我們的一些經驗)

(專治各種複雜的業務場景, 我們通過簡潔的元素和分層組合,來完成複雜場景的業務定製,我們在這一塊有非常多的案例,在互聯網創業專欄裏面有詳細描述)

參考

  1. ^網頁數據抓取工具,webscraper 最簡單的數據抓取教程,人人都用得上 https://www.cnblogs.com/fengzheng/p/8440806.html
  2. ^crxdl插件網 https://crxdl.com
  3. ^webscraper官方文檔 http://webscraper.top/543178
  4. ^海詞網 https://m.dict.cn/
  5. ^易讀網 http://www.yidukk.com/
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章