web scraper 入門到精通之路

【摘要】來一個插件幫忙翻看一下網頁上的數據——webscraper，目的當然是爲了學習新知識，希望在此與大家一起進步，一起成長。謝謝大家的過目！爲了更加透徹清晰，將採用圖文並茂的方式。（如有侵權，請及時聯繫我）本文來自於x-team成員：清泓。「最後更新時間2020年2月23日【持續更新】」

（本人鄭重聲明：抓取的所有資料著作權歸被抓取方所屬公司或集團，抓取數據只供學習使用，強烈譴責把數據商業化！！！請勿以身試法！）

本文主要參考文獻：[1]

一.安裝

安裝採用的網站[2]下載，這個網站是一個插件庫，實測可行。

下載下來之後，是一個crx文件，然後打開Chrome，重點是：只支持Chrome瀏覽器！

1.打開Chrome瀏覽器設置，找到拓展程序。

2.打開瀏覽器開發者模式。

3.將crx的後綴名改爲zip格式並解壓。

4.點擊拓展程序裏面的按鈕「加載已解壓的拓展程序」。

5.成功部署webscraper。

基本安裝步驟就說到這裏了，下面讓我們來小試一下牛刀。

二.初步使用，抓取csdn官方博客的所有條目數據。

1.抓取博客第一頁的所有標題。

（1）打開網頁，打開調試板，找到webscraper,點擊進去。

值得注意的是這個調試板必須要弄成下列模式佈局，在瀏覽器下方的佈局。

（2）添加請求頭，這個就是我們的網頁地址https://blog.csdn.net/blogdevteam/。

（3）理解工具含義。

創建選擇器時需使用 Element preview 和 Data preview 功能以確保你選中了正確的網頁元素及數據。

1）selector - CSS 選擇器選取所需元素；[3]

2）multiple - 如果要選擇多個記錄需勾選此項。從兩個或多個選中 multiple 的選擇器中提取的數據不會合併到一個單獨記錄中；

3）delay - 選擇器生效前的延遲時長；

4）parent selectors - 爲此選擇器選擇母選擇器以產生選擇器樹形結構；

5）文本選擇器（Text selector）；

6）鏈接選擇器（Link selector）；

7）元素選擇器（Element selector）。

(4). Date extraction 選擇器。

Date extraction 選擇器僅從選中的元素中返回數據。譬如 Text （文本）選擇器從選中的元素中提取文本。以下選擇器可用作 Date extraction 選擇器：

1）Text（文本）選擇器；

2）Link（鏈接）選擇器；

3）Link popup（彈出鏈接）選擇器；

4）Image（圖像）選擇器；

5）Table（表格）選擇器；

6）Element attribute（元素屬性）選擇器；

7）HTML 選擇器；

8）Grouped（組塊）選擇器。

（5). 設定規則

(6).抓取運行和抓取結果。

(7). 結果，這就是設定的單頁抓取標題的數據。

三.抓取整個博客的標題，描述和日期，閱讀數，評論數。

（1）關於多頁抓取。

多頁抓取分很多情況，需要看一下網站的規則，csdn的博客的分頁規則如下：

當點擊第二頁博客的數據的時候，網址鏈接變成了https://blog.csdn.net/blogdevteam//article/list/2?

再看再這個博客內容有多少頁。

可以看到總共37頁。

設置完之後保存一下設置,跑一下，測試一下結果是否正確。

可以看到最小頁碼是1，最大頁碼是37，抓取數據成功。現在來創建同級數據的多個數據集,道理同上，只是多了一個內容類型而已。現在的結構如下：

接下來多建幾個同級的內容。

讓我們來試一下效果，action。

這個是有殘缺的，每行至多一個數據內容，其餘的全沒了，是隨機的丟失。爲什麼會出現這種情況呢？？？？？？太奇怪了。檢查一下：

1.首先結構是沒有問題的。

2.單條數據沒問題。

3.逐條檢查規則沒任何問題。

原因定位在multiple！

這個只能配置一個作爲起始點。感覺和只能有一個主鍵key差不多了。

疑似原因的解除如下，設定之後，成功加載出數據，然後導出爲Excel文檔。

導出Excel文檔如下：

注意：一個常見錯誤是同時創建兩個選擇器設定選項均選中 multiple，期望結果自然合併。例如，如果同時選擇分頁鏈接和導航鏈接，這些鏈接無法自然合併。正確的方法是使用元素選擇器選用 Element 元素，並將 Data 選擇器作爲子選擇器添加到 Element 選擇器中，而不是選中 multiple 選項。

這個要特別注意，當時爬取網站的時候，是把multiple當成了一個類型選擇器在使用，正確用法應該在默認_root的目錄下新建一個類型選擇器進行合併操作，相當於把一撮毛用橡皮筋捲起來，這個element就相當於那個橡皮筋。

2020年2月23日補充說明：（在此感謝熱心知友提出的問題，以下提供的圖片，也是熱心網友提供的）

如下圖所示，multiple1圖是利用multiple對所選數據進行抓取的，但是，這個會出現一個問題，就是多個元素的批量抓取的時候，容易出現multiple2圖出現的情況，單條數據的元素不能完全被抓下來，造成了單條數據的元素缺失，比如說，我抓取的電影，有三個元素，一個是電影標題，第二個元素是電影簡介，第三個元素是電影評分，而結果是我們只抓取到了電影簡介，或者只抓取到了電影標題。

multiple1(只設置了multiple)

multiple2

解決方案，在_root目錄下加一個element類型的元素束，把這些元素捆起來。如下圖element3所示：

element3

2020年3月21日補充說明：

添加element的方法：

選擇element類型的選擇器

element內容區域其實就是一個母容器

element創建成功

四.關於多級數據的抓取。

二級頁面抓取，可以設定一個子數據源的selector。

現在開始，我們來爬一個處女座程序猿的博客，做個簡單一點的，多級頁面的的每頁的單個數據源和多級頁面的全部正文，這裏主要偏向的有兩個方面，其一是多級頁面的數據抓取，其一是子數據源的橋接點的建立。

1.我們首先來新建一個請求頭，然後暫且不抓太多數據，就抓取處女座程序猿的1-5頁的博客數據。請求頭如下，點擊保存。

2.創建父類選擇器。

父類選擇器創建成功，我們可以在這個父類選擇器裏面創建新的子類選擇器了。

點擊父類選擇器我們可以再新建一個子類選擇器，我們這裏先把流程簡單化，每個分級都只帶一個屬性，這個本身是一個橋接點，類型爲link,是一個鏈接，意思就是以標題爲鏈接源，（相當於我們手動點擊知乎某個推送標題可以進入具體文章瀏覽內容）這個當然是分在我們剛纔創建的root目錄下面的。

3.開始抓取二級頁面內容。

打開子頁面之後，我們直接在剛纔建立的子選擇器裏面添加文本類型的選擇器就行了，內容選中爲整個文章的內容。

4.整個頁面結構圖。

五. 多級頁面的熱身運動到此爲止，接下來，是多級頁面的多數據抓取，首先思路是：

1.創建一個公共的父類選擇器。

2.創建多個分支選擇器。

3.在分支選擇器的下面創建多個子類分支內容，可以子生子，孫子生孫子。

這個是抓取的數據字段：

{"_id":"zhihu","startUrl":["https://www.zhihu.com/question/352108632"],
"selectors":[{"id":"anwer","type":"SelectorElementScroll","parentSelectors":
["_root"],"selector":"div.List-item:nth-of-type(1)","multiple":true,"delay":0},
{"id":"name","type":"SelectorText","parentSelectors":["anwer"],
"selector":"#Popover13-toggle a","multiple":false,"regex":"","delay":0},
{"id":"Agree with the number","type":"SelectorText","parentSelectors":["anwer"],
"selector":".Voters button","multiple":false,"regex":"","delay":0},
{"id":"content","type":"SelectorText","parentSelectors":["anwer"],
"selector":"span[itemprop='text']","multiple":false,"regex":"","delay":0},
{"id":"Editing time","type":"SelectorText","parentSelectors":
["anwer"],"selector":"a span","multiple":false,"regex":"","delay":0},
{"id":"comment","type":"SelectorText","parentSelectors":["anwer"],
"selector":"button.ContentItem-action:nth-of-type(1)",
"multiple":false,"regex":"","delay":0}]}

六. 關於滾動網頁的多數據抓取

以知乎回答爲例子:

注意點：選擇內容屬性的時候，特別是標籤屬性的時候，一定要選對，選不對會出現數據抓取失敗的情況。

2019年11月11日～12日關於抓取二級頁面的固定點擊事件中的內容

最近爬取網站的時候，發現二級頁面的數據中有一個展開數據的情況，如果不點擊的時候會導致收起的頁面抓取不到。

抓取海單詞[4]數據的實際問題：裏面的近反義詞有時候是展開的有時候是收起的，webscraper無法自動識別導致所有數據都是空值。在webscraper中寫上適合自己的網站實際情況的爬蟲。

查看官方文檔，我看到了一個好東西，仔細看了一下Element click的功能，嗯，可以，好像這個以前我認爲只能獲取分頁數據的點擊跳轉頁面的東西，好像還有一個作用，可以在當前頁面點擊按鈕然後爬取點擊事件結束之後的內容。仔細研究了一下。

4.11 Element click（元素點擊）選擇器

Element click 選擇器使用方式類似 Element 選擇器。主要目的也是元素選擇，作爲子選擇器的母選擇器。唯一差別在於， Element click 選擇器可通過點擊按鈕同網站交互，以加載新元素。比如採用 JavaScript 以及 AJAX 技術進行導航或頁面加載的網頁。

4.11.1 配置選項

1）selector - CSS 選擇器，用於選擇元素，作爲子選擇器的母選擇器。

2）click selector - CSS 選擇器，用於點擊按鈕加載更多元素。

3）click type - 選擇器類型，用於指示選擇如何得知無新元素並停止點擊。

4）click element uniqueness（點擊元素獨特性）- 選擇器如何的是按鈕已點擊過。

5）multiple - 選中多項記錄（默認應選中）。子選擇器的 multiple 通常不選。

6）delay- 配置在點擊及元素搜索之間的間隔。此項需指定，因爲按鈕點擊後數據未必能立刻加載。因爲服務器響應沒那麼及時，要想不丟失數據，最好設爲 2000ms 以上。

7）Discard initial elements（忽略初始元素）- 選擇器不會選中在第一次點擊按鈕前就已經存在的元素。這在去重時很有用。

4.11.2 Click type 點擊類型

重點：

1）Click Once 點擊一次

Click Once 只會點擊按鈕一次。如果符合條件的新按鈕出現亦會點擊。比如導航鏈接可能只會顯示1~5，6~10隨後纔會顯示。此選擇器也會對它們（6~10）進行點擊。

2）Click More 點擊更多

Click More 會點擊已有按鈕直至無新元素出現。新元素按照有獨有文本內容進行認定。

4.11.3 Click element uniqueness 點擊元素獨特性

當使用 Click Once 同一按鈕只會被點擊一次。當使用 Click More 會一直點擊直到不產生新元素。

1）Unique Text - 有同樣文本內容的按鈕被視爲同一按鈕

2）Unique HTML+Text - 有同樣 HTML 和文本內容的按鈕被視爲同一按鈕

3）Unique HTML - 有同樣 HTML 的按鈕被視爲同一按鈕

4）Unique CSS Selector - 有同樣 CSS 選擇器的按鈕被視爲同一按鈕

案例：

具體來說：

1.Click type

點擊類型，click more 表示點擊多次，因爲我們要抓取批量數據，這裏就選擇 click more，還有一個 click once 選項，點擊一次

2.Click element uniqueness

這個選項是控制 Web Scraper 什麼時候停止抓取數據的。比如說 Unique Text，表示文字改變時停止抓取數據。

我們都知道，一個網站的數據不可能是無窮無盡的，總有加載完的時候，這時候「加載更多」按鈕文字可能就變成「沒有更多」、「沒有更多數據」、「加載完了」等文字，當文字變動時，Web scraper 就會知道沒有更多數據了，會自動停止抓取數據。

3.Multiple

這個我們的老朋友了，表示是否多選，這裏我們要抓取多條數據，當然要打勾。

4.Discard initial elements

是否丟棄初始元素，這個主要是去除一些網站的重複數據用的，不是很重要，我們這裏也用不到，直接選擇 Never discard，從不丟棄數據。

5.Delay

延遲時間，因爲點擊加載更多後，數據加載需要一段時間，delay 就是等待數據加載的時間。一般我們設置要大於等於 2000，因爲延遲 2s 是一個比較合理的數據，如果網絡不好，我們可以設置更大的數字。

這次海詞詞典上的應用可以說正好可以應用到這個東西。

這個問題是解決了，說一下實際操作：

首先是我們有自己的服務器的情況下，

我們自己建立一個h5頁面，寫上鍊接，本次利用了webscraper抓取二級頁面的特性，在第一個頁面，人工寫入網頁鏈接。然後海詞詞典的數據。（本人鄭重聲明：海詞詞典的所有資料著作權歸屬海詞詞典所屬公司，抓取數據只供學習使用，強烈譴責把數據商業化！！！請勿以身試法！）

h5編寫頁面如下圖：

在瀏覽器中打開編寫的實際網頁顯示如下圖：

在這裏我們利用webscraper抓取二級頁面的屬性，抓取海詞資源。我們可以看到以下這種情況：

這圖中還有點擊事件的，點擊進去還有查看更多。。。

於是應證了我之前所出現的那一點問題。

解決辦法把圖中第一節的json文件變動一下：

{"_id":"test_python_bigboom","startUrl":
["http://shupai.downline.cn/local_test_db_009/001_center_data_shupai/000_test_python_webscraper_data_explesion.html"],
"selectors":[{"id":"root","type":"SelectorElement","parentSelectors":["_root"],
"selector":"a","multiple":true,"delay":0},{"id":"titlelink","type":"SelectorLink",
"parentSelectors":["root"],"selector":"_parent_","multiple":true,"delay":0},
{"id":"word_name","type":"SelectorText","parentSelectors":["titlelink"]
,"selector":"h1.keyword","multiple":false,"regex":"","delay":0},
{"id":"haici_n","type":"SelectorText","parentSelectors":["titlelink"],
"selector":".basic li:nth-of-type(1)","multiple":false,"regex":"","delay":0},
{"id":"haici_adj","type":"SelectorText","parentSelectors":["titlelink"],
"selector":".basic li:nth-of-type(2)","multiple":false,"regex":"","delay":0},
{"id":"haici_pron","type":"SelectorText","parentSelectors":["titlelink"],
"selector":".basic li:nth-of-type(3)","multiple":false,"regex":"","delay":0},
{"id":"Detailed interpretation","type":"SelectorText","parentSelectors":
["titlelink"],"selector":"div.detail","multiple":false,"regex":"","delay":0},
{"id":"Near antonym","type":"SelectorText","parentSelectors":["titlelink"],
"selector":"div.nfo","multiple":false,"regex":"","delay":0},
{"id":"Proximity word","type":"SelectorElementClick","parentSelectors":
["titlelink"],"selector":".rel h3.cur","multiple":false,"delay":0,
"clickElementSelector":".rel h3.cur","clickType":"clickOnce",
"discardInitialElements":"do-not-discard","clickElementUniquenessType":
"uniqueText"}]}

變動爲：

{"_id":"test_python_bigboom","startUrl":
["http://shupai.downline.cn/local_test_db_009/001_center_data_shupai/000_test_python_webscraper_data_explesion.html"],
"selectors":[{"id":"root","type":"SelectorElement","parentSelectors":
["_root"],"selector":"a","multiple":true,"delay":0},{"id":"titlelink","type":
"SelectorLink","parentSelectors":["root"],"selector":"_parent_",
"multiple":true,"delay":0},{"id":"word_name","type":"SelectorText",
"parentSelectors":["titlelink"],"selector":"h1.keyword",
"multiple":false,"regex":"","delay":0},
{"id":"haici_n","type":"SelectorText","parentSelectors":["titlelink"],
"selector":".basic li:nth-of-type(1)","multiple":false,"regex":"","delay":0},
{"id":"haici_adj","type":"SelectorText","parentSelectors":["titlelink"],
"selector":".basic li:nth-of-type(2)","multiple":false,"regex":"","delay":0},
{"id":"haici_pron","type":"SelectorText","parentSelectors":["titlelink"],
"selector":".basic li:nth-of-type(3)","multiple":false,"regex":"","delay":0},
{"id":"Detailed interpretation","type":"SelectorText","parentSelectors":
["titlelink"],"selector":"div.detail","multiple":false,"regex":"","delay":0},
{"id":"Near antonym","type":"SelectorText","parentSelectors":["titlelink"],
"selector":"div.nfo","multiple":false,"regex":"","delay":0},
{"id":"Proximity word","type":"SelectorElementClick","parentSelectors":
["titlelink"],"selector":"div.nwd","multiple":true,"delay":"2000",
"clickElementSelector":".rel h3.cur","clickType":"clickMore",
"discardInitialElements":"do-not-discard","clickElementUniquenessType":
"uniqueText"},{"id":"liju","type":"SelectorText","parentSelectors":
["titlelink"],"selector":"div.sort","multiple":false,"regex":"","delay":0},
{"id":"linjinyici","type":"SelectorText","parentSelectors":["Proximity word"],
"selector":"_parent_","multiple":false,"regex":"","delay":0}]}

{"_id":"test_python_bigboom","startUrl":
["http://shupai.downline.cn/local_test_db_009/001_center_data_shupai/000_test_python_webscraper_data_explesion.html"],
"selectors":[{"id":"root","type":"SelectorElement","parentSelectors":
["_root"],"selector":"a","multiple":true,"delay":0},
{"id":"titlelink","type":"SelectorLink","parentSelectors":
["root"],"selector":"_parent_","multiple":true,"delay":0},
{"id":"word_name","type":"SelectorText","parentSelectors":["titlelink"],
"selector":"h1.keyword","multiple":false,"regex":"","delay":0},{"id":"haici_n",
"type":"SelectorText","parentSelectors":["titlelink"],
"selector":".basic li:nth-of-type(1)","multiple":false,"regex":"","delay":0},
{"id":"haici_adj","type":"SelectorText","parentSelectors":["titlelink"],
"selector":".basic li:nth-of-type(2)","multiple":false,"regex":"","delay":0},
{"id":"haici_pron","type":"SelectorText","parentSelectors":
["titlelink"],"selector":".basic li:nth-of-type(3)","multiple":false,"regex":"",
"delay":0},{"id":"Detailed interpretation","type":"SelectorText","parentSelectors":
["titlelink"],"selector":"div.detail","multiple":false,"regex":"","delay":0},
{"id":"Near antonym","type":"SelectorText","parentSelectors":["titlelink"],
"selector":"div.nfo","multiple":false,"regex":"","delay":0},
{"id":"Proximity word","type":"SelectorElementClick","parentSelectors":
["titlelink"],"selector":"div.nwd","multiple":true,"delay":"2000",
"clickElementSelector":".rel h3.cur","clickType":"clickMore",
"discardInitialElements":"do-not-discard","clickElementUniquenessType":"uniqueText"},
{"id":"liju","type":"SelectorText","parentSelectors":["titlelink"],
"selector":"div.sort","multiple":false,"regex":"","delay":0},
{"id":"linjinyici","type":"SelectorText","parentSelectors":
["Proximity word"],"selector":"_parent_","multiple":false,"regex":"","delay":0}]}

以下是結構圖：

實際效果：到最後proximityword 爲 elemtmore類型的click，所以不會在成果表單中顯示，

proximity word 之後的 linjinyici爲text類型，是真正展現在結果表單中的展示數據。

可以看下結果，以前是抓不到的。

上述僅僅爲方法，真正應用實戰中又出現了一部分問題，

所以進行了第二次修訂：

樹形圖如下：

由於webscraper的樹形圖片區只有這麼大（反正左右拉，上下拉都沒有放大，將就一下，看不清直接導入json文件即可。）

以下爲json文件：

{"_id":"python_haici","startUrl":
["http://shupai.downline.cn/001_center_data_shupai/000_test_python_webscraper_data_explesion.html"],
"selectors":[{"id":"base","type":"SelectorElement","parentSelectors":["_root"],
"selector":"a","multiple":true,"delay":0},{"id":"links","type":"SelectorLink",
"parentSelectors":["base"],"selector":"_parent_","multiple":true,"delay":0},
{"id":"word","type":"SelectorText","parentSelectors":["links"],
"selector":"h1.keyword","multiple":false,"regex":"","delay":0},
{"id":"Basic interpretation","type":"SelectorText","parentSelectors":["links"],
"selector":"div.word","multiple":false,"regex":"","delay":0},
{"id":"type_one","type":"SelectorText","parentSelectors":["links"],
"selector":".detail span:nth-of-type(1)","multiple":false,"regex":"","delay":0},
{"id":"Explain one","type":"SelectorText","parentSelectors":["links"],
"selector":".detail ol:nth-of-type(1)","multiple":false,"regex":"","delay":0},
{"id":"type_two","type":"SelectorText","parentSelectors":["links"],
"selector":".detail span:nth-of-type(2)","multiple":false,"regex":"","delay":0},
{"id":"Explain two","type":"SelectorText","parentSelectors":["links"],
"selector":".detail ol:nth-of-type(2)","multiple":false,"regex":"","delay":0},
{"id":"type_three","type":"SelectorText","parentSelectors":["links"],
"selector":".layout span:nth-of-type(3)","multiple":false,"regex":"","delay":0},
{"id":"Explain_three","type":"SelectorText","parentSelectors":
["links"],"selector":".detail ol:nth-of-type(3)","multiple":false,"regex":"",
"delay":0},{"id":"type_four","type":"SelectorText","parentSelectors":["links"],
"selector":"span:nth-of-type(4)","multiple":false,"regex":"","delay":0},
{"id":"Explain four","type":"SelectorText","parentSelectors":
["links"],"selector":"ol:nth-of-type(4)","multiple":false,"regex":"","delay":0},
{"id":"type_five","type":"SelectorText","parentSelectors":["links"],
"selector":"span:nth-of-type(5)","multiple":false,"regex":"","delay":0},
{"id":"Explain_five","type":"SelectorText","parentSelectors":
["links"],"selector":"ol:nth-of-type(5)","multiple":false,"regex":"","delay":0},
{"id":"type_six","type":"SelectorText","parentSelectors":
["links"],"selector":"span:nth-of-type(6)","multiple":false,"regex":"","delay":0},
{"id":"Explain_six","type":"SelectorText","parentSelectors":
["links"],"selector":"ol:nth-of-type(6)","multiple":false,"regex":"","delay":0},
{"id":"English plus English interpretation click",
"type":"SelectorElementClick","parentSelectors":["links"],
"selector":"div.en","multiple":false,"delay":"400",
"clickElementSelector":".def h3.cur","clickType":"clickMore",
"discardInitialElements":"do-not-discard","clickElementUniquenessType":
"uniqueText"},{"id":"English plus English interpretation++",
"type":"SelectorText","parentSelectors":
["English plus English interpretation click"],"selector":
"_parent_","multiple":false,"regex":"","delay":0},
{"id":"Double interpretation click","type":
"SelectorElementClick","parentSelectors":["links"],"selector":
"div.dual","multiple":false,"delay":"400","clickElementSelector":
".def h3.cur","clickType":"clickMore","discardInitialElements":"do-not-discard",
"clickElementUniquenessType":"uniqueText"},
{"id":"Double interpretation++","type":"SelectorText","parentSelectors":
["Double interpretation click"],"selector":"_parent_","multiple":false,"regex":"",
"delay":0},{"id":"Example","type":"SelectorText","parentSelectors":["links"],
"selector":"div.sort","multiple":false,"regex":"","delay":0},
{"id":"Common sentence pattern click",
"type":"SelectorElementClick","parentSelectors":["links"],"selector":
"div.patt","multiple":false,"delay":"400","clickElementSelector":
".sent h3.cur","clickType":"clickOnce","discardInitialElements":
"do-not-discard","clickElementUniquenessType":"uniqueText"},
{"id":"Common sentence pattern++","type":"SelectorText","parentSelectors":
["Common sentence pattern click"],"selector":
"_parent_","multiple":false,"regex":"","delay":0},
{"id":"Common Phrases click","type":
"SelectorElementClick","parentSelectors":["links"],
"selector":"div.phrase","multiple":false,"delay":"400","clickElementSelector":
".sent h3.cur","clickType":"clickOnce","discardInitialElements":"do-not-discard",
"clickElementUniquenessType":"uniqueText"},
{"id":"Common Phrases++","type":"SelectorText","parentSelectors":
["Common Phrases click"],"selector":"_parent_","multiple":false,"regex":"",
"delay":0},{"id":"Vocabulary matching click","type":
"SelectorElementClick","parentSelectors":["links"],"selector":"div.coll",
"multiple":false,"delay":0,"clickElementSelector":".sent h3.cur","clickType":
"clickOnce","discardInitialElements":"do-not-discard","clickElementUniquenessType":
"uniqueText"},{"id":"Vocabulary matching++",
"type":"SelectorText","parentSelectors":["Vocabulary matching click"],
"selector":"_parent_","multiple":false,"regex":"","delay":0},
{"id":"Classic citation click","type":
"SelectorElementClick","parentSelectors":["links"],
"selector":"div.auth","multiple":false,"delay":"400","clickElementSelector":
".sent h3.cur","clickType":"clickOnce","discardInitialElements":"do-not-discard",
"clickElementUniquenessType":"uniqueText"},
{"id":"Classic citation++","type":"SelectorText","parentSelectors":
["Classic citation click"],"selector":"_parent_","multiple":false,"regex":
"","delay":0},{"id":"Word usage","type":"SelectorText","parentSelectors":
["links"],"selector":"div.ess","multiple":false,"regex":"","delay":0},
{"id":"Discrimination of word meaning click","type":
"SelectorElementClick","parentSelectors":
["links"],"selector":"div.discrim","multiple":false,"delay":"400",
"clickElementSelector":".learn h3.cur","clickType":
"clickOnce","discardInitialElements":"do-not-discard","clickElementUniquenessType":
"uniqueText"},{"id":"Discrimination of word meaning++",
"type":"SelectorText","parentSelectors":["Discrimination of word meaning click"],
"selector":"_parent_","multiple":false,"regex":"","delay":0},
{"id":"Common mistakes click","type":"SelectorElementClick","parentSelectors":
["links"],"selector":"div.comn","multiple":false,"delay":"400",
"clickElementSelector":".learn h3.cur","clickType":"clickOnce",
"discardInitialElements":"do-not-discard","clickElementUniquenessType":
"uniqueText"},{"id":"Common mistakes++","type":"SelectorText","parentSelectors":
["Common mistakes click"],"selector":
"_parent_","multiple":false,"regex":"","delay":0},
{"id":"Etymological explanation click","type":
"SelectorElementClick","parentSelectors":["links"],"selector":"div.etm",
"multiple":false,"delay":"400","clickElementSelector":
".learn h3.cur","clickType":"clickOnce","discardInitialElements":
"do-not-discard","clickElementUniquenessType":"uniqueText"},
{"id":"Etymological explanation++","type":"SelectorText","parentSelectors":
["Etymological explanation click"],
"selector":"_parent_","multiple":false,"regex":"","delay":0},
{"id":"Near antonym","type":"SelectorText","parentSelectors":
["links"],"selector":"div.nfo","multiple":false,"regex":"","delay":0},
{"id":"Proximity word click","type":"SelectorElementClick","parentSelectors":
["links"],"selector":"div.nwd","multiple":false,
"delay":"400","clickElementSelector":".rel h3.cur","clickType":"clickOnce",
"discardInitialElements":"do-not-discard","clickElementUniquenessType":
"uniqueText"},{"id":"Proximity word++","type":"SelectorText","parentSelectors":
["Proximity word click"],"selector":"_parent_","multiple":false,"regex":
"","delay":0}]}

這邊也沒什麼特點好說的，主要是一點，類型太多，動名詞等類型有的單詞有很多，而有的單詞沒有，有多種類型的單詞解釋，多類型的單詞解釋那一部分數據必不可缺，類型少的單詞有的數據類型列是爲空值的。

雖然說後期的列表排序和管理比較困難，但是必須這麼去做，不然有一部分數據是抓取不到的。

實例：（no單詞和one單詞的用例標籤欄有區別，或多或少）

這邊用六個類型囊括詳盡釋義，對於詳盡釋義這一欄我重點抓取，

造成的是部分type類型爲空，因爲有的單詞沒有這麼多類型，而有的單詞達到六個類型之多。

但這一欄的分開寫，分類型寫，我認爲是有必要的，因爲便於以後學習。分清楚這些基本類型，動，名，代，數。。。詞。

OK裏面就有幾個是空值，但是這個不可避免。

曬一下抓取之後的結果圖：

好了，解釋就到這裏了，這個可能只有實際操作網頁才能弄透徹，希望大家動手實操，一起學習，一起進步！

這個插件還有一些內容可以深挖，而且有很多隱藏性的問題，可能存在屬性衝突之類的疑難問題，可能使用還不夠熟悉，需要多加熟悉。

本文將持續更新，完善，對此文檔有疑問或者對這方面有興趣的同志可以留言聯繫我，與我一起學習，一起進步，come on!

2019年12月6日更新

大家好久不見哈。今天給大家實戰一個項目。本次是抓取易讀網[5]的小說。本人不具備版權，大家記得數據僅提供學習使用，私自挪用產生的一切後果，本人不承擔任何連帶責任。好了，不說了，展開正題。

乍一看，結構分明，非常適合操作。來來來，操作一把。

沒有任何“雜質”（華麗佈局，繁雜廣告等）而且外表看似一個非常好抓的網站，誰料到它是分離型的結構。沒有外框，這就意味着不能設置元素選擇器。元素選擇器需要設置一大片區域，Element...想了一下，直接把外部標題弄成束狀集結點。具體結構如下圖所示。

json串如下：

{"_id":"yidu","startUrl":["https://yiduks.com/artlist_[1-5].html"],
"selectors":[{"id":"外部標題","type":"SelectorText","parentSelectors":["_root"],
"selector":".b_title a","multiple":true,"regex":"","delay":0},
{"id":"版權作者","type":"SelectorText","parentSelectors":["_root"],"selector":
".b_auth a","multiple":false,"regex":"","delay":0},
{"id":"小說類型","type":"SelectorText","parentSelectors":["_root"],
"selector":"div.b_artc","multiple":false,"regex":"","delay":0},
{"id":"是否連載","type":"SelectorText","parentSelectors":["_root"],
"selector":"div.b_staus","multiple":false,"regex":"","delay":0},
{"id":"點擊閱讀","type":"SelectorLink","parentSelectors":["_root"],
"selector":".b_read a","multiple":true,"delay":0},
{"id":"章節鏈接","type":"SelectorLink","parentSelectors":["點擊閱讀"],
"selector":"td[width] a","multiple":true,"delay":0},
{"id":"小說章節","type":"SelectorText","parentSelectors":
["章節鏈接"],"selector":"b","multiple":false,"regex":"","delay":0},
{"id":"小說作者","type":"SelectorText","parentSelectors":["章節鏈接"],"selector":
".MC a[title]","multiple":false,"regex":"","delay":0},
{"id":"小說正文","type":"SelectorText","parentSelectors":["章節鏈接"],
"selector":"div.ART","multiple":false,"regex":"","delay":0}]}

後來發現了一個問題，就是，這個網站做了如下限制：（有的章節不能看，這個不是爬取數據被識別出來了，而是網站本身的問題）

解決方案：

ok,填入邀請碼，不影響我們的接下來的操作，繼續爬。

這個爲結果，構成結構還需要調整一下。

2019年11月19日更新：

經過後續測試，發現是我之前錯了，這個Element可以多個條目的情況下使用，當然，其他問題，我之前所擔心的只能抓取單條數據的可能存在的問題，都不是問題。

所以，這次經歷告訴我，實踐是檢驗真理的唯一標準，不要以自己的猜想和臆測，或者根據之前的經驗，而妄下定論。

這個修改起來是不能直接在那個地方修改的，只能在json字符串裏面進行改動，只需要在前面的頭部結構加一個束狀元素把它們捆綁起來就OK。

修改json數據如下：

測試窗口：

這裏要詳細解釋一下之前出現的問題。

問題體現在設立元素選擇器的時候，沒有大框，不能一下選定所有需要爬取的數據，如果沒有整合，如何能夠抓取裏面的內容。

因爲束狀選擇器（這個算是無中生有的名字，這是我自己命名的，其實就是我之前文檔中所說過的，一個橡皮筋的作用）裏面包含了

所有的內容，下一級的內容都要從這個大盤子中獲取。

這裏是一條一條疊起來的element區域。詳情見下圖：

連續點擊兩條條目數據之後疊加，和子類數據條目獲取一個概念。

我這裏就不對抓取詳細說明了，第一爲了減少篇幅，我削減了很多之前已經在此欄目說明的抓取方法，如果對基本使用存在疑問可以往上面看一下之前的項目怎麼抓取的。再次感謝大家追更的心。篇幅確實有點長。

以下我對本次抓取的結構放在下面。如果對本次抓取的結構還有疑問，請複製此json串慢慢研究。

測試json如下：

{"_id":"yidutwo","startUrl":
["https://yiduks.com/artlist_[1-2].html"],"selectors":
[{"id":"test","type":"SelectorElement","parentSelectors":["_root"],
"selector":"div.b_row","multiple":true,"delay":0},
{"id":"title","type":"SelectorText","parentSelectors":["test"],
"selector":".b_title a","multiple":false,"regex":"","delay":0},
{"id":"auther","type":"SelectorText","parentSelectors":["test"],
"selector":".b_auth a","multiple":false,"regex":"","delay":0},
{"id":"type","type":"SelectorText","parentSelectors":
["test"],"selector":"div.b_artc","multiple":false,"regex":"","delay":0},
{"id":"yesnonext","type":"SelectorText",
"parentSelectors":["test"],"selector":"div.b_staus",
"multiple":false,"regex":"","delay":0},
{"id":"readclick","type":"SelectorLink",
"parentSelectors":["test"],"selector":".b_read a",
"multiple":false,"delay":0}]}

根據這個json,改動正式結構的json結構如下：

{"_id":"yidu","startUrl":
["https://yiduks.com/artlist_[1-5].html"],
"selectors":[{"id":"yiduelement","type":"SelectorElement","parentSelectors":
["_root"],"selector":"div.b_row","multiple":true,"delay":0},
{"id":"外部標題","type":"SelectorText","parentSelectors":["test"],"selector":
".b_title a","multiple":false,"regex":"","delay":0},
{"id":"版權作者","type":"SelectorText","parentSelectors":["_root"],
"selector":".b_auth a","multiple":false,"regex":"","delay":0},
{"id":"小說類型","type":"SelectorText","parentSelectors":
["_root"],"selector":"div.b_artc","multiple":false,"regex":"","delay":0},
{"id":"是否連載","type":"SelectorText","parentSelectors":["_root"],
"selector":"div.b_staus","multiple":false,"regex":"","delay":0},
{"id":"點擊閱讀","type":"SelectorLink","parentSelectors":["_root"],
"selector":".b_read a","multiple":false,"delay":0},
{"id":"章節鏈接","type":"SelectorLink","parentSelectors":["點擊閱讀"],
"selector":"td[width] a","multiple":true,"delay":0},
{"id":"小說章節","type":"SelectorText","parentSelectors":
["章節鏈接"],"selector":"b","multiple":false,"regex":"","delay":0},
{"id":"小說作者","type":"SelectorText","parentSelectors":["章節鏈接"],
"selector":".MC a[title]","multiple":false,"regex":"","delay":0},
{"id":"小說正文","type":"SelectorText","parentSelectors":["章節鏈接"],
"selector":"div.ART","multiple":false,"regex":"","delay":0}]}

{"_id":"yidu","startUrl":
["https://yiduks.com/artlist_[1-5].html"],
"selectors":[{"id":"外部標題","type":"SelectorText","parentSelectors":["_root"],
"selector":".b_title a","multiple":false,"regex":"","delay":0},
{"id":"版權作者","type":"SelectorText","parentSelectors":["_root"],
"selector":".b_auth a","multiple":false,"regex":"","delay":0},
{"id":"小說類型","type":"SelectorText","parentSelectors":
["_root"],"selector":"div.b_artc","multiple":false,"regex":"","delay":0},
{"id":"是否連載","type":"SelectorText","parentSelectors":
["_root"],"selector":"div.b_staus","multiple":false,"regex":"","delay":0},
{"id":"點擊閱讀","type":"SelectorLink","parentSelectors":
["_root"],"selector":".b_read a","multiple":false,"delay":0},
{"id":"章節鏈接","type":"SelectorLink","parentSelectors":["點擊閱讀"],
"selector":"td[width] a","multiple":true,"delay":0},
{"id":"小說章節","type":"SelectorText","parentSelectors":["章節鏈接"],
"selector":"b","multiple":false,"regex":"","delay":0},
{"id":"小說作者","type":"SelectorText","parentSelectors":["章節鏈接"],
"selector":".MC a[title]","multiple":false,"regex":"","delay":0},
{"id":"小說正文","type":"SelectorText","parentSelectors":["章節鏈接"],
"selector":"div.ART","multiple":false,"regex":"","delay":0}]}

{"_id":"yidu","startUrl":
["https://yiduks.com/artlist_[1-5].html"],"selectors":
[{"id":"yidu","type":"SelectorElement","parentSelectors":["_root"],
"selector":"div.b_row","multiple":true,"delay":0}
{"id":"bubiaoti","type":"SelectorText","parentSelectors":["test"],
"selector":".b_title a","multiple":false,"regex":"","delay":0},
{"id":"banquanzuozhe","type":"SelectorText","parentSelectors":["_root"],
"selector":".b_auth a","multiple":false,"regex":"","delay":0},
{"id":"xiaoshuoleixing","type":"SelectorText","parentSelectors":
["_root"],"selector":"div.b_artc","multiple":false,"regex":"","delay":0},
{"id":"shifoulianzai","type":"SelectorText","parentSelectors":["_root"],
"selector":"div.b_staus","multiple":false,"regex":"","delay":0},
{"id":"yuedu","type":"SelectorLink","parentSelectors":["_root"],
"selector":".b_read a","multiple":false,"delay":0},
{"id":"ielianjie","type":"SelectorLink","parentSelectors":["dianjiyuedu"],
"selector":"td[width] a","multiple":true,"delay":0},
{"id":"zhangjie","type":"SelectorText","parentSelectors":["lianjie"],
"selector":"b","multiple":false,"regex":"","delay":0},
{"id":"zuozhe","type":"SelectorText","parentSelectors":["elianjie"],
"selector":".MC a[title]","multiple":false,"regex":"","delay":0},
{"id":"hengwen","type":"SelectorText","parentSelectors":["lianjie"],
"selector":"div.ART","multiple":false,"regex":"","delay":0}]}

歡迎關注 技術團隊的知乎賬號我們憑團隊實例運作以下專欄，必須乾貨！

互聯網創業專欄 (我們小夥伴的創業歷程)

與您一起聊技術 (APP、微信公衆號、小程序、H5 技術總結)

互聯網產品研發管理 (我們公司對產品結構的管理思路)

我們是不一樣的技術團隊: