[Python爬蟲] 六、數據提取之XPath與lxml類庫

往期內容提要：

一、非結構化數據與結構化數據

一般來講對我們而言，需要抓取的是某個網站或者某個應用的內容，提取有用的價值。內容一般分爲兩部分，非結構化的數據和結構化的數據。

非結構化數據：先有數據，再有結構。
結構化數據：先有結構、再有數據。
不同類型的數據，我們需要採用不同的方式來處理。

處理方式	非結構化數據	結構化數據
正則表達式	文本、電話號碼、郵箱地址、HTML 文件	XML 文件
XPath	HTML 文件	XML 文件
CSS選擇器	HTML 文件	XML 文件
JSON Path		JSON 文件
轉化成Python類型		JSON 文件（json類）、XML 文件（xmltodict）

上一章節詳細向大家介紹了正則表達式，有同學說，我正則用的不好，處理HTML文檔很累，有沒有其他的方法？有！那就是XPath，我們可以先將 HTML文件轉換成 XML文檔，然後用 XPath 查找 HTML 節點或元素。

二、瞭解XML

XML 指可擴展標記語言（EXtensible Markup Language）
XML 是一種標記語言，很類似 HTML
XML 的設計宗旨是傳輸數據，而非顯示數據
XML 的標籤需要我們自行定義。
XML 被設計爲具有自我描述性。
XML 是 W3C 的推薦標準

W3School官方文檔：http://www.w3school.com.cn/xml/index.asp

（1） XML 和 HTML 的區別

數據格式	描述	設計目標
XML	Extensible Markup Language `（可擴展標記語言）`	被設計爲傳輸和存儲數據，其焦點是數據的內容。
HTML	HyperText Markup Language `（超文本標記語言）`	顯示數據以及如何更好顯示數據。
HTML DOM	Document Object Model for HTML `(文檔對象模型)`	通過 HTML DOM，可以訪問所有的 HTML 元素，連同它們所包含的文本和屬性。可以對其中的內容進行修改和刪除，同時也可以創建新的元素。

（2） XML文檔示例

<?xml version="1.0" encoding="utf-8"?>

<bookstore>

  <book category="cooking">
    <title lang="en">Everyday Italian</title>  
    <author>Giada De Laurentiis</author>  
    <year>2005</year>  
    <price>30.00</price>
  </book>  

  <book category="children">
    <title lang="en">Harry Potter</title>  
    <author>J K. Rowling</author>  
    <year>2005</year>  
    <price>29.99</price>
  </book>  

  <book category="web">
    <title lang="en">XQuery Kick Start</title>  
    <author>James McGovern</author>  
    <author>Per Bothner</author>  
    <author>Kurt Cagle</author>  
    <author>James Linn</author>  
    <author>Vaidyanathan Nagarajan</author>  
    <year>2003</year>  
    <price>49.99</price>
  </book>

  <book category="web" cover="paperback">
    <title lang="en">Learning XML</title>  
    <author>Erik T. Ray</author>  
    <year>2003</year>  
    <price>39.95</price>
  </book>

</bookstore>

（3） HTML DOM 模型示例

HTML DOM 定義了訪問和操作 HTML 文檔的標準方法，以樹結構方式表達 HTML 文檔。

（4） XML的節點關係

<?xml version="1.0" encoding="utf-8"?>

<bookstore>

<book>
  <title>Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>

</bookstore>

名稱	含義	例子
父（Parent）	每個元素以及屬性都有一個父	book 元素是 title、author、year 以及 price 元素的父
子（Children）	元素節點可有零個、一個或多個子	title、author、year 以及 price 元素都是 book 元素的子
同胞（Sibling）	擁有相同的父的節點	title、author、year 以及 price 元素都是同胞
先輩（Ancestor）	某節點的父、父的父，等等	title 元素的先輩是 book 元素和 bookstore 元素
後代（Descendant）	某個節點的子，子的子，等等	bookstore 的後代是 book、title、author、year 及 price 元素

三、瞭解XPath

XPath (XML Path Language) 是一門在 XML 文檔中查找信息的語言，可用來在 XML 文檔中對元素和屬性進行遍歷。

W3School官方文檔：http://www.w3school.com.cn/xpath/index.asp

（1） XPath 開發工具

開源的XPath表達式編輯工具:XMLQuire(XML格式文件可用)
Chrome插件 XPath Helper
Firefox插件 XPath Checker

這裏以Chrome插件 XPath Helper爲例，可以看到匹配到的標籤會加載上class="xh-highlight"高光標籤。初學者可以多加練習，結果會在右上方的黑色方框中回顯，其中RESULTS 括號後的數字指匹配到的目標個數。

（2）選取節點

XPath 使用路徑表達式來選取 XML 文檔中的節點或者節點集。這些路徑表達式和我們在常規的電腦文件系統中看到的表達式非常相似。

下面列出了最常用的路徑表達式：

表達式	描述
nodename	選取此節點的所有子節點。
/	從根節點選取。
//	從匹配選擇的當前節點選擇文檔中的節點，而不考慮它們的位置。
.	選取當前節點。
..	選取當前節點的父節點。
@	選取屬性。

在下面的表格中，我們已列出了一些路徑表達式以及表達式的結果：

路徑表達式	結果
bookstore	選取 bookstore 元素的所有子節點。
/bookstore	選取根元素 bookstore。註釋：假如路徑起始於正斜槓( / )，則此路徑始終代表到某元素的絕對路徑！
bookstore/book	選取屬於 bookstore 的子元素的所有 book 元素。
//book	選取所有 book 子元素，而不管它們在文檔中的位置。
bookstore//book	選擇屬於 bookstore 元素的後代的所有 book 元素，而不管它們位於 bookstore 之下的什麼位置。
//@lang	選取名爲 lang 的所有屬性。

（3）謂語（Predicates）

謂語用來查找某個特定的節點或者包含某個指定的值的節點，被嵌在方括號中。

在下面的表格中，我們列出了帶有謂語的一些路徑表達式，以及表達式的結果：

路徑表達式	結果
/bookstore/book[1]	選取屬於 bookstore 子元素的第一個 book 元素。
/bookstore/book[last()]	選取屬於 bookstore 子元素的最後一個 book 元素。
/bookstore/book[last()-1]	選取屬於 bookstore 子元素的倒數第二個 book 元素。
/bookstore/book[position()<3]	選取最前面的兩個屬於 bookstore 元素的子元素的 book 元素。
//title[@lang]	選取所有擁有名爲 lang 的屬性的 title 元素。
//title[@lang=’eng’]	選取所有 title 元素，且這些元素擁有值爲 eng 的 lang 屬性。
/bookstore/book[price>35.00]	選取 bookstore 元素的所有 book 元素，且其中的 price 元素的值須大於 35.00。
/bookstore/book[price>35.00]/title	選取 bookstore 元素中的值須大於 35.00的 book 元素的所有 title 元素。

（4）選取未知節點

XPath 通配符可用來選取未知的 XML 元素。

通配符	描述
*	匹配任何元素節點。
@*	匹配任何屬性節點。
node()	匹配任何類型的節點。

在下面的表格中，我們列出了一些路徑表達式，以及這些表達式的結果：

路徑表達式	結果
/bookstore/*	選取 bookstore 元素的所有子元素。
//*	選取文檔中的所有元素。
html/node()/meta/@*	選擇html下面任意節點下的meta節點的所有屬性
//title[@*]	選取所有帶有屬性的 title 元素。

（5）選取若干路徑

通過在路徑表達式中使用“|”運算符，您可以選取若干個路徑。

在下面的表格中，我們列出了一些路徑表達式，以及這些表達式的結果：

路徑表達式	結果
//book/title \| //book/price	選取 book 元素的所有 title 和 price 元素。
//title \| //price	選取文檔中的所有 title 和 price 元素。
/bookstore/book/title \| //price	選取屬於 bookstore 元素的 book 元素的所有 title 元素，以及文檔中所有的 price 元素。

（6） XPath的運算符

下面列出了可用在 XPath 表達式中的運算符：

這些就是XPath的語法內容，在運用到Python抓取時要先轉換爲xml。

（7）歸納總結：

獲取文本
- a/text() 獲取a下的文本
- a//text() 獲取a下的所有標籤的文本
- //a[text()='下一頁'] 選擇文本爲下一頁三個字的a標籤
@符號
- a/@href 獲取a下的href ——>舉一反三：a/@scr 獲取a下的scr值
- //div[@id="detail-list"]——>舉一反三：//*[@class="aa"] 定位任意class爲aa的標籤
//
- 在xpath最前面表示從當前html中任意位置開始選擇
- li//a 表示的是li下任何一個標籤

四、lxml庫

lxml 是一個HTML/XML的解析器，主要的功能是如何解析和提取 HTML/XML 數據。

lxml和正則一樣，也是用 C 實現的，是一款高性能的 Python HTML/XML 解析器，我們可以利用之前學習的XPath語法，來快速的定位特定元素以及節點信息。

lxml python 官方文檔：http://lxml.de/index.html

需要安裝C語言庫，可使用 pip 安裝：pip install lxml （或通過wheel方式安裝）

（1）初步使用

我們利用它來解析 HTML 代碼，簡單示例：

# lxml_test.py

# 使用 lxml 的 etree 庫
from lxml import etree

text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a> 
		 #注意，此處缺少一個 </li> 閉合標籤
     </ul>
 </div>
'''

#利用etree.HTML，將字符串解析爲HTML文檔
html = etree.HTML(text)

# 按字符串序列化HTML文檔
result = etree.tostring(html)

print(result)

輸出結果：

<html><body>
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
 </div>
</body></html>

lxml 可以自動修正 html 代碼，例子裏不僅補全了 li 標籤，還添加了 body，html 標籤。

（2）文件讀取：

除了直接讀取字符串，lxml還支持從文件裏讀取內容。我們新建一個hello.html文件：

<!-- hello.html -->
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>

再利用 etree.parse() 方法來讀取文件。

# lxml_parse.py

from lxml import etree

# 讀取外部文件 hello.html
html = etree.parse('./hello.html')
result = etree.tostring(html, pretty_print=True)

print(result)

輸出結果與之前相同：

<html><body>
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
 </div>
</body></html>

五、XPath實例測試

<!-- hello.html -->
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>

（1）獲取所有的 `<li>` 標籤

# xpath_li.py

from lxml import etree

html = etree.parse('hello.html')
print type(html)  # 顯示etree.parse() 返回類型

result = html.xpath('//li')

print result      # 打印<li>標籤的元素集合
print len(result)
print type(result)
print type(result[0])

輸出結果：

<type 'lxml.etree._ElementTree'>
[<Element li at 0x1014e0e18>, <Element li at 0x1014e0ef0>, <Element li at 0x1014e0f38>, <Element li at 0x1014e0f80>, <Element li at 0x1014e0fc8>]
5
<type 'list'>
<type 'lxml.etree._Element'>

（2）繼續獲取 `hello.html` 屬性

<!-- hello.html -->
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>

# xpath_li.py

from lxml import etree

html = etree.parse('hello.html')
result1 = html.xpath('//li/@class')                // 獲取 <li> 標籤的所有 class 屬性
result2 = html.xpath('//li/a[@href="link1.html"]') //獲取<li>標籤下 href 爲 link1.html 的 <a> 標籤
result3 = html.xpath('//li//span')                 //獲取<li> 標籤下的所有 <span> 標籤 (因爲 / 是用來獲取子元素的，而 <span> 並不是 <li> 的子元素，所以，要用雙斜槓)
result4 = html.xpath('//li/a//@class')             //獲取 <li> 標籤下的 <a> 標籤裏的所有 class
result5 = html.xpath('//li[last()]/a/@href')       //獲取最後一個 <li> 的 <a> 的 href
result6 = html.xpath('//li[last()-1]/a')           //獲取倒數第二個元素的內容
result7 = html.xpath('//*[@class="bold"]')         //獲取 class 值爲 bold 的標籤名

print result1
print result2
print result3
print result4
print result5
print result6
print result7

運行結果

['item-0', 'item-1', 'item-inactive', 'item-1', 'item-0']
[<Element a at 0x10ffaae18>]
[<Element span at 0x10d698e18>]
['blod']
['link5.html']
fourth item
span

六、使用XPath爬蟲

現在我們用XPath來做一個簡單的爬蟲，我們嘗試爬取某個貼吧裏的所有帖子，並且將該這個帖子裏每個樓層發佈的圖片下載到本地。

#coding=utf-8
import requests
from lxml import etree
import json

class Tieba:

    def __init__(self,tieba_name):
        self.tieba_name = tieba_name #接收貼吧名
        #設置爲手機端的UA
        self.headers = {"User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 9_1 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Version/9.0 Mobile/13B143 Safari/601.1"}

    def get_total_url_list(self):
        '''獲取所有的url list'''
        url = "https://tieba.baidu.com/f?kw="+self.tieba_name+"&ie=utf-8&pn={}&"
        url_list = []
        for i in range(100): #通過循環拼接100個url
            url_list.append(url.format(i*50))
        return url_list #返回100個url的url list

    def parse_url(self,url):
        '''一個發送請求，獲取響應，同時etree處理html'''
        print("parsing url:",url)
        response = requests.get(url,headers=self.headers,timeout=10) #發送請求
        html = response.content.decode() #獲取html字符串
        html = etree.HTML(html) #獲取element 類型的html
        return html

    def get_title_href(self,url):
        '''獲取一個頁面的title和href'''
        html = self.parse_url(url)  #返回elemet類型的html，具有xpath方法
        li_temp_list = html.xpath("//li[@class='tl_shadow']") #分組，按照li標籤分組
        total_items = []
        for i in li_temp_list: #遍歷分組
            # href = i.xpath("./a/@href")[0] if len(i.xpath("./a/@href"))>0 else None
            # if href is not None and not href.startswith("https:"):
            # href = "https:"+href
            href = "https:"+i.xpath("./a/@href")[0] if len(i.xpath("./a/@href"))>0 else None
            text = i.xpath("./a/div[1]/span[1]/text()")
            text = text[0] if len(text)>0 else None
            item = dict(  #放入字典
                href = href,
                text = text
            )
            total_items.append(item)
        return total_items #返回一個頁面所有的item

    def get_img(self,url):
        '''獲取一個帖子裏面的所有圖片'''
        html = self.parse_url(url) #返回elemet類型的html，具有xpath方法
        img_list = html.xpath('//div[@data-class="BDE_Image"]/@data-url')
        img_list = [i.split("src=")[-1] for i in img_list] #正則表達式提取圖片的url
        img_list = [requests.utils.unquote(i) for i in img_list] #URL解碼
        return img_list

    def save_item(self,item):
        '''保存一個item'''
        with open("teibatupian.txt","a") as f:
            f.write(json.dumps(item,ensure_ascii=False,indent=2))
            f.write("\n")

    def run(self):
        #1、找到了url規律，url list
        url_list = self.get_total_url_list()
        for url in url_list:
        #2、遍歷urllist 發送請求，獲得響應，etree處理html
        # 3、提取title，href
            total_item = self.get_title_href(url)
            for item in total_item:
                href = item["href"]
                img_list = self.get_img(href) #獲取到了帖子的圖片列表
                item["img"] = img_list
                # 4、保存到本地
                print(item)
                self.save_item(item)

if __name__ == "__main__":
    tieba = Tieba("CSDN")
    tieba.run()

基本思路：在確定爬取對象後，開始運行run方法，get_total_url_list方法定義了每頁鏈接的遞歸方法，首先結合parse_url方法爬得全部數據，並通過etree將全部數去導入至lxml類庫中，再通過get_title_href方法和get_img方法採用XPath形式提取有用數據，最後通過save_item方法實現數據存儲。

爬蟲一共四個主要步驟：

明確目標 (要知道你準備在哪個範圍或者網站去搜索)
爬 (將所有的網站的內容全部爬下來)
取 (去掉對我們沒用處的數據)
處理數據（按照我們想要的方式存儲和使用）

步驟編號	爬蟲步驟	對應操作
1	明確目標	Tieba(self,tieba_name)
2	爬	get_total_url_list；parse_url
3	取	get_title_href；get_img
4	處理數據	save_item

七、CSS 選擇器：BeautifulSoup4

除了 lxml 之外，Beautiful Soup 也是一個HTML/XML的解析器，主要的功能也是解析和提取 HTML/XML 數據。lxml 只會局部遍歷，而Beautiful Soup 是基於HTML DOM的，會載入整個文檔，解析整個DOM樹，因此時間和內存開銷都會大很多，所以性能要低於lxml，故在此不再多述。

BeautifulSoup 用來解析 HTML 比較簡單，API非常人性化，支持CSS選擇器、Python標準庫中的HTML解析器，也支持 lxml 的 XML解析器。

Beautiful Soup 3 目前已經停止開發，推薦現在的項目使用Beautiful Soup 4。使用 pip 安裝即可：pip install beautifulsoup4

官方文檔：http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0

抓取工具	速度	使用難度	安裝難度
正則	最快	困難	無（內置）
BeautifulSoup	慢	最簡單	簡單
lxml	快	簡單	一般

後期內容提要：

[Python爬蟲] 七、結構化數據提取之JSON與JsonPATH
[Python爬蟲] 八、動態HTML處理之Selenium與PhantomJS
[Python爬蟲] 九、機器圖像識別之機器視覺與Tesseract
[Python爬蟲] 十、機器圖像識別之文字、驗證碼識別
[Python爬蟲] 十一、Scrapy 框架

如果您有任何疑問或者好的建議，期待你的留言與評論！