python中lxml模塊的使用

lxml是python的一個解析庫,支持HTML和XML的解析,支持XPath解析方式,而且解析效率非常高

1.lxml的安裝

pip install lxml

2.導入lxml 的 etree 庫

 from lxml import etree

3.利用etree.HTML,將字符串轉化爲Element對象,Element對象具有xpath的方法,返回結果的列表,能夠接受bytes類型的數據和str類型的數據。

from lxml import etree
html = etree.HTML(response.text) 
ret_list = html.xpath("xpath字符串")

也可以這樣使用:

from lxml import etree
htmlDiv = etree.HTML(response.content.decode())
hrefs = htmlDiv.xpath("//h4//a/@href")

4.把轉化後的element對象轉化爲字符串,返回bytes類型,etree.tostring(element)

假設我們現有如下的html字符換,嘗試對他進行操作:

<div> <ul> 
<li class="item-1"><a href="link1.html">first item</a></li> 
<li class="item-1"><a href="link2.html">second item</a></li> 
<li class="item-inactive"><a href="link3.html">third item</a></li> 
<li class="item-1"><a href="link4.html">fourth item</a></li> 
<li class="item-0"><a href="link5.html">fifth item</a> # 注意,此處缺少一個 </li> 閉合標籤 
</ul> </div>

代碼示例:

from lxml import etree
text = ''' <div> <ul> 
        <li class="item-1"><a href="link1.html">first item</a></li> 
        <li class="item-1"><a href="link2.html">second item</a></li> 
        <li class="item-inactive"><a href="link3.html">third item</a></li> 
        <li class="item-1"><a href="link4.html">fourth item</a></li> 
        <li class="item-0"><a href="link5.html">fifth item</a> 
        </ul> </div> '''

html = etree.HTML(text)
print(type(html)) 

handeled_html_str = etree.tostring(html).decode()
print(handeled_html_str)

輸出結果:

<class 'lxml.etree._Element'>
<html><body><div> <ul> 
        <li class="item-1"><a href="link1.html">first item</a></li> 
        <li class="item-1"><a href="link2.html">second item</a></li> 
        <li class="item-inactive"><a href="link3.html">third item</a></li> 
        <li class="item-1"><a href="link4.html">fourth item</a></li> 
        <li class="item-0"><a href="link5.html">fifth item</a> 
        </li></ul> </div> </body></html>

可以發現,lxml確實能夠把確實的標籤補充完成,但是請注意lxml是人寫的,很多時候由於網頁不夠規範,或者是lxml的bug。
即使參考url地址對應的響應去提取數據,任然獲取不到,這個時候我們需要使用etree.tostring的方法,觀察etree到底把html轉化成了什麼樣子,即根據轉化後的html字符串去進行數據的提取。

5.lxml的深入練習

from lxml import etree
text = ''' <div> <ul> 
        <li class="item-1"><a href="link1.html">first item</a></li> 
        <li class="item-1"><a href="link2.html">second item</a></li> 
        <li class="item-inactive"><a href="link3.html">third item</a></li> 
        <li class="item-1"><a href="link4.html">fourth item</a></li> 
        <li class="item-0"><a href="link5.html">fifth item</a> 
        </ul> </div> '''

html = etree.HTML(text)

#獲取href的列表和title的列表
href_list = html.xpath("//li[@class='item-1']/a/@href")
title_list = html.xpath("//li[@class='item-1']/a/text()")

#組裝成字典
for href in href_list:
    item = {}
    item["href"] = href
    item["title"] = title_list[href_list.index(href)]
    print(item)

輸出爲:

{'href': 'link1.html', 'title': 'first item'}
{'href': 'link2.html', 'title': 'second item'}
{'href': 'link4.html', 'title': 'fourth item'}

6.lxml模塊的進階使用

返回的是element對象,可以繼續使用xpath方法,對此我們可以在後面的數據提取過程中:先根據某個標籤進行分組,分組之後再示例如下:

from lxml import etree
text = ''' <div> <ul> 
        <li class="item-1"><a>first item</a></li> 
        <li class="item-1"><a href="link2.html">second item</a></li> 
        <li class="item-inactive"><a href="link3.html">third item</a></li> 
        <li class="item-1"><a href="link4.html">fourth item</a></li> 
        <li class="item-0"><a href="link5.html">fifth item</a> 
        </ul> </div> '''

html = etree.HTML(text)

li_list = html.xpath("//li[@class='item-1']")
print(li_list)

結果爲:

[<Element li at 0x11106cb48>, <Element li at 0x11106cb88>, <Element li at 0x11106cbc8>]數據的提取

可以發現結果是一個element對象,這個對象能夠繼續使用xpath方法
先根據li標籤進行分組,之後再進行數據的提取

from lxml import etree
text = ''' <div> <ul> 
        <li class="item-1"><a>first item</a></li> 
        <li class="item-1"><a href="link2.html">second item</a></li> 
        <li class="item-inactive"><a href="link3.html">third item</a></li> 
        <li class="item-1"><a href="link4.html">fourth item</a></li> 
        <li class="item-0"><a href="link5.html">fifth item</a> 
        </ul> </div> '''

#根據li標籤進行分組
html = etree.HTML(text)
li_list = html.xpath("//li[@class='item-1']")

#在每一組中繼續進行數據的提取
for li in li_list:
    item = {}
    item["href"] = li.xpath("./a/@href")[0] if len(li.xpath("./a/@href"))>0 else None
    item["title"] = li.xpath("./a/text()")[0] if len(li.xpath("./a/text()"))>0 else None
    print(item)

結果是:

{'href': None, 'title': 'first item'}
{'href': 'link2.html', 'title': 'second item'}
{'href': 'link4.html', 'title': 'fourth item'}

7.案列:貼吧極速版:

在這裏插入圖片描述
代碼如下:

import requests
from lxml import etree


class TieBaSpider:
    def __init__(self, tieba_name):
        #1. start_url
        self.start_url= "http://tieba.baidu.com/mo/q---C9E0BC1BC80AA0A7CE472600CDE9E9E3%3AFG%3D1-sz%40320_240%2C-1-3-0--2--wapp_1525330549279_782/m?kw={}&lp=6024".format(tieba_name)
        self.headers = {"User-Agent": "Mozilla/5.0 (Linux; Android 8.0; Pixel 2 Build/OPD3.170816.012) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Mobile Safari/537.36"}
        self.part_url = "http://tieba.baidu.com/mo/q---C9E0BC1BC80AA0A7CE472600CDE9E9E3%3AFG%3D1-sz%40320_240%2C-1-3-0--2--wapp_1525330549279_782"

    def parse_url(self,url): #發送請求,獲取響應
        # print(url)
        response = requests.get(url,headers=self.headers)
        return response.content

    def get_content_list(self,html_str): #3. 提取數據
        html = etree.HTML(html_str)
        div_list = html.xpath("//body/div/div[contains(@class,'i')]")
        content_list = []
        for div in div_list:
            item = {}
            item["href"] = self.part_url+div.xpath("./a/@href")[0]
            item["title"] = div.xpath("./a/text()")[0]
            item["img_list"] = self.get_img_list(item["href"], [])
            content_list.append(item)

        #提取下一頁的url地址
        next_url = html.xpath("//a[text()='下一頁']/@href")
        next_url = self.part_url + next_url[0] if len(next_url)>0 else None
        return content_list, next_url

    def get_img_list(self,detail_url, img_list):
        #1. 發送請求,獲取響應
        detail_html_str = self.parse_url(detail_url)
        #2. 提取數據
        detail_html = etree.HTML(detail_html_str)
        img_list += detail_html.xpath("//img[@class='BDE_Image']/@src")

        #詳情頁下一頁的url地址
        next_url = detail_html.xpath("//a[text()='下一頁']/@href")
        next_url = self.part_url + next_url[0] if len(next_url)>0 else None
        if next_url is not None: #當存在詳情頁的下一頁,請求
            return self.get_img_list(next_url, img_list)

        #else不用寫
        img_list = [requests.utils.unquote(i).split("src=")[-1] for i in img_list]
        return img_list

    def save_content_list(self,content_list):#保存數據
        for content in content_list:
            print(content)

    def run(self): #實現主要邏輯
        next_url = self.start_url
        while next_url is not None:
            #1. start_url
            #2. 發送請求,獲取響應
            html_str = self.parse_url(next_url)
            #3. 提取數據
            content_list, next_url = self.get_content_list(html_str)
            #4. 保存
            self.save_content_list(content_list)
            #5. 獲取next_url,循環2-5


if __name__ == '__main__':

    tieba = TieBaSpider("每日中國")
    tieba.run()
發佈了140 篇原創文章 · 獲贊 56 · 訪問量 4萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章