HTML解析-Xpath

文章目錄

HTML的內容返回給瀏覽器，瀏覽器就會解析它，並對它渲染。

HTML超文本表示語言，設計的初衷就是爲了超越普通文本，讓文本表現力更強。
XML擴展標記語言，不是爲了替代HTML，而是覺得HTML的設計中包含了過多的格式，承擔了一部分數據之外的任務，所以才設計了XML只用來描述數據。

HTML和XML都有結構，使用標記形成樹型的嵌套結構。DOM(Document Object Model)來解析這種嵌套樹型結構，瀏覽器往往都提供了對DOM操作的API，可以用面向對象的方式來操作DOM。

XPath

http://www.w3school.com.cn/xpath/index.asp中文教程
XPath是一門在XML文檔中查找信息的語言。XPath可用來在XML文檔中對元素和屬性進行遍歷。
測試工具：XMLQuire win7+需要.net框架4.0-4.5。

測試XML、XPath

測試文檔

<?xml version="1.0" encoding="utf-8"?>
<bookstore>
<book id="bk101">
    <author>Gambardella, Matthew</author>
    <title>XML Developer's Guide</title>
    <genre>Computer</genre>
    <price>44.95</price>
    <publish_date>2000-10-01</publish_date>
    <description>An in-depth look at creating applications 
    with XML.</description>
</book>
<book id="bk102" class="bookinfo even">
    <author>Ralls, Kim</author>
    <title>Midnight Rain</title>
    <genre>Fantasy</genre>
    <price>5.95</price>
    <publish_date>2000-12-16</publish_date>
    <description>A former architect battles corporate zombies, 
    an evil sorceress, and her own childhood to become queen 
    of the world.</description>
</book>
<book id="bk103">
    <author>Corets, Eva</author>
    <title>Maeve Ascendant</title>
    <genre>Fantasy</genre>
    <price>5.95</price>
    <publish_date>2000-11-17</publish_date>
    <description>After the collapse of a nanotechnology 
    society in England, the young survivors lay the 
    foundation for a new society.</description>
</book>
<book id="bk104">
    <author>Corets, Eva</author>
    <title>Oberon's Legacy</title>
    <genre>Fantasy</genre>
    <price>5.95</price>
    <publish_date>2001-03-10</publish_date>
    <description>In post-apocalypse England, the mysterious 
    agent known only as Oberon helps to create a new life 
    for the inhabitants of London. Sequel to Maeve 
    Ascendant.</description>
</book>
</bookstore>

測試工具:XMLQuire win7+需要.NET框架4.0-4.5。

節點

在XPath中，有七種類型的節點：元素、屬性、文本、命名空間、處理指令、註釋以及文檔(根)節點。
1. /根節點
2. <bookstore>元素節點
3. <author>Corets,Eva</author>元素節點。
4. id="bk104"是屬性節點，id是元素節點book的屬性
節點之間的嵌套形成父子(parent,children)關係。
具有統一個父結點的不同節點是兄弟(sibling)關係。
節點選擇

操作符或表達式	含義
`/`	從根節點開始找
`//`	從當前節點開始的任意層找
`.`	當前節點
`..`	當前結點的父節點
`@`	選擇屬性
`節點名`	選取所有這個節點名的節點
`*`	匹配任意元素節點
`@*`	匹配任意屬性節點
`node()`	匹配任意類型的節點
`text()`	匹配text類型節點

謂語(Predicates)

謂語用來查找某個特定的節點或者包含某個指定的值的節點。
謂語被嵌在方括號中。
謂語就是查詢的條件。
即在路徑選擇時，在中括號內指定查詢條件。

XPath軸(Axes)

軸的意思是相對於當前結點的節點集

軸名稱	結果
ancestor	選取當前結點的所有先輩(父、祖父等)
ancestor-or-self	選取當前節點的所有先輩(父、祖父等)以及當前節點本身
attribute	選取當前節點的所有屬性。@id等價於attribute::id
child	選取當前節點的所有子元素，title等價於child:title
descendant	選取當前節點的所有後代元素(子、孫等)
descendant-or-self	選取當前節點的所有後代運算(子、孫等)以及當前節點本身
following	選取文檔中當前節點的結束標籤之後的所有結點
namespace	選取當前節點的所有命名空間節點
parent	選取當前節點的父節點
preceding	選取當前節點的父節點
preceding-sibling	選取當前節點之前的所有同級節點
self	選取當前節點。等駕馭self::node()

步Step

步的語法軸名稱：節點測試[謂語]

例子	結果
`child::book`	選取所有屬於當前節點的只元素的book節點
`attribute::lang`	選取當前節點的lang屬性
`child::*`	選取當前節點的所有隻元素
`attribute::*`	選取當前節點的所有屬性
`child::text()`	選取當前節點的所有文本子節點
`child::node()`	選取當前節點的所有子節點
`descendant::book`	選取當前節點的所有book後代
`ancestor:book`	選擇當前節點的所有book先輩
`ancestor-or-self::book`	選取當前節點的所有book先輩以及當前節點(如果此節點是book節點)
`child::*/child::price`	選取當前節點的所有price孫節點

XPATH示例

以斜槓開始的稱爲絕對路徑，表示從根開始。
不以斜槓開始的稱爲相對路徑，一般都是依照當前節點來計算。當前節點在上下文環境中，當前節點很可能已經補是根節點了。
一般爲了方便，往往xml如果層次很深，都會使用//來查找節點。

路徑表達式	含義
`title`	選取當前節點下所有title子節點
`/book`	從根節點找子節點是book的，找不到
`book/title`	當前節點下所有子節點book下的title節點
`//title`	從根節點向下找任意層中title的結點
`book//title`	當前節點下所有book子節點下任意層次的title節點
`//@id`	任意層次下含有id的屬性，取回的是屬性
`//book[@id]`	任意層次下含有id屬性的book節點
`//*[@id]`	任意層下含有id屬性的節點
`//book[@id="bk102"]`	任意層次下book節點，且含有id屬性爲bk102的節點。
`/bookstore/book[1]`	根節點bookstore下第一個book節點，從1開始
`/bookstore/book[1]/@id`	根節點bookstore下的第一個book節點的id屬性
`/bookstore/book[last()-1]`	根節點bookstore下倒數第二個book節點, 函數last()返回最後一個元素索引
`/bookstore/*`	匹配根節點bookstore的所有子節點，不遞歸
`//*`	匹配所有子孫節點
`//[@]`	匹配所有有屬性的節點
`//book/title \| //price`	匹配任意層下的book下節點是title節點，或者任意層下的price
`//book[position()=2]`	匹配book節點，取第二個
`//book[position()<last()-1]`	匹配book節點，取位置小於倒數第二個
`//book[price>40]`	匹配book節點，取節點值大於40的book節點
`//book[2]/node()`	匹配位置爲2的book節點下的所有類型的節點
`//book[1]/text()`	匹配第一個book節點下的所有文本子節點
`//book[1]/text()`	匹配第一個book節點下的所有文本節點
`//*[local-name()="book"]`	匹配所有節點且不帶限定名的節點名稱爲book的所有節點。 local-name函數取不帶限定名的名稱。相當於指定標籤元素爲…的節點
下面這三種表達式等價 `//book[price<6]/price` `//book/price[text()<6]` `//book/child::node()[local-name()="price" and text()<6]`	獲取book節點下的price節點，且price中內容小於6的節點
`//book//[self::title or self::price]` 等價於`//book//title \| //book/price` 也等價於`//book//[local-name()="title" or local-name()="price"]`	所有book節點下子孫節點，且這些節點是title或者price。
`//*[@class]`	所有有class屬性的節點
`//*[@class="bookinfo even"]`	所有屬性爲“bookinfo even”的節點
`//*[contains(@class,'even')`	獲取所有屬性class中包含even字符串的節點
`//*[contains(local-name(),'book')`	標籤名包含book的節點

函數總結

函數	含義
`local-name()`	獲取不帶限定名的名稱。相當於指定標籤元素
`text()`	獲取標籤之間的文本內容
`node()`	所有節點。
`contains(@class,str)`	包含
`starts-with(local-name(),"book")`	以book開頭
`last()`	最後一個元素索引
`position()`	元素索引

lxml

lxml是Python下功能豐富的XML、HTML解析庫，性能非常好，是對libxml2和libxslt的封裝。
最新版本支持Python 2.6+,python3支持3.6.

在CentOS編譯安裝需要

#yum install libxml2-devel libxslt-devel

注意,不同平臺不一樣，參看https://lxml.de/installation.html
lxml安裝$ pip install lxml

from lxml import etree

# 使用etree構建HTML
root = etree.Element("html")
print(type(root))
print(root.tag)

body = etree.Element("body")
root.append(body)
print(etree.tostring(root))

#增加子節點
sub = etree.SubElement(body,"child1")
print(type(sub))
sub = etree.SubElement(body,"child2").append(etree.Element("child21"))
html = etree.tostring(root,pretty_print=True).decode()
print(html)
print("- "*30)

r = etree.HTML(html) #返回根節點
print(r.tag)
print(r.xpath("//*[contains(local-name(),'child')]"))

etree還提供了2個有用的函數
etree.HTML(text)解析HTML文檔，返回根節點
anode.xpath(‘xpath路徑’)對節點使用xpath語法

練習：爬取“口碑榜”
1. 從豆瓣電影中獲取"本週口碑榜"

from lxml import etree
import requests

url = "https://movie.douban.com/"
ua = "Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-CN) AppleWebKit/537.36 (KHTML, like Gecko) Version/5.0.1 Safari/537.36"

with requests.get(url,headers={"User-agent":ua}) as response:
    if response.status_code==200:
        content = response.text #html內容
        html = etree.HTML(content) #分析html，返回DOM根節點
        titles = html.xpath("//div[@class='billboard-bd']//tr/td/a/text()") #返回文本列表
        for i in titles: #豆瓣電影之本週排行榜
            print(i)
    else:
        print("訪問錯誤")

HTML解析-Xpath

HTML解析-Xpath

文章目錄

XPath

節點

lxml

再談23種設計模式（3）：行爲型模式（學習筆記）

Power Automate Desktop 安裝完，登錄後老是提示one driver 錯誤

微前端學習筆記(4):從微前端到微模塊之EMP與hel-micro方案探索

微前端學習筆記（1）：微前端總體架構概述，從微服務發微

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

記一次 .NET某工控視覺自動化系統卡死分析

WindowsServer--SQL Server搭建主從同步實現讀寫分離 - 事務性分發

Doccano標註系統安裝與二次開發

Reids持久化和高可用

播客系統數據庫模型設計--Django播客系統(二)

前端開發及登錄功能實現--Django播客系統(九)

博文接口實現--Django播客系統(八)

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結