XPath,全稱 XML Path Language,即 XML 路徑語言,它是一門在XML文檔中查找信息的語言。XPath 最初設計是用來搜尋XML文檔的,但是它同樣適用於 HTML 文檔的搜索
官方文檔:https://www.w3.org/TR/xpath/
XPath常用規則:
nodename 選取此節點的所有子節點
/ 從當前節點選取直接子節點
// 從當前節點選取子孫節點
. 選取當前節點
.. 選取當前節點的父節點
@ 選取屬性
實例引入:
#!/user/bin/env python
#-*- coding:utf-8 -*-
from lxml import etree
def test1():
content = '''
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
'''
# 調用etree模塊的HTML類構造一個XPath解析對象
html = etree.HTML(content)
result = etree.tostring(html)
print(result.decode('utf-8'))
def test2():
html = etree.parse('test.html', etree.HTMLParser())
result = etree.tostring(html)
print(result.decode('utf-8'))
# 下面是詳解了嘿嘿——————————————————————————————————————————
# 所有節點:利用//開頭的xpath規則選取所有符合要求的節點
def demo1():
html = etree.parse('test.html',etree.HTMLParser())
result = html.xpath('//*')
print(result)
# *代表匹配所有的結點
# 只獲取li節點
# 要選取所有 li 節點可以使用 //,然後直接加上節點的名稱即可,調用時直接調用 xpath() 方法即可提取
def demo2():
html = etree.parse('test.html',etree.HTMLParser())
result = html.xpath('//li')
print(result)
# 子結點,獲取li節點下的a節點
def demo3():
html = etree.parse('test.html',etree.HTMLParser())
result = html.xpath('//li/a')
print(result)
# 實例:查找ul節點下的所有的子孫a節點
def demo4():
html = etree.parse('test.html', etree.HTMLParser())
result = html.xpath('//ul//a')
print(result)
# 父節點 :可以通過..來獲取父節點
# 例:獲取href 是 link4.html 的 a 節點的父節點的class屬性
def demo5():
# html = etree.parse('test.html',etree.HTMLParser())
# result = html.xpath("//a[@href='link4.html']/../@class")
#2
html = etree.parse('test.html', etree.HTMLParser())
result = html.xpath('//a[@href="link4.html"]/parent::*/@class')
print(result)
# 屬性匹配 :@符號可以進行匹配屬性
def demo6():
# 注意下,demo是獲取屬性值,這裏是屬性匹配,/@class 是獲取屬性值
html = etree.parse('test.html',etree.HTMLParser())
result = html.xpath("//li[@class='item-0']")
print(result)
#文本獲取:利用xpath中的text()方法可以獲取節點中的文本
# 實例:獲取li節點下的文本
def demo7():
html = etree.parse('test.html', etree.HTMLParser())
# result = html.xpath('//li[@class="item-0"]/text()')
# print(result)
# 假如我們獲取a節點的內容
# 方法1
result = html.xpath('//li[@class="item-0"]/a/text()')
print(result)
# 方法2
result = html.xpath('//li[@class="item-0"]//text()')
print(result)
# 屬性獲取
# 實例:獲取li節點下所有a節點的href屬性
def demo8():
html = etree.parse('test.html',etree.HTMLParser())
result = html.xpath('//li/a/@href')
print(result)
# 屬性多值匹配
# 匹配有多個屬性值的節點,需要用contains()函數
# 語法:contains(@屬性名稱,屬性值)
def demo9():
text = '''
<li class="li li-first"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class,"li")]/a/text()')
print(result)
# 多屬性匹配
# 根據多個屬性才能確定一個節點,需要使用運算符and來連接
def demo10():
text = '''
<li class="li li-first" name="item"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class,"li") and @name="item"]/a/text()')
print(result)
# 按序選擇
def demo11():
text = '''
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
'''
html = etree.HTML(text)
result = html.xpath('//li[1]/a/text()')
print(result)
result = html.xpath('//li[last()]/a/text()') #最後一個 li 節點
print(result)
result = html.xpath('//li[position()<3]/a/text()') # 小於3的
print(result)
result = html.xpath('//li[last()-2]/a/text()') #中括號中傳入 last()-2即可,因爲 last() 是最後一個,所以 last()-2 就是倒數第三個
print(result)
# 節點軸選擇
def demo12():
text = '''
<div>
<ul>
<li class="item-0"><a href="link1.html"><span>first item</span></a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
'''
html = etree.HTML(text)
# 獲取所有祖先節點
result = html.xpath('//li[1]/ancestor::*')
print(result)
# 獲取div的祖先節點
result = html.xpath('//li[1]/ancestor::div')
print(result)
# 獲取屬性值
result = html.xpath('//li[1]/attribute::*')
print(result)
# 獲取直接子節點
result = html.xpath('//li[1]/child::a[@href="link1.html"]')
print(result)
# 獲取所有子孫節點
result = html.xpath('//li[1]/descendant::span')
print(result)
# 獲取當前節點之後的所有節點
result = html.xpath('//li[1]/following::*[2]')
print(result)
# 獲取當前節點之後的所有同級節點
result = html.xpath('//li[1]/following-sibling::*')
print(result)
# 軸的使用,用法參考:http://www.w3school.com.cn/xpath/xpath_axes.asp
XPath 中的運算符,另外還有很多運算符,如 or、mod 等等,在此總結如下:
http://www.w3school.com.cn/xpath/xpath_operators.asp
xpath 就寫完了,後面會更新更加多的內容,不過會轉移到自己的博客上面去! 覺得還可以的就點喜歡加關注!