python之解析库的使用【 xpath详解】

XPath，全称 XML Path Language，即 XML 路径语言，它是一门在XML文档中查找信息的语言。XPath 最初设计是用来搜寻XML文档的，但是它同样适用于 HTML 文档的搜索

官方文档:https://www.w3.org/TR/xpath/

XPath常用规则:

nodename     选取此节点的所有子节点
/            从当前节点选取直接子节点
//           从当前节点选取子孙节点
.            选取当前节点
..           选取当前节点的父节点
@            选取属性

实例引入:

#!/user/bin/env python    
#-*- coding:utf-8 -*-

from lxml import etree


def test1():
    content = '''
    <div>
        <ul>
             <li class="item-0"><a href="link1.html">first item</a></li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-inactive"><a href="link3.html">third item</a></li>
             <li class="item-1"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a>
         </ul>
     </div>
    '''

    # 调用etree模块的HTML类构造一个XPath解析对象
    html = etree.HTML(content)
    result = etree.tostring(html)
    print(result.decode('utf-8'))


def test2():
    html = etree.parse('test.html', etree.HTMLParser())
    result = etree.tostring(html)
    print(result.decode('utf-8'))


#  下面是详解了嘿嘿——————————————————————————————————————————
# 所有节点:利用//开头的xpath规则选取所有符合要求的节点
def demo1():
    html = etree.parse('test.html',etree.HTMLParser())
    result = html.xpath('//*')
    print(result)
    # *代表匹配所有的结点

# 只获取li节点
# 要选取所有 li 节点可以使用 //，然后直接加上节点的名称即可，调用时直接调用 xpath() 方法即可提取

def demo2():
    html = etree.parse('test.html',etree.HTMLParser())
    result = html.xpath('//li')
    print(result)


# 子结点,获取li节点下的a节点
def demo3():
    html = etree.parse('test.html',etree.HTMLParser())
    result = html.xpath('//li/a')
    print(result)

# 实例:查找ul节点下的所有的子孙a节点
def demo4():
    html = etree.parse('test.html', etree.HTMLParser())
    result = html.xpath('//ul//a')
    print(result)


# 父节点 ：可以通过..来获取父节点
#  例:获取href 是 link4.html 的 a 节点的父节点的class属性
def demo5():
    # html = etree.parse('test.html',etree.HTMLParser())
    # result = html.xpath("//a[@href='link4.html']/../@class")

    #2
    html = etree.parse('test.html', etree.HTMLParser())
    result = html.xpath('//a[@href="link4.html"]/parent::*/@class')
    print(result)




# 属性匹配 :@符号可以进行匹配属性

def demo6():
    # 注意下，demo是获取属性值，这里是属性匹配，/@class 是获取属性值
    html = etree.parse('test.html',etree.HTMLParser())
    result = html.xpath("//li[@class='item-0']")
    print(result)


#文本获取：利用xpath中的text()方法可以获取节点中的文本
# 实例:获取li节点下的文本
def demo7():
    html = etree.parse('test.html', etree.HTMLParser())
    # result = html.xpath('//li[@class="item-0"]/text()')
    # print(result)
    # 假如我们获取a节点的内容
    # 方法1
    result = html.xpath('//li[@class="item-0"]/a/text()')
    print(result)
    # 方法2
    result = html.xpath('//li[@class="item-0"]//text()')
    print(result)



#  属性获取
# 实例:获取li节点下所有a节点的href属性
def demo8():
    html = etree.parse('test.html',etree.HTMLParser())
    result = html.xpath('//li/a/@href')
    print(result)

# 属性多值匹配
# 匹配有多个属性值的节点，需要用contains()函数
# 语法：contains(@属性名称,属性值)
def demo9():
    text = '''
    <li class="li li-first"><a href="link.html">first item</a></li>
    '''
    html = etree.HTML(text)
    result = html.xpath('//li[contains(@class,"li")]/a/text()')
    print(result)


#  多属性匹配
# 根据多个属性才能确定一个节点，需要使用运算符and来连接
def demo10():
    text = '''
    <li class="li li-first" name="item"><a href="link.html">first item</a></li>
    '''
    html = etree.HTML(text)
    result = html.xpath('//li[contains(@class,"li") and @name="item"]/a/text()')
    print(result)



# 按序选择
def demo11():
    text = '''
    <div>
        <ul>
             <li class="item-0"><a href="link1.html">first item</a></li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-inactive"><a href="link3.html">third item</a></li>
             <li class="item-1"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a>
         </ul>
     </div>
    '''
    html = etree.HTML(text)
    result = html.xpath('//li[1]/a/text()')
    print(result)
    result = html.xpath('//li[last()]/a/text()') #最后一个 li 节点
    print(result)
    result = html.xpath('//li[position()<3]/a/text()') # 小于3的
    print(result)
    result = html.xpath('//li[last()-2]/a/text()') #中括号中传入 last()-2即可，因为 last() 是最后一个，所以 last()-2 就是倒数第三个
    print(result)

# 节点轴选择

def demo12():
    text = '''
    <div>
        <ul>
             <li class="item-0"><a href="link1.html"><span>first item</span></a></li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-inactive"><a href="link3.html">third item</a></li>
             <li class="item-1"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a>
         </ul>
     </div>
    '''
    html = etree.HTML(text)
    # 获取所有祖先节点
    result = html.xpath('//li[1]/ancestor::*')
    print(result)
    # 获取div的祖先节点
    result = html.xpath('//li[1]/ancestor::div')
    print(result)
    # 获取属性值
    result = html.xpath('//li[1]/attribute::*')
    print(result)
    # 获取直接子节点
    result = html.xpath('//li[1]/child::a[@href="link1.html"]')
    print(result)
    # 获取所有子孙节点
    result = html.xpath('//li[1]/descendant::span')
    print(result)
    # 获取当前节点之后的所有节点
    result = html.xpath('//li[1]/following::*[2]')
    print(result)
    # 获取当前节点之后的所有同级节点
    result = html.xpath('//li[1]/following-sibling::*')
    print(result)


# 轴的使用，用法参考：http://www.w3school.com.cn/xpath/xpath_axes.asp

XPath 中的运算符，另外还有很多运算符，如 or、mod 等等，在此总结如下：

http://www.w3school.com.cn/xpath/xpath_operators.asp

xpath 就写完了，后面会更新更加多的内容，不过会转移到自己的博客上面去！觉得还可以的就点喜欢加关注！

python之解析库的使用【 xpath详解】

测试人员都是画画大神，让我看看谁还不会用代码图？

Object.values()对象遍历

我拍了拍Redis，被移出了群聊···

网络现代化通向云原生应用的高速公路

面试官：说说你对序列化的理解

Burp suite 和 Fiddler 抓包exe 客戶端【主講Fiddler 】

centos7 的linux 刪除卸載 docker

python 操作json數據格式

開始構建安全的nginx服務器_ linux編譯安裝指南

centos7 快速安裝 nginx 部署靜態網站

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結