爬蟲之Beautifulsoup及xpath

1.BeautifulSoup　(以 Python 風格的方式來對 HTML 或 XML 進行迭代，搜索和修改)

1.1 介紹

　　Beautiful Soup提供一些簡單的、python式的函數用來處理導航、搜索、修改分析樹等功能。它是一個工具箱，通過解析文檔爲用戶提供需要抓取的數據，因爲簡單，所以不需要多少代碼就可以寫出一個完整的應用程序。

1.2 解析器

　　Beautiful Soup支持Python標準庫中的HTML解析器,還支持一些第三方的解析器，如果我們不安裝它，則 Python 會使用 Python默認的解析器，lxml 解析器更加強大，速度更快，推薦安裝。

1.3 使用

　　借用官方文檔提供的愛麗絲夢遊仙境文檔內容

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

1.標籤獲取

from bs4 import BeautifulSoup

soup=BeautifulSoup(html_doc,'html.parser')

# 獲取a標籤的所有內容
print(soup.a)       # <a class="sister 123" href="http://example.com/elsie" id="link1">Elsie</a>
print(type(soup.a))     # <class 'bs4.element.Tag'>


# 獲取title節點的所有內容
print(soup.head.title)  # <title>The Dormouse's story</title>


print(soup.find_all("a"))   # 獲取所有符合條件的標籤對象集合，以列表形式展示

2.標籤對象的屬性，名稱，文本

from bs4 import BeautifulSoup

soup=BeautifulSoup(html_doc,'html.parser')

for link in soup.find_all('a'):
    # print(link.name)       # 獲取標籤名稱

    # print(link.get('href'))     # 獲取a標籤中的所有href
    # print(link["href"])
    # print(link.get("id"))   # 獲取標籤的id
    # print(link.get("class"))    # 獲取標籤的class值
    # print(link.attrs)       # 獲取a標籤中的所有屬性
    # del link["id"]
    # print(link.attrs)   # 獲取除了a標籤中除id屬性外的所有屬性

    print(link.text)    # 獲取a標籤下的文本信息
    print(link.string)
    print(link.get_text())

# text和string的區別

print(soup.p.string)    # None
print(soup.p.text)  # The Dormouse's story      123

1.4 文檔樹信息獲取

print(soup.head.title.string)   #連續跨節點獲取文本信息
print(soup.body.a.string)       # 獲取a標籤下的第一個文本信息

# 子節點，子孫節點
print(soup.p.contents)      # 獲取第一個p標籤下的所有文本信息，最終在一個列表內
# ['\n', <b>The Dormouse's story</b>, '\n', <span alex="dsb" class="123">123</span>, '\n']

print(soup.p.children)  # 包含p下所有子節點的生成器
for child in soup.p.children:
    print(child)    # 獲取p下的所有子節點

print(soup.p.descendants)   # 獲取子孫節點,p下所有的標籤都會選擇出來
for child in soup.p.descendants:
    print(child)

# 父節點，祖先節點
print(soup.p.parent)      # 獲取p標籤的父節點，得到整個body
print(soup.p.parents)     # 一個生成器，找到p標籤所有的祖先節點

# 兄弟節點
print(soup.a.next_sibling)  # a節點的下一個兄弟，得到一個逗號
print(soup.a.next_siblings) # 一個生成器，下面的兄弟們

print(soup.a.previous_sibling)  # 上一個兄弟，即得到了上面的文本信息
print(soup.a.previous_siblings) # 一個生成器，上面的兄弟們

搜索文檔樹下的幾種過濾器(結合find_all)　　

還是借用官方文檔提供的內容

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

第一種：字符串

ret = soup.find_all(name="a")   # a標籤的所有節點

第二種：正則

import re
tmp = re.compile("^h")
rep = soup.find_all(name=tmp)   # 獲取所有以h開頭的標籤節點，包含html標籤和head標籤

第三種：列表

ret = soup.find_all(name=["a","b"])     # 獲取所有的a標籤和b標籤

第四種：方法

def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')
    
for tag in soup.find_all(name=has_class_but_no_id):
    print(tag)  # 獲取具有class屬性但不具有id屬性的標籤

關於limit參數:

　　如果我們不需要全部結果,可以使用 limit參數限制返回結果的數量

print(soup.find_all('a',limit=2))

關於recursive參數:

　　調用tag的 find_all() 方法時,Beautiful Soup會檢索當前tag的所有子孫節點,如果只想搜索tag的直接子節點,可以使用參數 recursive=False

print(soup.html.find_all('a',recursive=False))

find的使用(只返回一個)：

　　find_all() 方法的返回結果是值包含一個元素的列表,而 find() 方法直接返回結果

print(soup.find('a'))
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

　　soup.head.title 是 tag的名字方法的簡寫.這個簡寫的原理就是多次調用當前tag的 find() 方法:

soup.head.title
# <title>The Dormouse's story</title>
soup.find("head").find("title")
# <title>The Dormouse's story</title>

1.5 css選擇器

　　這裏其實通過名稱就可以知道，它是通過css屬性來進行查找的

ret=soup.select("a")          # 標籤名查找
ret=soup.select("#link1")     # 通過id查找
ret=soup.select(".sister")    # 通過類名查找
ret=soup.select(".c1 p,a")    # 組合查找
ret = soup.select("a[href='http://example.com/tillie']")  # 通過屬性查找

更多介紹可以查看官方文檔

2.xpath　(快速，簡單易用，功能齊全的庫，用來處理 HTML 和 XML)

　　xpath全稱爲XML Path Language， 一種小型的查詢語言，實現的功能與re以及bs一樣，但是大多數情況會選擇使用xpath

　　由於XPath屬於lxml庫模塊，所以首先要安裝庫lxml

調用方法：

from lxml import etree

selector=etree.HTML('源碼')   # 將源碼轉化爲能被XPath匹配的格式
# <Element html at 0x29b7fdb6708>
ret = selector.xpath('表達式')     # 返回爲一列表

2.1 查詢語法

原文

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

這裏首先需要將它轉換成xpath匹配的格式

from lxml import etree
selector=etree.HTML(html_doc)   # 將源碼轉化爲能被XPath匹配的格式

節點選取

nodename     選取nodename節點的所有子節點         xpath(‘//div’)         選取了所有div節點
/            從根節點選取                        xpath(‘/div’)          從根節點上選取div節點
//           選取所有的當前節點，不考慮他們的位置    xpath(‘//div’)         選取所有的div節點
.            選取當前節點                        xpath(‘./div’)         選取當前節點下的div節點
..           選取當前節點的父節點                 xpath(‘..’)            回到上一個節點
@            選取屬性                           xpath（’//@calss’）     選取所有的class屬性

用法

from lxml import etree
selector = etree.HTML(html_doc)

ret=selector.xpath("//p")
# [<Element p at 0x2a6126569c8>, <Element p at 0x2a612656a08>, <Element p at 0x2a612656a48>]
ret=selector.xpath("//p/text()")   # 打印當中的文本信息，包括換行符

ret=selector.xpath("/p")     # []

ret=selector.xpath("//a[@id='link1']")     # [<Element a at 0x1c541e43808>]
ret=selector.xpath("//a[@id='link1']/text()")     # ['Elsie']

謂語用法(返回的都是element對象)

表達式                                         結果
xpath(‘/body/div[1]’)                     選取body下的第一個div節點
xpath(‘/body/div[last()]’)                選取body下最後一個div節點
xpath(‘/body/div[last()-1]’)              選取body下倒數第二個div節點
xpath(‘/body/div[positon()<3]’)           選取body下前兩個div節點
xpath(‘/body/div[@class]’)                選取body下帶有class屬性的div節點
xpath(‘/body/div[@class=”main”]’)         選取body下class屬性爲main的div節點
xpath(‘/body/div[price>35.00]’)           選取body下price元素值大於35的div節點

通配符

表達式                 結果
xpath（’/div/*’）     選取div下的所有子節點
xpath(‘/div[@*]’)    選取所有帶屬性的div節點

多個路徑的選取

表達式                         結果
xpath(‘//div|//table’)    選取所有的div和table節點

代碼

from lxml import etree
selector = etree.HTML(html_doc)

ret = selector.xpath('//title/text()|//a/text()')
# ["The Dormouse's story", 'Elsie', 'Lacie', 'Tillie']

2.2 xpath軸

　　軸可以定義相對於當前節點的節點集

軸名稱                      表達式                                  描述
ancestor                xpath(‘./ancestor::*’)              選取當前節點的所有先輩節點（父、祖父）
ancestor-or-self        xpath(‘./ancestor-or-self::*’)      選取當前節點的所有先輩節點以及節點本身
attribute               xpath(‘./attribute::*’)             選取當前節點的所有屬性
child                   xpath(‘./child::*’)                 返回當前節點的所有子節點
descendant              xpath(‘./descendant::*’)            返回當前節點的所有後代節點（子節點、孫節點）
following               xpath(‘./following::*’)             選取文檔中當前節點結束標籤後的所有節點
following-sibing        xpath(‘./following-sibing::*’)      選取當前節點之後的兄弟節點
parent                  xpath(‘./parent::*’)                選取當前節點的父節點
preceding               xpath(‘./preceding::*’)             選取文檔中當前節點開始標籤前的所有節點

preceding-sibling       xpath(‘./preceding-sibling::*’)     選取當前節點之前的兄弟節點
self                    xpath(‘./self::*’)                  選取當前節點

用法

ret = selector.xpath('//a/ancestor::*')
# [<Element html at 0x168a62717c8>, <Element body at 0x168a6271748>, <Element p at 0x168a6271708>]

ret = selector.xpath('//a/parent::*/text()')
# ['Once upon a time there were three little sisters; and their names were\n', ',\n', ' and\n',
#  ';\nand they lived at the bottom of a well.']

ret = selector.xpath('//a/attribute::*')
# ['http://example.com/elsie', 'sister', 'link1', 'http://example.com/lacie', 'sister',
#  'link2', 'http://example.com/tillie', 'sister', 'link3']

2.3 功能函數

　　使用功能函數能夠進行模糊搜索

函數                  用法                                                               解釋
starts-with         xpath(‘//div[starts-with(@id,”ma”)]‘)                        選取id值以ma開頭的div節點
contains            xpath(‘//div[contains(@id,”ma”)]‘)                           選取id值包含ma的div節點
and                 xpath(‘//div[contains(@id,”ma”) and contains(@id,”in”)]‘)    選取id值包含ma和in的div節點
text()              xpath(‘//div[contains(text(),”ma”)]‘)                        選取節點文本包含ma的div節點

用法

from lxml import etree
selector = etree.HTML(html_doc)

# p標籤class屬性爲story,在它下面的a標籤id屬性以link開頭的文本信息
ret=selector.xpath("//p[@class='story']/a[starts-with(@id,'link')]/text()")
# ['Elsie', 'Lacie', 'Tillie']

# p標籤class屬性爲story,在它下面的a標籤id屬性包含k的文本信息
ret=selector.xpath("//p[@class='story']/a[contains(@id,'k')]/text()")
# ['Elsie', 'Lacie', 'Tillie']

# p標籤class屬性爲story,在它下面的a標籤class屬性包含is的文本信息
ret=selector.xpath("//p[@class='story']/a[contains(@class,'is')]/text()")
# ['Elsie', 'Lacie']

# 選取p標籤class屬性爲story,在它下面的a標籤文本信息包含ie的文本信息
ret=selector.xpath("//p[@class='story']/a[contains(text(),'ie')]/text()")
# ['Elsie', 'Lacie', 'Tillie']

更多介紹可以參考w3c

2.4 鏈家二手房信息地的抓取

　　打開鏈家網，選取我們需要的信息，點擊右鍵在copy中點擊copy xpath

得到如下內容：

//*[@id="leftContent"]/ul/li[1]/div/div[1]/a

代碼：

import requests
from lxml import etree

response = requests.get("https://bj.lianjia.com/ershoufang/changping/pg1/",
        headers={
            'Referer':'https://bj.lianjia.com/ershoufang/changping/',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3534.4 Safari/537.36',
                    })

selector=etree.HTML(response.content) # 將html源碼轉化爲能被XPath匹配的格式

ret = selector.xpath("//*[@id='leftContent']/ul/li[1]/div/div[1]/a/text()")
print(ret)      # ['商品房滿五年唯一 有電梯高樓層 東南2居室 已留鑰匙']

這裏我們要獲取首頁所有該房源名稱呢

ret = selector.xpath("//*[@id='leftContent']/ul/li[1]//div/div[1]/a/text()")

注意兩個的區別，這裏我們不考慮它的位置

3.總結

　　幾種獲取節點的庫比較：

抓取方式	性能	使用難度
re正則	快	困難
BeautifulSoup	慢	簡單
Xpath	快	快

　　通常情況下，lxml 是抓取數據的最好選擇，它不僅速度快(結合谷歌瀏覽器)，功能也更加豐富，而正則表達式和 Beautiful Soup只在某些特定場景下有用

爬蟲之Beautifulsoup及xpath

1.BeautifulSoup　(以 Python 風格的方式來對 HTML 或 XML 進行迭代，搜索和修改)

1.1 介紹

1.2 解析器

1.3 使用

1.4 文檔樹信息獲取

2.xpath　(快速，簡單易用，功能齊全的庫，用來處理 HTML 和 XML)

2.1 查詢語法

2.2 xpath軸

2.3 功能函數

2.4 鏈家二手房信息地的抓取

3.總結

.Net 8.0 下的新RPC，IceRPC之試試的新玩法"打洞"

完美替代postman的軟件

Vue mockjs mock.js

關於遊戲付費的一點想法

我通過CKA和CKS啦！

安裝chromadb注意事項

《最新出爐》系列入門篇-Python+Playwright自動化測試-42-強大的可視化追蹤利器Trace Viewer

大數據怎麼學？對大數據開發領域及崗位的詳細解讀，完整理解大數據開發領域技術體系

Jmeter學習

python-appium自動化操作微信

jpg圖片打包生成pdf文件的幾種方式

華爲OD機試題

redis介紹及常見問題總結

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

爬蟲之Beautifulsoup及xpath

1.BeautifulSoup (以 Python 風格的方式來對 HTML 或 XML 進行迭代，搜索和修改)

1.1 介紹

1.2 解析器

1.3 使用

1.4 文檔樹信息獲取

2.xpath (快速，簡單易用，功能齊全的庫，用來處理 HTML 和 XML)

2.1 查詢語法

2.2 xpath軸

2.3 功能函數

2.4 鏈家二手房信息地的抓取

3.總結

1.BeautifulSoup　(以 Python 風格的方式來對 HTML 或 XML 進行迭代，搜索和修改)

2.xpath　(快速，簡單易用，功能齊全的庫，用來處理 HTML 和 XML)