文章目錄

一、BeautifulSoup簡介及安裝

二、BeautifulSoup使用方法介紹

一、BeautifulSoup簡介及安裝

1. 簡介

簡單來說，BeautifulSoup是python的一個解析庫，其主要的功能就是解析網頁的HTML數據
官方解釋如下：

Beautiful Soup提供一些簡單的、python式的函數用來處理導航、搜索、修改分析樹等功能。它是一個工具箱，通過解析文檔爲用戶提供需要抓取的數據，因爲簡單，所以不需要多少代碼就可以寫出一個完整的應用程序。
Beautiful Soup自動將輸入文檔轉換爲Unicode編碼，輸出文檔轉換爲utf-8編碼。你不需要考慮編碼方式，除非文檔沒有指定一個編碼方式，這時，Beautiful Soup就不能自動識別編碼方式了。然後，你僅僅需要說明一下原始編碼方式就可以了。
Beautiful Soup已成爲和lxml、html6lib一樣出色的python解釋器，爲用戶靈活地提供不同的解析策略或強勁的速度。

2. 安裝

直接使用pip安裝即可

pip install beautifulsoup4

二、BeautifulSoup使用方法介紹

1. 注意事項

BeautifulSoup在使用時需要指定一個解析器：

html.parse- python 自帶，但容錯性不夠高，對於一些寫得不太規範的網頁會丟失部分內容
lxml- 解析速度快，需額外安裝
xml- 同屬 lxml 庫，支持 XML 文檔
html5lib- 最好的容錯性，但速度稍慢

這裏的 lxml 和 html5lib 都需要額外安裝，自行使用pip安裝即可（推薦使用lxml）

2. 使用方法

例如有如下的HTML文檔片段：

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="beautiful title"><b>The Dormouse's story</b></p>
 
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
 
<p class="story">...</p>
"""

初始化對象，指定解析器爲lxml：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')

2.1 獲取標籤信息

補全HTML代碼、獲取標籤信息

print(soup.prettify())  # 補全HTML代碼
print(soup.p)  # 獲取整條p標籤
print(soup.p.name)  # 獲取p標籤名稱
# 獲取標籤屬性
print(soup.p.attrs)  # 獲取p標籤內的所有屬性，返回一個字典
print(soup.p['class'])  # 獲取p標籤內的class屬性值，返回一個列表
# 獲取標籤文本
print(soup.p.string)  # 獲取p標籤的文本信息，如果p標籤內包含了多個子節點並有多個文本時返回None
print(soup.p.strings)  # 獲取p標籤內的所有文本信息，返回一個生成器
print(soup.p.text)  # 獲取p標籤內的所有文本信息，返回一個列表
print(soup.stripped_strings)  # 去掉空白，保留所有的文本，返回一個生成器

2.2 獲取元素節點

獲取指定元素的父/祖先節點、子/子孫節點和兄弟節點

# 獲取父·祖先節點
print(soup.p.parent)  # 獲取p標籤的直接父節點
print(soup.p.parents)  # 獲取p標籤的祖先節點，返回一個生成器
# 獲取子·子孫節點
print(soup.p.contents)  # 獲取p標籤內的直接子節點，返回一個列表
print(soup.p.children)  # 獲取p標籤內的直接子節點，返回一個生成器
print(soup.p.descendants)  # 獲取p標籤內的子孫節點，返回一個生成器
# 獲取兄弟節點
print(soup.a.previous_sibling)  # 獲取a標籤的上一個兄弟節點
print(soup.a.previous_siblings)  # 獲取a標籤前面的所有兄弟節點，返回一個生成器
print(soup.a.next_sibling)  # 獲取a標籤的下一個兄弟節點
print(soup.a.next_siblings)  # 獲取a標籤後面的所有兄弟節點，返回一個生成器

2.3 使用方法選擇器

並不是所有的信息都可以簡單地通過結構化獲取，通常使用 find() 和 find_all() 方法進行查找：

find()- 返回匹配到的第一個結果
find_all()- 返回一個包含所有匹配結果的列表

因爲 find() 和 find_all() 在使用上幾乎一致，所以這裏只列出 find_all() 的使用方法

print(soup.find_all(text=re.compile('Lacie'), limit=2))  # 使用正則獲取所有文本包含'Lacie'的節點（limit: 限制匹配個數）
print(soup.find_all('a', text='Lacie'))  # 獲取所有a標籤內文本等於'Lacie'的節點（文本完整匹配）
print(soup.find_all('a', id='link2'))  # 獲取所有a標籤內id等於'link2'的節點
print(soup.find_all('a', class_='sister'))  # 獲取所有a標籤內class等於'sister'的節點
print(soup.find_all('a', class_='sister', id='link2'))  # 多個搜索條件疊加
print(soup.find_all(name='a'))  # 獲取所有a節點
print(soup.find_all(attrs={'class': 'sister'}))  # 獲取所有class屬性值爲'sister'的節點

2.4 使用CSS選擇器

如果你對CSS選擇器很熟悉，BeautifulSoup也提供了相應的方法：

.- 代表class
#- 代表id

print(soup.select('p'))  # 獲取所有p標籤，返回一個列表
print(soup.select('p a'))  # 獲取所有p標籤內的a節點，返回一個列表
print(soup.select('p.story'))  # 獲取p標籤內class爲'story'的所有元素，返回一個列表
print(soup.select('.story'))  # 獲取class爲'story'的所有元素，返回一個列表
print(soup.select('.beautiful.title'))  # 獲取class爲'beautiful title'的所有元素，返回一個列表
print(soup.select('#link1'))  # 獲取id爲'link1'的所有元素，返回一個列表

Python爬蟲之BeautifulSoup使用技巧

文章目錄

一、BeautifulSoup簡介及安裝

1. 簡介

2. 安裝

二、BeautifulSoup使用方法介紹

1. 注意事項

2. 使用方法

2.1 獲取標籤信息

2.2 獲取元素節點

2.3 使用方法選擇器

2.4 使用CSS選擇器

.Net 8.0 下的新RPC，IceRPC之試試的新玩法"打洞"

完美替代postman的軟件

Vue mockjs mock.js

關於遊戲付費的一點想法

我通過CKA和CKS啦！

安裝chromadb注意事項

《最新出爐》系列入門篇-Python+Playwright自動化測試-42-強大的可視化追蹤利器Trace Viewer

大數據怎麼學？對大數據開發領域及崗位的詳細解讀，完整理解大數據開發領域技術體系

Python之jieba分詞使用技巧

Python正則表達式使用技巧

Python爬蟲之JS逆向分析技巧

Python之Tkinter使用技巧

Python使用pxssh模塊進行遠程SSH連接

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結