python爬蟲工程師成長之路七(一) Beautiful Soup4(一)

文章目錄

Beautiful Soup4 簡介

BeautifulSoup4和 lxml 一樣是一套HTML/XML數據分析、清洗和獲取工具，主要的功能也是如何解析和提取 HTML/XML 數據。

BeautifulSoup支持Python標準庫中的HTML解析器,還支持一些第三方的解析器，如果我們不安裝它，則 Python 會使用 Python默認的解析器。

Beautiful Soup自動將輸入文檔轉換爲Unicode編碼，輸出文檔轉換爲utf-8編碼。你不需要考慮編碼方式，除非文檔沒有指定一個編碼方式，這時，Beautiful Soup就不能自動識別編碼方式了。然後，你僅僅需要說明一下原始編碼方式就可以了。

Beautiful Soup4 解析器

Beautiful Soup4常用解析器及優缺點

解析器	用法	優點	缺點
html.parser	BeautifulSoup(markup,“html.parser”)	python 內置庫，速度較好，容錯能力好	在python2.7.3或3.2.2前容錯差
lxml HTML解析器	BeautifulSoup(markup,“lxml”)	速度快，容錯能力好	依賴C
lxml XML解析器	BeautifulSoup(markup,“xml”)或BeautifulSoup(markup,“lxml-xml”)	速度非常快，唯一支持XML的解析器	依賴C
html5lib	BeautifulSoup(markup,“html5lib”)	容錯非常好，解析方式與瀏覽器相同	速度非常慢，依賴python

現在看不懂也沒關係，大概瞭解一下。

Beautiful Soup4 安裝

安裝最新版本

pip install beautifulsoup4

Beautiful Soup4 解析器安裝

安裝lxml解析器(建議安裝)

pip install lxml

安裝html5lib解析器

pip install html5lib

Beautiful Soup4 簡單使用

演示文檔(愛麗絲夢遊仙境的一段內容)

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

用BeautifulSoup解析這段代碼,能夠得到一個BeautifulSoup的對象,並能按照標準的縮進格式的結構輸出

In：

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())

Out：

獲取第一個某標籤的所有內容

print(soup.title)#獲取標籤title的所有內容
print(soup.p)#獲取標籤p的所有內容
print(soup.a)#獲取標籤a的所有內容

獲取第一個某標籤的name

print(soup.title.name)#獲取標籤title的name
print(soup.p.name)#獲取標籤p的name
print(soup.a.name)#獲取標籤a的name

獲取第一個某標籤的內容

print(soup.title.string)#獲取標籤title的內容
print(soup.p.string)#獲取標籤p的內容
print(soup.a.string)#獲取標籤a的內容

獲取第一個某標籤的name

print(soup.title.name)#獲取標籤title的name
print(soup.p.name)#獲取標籤p的name
print(soup.a.name)#獲取標籤a的name

獲取第一個某標籤的id值

print(soup.a['id'])#獲取標籤a的id值

獲取所有的某標籤的所有內容

print(soup.find_all('a'))#獲取標籤a的所有內容

按某個已知值進行查詢

print(soup.find(id="link3"))#查詢id=“link3”

獲取文檔中所有文字內容

print(soup.get_text())

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
# print(soup.prettify())
# print(soup.title)#獲取標籤title及其內容
# print(soup.p)#獲取標籤p及其內容
# print(soup.a)#獲取標籤a及其內容

# print(soup.title.name)#獲取標籤title的name
# print(soup.p.name)#獲取標籤p的name
# print(soup.a.name)#獲取標籤a的name

# print(soup.title.string)#獲取標籤title的內容
# print(soup.p.string)#獲取標籤p的內容
# print(soup.a.string)#獲取標籤a的內容
# print(soup.a['id'])#獲取標籤a的id值
# print(soup.find_all('a'))#獲取標籤a的所有內容
# print(soup.find(id="link3"))#查詢id=“link3”
print(soup.get_text())

Beautiful Soup4 四大對象

BeautifulSoup4將複雜HTML文檔轉換成一個複雜的樹形結構,每個節點都是Python對象,所有對象可以歸納爲4種:

Tag

bs4中的tag也是XML或HTML中的tag，簡單來說就是HTML中的標籤，tag有很多屬性：

name：

name：通過.name獲取

tag=soup.p
tag.name

如果改變了某個tag的name，會直接修改當前Beautiful Soup對象生成的HTML文檔

tag=soup.p
tag.name='ppp' #會將soup對象中的第一個p標籤修改
print(tag)

attrs：

一個tag可能會有很多屬性，tag屬性的操作方法與字典一致，可以增加、刪除、修改等

tag=soup.a
print(tag['class']) #訪問屬性的方法與字典類似
print(tag.attrs) #返回該tag的所有屬性
tag['class']='class_tag' #修改屬性值
del tag['id'] #刪除該tag的id屬性
print(tag['class'])

多值屬性：
HTML5中常見的多值屬性是class(一個tag可以有多個class)，另外的屬性 rel , rev , accept-charset , headers , accesskey等也是多值屬性

在Beautiful Soup中多值屬性的返回類型是list:

css_soup = BeautifulSoup('<p class="value1 value2"></p>')
print(css_soup.p['class'])

某些屬性有多個值，但不是多值屬性則Beautiful Soup會將這個屬性作爲字符串返回

css_soup = BeautifulSoup('<p id="value1 value2"></p>')
print(css_soup.p['id'])

tag被轉換成字符串時,多值屬性會合併爲一個值

css_soup = BeautifulSoup('<p class="value1 value2"></p>')
print(css_soup.p['class'])
print(css_soup.p)

如果是xml文檔中的tag，則不會出現多值屬性

css_soup = BeautifulSoup('<p class="value1 value2"></p>','xml')
print(css_soup.p['class'])

NavigableString

字符串常被包含在tag內.Beautiful Soup用 NavigableString 類來包裝tag中的字符串:

通過tag.string來獲取標籤中的內容

css_soup = BeautifulSoup('''<p class="value1 value2">The Dormouse's story</p>''','xml')
tag=css_soup.p
print(tag.string)
print(type(tag.string))

NavigableString 字符串與Python中的Unicode字符串相同，可以通過 unicode() 方法直接將 NavigableString 對象轉換成Unicode字符串

tag中包含的字符串不能編輯,但是可以用 replace_with() 方法來替換成其它的字符串,:

css_soup = BeautifulSoup('''<p class="value1 value2">The Dormouse's story</p>''','xml')
tag=css_soup.p
tag.string.replace_with("hello bs4")
print(tag.string)
print(type(tag.string))

BeautifulSoup

BeautifulSoup 對象表示的是一個文檔的全部內容.大部分時候,可以把它當作 Tag 對象，他具有的屬性爲

名稱：

通過.name獲取BeautifulSoup的名稱

類型：

通過type()獲取BeautifulSoup的類型

屬性：

通過.attrs獲取BeautifulSoup的屬性

soup = BeautifulSoup('''<p class="value1 value2">The Dormouse's story</p>''','xml')
print(soup.name)
print(type(soup))
print(soup.attrs)

Comment

Comment 對象是一種特殊的 NavigableString 對象，它會將標籤中的註釋輸出，但不包括註釋符。

html_a='''<a class="mnav" href="http://news.baidu.com" name="tj_trnews"><!--新聞--></a>'''
soup=BeautifulSoup(html_a)
comment=soup.a.string
print(comment)
print(type(comment))

python爬蟲工程師成長之路七(一) Beautiful Soup4(一)

文章目錄

Beautiful Soup4 簡介

Beautiful Soup4 解析器

Beautiful Soup4 安裝

Beautiful Soup4 解析器安裝

Beautiful Soup4 簡單使用

Beautiful Soup4 四大對象

Tag

NavigableString

BeautifulSoup

Comment

AI模型 Llama 3體驗筆記

【面試準備】又一次失敗的面試經歷，題目離譜～資深軟件測試工程師

dotnet 8 版本與銀河麒麟V10和UOS系統的 glibc 兼容性

Android CharSequence和Stirng之間的互相轉換

Android ScrollView 判斷到頂到底，和設置到頂到底

Android Studio連接真機教程(超詳細)

python 學習筆記十八正則表達式

Web學習筆記 CSS(一) CSS 基礎

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

python爬蟲工程師 成長之路七(一) Beautiful Soup4(一)

文章目錄

Beautiful Soup4 簡介

Beautiful Soup4 解析器

Beautiful Soup4 安裝

Beautiful Soup4 解析器安裝

Beautiful Soup4 簡單使用

Beautiful Soup4 四大對象

Tag

NavigableString

BeautifulSoup

Comment

python爬蟲工程師成長之路七(一) Beautiful Soup4(一)