Python BeautifulSoup基本使用

原創

2020-06-04 07:43

示例代碼：

#coding:utf-8
from bs4 import BeautifulSoup
import urllib.request

url = 'http://reeoo.com'
request = urllib.request.Request(url)
response = urllib.request.urlopen(request, timeout=20)
content = response.read()
soup = BeautifulSoup(content, 'html.parser')
print(soup)

# html_doc = """
# <html><head><title>The Dormouse's story</title></head>
# <body>
# <p class="title" stlye="color:red"><b>The Dormouse's story</b></p>
#
# <p class="story">Once upon a time there were three little sisters; and their names were
# <a href="http://example.com/elsie" rel="external nofollow" class="sister" id="link1">Elsie</a>,
# <a href="http://example.com/lacie" rel="external nofollow" class="sister" id="link2">Lacie</a> and
# <a href="http://example.com/tillie" rel="external nofollow"  class="sister" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
#
# <p class="story">...</p>
# """
#
# soup = BeautifulSoup(html_doc, 'lxml')
# # print(soup.prettify())
# print(soup.p.attrs)

要點概括：
1、引入類庫文件：

from bs4 import BeautifulSoup

需先進行類庫安裝：

pip install beautifulsoup4

2、初始化，使用BeautifulSoup解析返回結果

soup = BeautifulSoup(content, 'html.parser')

其中，‘html.parser’爲解析器（Python的內置標準庫、執行速度適中、文檔容錯能力強，但Python 2.7.3及Python 3.2.2之前的版本文檔容錯能力差），其他解析器還有：lxml（速度快、文檔容錯能力強，但需要安裝C語言庫）、xml（速度快、唯一支持XML的解析器，但需要安裝C語言庫）、html5lib（最好的容錯性、以瀏覽器的方式解析文檔、生成HTML5格式的文檔，但速度慢，不依賴外部擴展）

3、功能特性：
按照標準的縮進格式進行結構輸出

print(soup.prettify())

輸出匹配到的首個標籤：

print(soup.title)

輸出匹配到的首個標籤的內容：

print(soup.title.string)

輸出所有匹配到的標籤：

print(soup.find_all('a'))

通過標籤屬性進行匹配首個標籤：

print(soup.find(id='link1'))

4、Beautiful Soup 將複雜 HTML 文檔轉換成一個複雜的樹形結構，每個節點都是 Python 對象，所有對象可以歸納爲 4 種: Tag、NavigableString、BeautifulSoup、Comment 。

Tag ：有兩個重要的屬性 name 和 attrs，name 指標籤的名字或者 tag 本身的 name，attrs 通常指一個標籤的 class

print(soup.p.name)
print(soup.p.attrs)

NavigableString：獲取標籤內部的文字

print(soup.p.string)

BeautifulSoup：表示一個文檔的全部內容。大部分時候，可以把它當作 Tag 對象，是一個特殊的 Tag
Comment：Comment 對象是一個特殊類型的 NavigableString 對象，其輸出的內容不包括註釋符號，但是如果不好好處理它，可能會對我們的文本處理造成意想不到的麻煩

>>> markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
>>> soup = BeautifulSoup(markup)
>>> comment = soup.b.string
>>> print(comment)
Hey, buddy. Want to buy a used parser?
>>> type(comment)
<class 'bs4.element.Comment'>

5、find_all( name , attrs , recursive , text , **kwargs )

soup.find_all('b')
soup.find_all(id='link1')
soup.find_all(attrs={'id':'link1'})
soup.find_all(class_='sister')
soup.find_all('a', string='Elsie')

recursive 參數:Beautiful Soup 會檢索當前 tag 的所有子孫節點，如果只想搜索 tag 的直接子節點，可以使用參數 recursive=False。

9、find()
它與 find_all() 方法唯一的區別是 find_all() 方法的返回結果是值包含一個元素的列表，而 find() 方法只返回第一個匹配的結果。

10、get_text()
如果只想得到 tag 中包含的文本內容，那麼可以用 get_text() 方法，這個方法獲取到 tag 中包含的所有文本內容。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python BeautifulSoup基本使用

自學編程兩個月，現在我月入 4 萬元

Google Chrome驅動程序 124.0.6367.62（正式版本）去哪下載？

微信公衆號開發---本地調試（使用內網穿透）

phpstorm配置xdebug（遠程服務器）

Python BeautifulSoup基本使用

Python urllib.request基本使用

VMware15配置Centos7靜態IP

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結