Beautiful Soup 4解析網頁

原創

aican_yu

2018-08-22 11:02

Beautiful Soup 4的安裝及相關問題

Beautiful Soup的最新版本是4.1.1可以在此獲取（http://www.crummy.com/software/BeautifulSoup/bs4/download/）

文檔：

（http://www.crummy.com/software/BeautifulSoup/bs4/doc/）

使用：

from bs4 import BeautifulSoup

Example：

html文件：

html_doc = """ <html><head><title>The Dormouse's story</title></head> The Dormouse's story Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. ... """

代碼：

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc)

接下來可以開始使用各種功能

soup.X (X爲任意標籤，返回整個標籤，包括標籤的屬性，內容等）

如：soup.title

# <title>The Dormouse's story</title>

soup.p

# The Dormouse's story

soup.a （注：僅僅返回第一個結果）

# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a') （find_all 可以返回所有）

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

find還可以按屬性查找

soup.find(id="link3")

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

要取某個標籤的某個屬性，可用函數有 find_all,get

for link in soup.find_all('a'):

print(link.get('href'))

# http://example.com/elsie

# http://example.com/lacie

# http://example.com/tillie

要取html文件中的所有文本，可使用get_text()

print(soup.get_text())

# The Dormouse's story

# Once upon a time there were three little sisters; and their names were

# Elsie,

# Lacie and

# Tillie;

# and they lived at the bottom of a well.

# ...

如果是打開html文件，語句可用：

soup = BeautifulSoup(open("index.html"))

BeautifulSoup中的Object

tag （對應html中的標籤）

tag.attrs (以字典形式返回tag的所有屬性）

可以直接對tag的屬性進行增、刪、改，跟操作字典一樣

tag['class'] = 'verybold'

tag['id'] = 1

tag

# <blockquote class="verybold" id="1">Extremely bold</blockquote>

del tag['class']

del tag['id']

tag

# <blockquote>Extremely bold</blockquote>

tag['class']

# KeyError: 'class'

print(tag.get('class'))

# None

X.contents (X爲標籤，可返回標籤的內容）

eg.

head_tag = soup.head

head_tag

# <head><title>The Dormouse's story</title></head>

head_tag.contents

[<title>The Dormouse's story</title>]

title_tag = head_tag.contents[0]

title_tag

# <title>The Dormouse's story</title>

title_tag.contents

# [u'The Dormouse's story']

解決解析網頁出現亂碼問題：

import urllib2

`2`	`from` `BeautifulSoup` `import` `BeautifulSoup`

3

`4`	`page` `=` `urllib2.urlopen('http://www.leeon.me');`

`5`	`soup` `=` `BeautifulSoup(page,fromEncoding="gb18030")`

6

`7`	`print` `soup.originalEncoding`

`8`	`print` `soup.prettify()`

如果中文頁面編碼是gb2312，gbk，在BeautifulSoup構造器中傳入fromEncoding="gb18030"參數即可解決亂碼問題，即使分析的頁面是utf8的頁面使用gb18030也不會出現亂碼問題！

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Beautiful Soup 4解析網頁

《Python進階》學習筆記

Leetcode 3161. 物塊放置查詢

一個docker容器暴露多個端口

leetcode 60 排列序列

微服務實踐之使用 Visual Studio 2022 調試Dapr 應用程序

wpf附加屬性理解 WPF附加屬性

syslog的用法

nginx的rewrite機制

nginx屏蔽無效請求方式

新建網站方式

讓PIL生成的字帶有描邊效果

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結