Beautiful Soup 4的安裝及相關問題
Beautiful Soup的最新版本是4.1.1可以在此獲取(http://www.crummy.com/software/BeautifulSoup/bs4/download/)
文檔:
(http://www.crummy.com/software/BeautifulSoup/bs4/doc/)
使用:
from bs4 import BeautifulSoup
Example:
html文件:
html_doc = """ <html><head><title>The Dormouse's story</title></head> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """
代碼:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
接下來可以開始使用各種功能
soup.X (X爲任意標籤,返回整個標籤,包括標籤的屬性,內容等)
如:soup.title
# <title>The Dormouse's story</title>
soup.p
# <p class="title"><b>The Dormouse's story</b></p>
soup.a (注:僅僅返回第一個結果)
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
soup.find_all('a') (find_all 可以返回所有)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
find還可以按屬性查找
soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
要取某個標籤的某個屬性,可用函數有 find_all,get
for link in soup.find_all('a'):
print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
要取html文件中的所有文本,可使用get_text()
print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...
如果是打開html文件,語句可用:
soup = BeautifulSoup(open("index.html"))
BeautifulSoup中的Object
tag (對應html中的標籤)
tag.attrs (以字典形式返回tag的所有屬性)
可以直接對tag的屬性進行增、刪、改,跟操作字典一樣
tag['class'] = 'verybold'
tag['id'] = 1
tag
# <blockquote class="verybold" id="1">Extremely bold</blockquote>
del tag['class']
del tag['id']
tag
# <blockquote>Extremely bold</blockquote>
tag['class']
# KeyError: 'class'
print(tag.get('class'))
# None
X.contents (X爲標籤,可返回標籤的內容)
eg.
head_tag = soup.head
head_tag
# <head><title>The Dormouse's story</title></head>
head_tag.contents
[<title>The Dormouse's story</title>]
title_tag = head_tag.contents[0]
title_tag
# <title>The Dormouse's story</title>
title_tag.contents
# [u'The Dormouse's story']
解決解析網頁出現亂碼問題:
import urllib2 |
2 |
from BeautifulSoup import BeautifulSoup |
3 |
4 |
page = urllib2.urlopen( 'http://www.leeon.me' ); |
5 |
soup = BeautifulSoup(page,fromEncoding = "gb18030" ) |
6 |
7 |
print soup.originalEncoding |
8 |
print soup.prettify() |
如果中文頁面編碼是gb2312,gbk,在BeautifulSoup構造器中傳入fromEncoding="gb18030"參數即可解決亂碼問題,即使分析的頁面是utf8的頁面使用gb18030也不會出現亂碼問題!