BeautifulSoup 安裝及其使用

BeautifulSoup 是個好東東。

官網見這裏： http://www.crummy.com/software/BeautifulSoup/

下載地址見這裏：http://www.crummy.com/software/BeautifulSoup/bs4/download/4.1/ ，附件有4.1.2的安裝源碼

文檔見這裏： http://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html ，是中文翻譯的，不過文檔有點舊，是 3.0 的文檔版本，看起來沒有什麼意思。

我推薦大家看個： http://www.crummy.com/software/BeautifulSoup/bs4/doc/ ，這個是 python 的官網英文版，看起來要舒服，清晰很多。

在 python 下，你想按照 jquery 格式來讀取網頁，免除網頁格式、標籤的不規範的困擾，那麼 BeautifulSoup 是個不錯的選擇。按照官網所說， BeautifulSoup 是 Screen-Scraping 應用，旨在節省大家處理 HTML 標籤，並且從網絡中獲得信息的工程。 BeautifulSoup 有這麼幾個優點，使得其功能尤其強大：

1 ： Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. It doesn't take much code to write an application 。關鍵詞： python 風格、提供簡單方法

2 ： Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't autodetect one. Then you just have to specify the original encoding 。關鍵詞：編碼轉換，使用 Python 的同學都會認同Python 編碼格式的繁瑣， BeautifulSoup 能簡化這一點。

3 ： Beautiful Soup sits on top of popular Python parsers like lxml and html5lib , allowing you to try out different parsing strategies or trade speed for flexibility 。關鍵詞：兼容其它 html 解析器，能夠讓你隨心替換。

看完這幾個特性，想必有人心動了吧，我們先看下 BeautifulSoup 的安裝：

安裝方法：

1 ： apt-get install python-bs4

2 ： easy_install beautifulsoup4

3 ： pip install beautifulsoup4

4 ：源碼安裝： python setup.py install

根據不同的操作系統，選用不同的安裝方法，這些方法都能安裝成功，不同點在於安裝的工具不同。我自己的系統採用的是第四種安裝方法，下面我來簡要介紹下第四種安裝方法：

Python代碼

curl http://www.crummy.com/software/BeautifulSoup/bs4/download/4.1/beautifulsoup4-4.1.2.tar.gz >> beautifulsoup4-4.1.2.tar.gz
tar zxvf beautifulsoup4-4.1.2.tar.gz
cd beautifulsoup4-4.1.2
python setup.py install

Ok ，你就能看到安裝信息，提示安裝成功。

安裝成功，肯定想迫不及待的使用，你打開 python command 窗口，你很 happy 的輸入：

Python代碼

from beautifulsoup import beautifulsoup

sorry ， ImportError ，爲什麼會有這個 import error ，我都安裝好了的。打開官網，重新看下說明，原來安裝的是 BeautifulSoup 4.1 版本，這個 import 是 3.x 的說法。重新打開 command ，輸入：

Python代碼

from bs4 import BeautifulSoup

咦，沒有輸出提示。恭喜你， BeautifulSoup 包引入成功。

看文上篇博客， http://isilic.iteye.com/blog/1733560 ，想試下 dir 命令，看看 BeautifulSoup 提供了哪些方法：

Python代碼

dir(BeautifulSoup)

看到一堆的方法，有點頭大，將方法列出來會方便看許多。

Python代碼

>>> for method in dir(BeautifulSoup):
... print method
...

請仔細看下其中的 findXxx ， nextXxx ， previousXxx 方法，這些方法提供了 html 頁面的遍歷、回溯、查找、匹配功能；這些功能已經能夠提供獲取頁面信息的方法了。

我們以百度首頁爲例，試用下 BeautifulSoup 的強大功能。

Python代碼

>>> import urllib2
>>> page=urllib2.urlopen('http://www.baidu.com')
>>> soup=BeautifulSoup(page)
>>> print soup.title
>>> soup.title.string

看到結果顯示不錯， helloworld 的教程讓人心裏真是舒服啊。

想進一步試用功能，我想找出百度首頁上所有的鏈接，這個貌似很難，需要各種正則匹配，各種處理；等等，我們現在是在談論這個 BeautifulSoup ，看看 BeautifulSoup 怎麼實現這個功能。

Python代碼

>>> for lind in soup.find_all('a'):
... print lind['href']
...

看到輸出了嗎？是不是很簡單。

對於熟悉 Jquery 和 CSS 的同學，這種操作就是個折磨，需要不停的根據選擇出來的結果進行遍歷。看到上面的輸出，看到有很多的 # 這些非正常的 URL ，現在想把這些 URL 全部過濾掉，使用 select 語法就很簡單了。

Python代碼

>>> for link in soup.select('a[href^=http]'):
... print link['href'];
...

有人說我根據判斷出來的 URL 做處理不行嘛，當然可以，我這裏只是想試下 select 的語法，至於 select 中的語法定義，大家可以自行度之。準確的說，這個 select 語法都能重新開篇文章了。

再進一步，連接中的 / 或者 /duty 鏈接都是有含義的，是相對於本站的絕對地址，這些 / 開頭的怎麼不被過濾掉？如果是絕對地址的話，又該怎麼防止被過濾掉？ href 標籤裏面是個 javascript 又該怎麼過濾？如果考慮 css文件和 js 文件的話，怎麼把這些文件的 url 也給找出來？還有更進一步的，怎麼分析出 js 中 ajax 的請求地址？這些都是可以進一步擴展的一些要求。

好吧，我承認後面這些 URL 過濾已經超出了 BeautifulSoup 的能力範圍了，但是單純考慮功能的話，這些都是要考慮的內容，這些疑問大家考慮下實現原理就行，如果能做進一步的學習的話，算是本文額外的功勞了。

下面簡單過下 BeautifulSoup 的用法：

Python代碼

DEFAULT_BUILDER_FEATURES
FORMATTERS
ROOT_TAG_NAME
STRIP_ASCII_SPACES：BeautifulSoup的內置屬性
__call__
__class__
__contains__
__delattr__
__delitem__
__dict__
__doc__
__eq__
__format__
__getattr__
__getattribute__
__getitem__
__hash__
__init__
__iter__
__len__
__module__
__ne__
__new__
__nonzero__
__reduce__
__reduce_ex__
__repr__
__setattr__
__setitem__
__sizeof__
__str__
__subclasshook__
__unicode__
__weakref__
_all_strings
_attr_value_as_string
_attribute_checker
_feed
_find_all
_find_one
_lastRecursiveChild
_last_descendant
_popToTag：BeautifulSoup的內置方法，關於這些方法使用需要了解Python更深些的內容。
append：修改element tree
attribselect_re
childGenerator
children
clear：清除標籤內容
decode
decode_contents
decompose
descendants
encode
encode_contents
endData
extract：這個方法很關鍵，後面有介紹
fetchNextSiblings下一兄弟元素
fetchParents：父元素集
fetchPrevious：前一元素
fetchPreviousSiblings：前一兄弟元素：這幾個能夠對當前元素的父級別元素和兄弟級別進行查找。
find：只找到limit爲1的結果
findAll
findAllNext
findAllPrevious
findChild
findChildren：子集合
findNext：下一元素
findNextSibling：下一個兄弟
findNextSiblings：下一羣兄弟
findParent：父元素
findParents：所有的父元素集合
findPrevious
findPreviousSibling
findPreviousSiblings：對當前元素和子元素進行遍歷查找。
find_all_next
find_all_previous
find_next
find_next_sibling
find_next_siblings
find_parent
find_parents
find_previous
find_previous_sibling
find_previous_siblings：這些下劃線方法命名是bs4方法，推薦使用這類
format_string
get
getText
get_text：得到文檔標籤內的內容，不包括標籤和標籤屬性
handle_data
handle_endtag
handle_starttag
has_attr
has_key
index
insert
insert_after
insert_before：修改element tree
isSelfClosing
is_empty_element
new_string
new_tag
next
nextGenerator
nextSibling
nextSiblingGenerator
next_elements
next_siblings
object_was_parsed
parentGenerator
parents
parserClass
popTag
prettify：格式化HTML文檔
previous
previousGenerator
previousSibling
previousSiblingGenerator
previous_elements
previous_siblings
pushTag
recursiveChildGenerator
renderContents
replaceWith
replaceWithChildren
replace_with
replace_with_children：修改element tree 元素內容
reset
select：適用於jquery和css的語法選擇。
setup
string
strings
stripped_strings
tag_name_re
text
unwrap
wrap

需要注意的是，在BeautifulSoup中的方法有些有兩種寫法，有些是駝峯格式的寫法，有些是下劃線格式的寫法，但是看其方法的含義是一樣的，這主要是BeautifulSoup爲了兼容3.x的寫法。前者是3.x的寫法，後者是4.x的寫法，推薦使用後者，也就是下劃線的方法。

根據這些方法，應該能夠得到遍歷、抽取、修改、規範化文檔的一系列方法。大家如果能在工作中使用BeautifulSoup ，一定會理解更深。

BeautifulSoup 支持不同的 parser ，默認是 Html 格式解析，還有 xml parser 、 lxml parser 、 html5lib parser 、 html.parser ，這些 parser 都需要響應的解析器支持。

html，這個是默認的解析器

Python代碼

BeautifulSoup("<a></a>")
# <html><head></head><body><a></a></body></html>

xml格式解析器

Python代碼

BeautifulSoup("<a></a>", "xml")
# <?xml version="1.0" encoding="utf-8"?>
# <a></a>

lxml格式解析器

Python代碼

BeautifulSoup("<a>", "lxml")
# <html><body><a></a></body></html>

html5lib格式解析器

Python代碼

BeautifulSoup("<a>", "html5lib")
# <html><head></head><body><a></a></body></html>

html.parser解析器

Python代碼

BeautifulSoup("<a>", "html.parser")
# <a></a>

其中 parser 的區別大家看下這幾個例子就知道了。

在使用 BeautifulSoup 解析文檔的時候，會將整個文檔以一顆大又密集的數據載入到內存中，如果你只是從數據結構中獲得一個字符串，內存中保存一堆數據感覺就不划算了。並且如果你要獲得指向某個 Tag 的內容，這個Tag 又會指向其它的 Tag 對象，因此你需要保存這棵樹的所有部分，也就是說整棵樹都在內存中。 extract 方法可以破壞掉這些鏈接，它會將樹的連接部分斷開，如果你得到某個 Tag ，這個 Tag 的剩餘部分會離開這棵樹而被垃圾收集器捕獲；當然，你也可以實現其它的功能：如文檔中的某一塊你本身就不關心，你可以直接把它 extract 出樹結構，扔給垃圾收集器，優化內存使用的同時還能完成自己的功能。

正如 BeautifulSoup 的作者 Leonard 所說，寫 BeautifulSoup 是爲了幫助別人節省時間，減小工作量。一旦習慣使用上 BeautifulSoup 後，一些站點的內容很快就能搞定。這個就是開源的精神，將工作儘可能的自動化，減小工作量；從某個程度上來說，程序員應該是比較懶惰的，但是這種懶惰正好又促進了軟件行業的進步。

BeautifulSoup 安裝及其使用

如何在Linux中下載優酷視頻

如何使用aria2及webui-aria2下載百度雲資源

用樹莓派打造一個NAS

Linux批量重命名文件方法

休息五分鐘，學幾個bash快捷鍵

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結