[Python] - 爬蟲之Beautiful Soup的基本使用

Beautiful Soup的簡介

Beautiful Soup 是一個可以從HTML 或 XML 文件中提取數據的 Python 庫，最主要的功能是從網頁抓取數據

官方解釋如下：

Beautiful Soup提供一些簡單的、python式的函數用來處理導航、搜索、修改分析樹等功能。它是一個工具箱，通過解析文檔爲用戶提供需要抓取的數據，因爲簡單，所以不需要多少代碼就可以寫出一個完整的應用程序。

Beautiful Soup自動將輸入文檔轉換爲Unicode編碼，輸出文檔轉換爲utf-8編碼。你不需要考慮編碼方式，除非文檔沒有指定一個編碼方式，這時，Beautiful Soup就不能自動識別編碼方式了。然後，你僅僅需要說明一下原始編碼方式就可以了。

Beautiful Soup已成爲和lxml、html6lib一樣出色的python解釋器，爲用戶靈活地提供不同的解析策略或強勁的速度。

Beautiful Soup 安裝

命令行安裝

可以利用 pip 或者 easy_install 來安裝，以下兩種方法均可

easy_install beautifulsoup4
# 或者
pip3 install beautifulsoup4

安裝包安裝

下載完成之後解壓： Beautiful Soup 4.3.2

運行下面的命令即可完成安裝

sudo python setup.py install

安裝解析器

Beautiful Soup 支持 Python 標準庫中的 HTML 解析器,還支持一些第三方的解析器,其中一個是 lxml

$ easy_install lxml
或
$ pip install lxml

基本使用

這裏是官方文檔鏈接，不過內容是有些多，也不夠條理，在此選部分常用功能示例
官方文檔

BeautifulSoup 對象

將一段文檔傳入 BeautifulSoup 的構造方法,就能得到一個文檔的對象, 可以傳入一段字符串或一個文件句柄

from bs4 import BeautifulSoup
# 打來一個 html 文件
soup = BeautifulSoup(open('index.html')) 
# 打來 html 格式字符串
soup = BeautifulSoup('<html>data</html>')

對象的種類

Beautiful Soup將複雜HTML文檔轉換成一個複雜的樹形結構,每個節點都是Python對象,所有對象可以歸納爲4種:

Tag
NavigableString
BeautifulSoup
Comment

Tag

Tag 是什麼？通俗點講就是HTML 中的一個個標籤
Tag 對象與 XML 或 HTML 原生文檔中的 tag 相同

>>> soup = BeautifulSoup('<b class="boldest">Extremely</b>', "lxml")
>>> tag = soup.b
>>> tag
<b class="boldest">Extremely</b>
>>> type(tag)
<class 'bs4.element.Tag'>

tag中最重要的屬性:

Name: 每個 tag 都有自己的名字,通過 .name 來獲取
Attributes: tag 的屬性, 屬性的操作方法與字典相同

`name` 屬性和獲取和修改

# 獲取 tag 的 name
>>> tag.name
'b'
# 修改 tag 的 name
# 如果改變了tag的name,那將影響所有通過當前Beautiful Soup對象生成的HTML文檔
>>> tag.name = "blcokquote"
>>> tag
<blcokquote class="boldest">Extremely</blcokquote>

`Attributes` 屬性的操作

一個 tag 可能有很多個屬性. tag <b class="boldest"> 有一個 “class” 的屬性,值爲 “boldest”

tag 的屬性的操作方法與字典相同

tag 的屬性可以被添加,刪除或修改, 操作方法與字典一樣

# 獲取 tag 的 class 屬性值，返回一個列表
>>> tag['class']
['boldest']

>>> tag['class'][0]
'boldest'

>>> tag.attrs
{'class': ['boldest']}

# 修改 tag 的 class 和 id 屬性
>>> tag['class'] = 'mazy'
>>> tag['id'] = 1
>>> tag
<blcokquote class="mazy" id="1">Extremely</blcokquote>

# 刪除 tag 的 class 屬性
>>> del tag['class']
>>> tag
<blcokquote id="1">Extremely</blcokquote>

# 刪除 tag 的 id 屬性
>>> del tag['id']
>>> tag
<blcokquote>Extremely</blcokquote>

多值屬性

HTML 定義了一系列可以包含多個值的屬性,最常見的多值的屬性是 class (一個tag可以有多個CSS的class).在Beautiful Soup中多值屬性的返回類型是 List

>>> css_soup = BeautifulSoup('<p class="body strikeout"></p>','lxml')
>>> css_soup.p['class']
['body', 'strikeout']

在任何版本的 HTML 定義中都沒有被定義爲多值屬性,那麼 Beautiful Soup 會將這個屬性作爲字符串返回

>>> id_soup = BeautifulSoup('<p id="my id"></p>', 'lxml')
>>> id_soup.p['id']
'my id'

NavigableString

字符串常被包含在 tag 內. Beautiful Soup 用 NavigableString 類來包裝 tag 中的字符串

tag 中包含的字符串不能編輯,但是可以被替換成其它的字符串,用 replace_with() 方法

>>> tag
<blcokquote class="boldest">Extremely</blcokquote>

>>> tag.string
'Extremely'

>>> type(tag.string)
<class 'bs4.element.NavigableString'>

>>> tag.string.replace_with('No longer bold')
>>> tag
<blcokquote>No longer bold</blcokquote>

遍歷文檔樹

操作示例代碼：

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

創建 `BeautifulSoup` 文檔對象

>>> soup = BeautifulSoup(html_doc, 'html.parser')
# soup 對象
>>> soup
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

子節點

一個Tag可能包含多個字符串或其它的 Tag,這些都是這個 Tag 的子節點

`Tag` 的名字

# 獲取 head 標籤
>>> soup.head
<head><title>The Dormouse's story</title></head>

# 獲取 title 標籤
>>> soup.title
<title>The Dormouse's story</title>

# 這是個獲取tag的小竅門,可以在文檔樹的tag中多次調用這個方法.下面的代碼可以獲取<body>標籤中的第一個<b>標籤
>>> soup.body.b
<b>The Dormouse's story</b>

# 獲取第一個 a 標籤
>>> soup.a
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

# 如果想要得到所有的<a>標籤,或是通過名字得到比一個tag更多的內容的時候,就需要用到: find_all()
>>> soup.find_all('a')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

`Tag` 的 `.contents` 和 `.children` 屬性

Tag 的 .contents 屬性可以將 Tag 的子節點以列表的方式輸出
字符串沒有 .contents 屬性,因爲字符串沒有子節點

# 獲取 head 標籤內部的內容
>>> soup.head.contents
[<title>The Dormouse's story</title>]

# 獲取 head 標籤內部的內容的第一個元素
>>> soup.head.contents[0]
<title>The Dormouse's story</title>

# 獲取 head 標籤內部的內容的第一個元素的內容
>>> soup.head.contents[0].contents
["The Dormouse's story"]

通過 Tag 的 .children 生成器,可以對 Tag 的子節點進行循環

for child in soup.body.p.children:
    print(child) 

# <b>The Dormouse's story</b>

搜索文檔樹

Beautiful Soup 定義了很多搜索方法,這裏着重介紹2個:

find()
find_all()

使用 find_all() 類似的方法可以查找到想要查找的文檔內容

字符串搜索

最簡單的過濾器是字符串.在搜索方法中傳入一個字符串參數,Beautiful Soup會查找與字符串完整匹配的內容

下面的例子用於查找文檔中所有的標籤

>>> soup.find_all('b')
[<b>The Dormouse's story</b>]

正則表達式搜索

import re

# 下面例子中找出所有以b開頭的標籤 
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)

 #body
 #b

參數列表搜索

如果傳入列表參數,Beautiful Soup 會將與列表中任一元素匹配的內容返回

下面代碼找到文檔中所有 <a> 標籤和 <b> 標籤:

>>> soup.find_all(['a','b'])

[<b>The Dormouse's story</b>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

方法 / 函數搜索

如果沒有合適過濾器,那麼還可以定義一個方法,方法只接受一個元素參數 ,如果這個方法返回 True 表示當前元素匹配並且被找到,如果不是則反回 False

下面方法校驗了當前元素,如果包含 class 同時屬性包含 id 屬性,那麼將返回 True

def has_class_and_id(tag):
    return tag.has_attr('class') and tag.has_attr('id')

# 將這個方法作爲參數傳入 find_all() 方法,將得到所有<a>標籤: 
result = soup.find_all(has_class_and_id)
print(result) 

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

find() 的使用

find( name , attrs , recursive , string , **kwargs )

使用 find_all() 方法並設置 limit=1 參數不如直接使用 find() 方法.

下面兩行代碼是等價的:

>>> soup.find_all('title', limit=1)
[<title>The Dormouse's story</title>]

# 等價於
>>> soup.find('title')
<title>The Dormouse's story</title>

`find_all()` 和 `find()` 的區別：

唯一的區別是 find_all() 方法的返回結果是值包含一個元素的列表,而 find() 方法直接返回結果
find_all() 方法沒有找到目標是返回空列表, find() 方法找不到目標時,返回 None

CSS選擇器

Beautiful Soup 支持大部分的 CSS 選擇器, 在 Tag 或 BeautifulSoup 對象的 .select() 方法中傳入字符串參數, 即可使用CSS選擇器的語法找到 Tag

>>> soup.select('title')
[<title>The Dormouse's story</title>]

通過 `tag` 標籤逐層查找

>>> soup.select('body a')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

>>> soup.select('html head title')
[<title>The Dormouse's story</title>]

找到某個 `tag` 標籤下的直接子標籤

>>> soup.select('head > title')
[<title>The Dormouse's story</title>]

>>> soup.select('p > a')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

>>> soup.select('p > #link1')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

>>> soup.select('body > a')
[]

通過 `CSS` 的類名查找

>>> soup.select('.sister')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

通過 `tag` 的 `id` 查找

>>> soup.select('#link1')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

>>> soup.select('a#link1')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

同時用多種 `CSS` 選擇器查詢元素

>>> soup.select('#link1, #link2')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

返回查找到的元素的第一個

>>> soup.select_one('.sister')
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>