1. Beautiful Soup簡介

Beautiful Soup名字來源於《愛麗絲夢遊仙境》，是一個可以從HTML或XML文件中提取數據的Python庫，當前版本4.4.0，Beautiful Soup 3目前已經停止開發，官方推薦使用Beautiful Soup 4（簡稱BS4），官文指路：https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/，不得不說Beautiful Soup官文的可讀性秒爆某lx**的。

Beautiful Soup最主要的功能是從網頁抓取數據，三大功能系：

Beautiful Soup提供一些簡單的、python式的函數用來處理導航、搜索、修改分析樹等功能。它是一個工具箱，通過解析文檔爲用戶提供需要抓取的數據，因爲簡單，所以不需要多少代碼就可以寫出一個完整的應用程序；
Beautiful Soup自動將輸入文檔轉換爲Unicode編碼，輸出文檔轉換爲utf-8編碼，但是，當文檔沒有指定編碼並且Beautiful Soup未能自動識別到編碼方式，這時就需要指定原始編碼；
Beautiful Soup已成爲和lxml、html6lib一樣出色的python解釋器，解析高效且爲用戶靈活地提供不同的解析策略。

Beautiful Soup支持Python標準庫中的HTML解析器，還支持一些第三方的解析器：Python標準庫、lxml HTML 解析器、lxml XML 解析器、html5lib，推薦使用lxml作爲解析器，因爲高效性。

下載：https://www.crummy.com/software/BeautifulSoup/

安裝Beautiful Soup：

$ pip install beautifulsoup4

或者PyCharm：File→Settings→Project：Python Notes→Project Interpret→搜索“beautifulsoup4”→Install Package。

2. Beautiful Soup簡單使用

豆瓣《小王子》書目源碼用Beautiful Soup練一下：

# -*- coding: utf-8 -*-
# @Author : ChengYu
# @File : beausoup_dbbook.py
# @Project: Python Notes
# @CreateTime : 2020/5/8 15:08:46

from bs4 import BeautifulSoup

html_doc = """<li class="subject-item"> <div class="pic"> <a 
class="nbg" href="https://book.douban.com/subject/1084336/" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336', 
from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width 
="90"> </a> </div> <div class="info"> <h2 class=""> <a 
href="https://book.douban.com/subject/1084336/" title="小王子" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336', 
from:'book_subject_search'})"> 小王子 </a> </h2> <div class="pub"> [法]聖埃克蘇佩裏/馬振聘/人民文學出版社/2003-8/22.00元 </div> <div 
class="star clearfix"> <span class="allstar45"></span> <span class="rating_nums">9.0</span> <span class="pl"> ( 
561845人評價) </span> </div> <p>小王子是一個超凡脫俗的仙童，他住在一顆只比他大一丁點兒的小行星上。陪伴他的是一朵他非常喜愛的小玫瑰花。但玫瑰花的虛榮心傷害了小王子對她的感情。小王子告別小行...</p> 
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a 
href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> </div> </div> </div> </li>"""
soup = BeautifulSoup(html_doc, 'html.parser')
# 獲取<a>標籤的鏈接
for link in soup.find_all('a'):
    print(link.get('href'))
# 獲取文檔中所有的文字內容
print(soup.get_text())
# 按照標準的縮進格式輸出
print(soup.prettify())

“Run”結果：

2.1 對象種類

Beautiful Soup將複雜HTML文檔轉換成一個複雜的樹形結構,每個節點都是Python對象,所有對象可以歸納爲4種：Tag、NavigableString、BeautifulSoup、Comment 。

（1）Tag 對象可以看成 HTML 中的標籤，Tag有很多方法和屬性，Tag中最重要的屬性：name和attributes。

name 屬性是Tag對象的標籤名，比如獲取標籤p的內容：

print(soup.p)
#輸出結果如下：
<p>小王子是一個超凡脫俗的仙童，他住在一顆只比他大一丁點兒的小行星上。陪伴他的是一朵他非常喜愛的小玫瑰花。但玫瑰花的虛榮心傷害了小王子對她的感情。小王子告別小行...</p>

attributes屬性是 Tag 對象所包含的屬性值，它是一個字典類型，一個tag可能有很多個屬性，tag的屬性的操作方法與字典相同：

print(soup.a.attrs)
#輸出結果如下：
{'class': ['nbg'], 'href': 'https://book.douban.com/subject/1084336/', 'onclick': "moreurl(this,{i:'0',query:'',subject_id:'1084336', \nfrom:'book_subject_search'})"}

（2）NavigableString：字符串常被包含在tag內，Beautiful Soup用 NavigableString類來包裝tag中的字符串，用 .string 即可獲取標籤內部的文字。

print(soup.p.string)
#輸出結果如下：
小王子是一個超凡脫俗的仙童，他住在一顆只比他大一丁點兒的小行星上。陪伴他的是一朵他非常喜愛的小玫瑰花。但玫瑰花的虛榮心傷害了小王子對她的感情。小王子告別小行...

NavigableString 對象支持遍歷文檔樹和搜索文檔樹中定義的大部分屬性，一個字符串不能包含其它內容（tag能夠包含字符串或是其它tag)，字符串不支持 .contents 或 .string 屬性或 find() 方法。

如果想在Beautiful Soup之外使用 NavigableString 對象，需要調用 unicode() 方法，將該對象轉換成普通的Unicode字符串，否則就算Beautiful Soup已方法已經執行結束，該對象的輸出也會帶有對象的引用地址，這樣會浪費內存。

（3）BeautifulSoup：表示的是一個文檔的全部內容，大部分時候，可以把它當作Tag對象，它支持遍歷文檔樹和搜索文檔樹中描述的大部分的方法，因爲 BeautifulSoup對象並不是真正的HTML或XML的Tag，所以它沒有name和attribute屬性，但有時查看它的.name屬性是很方便的，所以 BeautifulSoup對象包含了一個值爲 “[document]” 的特殊屬性.name。

print(type(soup.span))
print(soup.name)
print(soup.attrs)
#輸出結果如下：
<class 'bs4.element.Tag'>
[document]
{}

（4）Comment：是一個特殊類型的 NavigableString對象，Tag、NavigableString、BeautifulSoup幾乎覆蓋了html和xml中的所有內容，但是還有一些特殊對象，比如文檔的註釋部分，在HTML文檔中時，Comment對象會使用特殊的格式輸出：

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup)
print(soup.b.prettify())
# 輸出結果如下：
<b>
 <!--Hey, buddy. Want to buy a used parser?-->
</b>
print(type(soup.b.string))
# 輸出結果如下：
<class 'bs4.element.Comment'>

如果不需要HTML 頁面中含有註釋及特殊字符串的內容，可以判斷篩掉：if type(soup.a.string) == bs4.element.Comment。

2.2 遍歷文檔樹

2.2.1 子節點

一個Tag可能包含多個字符串或其它的Tag，這些都是這個Tag的子節點，Beautiful Soup提供了許多操作和遍歷子節點的屬性，操作文檔樹最簡單的方法就是告訴它你想獲取的tag的name，

比如獲取 <span> 標籤：soup.span；
比如獲取li標籤下的a標籤：soup.li.img；
但是通過點取屬性的方式只能獲得當前名字的第一個tag，比如得到所有的<a>標籤：soup.find_all('a')，這關於 Searching the tree 中的find_all()方法。

（1）.contents 和 .children

tag的.contents屬性可以將tag的子節點以列表的方式輸出：

print(soup.a.contents)
# 輸出結果：
[' ', <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width="90"/>, ' ']

如上，contents屬性得到的結果是直接子節點的列表，children屬性的返回結果是生成器類型，通過tag的.children生成器，可以對tag的子節點進行循環，例如本節案例html_doc源代碼中a節點包含img子節點，用for循環可以輸出相應的內容。

for child in soup.a.children:
    print(soup.a.children)
    print(child)
# 輸出結果：
<list_iterator object at 0x000001E95CFB7A90>
 
<img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width="90"/>

（2）.descendants

接上面，.contents 和 .children屬性僅包含tag的直接子節點，要得到所有的子孫節點的話，可以調用descendants屬性，下面獲取div下屬所有的子節點：

for child in soup.div.descendants:
    print(child)
# 輸出結果：
<a class="nbg" href="https://book.douban.com/subject/1084336/" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336', 
from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width="90"/> </a>
 
<img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width="90"/>

（3）.string

如果tag只有一個NavigableString 類型子節點，那麼這個tag可以使用.string得到子節點，tag包含了多個子節點，tag就無法確定.string方法應該調用哪個子節點的內容，.string的輸出結果是None：

print(soup.a.string)
# 輸出結果：
None
print(soup.p.string)
#輸出結果：
小王子是一個超凡脫俗的仙童，他住在一顆只比他大一丁點兒的小行星上。陪伴他的是一朵他非常喜愛的小玫瑰花。但玫瑰花的虛榮心傷害了小王子對她的感情。小王子告別小行...

（4）.strings 和 .stripped_strings

tag中包含多個字符串，可以使用.strings來循環獲取：

for string in soup.strings:
    print(repr(string))
# 輸出結果：
' '
' '
' '
' '
' '
' '
' '
' '
' 小王子 '
' '
' '
' [法]聖埃克蘇佩裏/馬振聘/人民文學出版社/2003-8/22.00元 '
' '
' '
' '
'9.0'
' '
' ( \n561845人評價) '
' '
' '
'小王子是一個超凡脫俗的仙童，他住在一顆只比他大一丁點兒的小行星上。陪伴他的是一朵他非常喜愛的小玫瑰花。但玫瑰花的虛榮心傷害了小王子對她的感情。小王子告別小行...'
'\n'
' '
' '
' '
' '
' '
'紙質版47.30元起'
' '
' '
' '
' '
' '

上面輸出的字符串中包含了很多空格或空行，使用.stripped_strings可以去除多餘空白內容，全部是空格的行會被忽略掉,段首和段末的空白會被刪除：

for string in soup.stripped_strings:
    print(repr(string))
# 輸出結果：
'小王子'
'[法]聖埃克蘇佩裏/馬振聘/人民文學出版社/2003-8/22.00元'
'9.0'
'( \n561845人評價)'
'小王子是一個超凡脫俗的仙童，他住在一顆只比他大一丁點兒的小行星上。陪伴他的是一朵他非常喜愛的小玫瑰花。但玫瑰花的虛榮心傷害了小王子對她的感情。小王子告別小行...'
'紙質版47.30元起'

2.2.2 父節點

（1）.parent：通過.parent屬性來獲取某個元素的父節點，如在html_doc源代碼中，<div>標籤是<a>標籤的父節點：

print(soup.a.parent)
# 輸出結果：
<div class="pic"> <a class="nbg" href="https://book.douban.com/subject/1084336/" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336', 
from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width="90"/> </a> </div>

（2）.parents：通過元素的.parents屬性可以遞歸得到元素的所有父輩節點，比如 .parents 方法遍歷了<a>標籤到根節點的所有節點：

for parent in soup.a.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)
#輸出結果：
div
li
[document]

2.2.3 兄弟節點

（1）.next_sibling 和 .previous_sibling：獲取當前Tag的下/上一個節點，實際源代碼中的tag的.next_sibling和 .previous_sibling通常是字符串或空白，返回結果是當前標籤與上一個標籤之間的頓號和換行符。

print(soup.a.next_sibling)
print(soup.a.previous_sibling)

（2）.next_siblings 和 .previous_siblings：獲取當前Tag的下面/上面所有的兄弟節點，返回一個生成器。

for sibling in soup.div.next_siblings:
    print(repr(sibling))
for sibling in soup.div.previous_siblings:
    print(repr(sibling))
# 輸出結果：
' '
<div class="info"> <h2 class=""> <a href="https://book.douban.com/subject/1084336/" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336', 
from:'book_subject_search'})" title="小王子"> 小王子 </a> </h2> <div class="pub"> [法]聖埃克蘇佩裏/馬振聘/人民文學出版社/2003-8/22.00元 </div> <div class="star clearfix"> <span class="allstar45"></span> <span class="rating_nums">9.0</span> <span class="pl"> ( 
561845人評價) </span> </div> <p>小王子是一個超凡脫俗的仙童，他住在一顆只比他大一丁點兒的小行星上。陪伴他的是一朵他非常喜愛的小玫瑰花。但玫瑰花的虛榮心傷害了小王子對她的感情。小王子告別小行...</p>
<div class="ft"> <div class="collect-info"> </div> <div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> </div> </div> </div>
' '
' '

2.2.4 回退和前進

（1）.next_element 和 .previous_element

.next_element 屬性指向解析過程中下一個被解析的對象(字符串或Tag)，結果可能與.next_element 相同，但通常是不一樣的；.previous_element屬性剛好與.next_element 相反，它指向當前被解析的對象的前一個解析對象：

print(soup.p.next_element)
print(soup.p.previous_element)
# 輸出結果：
小王子是一個超凡脫俗的仙童，他住在一顆只比他大一丁點兒的小行星上。陪伴他的是一朵他非常喜愛的小玫瑰花。但玫瑰花的虛榮心傷害了小王子對她的感情。小王子告別小行...

（2）.next_elements 和 .previous_elements：返回一個生成器，可以向前/向後訪問文檔的解析內容：

for element in soup.span.next_elements:
    print(repr(element))

for element in soup.span.previous_elements:
    print(repr(element))

2.3 搜索文檔樹

Beautiful Soup定義了很多搜索方法，常用的：find(name, attrs, recursive, text, limit, kwargs)和find_all(name, attrs, recursive, text, limit, kwargs)。

2.3.1 過濾器

過濾器可以被用在tag的name中，節點的屬性中，字符串中或混合使用。

字符串：查找與字符串完全匹配的內容。

print(soup.find_all('a'))
# 輸出結果：
[<a class="nbg" href="https://book.douban.com/subject/1084336/" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336', 
from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width="90"/> </a>, <a href="https://book.douban.com/subject/1084336/" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336', 
from:'book_subject_search'})" title="小王子"> 小王子 </a>, <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a>]

正則表達式：如果傳入正則表達式作爲參數，Beautiful Soup會通過正則表達式的match()來匹配內容。

print(soup.find_all(re.compile('^p')))
# 輸出結果：
[<p>小王子是一個超凡脫俗的仙童，他住在一顆只比他大一丁點兒的小行星上。陪伴他的是一朵他非常喜愛的小玫瑰花。但玫瑰花的虛榮心傷害了小王子對她的感情。小王子告別小行...</p>]

列表：如果傳入列表參數，Beautiful Soup會將與列表中任一元素匹配的內容返回。

print(soup.find_all(['p', 'span']))
# 輸出結果：
[<span class="allstar45"></span>, <span class="rating_nums">9.0</span>, <span class="pl"> ( 
561845人評價) </span>, <p>小王子是一個超凡脫俗的仙童，他住在一顆只比他大一丁點兒的小行星上。陪伴他的是一朵他非常喜愛的小玫瑰花。但玫瑰花的虛榮心傷害了小王子對她的感情。小王子告別小行...</p>, <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span>]

True：True 可以匹配任何值。

for tag in soup.find_all(True):
    print(tag.name)
# 輸出結果：
li
div
a
img
div
h2
a
div
div
span
span
span
p
div
div
div
span
a

方法：如果沒有合適過濾器，那麼還可以定義一個方法，方法只接受一個元素參數，如果這個方法返回True表示當前元素匹配並且被找到，如果不是則反回False。如下校驗當前元素如果包含title屬性那麼將返回True，將這個方法作爲參數傳入find_all()方法，將得到所有含有“title”的標籤。

def has_title(tag):
    return tag.has_attr('title')


print(soup.find_all(has_title))
# 輸出結果：
[<a href="https://book.douban.com/subject/1084336/" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336', 
from:'book_subject_search'})" title="小王子"> 小王子 </a>]

另外 attrs 參數可以也作爲過濾條件來獲取內容，而 limit 參數是限制返回的條數。

2.3.2 find_all(name, attrs, recursive, text, limit, kwargs)

find_all()方法搜索當前tag的所有tag子節點，並判斷是否符合過濾器的條件。

（1）name參數：name參數可以查找所有名字爲name的tag，字符串對象會被自動忽略掉，搜索name參數的值可以使任一類型的過濾器，字符串、正則表達式、列表、方法或是True。

print(soup.find_all('span'))
# 輸出結果：
[<span class="allstar45"></span>, <span class="rating_nums">9.0</span>, <span class="pl"> ( 
561845人評價) </span>, <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span>]

（2）attrs參數：是用Python字典封裝一個標籤的若干屬性和對應的屬性值，如下找span標籤class屬性值爲“buy-info”。

print(soup.find_all('span', {'class', 'buy-info'}))
print(soup.find_all('span', attrs={'class': 'buy-info'}))
# 輸出結果：
[<span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span>]

（3）recursive參數：調用tag的find_all()方法時，Beautiful Soup會檢索當前tag的所有子孫節點，如果只想搜索tag的直接子節點，可以使用參數recursive=False。

print(soup.find_all("span", recursive=False))
# 輸出結果：
[]

（4）text參數：是用標籤的文本內容去匹配，而不是用標籤的屬性，通過 text 參數可以匹配文檔中的字符串內容，與name參數的可選值一樣，text 參數可以是字符串、正則表達式、列表、True。

假如查找前面網頁中包含“紙質版”內容的標籤或者包含“小王子”內容的tag數量，可以把之前的 findAll 方法換成下面的代碼：

print(soup.find_all(text=re.compile('紙質版')))
# 輸出結果：
['紙質版47.30元起']
bookname = soup.findAll(text=re.compile("小王子"))
print(len(bookname))
# 輸出結果：
2

（5）limit參數：find_all()方法返回全部的搜索結構，如果文檔樹很大那麼搜索會很慢，如果不需要全部結果，可以使用limit參數限制返回結果的數量，效果與SQL中的limit關鍵字類似，當搜索到的結果數量達到limit的限制時，就停止搜索返回結果，如下：文檔樹中有3個tag符合搜索條件，但結果只返回了1個，因爲限制了返回數量。

print(soup.find_all("a", limit=1))
# 輸出結果：
[<a class="nbg" href="https://book.douban.com/subject/1084336/" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336', 
from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width="90"/> </a>]

（6）kwargs參數：如果一個指定名字的參數不是搜索內置的參數名，搜索時會把該參數當作指定名字tag的屬性來搜索，如果包含一個name爲span的參數，Beautiful Soup會搜索每個tag的“span”屬性。

print(soup.find_all(title="小王子"))
# 輸出結果：
[<a href="https://book.douban.com/subject/1084336/" onclick="moreurl(this,{i:'0',query:'',subject_id:'1084336', 
from:'book_subject_search'})" title="小王子"> 小王子 </a>]

注意：用 keyword 偶爾會出現問題，如下在用 class 屬性查找標籤的時候，因爲 class 是 Python 中受保護的關鍵字，所以一般只採用前2個參數tag、attributes即可。

print(soup.find_all(class="pub"))
# 輸出結果：
  File "G:/pycharm/Python Notes/venv/beausoup_dbbook.py", line 91
    print(soup.find_all(class="pub"))
                        ^
SyntaxError: invalid syntax

2.3.3 搜索文檔樹的其他方法

（1）find( name , attrs , recursive , text , **kwargs )

find_all() 方法將返回文檔中符合條件的所有tag，而 find() 方法返回符合條件的第一個Tag結果，比如源碼中只有一個<span>標籤，那麼使用find_all() 方法來查找<span>標籤就大材小用了，使用find_all() 方法並設置limit參數不如直接使用find()方法，下面兩行代碼是等價的：

print(soup.find_all("a", limit=1))
print(soup.find('a'))
# 輸出結果：
[<a class="nbg" href="https://book.douban.com/subject/1084336/" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336', 
from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width="90"/> </a>]
<a class="nbg" href="https://book.douban.com/subject/1084336/" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336', 
from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width="90"/> </a>

（2）find_parents()和find_parent()

find_all() 和 find() 只搜索當前節點的所有子節點，孫子節點等， find_parents() 和 find_parent() 用來搜索當前節點的父輩節點，搜索方法與普通tag的搜索方法相同，搜索源代碼包含的內容：

print(soup.find('span', {'class', 'buy-info'}))
print(soup.find('span', {'class', 'buy-info'}).find_parent("div"))
print(soup.find('span', {'class', 'buy-info'}).find_parents("h2"))
# 輸出結果：
<span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span>
<div class="cart-actions"> <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span> </div>
[]

（3）find_next_siblings()和find_next_sibling()

這2個方法通過 .next_siblings 屬性對當 tag 的所有後面解析的兄弟 tag 節點進行迭代，find_next_siblings() 方法返回所有符合條件的後面的兄弟節點，find_next_sibling() 只返回符合條件的後面的第一個tag節點。

print(soup.a.find_next_siblings("a"))
print(soup.find("p").find_next_sibling("span"))

（4）find_previous_siblings()和find_previous_sibling()

這2個方法通過 .previous_siblings 屬性對當前 tag 的前面解析的兄弟 tag 節點進行迭代，find_previous_siblings()方法返回所有符合條件的前面的兄弟節點，find_previous_sibling() 方法返回第一個符合條件的前面的兄弟節點。

（5）find_all_next() 和 find_next()

這2個方法通過 .next_elements 屬性對當前 tag 的之後的 tag 和字符串進行迭代, find_all_next() 方法返回所有符合條件的節點, find_next() 方法返回第一個符合條件的節點。

（6）find_all_previous() 和 find_previous()

這2個方法通過 .previous_elements 屬性對當前節點前面的 tag 和字符串進行迭代, find_all_previous() 方法返回所有符合條件的節點, find_previous()方法返回第一個符合條件的節點。

2.3.4 CSS選擇器

Beautiful Soup支持大部分的CSS選擇器 http://www.w3.org/TR/CSS2/selector.html，在Tag或BeautifulSoup對象的.select()方法中傳入字符串參數，即可使用CSS選擇器的語法找到Tag。

（1）通過tag標籤逐層查找

print(soup.select("p"))
print(soup.select("li span"))
# 輸出結果：
[<p>小王子是一個超凡脫俗的仙童，他住在一顆只比他大一丁點兒的小行星上。陪伴他的是一朵他非常喜愛的小玫瑰花。但玫瑰花的虛榮心傷害了小王子對她的感情。小王子告別小行...</p>]
[<span class="allstar45"></span>, <span class="rating_nums">9.0</span>, <span class="pl"> ( 
561845人評價) </span>, <span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span>]

（2）找到某個tag標籤下的直接子標籤

print(soup.select("a > img"))
print(soup.select("li > a"))
# 輸出結果：
[<img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width="90"/>]
[]

（3）通過CSS的類名查找

print(soup.select(".buy-info"))
print(soup.select("[class~=nbg]"))
# 輸出結果：
[<span class="buy-info"> <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a> </span>]
[<a class="nbg" href="https://book.douban.com/subject/1084336/" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336', 
from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width="90"/> </a>]

（4）通過屬性查找：屬性需要用中括號括起來，屬性和標籤屬於同一節點，所以中間不能加空格，否則會無法匹配到：

print(soup.select('a[href]'))
# 輸出結果：
[<a class="nbg" href="https://book.douban.com/subject/1084336/" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336', 
from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width="90"/> </a>, <a href="https://book.douban.com/subject/1084336/" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336', 
from:'book_subject_search'})" title="小王子"> 小王子 </a>, <a href="https://book.douban.com/subject/1084336/buylinks">紙質版47.30元起</a>]

print(soup.select('a[class="nbg"]'))
# 輸出結果：
[<a class="nbg" href="https://book.douban.com/subject/1084336/" οnclick="moreurl(this,{i:'0',query:'',subject_id:'1084336', 
from:'book_subject_search'})"> <img class="" src="https://img3.doubanio.com/view/subject/s/public/s1103152.jpg" width="90"/> </a>]

3. Beautiful Soup爬取豆瓣經典書單

試試爬取豆瓣圖書“經典”標籤頁的書單信息，簡單應用：

# -*- coding: utf-8 -*-
# @Author : ChengYu
# @File : beausoup_dbbook.py
# @Project: Python Notes
# @CreateTime : 2020/5/8 15:08:46
import requests
import csv
from bs4 import BeautifulSoup

# 加上headers用來告訴網站這是通過一個瀏覽器進行的訪問
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                         'Chrome/81.0.4044.122 Safari/537.36'}
# 初始化csv文件
with open('DBbooks.csv', 'w', newline='', encoding='utf-8') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['序號', '書名', '作者/出版信息/價格', '評分', '評價數', '簡介', '購買渠道'])
    count = 0
    for page in range(0, 10, 1):
        # https://book.douban.com/tag/%E7%BB%8F%E5%85%B8
        # https://book.douban.com/tag/%E7%BB%8F%E5%85%B8?start=0&type=T
        # https://book.douban.com/tag/%E7%BB%8F%E5%85%B8?start=20&type=T
        # https://book.douban.com/tag/%E7%BB%8F%E5%85%B8?start=40&type=T
        url = 'https://book.douban.com/tag/%E7%BB%8F%E5%85%B8?start=' + str(page * 20) + '&type=T'
        # 獲取網頁源碼
        res = requests.get(url, headers=headers).text
        soup = BeautifulSoup(res, 'html.parser')
        # 從所有class爲subject-item中提取圖書信息
        for bookdata in soup.find_all(attrs={'class': 'subject-item'}):
            # 使用 .stripped_strings 生成器,獲得文本列表後手動處理列表
            bookitem = [text for text in bookdata.stripped_strings]
            count += 1
            bookitem = [count] + bookitem
            print(bookitem)
            # 寫入csv文件
            writer.writerow(bookitem)
    csvfile.close()

“Run”結果部分展示：

路漫漫，扯下口罩呼口氣，2020年05月11日END。

Python手記-10：Beautiful Soup爬取豆瓣經典書單

1. Beautiful Soup簡介

2. Beautiful Soup簡單使用

2.1 對象種類

2.2 遍歷文檔樹

2.2.1 子節點

2.2.2 父節點

2.2.3 兄弟節點

2.2.4 回退和前進

2.3 搜索文檔樹

2.3.1 過濾器

2.3.2 find_all(name, attrs, recursive, text, limit, kwargs)

2.3.3 搜索文檔樹的其他方法

2.3.4 CSS選擇器

3. Beautiful Soup爬取豆瓣經典書單

Linux、Oracle、MySQL命令提示符顯示時間

mysqldump: Couldn‘t execute ‘SET OPTION SQL_QUOTE_SHOW_CREATE=1‘

MySQL 8 導出之mysqlpump

mysqlshow

Python手記-2：Python IDE之PyCharm安裝簡介

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結