BeautifulSoup 使用指北 - 0x02_操作解析樹

GitHub@orca-j35，所有筆記均託管於 python_notes 倉庫。
歡迎任何形式的轉載，但請務必註明出處。

在解析樹中導航

參考: Navigating the tree

在學習與解析樹相關的"導航字段"之前，我們需要先了解 BeautifulSoup 解析樹的結構，下面這段 HTML 和其解析樹如下:

markup = '''
<p>To find out
    <em>more</em> see the
    <a href="http://www.w3.org/XML">standard</a>.
</p>'''
soup = BeautifulSoup(markup, 'lxml')

⚠"導航字段"的返回值總是節點對象(如，Tag 對象、NavigableString 對象)，或由節點對象組成的列表(或迭代器)。

Going down

Tag 中包含的字符串或 Tag 等節點被視作該 Tag 的 children (或 descendants )節點。爲了便於在 children (或 descendants )節點中進行導航，BeautifulSoup 提供了許多與此相關的方法。

⚠BeautifulSoup 中的字符串節點(如，NavigableString和註釋)不支持與導航相關的屬性，因爲字符串節點永遠不會包含任何 children 節點。

節點名

可使用節點名來選取目標節點，此時會返回子孫節點中的第一個同名節點。

html_doc = """
<html>
<head>
    <title>The Dormouse's story</title>
</head>
<body>
    <p class="title"><b>The Dormouse's story</b></p>

    <p class="story">Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
        and they lived at the bottom of a well.
    </p>

    <p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(repr(f"{type(soup.head)}:{soup.head}"))
print(repr(f"{type(soup.title)}:{soup.title}"))
print(repr(f"{type(soup.a)}:{soup.a}"))

輸出:

"<class 'bs4.element.Tag'>:<head>\n<title>The Dormouse's story</title>\n</head>"
"<class 'bs4.element.Tag'>:<title>The Dormouse's story</title>"
'<class \'bs4.element.Tag\'>:<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>'

.contents🔧

.contents 字段會返回一個由"直接子節點"組成的列表:

html_doc = """
<html>
<head>
    <title>The Dormouse's story</title>
</head>
<body>
    <p class="title"><b>The Dormouse's story</b></p>

    <p class="story">Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
        and they lived at the bottom of a well.
    </p>

    <p class="story">...</p>
"""
from pprint import pprint
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, 'html.parser')
p = soup.find('p', 'story')
pprint(p.contents)
pprint([type(i) for i in p.contents])

輸出:

['Once upon a time there were three little sisters; and their names were\n'
 '        ',
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 ',\n        ',
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 ' and\n        ',
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>,
 ';\n        and they lived at the bottom of a well.\n    ']
[<class 'bs4.element.NavigableString'>,
 <class 'bs4.element.Tag'>,
 <class 'bs4.element.NavigableString'>,
 <class 'bs4.element.Tag'>,
 <class 'bs4.element.NavigableString'>,
 <class 'bs4.element.Tag'>,
 <class 'bs4.element.NavigableString'>]

⚠.contents 返回的列表中的元素是節點對象，不是字符串對象。

⚠BeautifulSoup 中的字符串節點(如，NavigableString和註釋)不支持 .contents 字段，因爲字符串節點永遠不會包含任何 children 節點，強行獲取會拋出異常:

soup = BeautifulSoup(html_doc, 'html.parser')
pprint(soup.title.contents[0].contents)
#> AttributeError: 'NavigableString' object has no attribute 'contents'

.children🔧

.children 是 .contents 的迭代器版本，源代碼如下:

#Generator methods
@property
def children(self):
    # return iter() to make the purpose of the method clear
    return iter(self.contents)  # XXX This seems to be untested.

.descendants🔧

.descendants 字段會返回一個包含"所有子孫節點"的生成器，從而允許你以遞歸方式遍歷當前節點的所有子孫節點。

html_doc = """
<html>
<head>
    <title>The Dormouse's story</title>
</head>
<body>
    <p class="title"><b>The Dormouse's story</b></p>

    <p class="story">Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
        and they lived at the bottom of a well.
    </p>

    <p class="story">...</p>
"""
from pprint import pprint
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.head.descendants)
print(list(soup.head.descendants))

輸出:

<generator object Tag.descendants at 0x000001D502BA2750>
['\n', <title>The Dormouse's story</title>, "The Dormouse's story", '\n']

.string🔧

.string 屬性被用於獲取 tag 內部的字符串，其返回值可以是 NavigableString , None , Comment，具體如下:

如果 tag 僅含一個字符串子項，則返回一個包含該字符串的 NavigableString 對象:

from bs4 import BeautifulSoup
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'lxml')
tag = soup.b
print(type(tag.string))
#> <class 'bs4.element.NavigableString'>
print(tag.string)
#> Extremely bold

如果 tag 中僅包含一個子 tag，且該 tag 僅含一個字符串子項，則返回一個包含該字符串的 NavigableString 對象，該邏輯可遞歸:

soup = BeautifulSoup('<b class="boldest">
                         <i>
                           <i>Extremely bold</i>
                         </i></b>',
                     'lxml')
tag = soup.b
print(type(tag.string))
#> <class 'bs4.element.NavigableString'>
print(tag.string)
#> Extremely bold

如果 tag 中沒有子項，或單個子項中不包含字符串，或有多個子項，或有多個字符串子項，都將會返回 None:

# 沒有子項
soup = BeautifulSoup('<b class="boldest"></b>', 'lxml')
tag = soup.b
print(type(tag.string))
#> <class 'NoneType'>
print(tag.string)
#> None

# 子項中不包含字符串
soup = BeautifulSoup('<b class="boldest"><i></i></b>', 'lxml')
print(soup.b.string)
#> None

# 多個子項,即便包含字符串也返回None
soup = BeautifulSoup('<b class="boldest">link to <i>example.com</i></b>',
                     'lxml')
print(soup.b.string)
#> None

如果 tag 僅含一個註釋子項，則返回一個包含該註釋的 Comment 對象:

from bs4 import BeautifulSoup
markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup, 'lxml')
comment = soup.b.string
print(type(comment))
#> <class 'bs4.element.Comment'>
print(comment)
#> Hey, buddy. Want to buy a used parser?

.strings🔧

如果 tag 有數個內含字符串的子孫節點，.stirng 字段允許你以遞歸方式遍歷這些字符串:

html_doc = """
<html>
<head>
    <title>The Dormouse's story</title>
</head>
<body>
    <p class="title"><b>The Dormouse's story</b></p>

    <p class="story">Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
        and they lived at the bottom of a well.
    </p>

    <p class="story">...</p>
"""
from pprint import pprint
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.strings)
pprint(list(soup.strings))

輸出:

<generator object Tag._all_strings at 0x0000013C23342750>
['\n',
 '\n',
 '\n',
 "The Dormouse's story",
 '\n',
 '\n',
 '\n',
 "The Dormouse's story",
 '\n',
 'Once upon a time there were three little sisters; and their names were\n'
 '        ',
 'Elsie',
 ',\n        ',
 'Lacie',
 ' and\n        ',
 'Tillie',
 ';\n        and they lived at the bottom of a well.\n    ',
 '\n',
 '...',
 '\n']

stripped_strings🔧

.stripped_strings 的功能與 .strings 類似，但會剝離掉多餘的空白符。.stripped_strings 會忽略掉完全由空白符組成的字符串，並刪除字符串開頭和結尾處的空白符。

from pprint import pprint
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.stripped_strings)
pprint(list(soup.stripped_strings))

輸出:

<generator object Tag.stripped_strings at 0x000002644BE22750>
["The Dormouse's story",
 "The Dormouse's story",
 'Once upon a time there were three little sisters; and their names were',
 'Elsie',
 ',',
 'Lacie',
 'and',
 'Tillie',
 ';\n        and they lived at the bottom of a well.',
 '...']

Going up

每個 tag 或字符串都有父節點: 包含當前 tag 的節點。

.parent🔧

.parent 字段用於訪問當前節點的父節點。

html_doc = """
<html>
<head>
    <title>The Dormouse's story</title>
</head>
<body>
    <p class="title"><b>The Dormouse's story</b></p>

    <p class="story">Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
        and they lived at the bottom of a well.
    </p>

    <p class="story">...</p>
"""
from pprint import pprint
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.parent)
print(soup.html.parent.name)
print(soup.title.parent.name)

輸出:

None
[document]
head

.parents🔧

.parent 字段會返回一個內含所有祖先節點的生成器，可用於迭代訪問當前節點的所有祖先節點:

from pprint import pprint
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, 'html.parser')
link = soup.a
print(link.parents)
print([i.name for i in link.parents])

輸出:

<generator object PageElement.parents at 0x0000013D87571750>
['p', 'body', 'html', '[document]']

Going sideways

先考慮下面這個示例:

sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>",
                             'html.parser')
print(sibling_soup.prettify())

輸出:

<a>
 <b>
  text1
 </b>
 <c>
  text2
 </c>
</a>

<b> 和 <c> 是兄弟節點，因爲它們擁有相同的父節點；字符串 'text1' 和 'text2' 不是兄弟節點，因爲它們的父節點不同。

.next_sibling🔧.previous_sibling🔧

.next_sibling 字段用於選取下一個兄弟節點，.previous_sibling 字段用於選取上一個兄弟節點:

sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>",
                             'html.parser')
print(sibling_soup.b.previous_sibling)
print(sibling_soup.b.next_sibling)

print(sibling_soup.c.previous_sibling)
print(sibling_soup.c.next_sibling)

輸出:

None
<c>text2</c>
<b>text1</b>
None

<c> 沒有 .next_sibling，因爲在 <c> 之後並沒有兄弟節點；<b> 沒有 .previous_sibling，因爲在 <b> 之前並沒有兄弟節點。

⚠在實際的文檔中，節點的 .next_sibling ( 或 .previous_sibling) 字段可能是包含空白符的字符串:

html_doc = """
<html>
<head>
    <title>The Dormouse's story</title>
</head>
<body>
    <p class="title"><b>The Dormouse's story</b></p>
    <b>The</b>
    <p class="story">Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
        and they lived at the bottom of a well.
    </p>

    <p class="story">...</p>
"""
from pprint import pprint
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, 'html.parser')
print(repr(soup.a.next_sibling))

輸出:

',\n        '

.next_siblings🔧.previous_siblings🔧

.next_siblings 和 .previous_siblings 會返回由兄弟節點組成的生成器:

html_doc = """
<html>
<head>
    <title>The Dormouse's story</title>
</head>
<body>
    <p class="title"><b>The Dormouse's story</b></p>
    <b>The</b>
    <p class="story">Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
        and they lived at the bottom of a well.
    </p>

    <p class="story">...</p>
"""
from pprint import pprint
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.a.next_siblings)
pprint([repr(i) for i in soup.a.next_siblings])

pprint([repr(i) for i in soup.find(id='link3').previous_siblings])

輸出:

<generator object PageElement.next_siblings at 0x000001DDDD0C2750>
["',\\n        '",
 '<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>',
 "' and\\n        '",
 '<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>',
 "';\\n        and they lived at the bottom of a well.\\n    '"]
["' and\\n        '",
 '<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>',
 "',\\n        '",
 '<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>',
 "'Once upon a time there were three little sisters; and their names "
 "were\\n        '"]

Going back and forth

先看一段 "three sisters" 中的 HTML 文檔:

<html><head><title>The Dormouse's story</title></head>
<p class="title"><b>The Dormouse's story</b></p>

HTML 解析器在獲得上面的 HTML 文檔後，會將其轉換成一連串事件: "打開 <html> 標籤"，"打開一個 <head> 標籤"，"打開一個 <title> 標籤"，"添加一段字符串"，"關閉 <title> 標籤"，"打開 <p> 標籤"，等等。BeautifulSoup 提供了重現文檔初始解析過程的工具。

.next_element🔧.previous_element🔧

.next_element 字段指向下一個被解析的節點，其結果通常與 .next_sibling 不同:

html_doc = """
<html>
<head>
    <title>The Dormouse's story</title>
</head>
<body>
    <p class="title"><b>The Dormouse's story</b></p>
    <b>The</b>
    <p class="story">Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
        and they lived at the bottom of a well.
    </p>

    <p class="story">...</p>
"""
from pprint import pprint
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, 'html.parser')
print(repr(soup.find('a', id='link3').next_sibling)) # 下一個兄弟節點
print(repr(soup.find('a', id='link3').next_element)) # 下一個被解析的節點

輸出:

';\n        and they lived at the bottom of a well.\n    '
'Tillie'

.previous_element 字段指向前一個被解析的節點，其結果通常與 .previous_sibling 不同:

sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>",
                             'html.parser')

print(repr(sibling_soup.c.next_element))
print(repr(sibling_soup.c.next_sibling))

輸出:

'text2'
None

.next_elements🔧.previous_elements🔧

.next_elements 會返回一個生成器，該生成器會按照解析順序逆向獲取先前解析的節點； .previous_elements 會返回一個生成器，該生成器會按照解析順序依次獲取之後解析的節點。

sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>",
                             'html.parser')

pprint([repr(i) for i in sibling_soup.a.next_elements])
print(repr(sibling_soup.c.next_sibling))

修改解析樹

GitHub@orca-j35，所有筆記均託管於 python_notes 倉庫

BeautifulSoup 的強項是搜索文檔樹，但是你也可以利用 BeautifulSoup 來修改文檔樹，並將修改後的文檔樹保存到一個新的 HTML 或 XML 文檔中，具體功能如下:

修改 tag 名和屬性
修改 .string
append() - 向 tag 中追加內容
extend() - 4.7.0 新增方法，擴展 tag 中的內容
NavigableString() & .new_tag() - 向 tag 中添加新文本或新標籤
insert() - 向 tag 中插入內容，可設定插入位置
insert_before() & insert_after() - 在當前 tag 前(或後)插入內容
clear() - 清理當前 tag 中的內容
extract() - 從文檔樹中移除當前 tag，並返回被移除的 tag
decompose() - 從文檔樹中移除當前 tag，並完全銷燬
replace_with() - 替換文檔樹中的內容
wrap() - 打包指定元素
unwrap() - 解包指定元素

BeautifulSoup 使用指北 - 0x02_操作解析樹

在解析樹中導航

Going down

節點名

.contents🔧

.children🔧

.descendants🔧

.string🔧

.strings🔧

stripped_strings🔧

Going up

.parent🔧

.parents🔧

Going sideways

.next_sibling🔧.previous_sibling🔧

.next_siblings🔧.previous_siblings🔧

Going back and forth

.next_element🔧.previous_element🔧

.next_elements🔧.previous_elements🔧

修改解析樹

歡迎關注公衆號: import hello

BeautifulSoup 指北_概覽

operator﹝Python 標準庫﹞ operator - Standard operators as functions

數據庫API規範 v2.0 (PEP 249)

序列化(serialization)

BeautifulSoup 使用指北 - 0x03_搜索解析樹

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結