GitHub@orca-j35,所有筆記均託管於 python_notes 倉庫。
歡迎任何形式的轉載,但請務必註明出處。
在解析樹中導航
參考: Navigating the tree
在學習與解析樹相關的"導航字段"之前,我們需要先了解 BeautifulSoup 解析樹的結構,下面這段 HTML 和其解析樹如下:
markup = '''
<p>To find out
<em>more</em> see the
<a href="http://www.w3.org/XML">standard</a>.
</p>'''
soup = BeautifulSoup(markup, 'lxml')
⚠"導航字段"的返回值總是節點對象(如,Tag 對象、NavigableString 對象),或由節點對象組成的列表(或迭代器)。
Going down
Tag 中包含的字符串或 Tag 等節點被視作該 Tag 的 children (或 descendants )節點。爲了便於在 children (或 descendants )節點中進行導航,BeautifulSoup 提供了許多與此相關的方法。
⚠BeautifulSoup 中的字符串節點(如,NavigableString和註釋)不支持與導航相關的屬性,因爲字符串節點永遠不會包含任何 children 節點。
節點名
可使用節點名來選取目標節點,此時會返回子孫節點中的第一個同名節點。
html_doc = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(repr(f"{type(soup.head)}:{soup.head}"))
print(repr(f"{type(soup.title)}:{soup.title}"))
print(repr(f"{type(soup.a)}:{soup.a}"))
輸出:
"<class 'bs4.element.Tag'>:<head>\n<title>The Dormouse's story</title>\n</head>"
"<class 'bs4.element.Tag'>:<title>The Dormouse's story</title>"
'<class \'bs4.element.Tag\'>:<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>'
.contents🔧
.contents
字段會返回一個由"直接子節點"組成的列表:
html_doc = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
from pprint import pprint
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
p = soup.find('p', 'story')
pprint(p.contents)
pprint([type(i) for i in p.contents])
輸出:
['Once upon a time there were three little sisters; and their names were\n'
' ',
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
',\n ',
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
' and\n ',
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>,
';\n and they lived at the bottom of a well.\n ']
[<class 'bs4.element.NavigableString'>,
<class 'bs4.element.Tag'>,
<class 'bs4.element.NavigableString'>,
<class 'bs4.element.Tag'>,
<class 'bs4.element.NavigableString'>,
<class 'bs4.element.Tag'>,
<class 'bs4.element.NavigableString'>]
⚠.contents
返回的列表中的元素是節點對象,不是字符串對象。
⚠BeautifulSoup 中的字符串節點(如,NavigableString和註釋)不支持 .contents
字段,因爲字符串節點永遠不會包含任何 children 節點,強行獲取會拋出異常:
soup = BeautifulSoup(html_doc, 'html.parser')
pprint(soup.title.contents[0].contents)
#> AttributeError: 'NavigableString' object has no attribute 'contents'
.children🔧
.children
是 .contents
的迭代器版本,源代碼如下:
#Generator methods
@property
def children(self):
# return iter() to make the purpose of the method clear
return iter(self.contents) # XXX This seems to be untested.
.descendants🔧
.descendants
字段會返回一個包含"所有子孫節點"的生成器,從而允許你以遞歸方式遍歷當前節點的所有子孫節點。
html_doc = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
from pprint import pprint
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.head.descendants)
print(list(soup.head.descendants))
輸出:
<generator object Tag.descendants at 0x000001D502BA2750>
['\n', <title>The Dormouse's story</title>, "The Dormouse's story", '\n']
.string🔧
.string
屬性被用於獲取 tag 內部的字符串,其返回值可以是 NavigableString
, None
, Comment
,具體如下:
-
如果 tag 僅含一個字符串子項,則返回一個包含該字符串的
NavigableString
對象:from bs4 import BeautifulSoup soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'lxml') tag = soup.b print(type(tag.string)) #> <class 'bs4.element.NavigableString'> print(tag.string) #> Extremely bold
-
如果 tag 中僅包含一個子 tag,且該 tag 僅含一個字符串子項,則返回一個包含該字符串的
NavigableString
對象,該邏輯可遞歸:soup = BeautifulSoup('<b class="boldest"> <i> <i>Extremely bold</i> </i></b>', 'lxml') tag = soup.b print(type(tag.string)) #> <class 'bs4.element.NavigableString'> print(tag.string) #> Extremely bold
-
如果 tag 中沒有子項,或單個子項中不包含字符串,或有多個子項,或有多個字符串子項,都將會返回
None
:# 沒有子項 soup = BeautifulSoup('<b class="boldest"></b>', 'lxml') tag = soup.b print(type(tag.string)) #> <class 'NoneType'> print(tag.string) #> None # 子項中不包含字符串 soup = BeautifulSoup('<b class="boldest"><i></i></b>', 'lxml') print(soup.b.string) #> None # 多個子項,即便包含字符串也返回None soup = BeautifulSoup('<b class="boldest">link to <i>example.com</i></b>', 'lxml') print(soup.b.string) #> None
-
如果 tag 僅含一個註釋子項,則返回一個包含該註釋的
Comment
對象:from bs4 import BeautifulSoup markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>" soup = BeautifulSoup(markup, 'lxml') comment = soup.b.string print(type(comment)) #> <class 'bs4.element.Comment'> print(comment) #> Hey, buddy. Want to buy a used parser?
.strings🔧
如果 tag 有數個內含字符串的子孫節點,.stirng
字段允許你以遞歸方式遍歷這些字符串:
html_doc = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
from pprint import pprint
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.strings)
pprint(list(soup.strings))
輸出:
<generator object Tag._all_strings at 0x0000013C23342750>
['\n',
'\n',
'\n',
"The Dormouse's story",
'\n',
'\n',
'\n',
"The Dormouse's story",
'\n',
'Once upon a time there were three little sisters; and their names were\n'
' ',
'Elsie',
',\n ',
'Lacie',
' and\n ',
'Tillie',
';\n and they lived at the bottom of a well.\n ',
'\n',
'...',
'\n']
stripped_strings🔧
.stripped_strings
的功能與 .strings
類似,但會剝離掉多餘的空白符。.stripped_strings
會忽略掉完全由空白符組成的字符串,並刪除字符串開頭和結尾處的空白符。
from pprint import pprint
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.stripped_strings)
pprint(list(soup.stripped_strings))
輸出:
<generator object Tag.stripped_strings at 0x000002644BE22750>
["The Dormouse's story",
"The Dormouse's story",
'Once upon a time there were three little sisters; and their names were',
'Elsie',
',',
'Lacie',
'and',
'Tillie',
';\n and they lived at the bottom of a well.',
'...']
Going up
每個 tag 或字符串都有父節點: 包含當前 tag 的節點。
.parent🔧
.parent
字段用於訪問當前節點的父節點。
html_doc = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
from pprint import pprint
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.parent)
print(soup.html.parent.name)
print(soup.title.parent.name)
輸出:
None
[document]
head
.parents🔧
.parent
字段會返回一個內含所有祖先節點的生成器,可用於迭代訪問當前節點的所有祖先節點:
from pprint import pprint
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
link = soup.a
print(link.parents)
print([i.name for i in link.parents])
輸出:
<generator object PageElement.parents at 0x0000013D87571750>
['p', 'body', 'html', '[document]']
Going sideways
先考慮下面這個示例:
sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>",
'html.parser')
print(sibling_soup.prettify())
輸出:
<a>
<b>
text1
</b>
<c>
text2
</c>
</a>
<b>
和 <c>
是兄弟節點,因爲它們擁有相同的父節點;字符串 'text1'
和 'text2'
不是兄弟節點,因爲它們的父節點不同。
.next_sibling🔧.previous_sibling🔧
.next_sibling
字段用於選取下一個兄弟節點,.previous_sibling
字段用於選取上一個兄弟節點:
sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>",
'html.parser')
print(sibling_soup.b.previous_sibling)
print(sibling_soup.b.next_sibling)
print(sibling_soup.c.previous_sibling)
print(sibling_soup.c.next_sibling)
輸出:
None
<c>text2</c>
<b>text1</b>
None
<c>
沒有 .next_sibling
,因爲在 <c>
之後並沒有兄弟節點;<b>
沒有 .previous_sibling
,因爲在 <b>
之前並沒有兄弟節點。
⚠在實際的文檔中,節點的 .next_sibling
( 或 .previous_sibling
) 字段可能是包含空白符的字符串:
html_doc = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<b>The</b>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
from pprint import pprint
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(repr(soup.a.next_sibling))
輸出:
',\n '
.next_siblings🔧.previous_siblings🔧
.next_siblings
和 .previous_siblings
會返回由兄弟節點組成的生成器:
html_doc = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<b>The</b>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
from pprint import pprint
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.a.next_siblings)
pprint([repr(i) for i in soup.a.next_siblings])
pprint([repr(i) for i in soup.find(id='link3').previous_siblings])
輸出:
<generator object PageElement.next_siblings at 0x000001DDDD0C2750>
["',\\n '",
'<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>',
"' and\\n '",
'<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>',
"';\\n and they lived at the bottom of a well.\\n '"]
["' and\\n '",
'<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>',
"',\\n '",
'<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>',
"'Once upon a time there were three little sisters; and their names "
"were\\n '"]
Going back and forth
先看一段 "three sisters" 中的 HTML 文檔:
<html><head><title>The Dormouse's story</title></head>
<p class="title"><b>The Dormouse's story</b></p>
HTML 解析器在獲得上面的 HTML 文檔後,會將其轉換成一連串事件: "打開 <html>
標籤","打開一個 <head>
標籤","打開一個 <title>
標籤","添加一段字符串","關閉 <title>
標籤","打開 <p>
標籤",等等。BeautifulSoup 提供了重現文檔初始解析過程的工具。
.next_element🔧.previous_element🔧
.next_element
字段指向下一個被解析的節點,其結果通常與 .next_sibling
不同:
html_doc = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<b>The</b>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
from pprint import pprint
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(repr(soup.find('a', id='link3').next_sibling)) # 下一個兄弟節點
print(repr(soup.find('a', id='link3').next_element)) # 下一個被解析的節點
輸出:
';\n and they lived at the bottom of a well.\n '
'Tillie'
.previous_element
字段指向前一個被解析的節點,其結果通常與 .previous_sibling
不同:
sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>",
'html.parser')
print(repr(sibling_soup.c.next_element))
print(repr(sibling_soup.c.next_sibling))
輸出:
'text2'
None
.next_elements🔧.previous_elements🔧
.next_elements
會返回一個生成器,該生成器會按照解析順序逆向獲取先前解析的節點; .previous_elements
會返回一個生成器,該生成器會按照解析順序依次獲取之後解析的節點。
sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>",
'html.parser')
pprint([repr(i) for i in sibling_soup.a.next_elements])
print(repr(sibling_soup.c.next_sibling))
修改解析樹
GitHub@orca-j35,所有筆記均託管於 python_notes 倉庫
BeautifulSoup 的強項是搜索文檔樹,但是你也可以利用 BeautifulSoup 來修改文檔樹,並將修改後的文檔樹保存到一個新的 HTML 或 XML 文檔中,具體功能如下:
- 修改 tag 名和屬性
- 修改
.string
-
append()
- 向 tag 中追加內容 -
extend()
- 4.7.0 新增方法,擴展 tag 中的內容 -
NavigableString()
&.new_tag()
- 向 tag 中添加新文本或新標籤 -
insert()
- 向 tag 中插入內容,可設定插入位置 -
insert_before()
&insert_after()
- 在當前 tag 前(或後)插入內容 -
clear()
- 清理當前 tag 中的內容 -
extract()
- 從文檔樹中移除當前 tag,並返回被移除的 tag -
decompose()
- 從文檔樹中移除當前 tag,並完全銷燬 -
replace_with()
- 替換文檔樹中的內容 -
wrap()
- 打包指定元素 -
unwrap()
- 解包指定元素
歡迎關注公衆號: import hello