Beautiful Soup庫
B和S要大寫
1.作用
- Beautiful Soup庫是解析、遍歷、維護“標籤樹”的功能。
標籤樹:
<html>
<body>
<p class="title">...</p>
</body>
</html>
2.BeautifulSoup類
- HTML頁面<——>標籤樹<——>BeautifulSoup類
from bs4 import BeautifulSoup
soup = BeautifulSoup("<html>data</html>", "html.parser") # "html.parser"是HTML解析器
soup2 = BeautifulSoup(open("D://demo.html"), "html.parser")
- BeautifulSoup類對應一個HTML/XML文檔的全部內容
3.基本元素
- NavigableString可以跨越多個標籤層次
4.庫的理解
from bs4 import BeautifulSoup
newsoup = BeautifulSoup("<b><!--This is a comment--></b><p>This is not a comment</p>", "html.parser")
newsoup.b.string
'This is a comment'
type(newsoup.b.string)
<class 'bs4.element.Comment'>
newsoup.p.string
- <!- -> 是註釋標籤,解析時會自動忽略,只提取文本。爲了區分b標籤和p標籤中的文本內容,可以通過字符類型進行區分。
5.基於bs4庫的HTML內容遍歷方法
- 標籤樹的下行遍歷
- 標籤樹的上行遍歷
- 標籤樹的平行遍歷
6.基於bs4庫的HTML格式輸出
- 對HTML格式輸出進行美化
import requests
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
>>>demo
'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'
from bs4 import BeautifulSoup
soup = BeautifulSoup(demo, "html.parser")
>>>soup.prettify()
'<html>\n <head>\n <title>\n This is a python demo page\n </title>\n </head>\n <body>\n <p class="title">\n <b>\n The demo python introduces several python courses.\n </b>\n </p>\n <p class="course">\n Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\n <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">\n Basic Python\n </a>\n and\n <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">\n Advanced Python\n </a>\n .\n </p>\n </body>\n</html>'
print(soup.prettify()) #美化,添加回車符
<html>
<head>
<title>
This is a python demo page
</title>
</head>
<body>
<p class="title">
<b>
The demo python introduces several python courses.
</b>
</p>