簡介

BeautifulSoup 是一個HTML/XML的解析器，主要用於解析和提取HTML/XML 數據。

它基於HTML DOM的，會載入整個文檔，解析整個DOM樹，因此時間和內存開銷都會大很多，所以性能要低於lxml。

BeautifulSoup用來解析HTML 比較簡單，API非常人性化，支持CSS選擇器、Python標準庫中的HTML解析器，也支持lxml 的XML解析器。

官方文檔：http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0

抓取工具	速度	難度
正則	最快	複雜
BeautifulSoup	慢	簡單
xpath	快	簡單

安裝

在pycharm的terminal終端命令窗口中輸入：
pip install beautifulsoup4
或者使用清華源的鏡像，會更快一點
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple beautifulsoup4

知識

初始化

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

root = etree.HTML(html_doc)
root.xpath()

# 加載字符串 構建BeautifulSoup對象 指明解析器
soup = BeautifulSoup(html_doc,"lxml")
soup.select()
# 加載文檔
soup = BeautifulSoup(open("path"),"lxml")
# 格式化輸出加載的內容
print(soup.prettify())

Beautiful Soup將複雜HTML文檔轉換成一個複雜的樹形結構,每個節點都是Python對象,所有對象可以分爲4種。

Tag（標籤）

最重要的一類對象，後邊解析網頁就靠它了。

用BeautifulSoup來獲取Tags

from bs4 import BeautifulSoup

# 加載字符串 構建BeautifulSoup對象 指明解析器
soup = BeautifulSoup(html_doc,"lxml")
# 加載文檔
soup = BeautifulSoup(open("path"),"lxml")
# 格式化輸出加載的內容
print(soup.prettify())

# 注意，它查找的是在所有內容中的第一個符合要求的標籤。
# 等同於xpath的 //title[1]
print(soup.title)
print(soup.head)
print(soup.a)
print(soup.p)
print(type(soup.p))

對於Tag，它有兩個重要的屬性，是name 和attrs(attributes)

#soup對象本身比較特殊，它的name即爲[document]
print(soup.name)
print(soup.head.name)

#把p標籤的所有屬性打印輸出了出來，得到的類型是一個字典。
print(soup.p.attrs["class"])   attributes
#根據名稱獲取對應的屬性值，類型爲列表
print(soup.p['class'])
#//p[1]/@class
print(soup.p.get('class'))

===沒啥用====
#可以對這些屬性和內容等等進行修改
soup.p['class']="newClass"
print(soup.p)
#刪除屬性
del soup.p['class']
print(soup.p)

NavigableString（標籤文本內容）

獲取標籤內部的文字用.string即可，例如

print(soup.p.string)
#The Dormouse's story

# //p[1]/text()
print(type(soup.p.string))
#In[13]:<class'bs4.element.NavigableString'>

BeautifulSoup（根對象）（沒啥用）

BeautifulSoup 對象表示的是一個文檔的內容。大部分時候,可以把它當作Tag對象，是一個特殊的Tag，我們可以分別獲取它的類型，名稱，以及屬性

print(type(soup.name))
#<type'unicode'>

print(soup.name)
#[document]

# attributes
print(soup.attrs)
#文檔本身的屬性爲空

Comment（註釋內容）

Comment對象是一個特殊類型的NavigableString對象，其輸出註釋但不包含註釋符號。

print(soup.a)
#<aclass="sister"href="http://example.com/elsie"id="link1"><!--Elsie--></a>

print(soup.a.string)
#Elsie

print(type(soup.a.string))#<class'bs4.element.Comment'>

a 標籤裏的內容實際上是註釋，但是如果我們利用.string 來輸出它的內容時，註釋符號已經去掉了。

遍歷文檔樹

.contents

tag 的.content 屬性可以將tag的子節點以列表的方式輸出

print(soup.head.contents)
print(soup.head.contents[0])

.children

它返回的不是一個list，不過我們可以通過遍歷獲取所有子節點。
我們打印輸出.children 看一下，可以發現它是一個list 生成器對象

print(soup.head.children)
for child in soup.body.children:
	print(child)

.descendants屬性

.contents 和.children 屬性僅包含tag的直接子節點，.descendants 屬性可以對所有tag的子孫節點進行遞歸循環，
和children類似，我們也需要遍歷獲取其中的內容。

for child in soup.descendants:
	print(child)

CSS選擇器

通過標籤名查找

語法格式：
#xpath: //標籤名
soup.select(‘標籤名’)

查找所有title標籤
print(soup.select(‘title’))
查找所有a標籤
print(soup.select('a'))
查找標籤b標籤
print(soup.select('b'))

通過類名查找

語法格式：
#xpath //*[@class=“類名”]
soup.select(’.類名’)

查找所有class是sister的標籤
print(soup.select(‘.sister’))

通過ID查找

語法格式：
#xpath： //*[@id=“id名字”]
soup.select(’#id名字’)

查找所有id是link1的標籤
print(soup.select(‘#link1’))

組合查找

直接子集查找

直接子標籤查找，則使用 > 分隔。
例如查找head標籤直接子集中，標籤爲title的元素，二者需要用。

# xpath:     //head/title
print(soup.select("head > title"))

所有子集查找

每塊表達式用空格連接。
例如查找p 標籤中，id 等於link1的內容，二者需要用空格分開。

# xpath：    //p//*[@id="link"]
print(soup.select('p #link1'))

屬性查找

查找時還可以加入屬性元素，屬性需要用中括號括起來，注意屬性和標籤屬於同一節點，所以中間不能加空格，否則會無法匹配到。

# 查找所有class屬性值爲sister的a標籤
#xpath //a[@class="sister"]
print(soup.select('a[class="sister"]'))

# 查找所有href屬性值爲"http://example.com/elsie"的a標籤
#xpath //a[@href="http://example.com/elsie"]
print(soup.select('a[href="http://example.com/elsie"]'))

同樣，屬性仍然可以與上述查找方式組合，不在同一節點的空格隔開

# 查找所有p標籤下 href屬性值爲"http://example.com/elsie"的a標籤

#xpath //p//a[@href="http://example.com/elsie"]/*[@class="className"]
print(soup.select('p a[href="http://example.com/elsie"] > .className')

任務

雙色球信息爬取
http://zst.aicai.com/ssq/openInfo/

豆瓣新書的信息爬取
https://book.douban.com/latest

第四章 bs4與css選擇器

簡介

安裝

知識

初始化

Tag（標籤）

NavigableString（標籤文本內容）

BeautifulSoup（根對象）（沒啥用）

Comment（註釋內容）

遍歷文檔樹

.contents

.children

.descendants屬性

CSS選擇器

通過標籤名查找

通過類名查找

通過ID查找

組合查找

直接子集查找

所有子集查找

屬性查找

任務

如何使用 JS 判斷用戶是否處於活躍狀態

通過HPA+CronHPA組合應對業務複雜彈性伸縮場景

第五章正則：通喫一切字符串處理

win10 tensorflow2.2 安裝踩坑總結

第十二章 Scrapy中間件與圖片管道

第九章爬蟲基礎總結

第十一章 Scrapy入門：多線程+異步

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結