搜索文檔樹
1、Beautiful Soup定義了很多搜索方法,這裏着重介紹2個: find() 和 find_all()
2、使用find_all()類似的方法可以查找到想要查找的文檔內容
過濾器
1、介紹find_all()方法前,先介紹一下過濾器的類型,這些過濾器貫穿整個搜索的API。過濾器可以被用在tag的name中,節點的屬性中,字符串中或他們的混合中
2、過濾器只能作爲搜索文檔的參數,或者說應該叫參數類型更爲貼切(即需要查找什麼,就將其作爲find_all()類似方法的參數)
字符串
最簡單的過濾器是字符串(標籤對名)。在搜索方法中傳入一個字符串參數,BeautifulSoup會查找與字符串完整匹配的內容
例1:查找文檔中所有的<b>標籤
from bs4 import BeautifulSoup #導入bs4庫
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html,"lxml")#指定解析器
tag_p = soup.find_all("p")
print(tag_p)
tag_b = soup.find_all("b")#b標籤對是內嵌在第一個p標籤對中的
print(tag_b)
tag_a = soup.find_all("a")#a標籤對是內嵌在第二個p標籤對中的
print(tag_a)
"""
[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>,
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story">...</p>]
[<b>The Dormouse's story</b>]
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
"""
注:由上面的例子可以看出
1、輸入結果爲所有符合要求的標籤對組成的列表(元素的類型爲tag對象),每一對符合要求的標籤對爲列表中的一個元素
2、不論標籤對中有什麼,只要符合查找要求都會將其整個輸出:a標籤對中內嵌了b標籤對,在查找a標籤對時,也會把a中內嵌的b標籤對一起輸出(當然內嵌的b標籤對可能是整個b標籤對中的一部分)
3、a標籤對中內嵌了b標籤對,在查找b標籤對時:只會輸入符合要求的b標籤對,不會輸入用於內嵌b的a標籤對
4、如果傳入字節碼參數,Beautiful Soup會當作UTF-8編碼,可以傳入一段Unicode 編碼來避免Beautiful Soup解析編碼出錯
5、我們遍歷列表後就可以得到一個一個的類型爲tag對象的標籤對,因此我們也可以對其使用tag對象的方法
例1_1:
for i in tag_a:
print(i,type(i))
print(soup.a.get("href"))
"""
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a> <class 'bs4.element.Tag'>
http://example.com/elsie
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> <class 'bs4.element.Tag'>
http://example.com/elsie
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> <class 'bs4.element.Tag'>
http://example.com/elsie
"""
正則表達式
如果傳入正則表達式作爲參數。Beautiful Soup會通過正則表達式的match()來匹配內容
例2:找出所有以b開頭的標籤
from bs4 import BeautifulSoup #導入bs4庫
import re
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html,"lxml")#指定解析器
tag_b = soup.find_all(re.compile("^b"))#返回的也是一個列表
for i in tag_b:
print(i,type(i))
print(i.name)
"""
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body> <class 'bs4.element.Tag'>
body
<b>The Dormouse's story</b> <class 'bs4.element.Tag'>
b
"""
注:由上面的例子可以看出
1、find_all()的搜索條件(過濾器)爲正則表達式(以b開頭的標籤對),則在整個HTML文件中符合條件的有body標籤對和b標籤對,因此分別輸出了兩個標籤對的內容
2、返回的類型爲tag對象,因此我們可以使用tag對象的方法
列表
如果傳入列表參數。Beautiful Soup會將與列表中任一元素匹配的內容返回
例3:找到文檔中所有<a>標籤和<b>標籤
from bs4 import BeautifulSoup #導入bs4庫
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html,"lxml")#指定解析器
tag_a_b = soup.find_all(["a","b"])#返回的也是一個列表
print(tag_a_b,type(tag_a_b))
"""
[<b>The Dormouse's story</b>,
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
<class 'bs4.element.ResultSet'>
"""
注:由上面的例子可以看出
1、需要查找多個標籤對時,可以將需要查找的內容組成一個列表傳到find_all()方法中作爲過濾器
2、返回的結果是所有符合條件的標籤對組成的列表,且其原始的類型也爲tag對象
True
True可以匹配任何值。下面代碼查找到所有的tag,但是不會返回字符串節點
例4:
from bs4 import BeautifulSoup
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html,"lxml")#指定解析器
tag = soup.find_all(True)
print(tag)
#感覺這種方法用得不是很多,所以只是瞭解了下,知道有這種方法就好了
方法
1、如果沒有合適過濾器,那麼還可以定義一個方法,方法只接受一個元素參數,如果這個方法返回True表示當前元素匹配並且被找到,如果不是則反回False
2、元素參數:HTML文檔中的一個tag節點,不能是文本節點
例5:包含class屬性卻不包含id屬性,那麼將返回True
from bs4 import BeautifulSoup
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
def has_class_but_no_id(tag):
return tag.has_attr('class') and not tag.has_attr('id')
soup = BeautifulSoup(html,"lxml")
tag = soup.find_all(has_class_but_no_id)#這個方法作爲參數傳入find_all()方法
print(tag)
"""
[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>,
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;and they lived at the bottom of a well.</p>,
<p class="story">...</p>]
"""
注:
上面例子中的搜索條件爲有class屬性但不包含id屬性,因此整個HTML中p標籤對符合該條件(a標籤對雖然不符合,但是其是內嵌在P標籤對中的,因此在輸入P時會有a)
find_all( )方法
語法:
find_all(name , attrs , recursive , text , **kwargs )
描述:
1、find_all()方法搜索當前tag的所有tag子節點,並判斷是否符合過濾器的條件
2、這裏的使用方法感覺跟前面說的過濾器差不多,只是這裏用的是標籤對內中的屬性,而過濾器用得是標籤對的名字
name 參數
1、name 參數可以查找所有名字爲name的tag,字符串對象會被自動忽略掉
2、搜索name參數的值可以使任一類型的 過濾器 ,字符串,正則表達式,列表,方法或是True
例6:
from bs4 import BeautifulSoup
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html,"lxml")#指定解析器
tag_title = soup.find_all("title")
print(tag_title)
tag_a = soup.find_all("a")
print(tag_a)
"""
[<title>The Dormouse's story</title>]
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
"""
注:
從上面的結果可以看出,其實這種方法跟前面說的過濾器是一樣的,即name參數的值可以使任一類型的過濾器
keyword 參數
如果一個指定名字的參數不是搜索內置的參數名,搜索時會把該參數當作指定名字tag的屬性來搜索
例:如果包含一個名字爲id的參數,Beautiful Soup會搜索每個tag的”id”屬性
例7:
from bs4 import BeautifulSoup
import re
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html,"lxml")#指定解析器
tag_link = soup.find_all(id ="link2")#傳入id參數
print(tag_link)
tag_href = soup.find_all(href=re.compile("example"))#傳入href參數
print(tag_href)
tag_True = soup.find_all(id=True)#傳入Trur參數
print(tag_True)
tag_all = soup.find_all(href=re.compile("example"), id='link1')#多個指定名字的參數
print(tag_all)
tag_class = soup.find_all(class_="sister")#傳入class參數
print(tag_class)
"""
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
"""
注:
上面介紹了幾種keyword 參數的搜索方式:搜索指定名字的屬性時可以使用的參數值包括 字符串 , 正則表達式 , 列表, True,各種參數間可以相互組合
1、使用id關鍵字:包含一個名字爲 id 的參數,Beautiful Soup會搜索每個tag的”id”屬性
2、使用href關鍵字:如果傳入href參數,Beautiful Soup會搜索每個tag的”href”屬性
3、使用True關鍵字:在文檔樹中查找所有包含 id 屬性的tag,無論id的值是什麼
4、多個關鍵字組合:使用多個指定名字的參數可以同時過濾tag的多個屬性
5、使用class關鍵字:class是python的關鍵詞,所以在使用其作爲關鍵字時需要加個下劃線
6、多種過濾類型組合在一起可以進一步加強搜索(匹配)結果的準確性
按CSS搜索
1、按照CSS類名搜索tag的功能非常實用,但標識CSS類名的關鍵字class在Python中是保留字,使用class做參數會導致語法錯誤。從Beautiful Soup的4.1.1版本開始。可以通過 class_ 參數搜索有指定CSS類名的tag(在上面例子中也有講解)
例8:
from bs4 import BeautifulSoup
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html,"lxml")#指定解析器
tag_class_1 = soup.find_all(class_="sister",id="link3")#class參數與id參數組合使用
print(tag_class_1)
"""
[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
"""
2、class_ 參數同樣接受不同類型的 過濾器 ,字符串,正則表達式,方法或 True
例8_1:
from bs4 import BeautifulSoup
import re
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html,"lxml")#指定解析器
tag_class_1 = soup.find_all(class_=re.compile("itl"))
print(tag_class_1)
#[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>]
text 參數
1、通過 text 參數可以搜文檔中的字符串內容與 name 參數的可選值一樣, text 參數接受 字符串 , 正則表達式 , 列表, True
例9:
from bs4 import BeautifulSoup
import re
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html,"lxml")#指定解析器
more = soup.find_all(text=["Tillie", "Elsie", "Lacie"])
print(more)
all = soup.find_all(text=re.compile("story"))
print(all)
"""
['Lacie', 'Tillie']
["The Dormouse's story", "The Dormouse's story"]
"""
2、雖然 text 參數用於搜索字符串,還可以與其它參數混合使用來過濾tag.Beautiful Soup會找到 .string 方法與 text 參數值相符的tag
例9_1:
from bs4 import BeautifulSoup
import re
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html,"lxml")#指定解析器
tag_a = soup.find_all("a",text= "Tillie")
print(tag_a)
#[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
limit 參數
find_all() 方法返回全部的搜索結構,如果文檔樹很大那麼搜索會很慢。如果我們不需要全部結果,可以使用 limit 參數限制返回結果的數量。效果與SQL中的limit關鍵字類似,當搜索到的結果數量達到 limit 的限制時,就停止搜索返回結果。
例10:文檔樹中有3個tag符合搜索條件,但結果只返回了2個,因爲我們限制了返回數量
soup.find_all("a", limit=2)
"""
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
"""
find( )方法
語法:
find( name , attrs , recursive , text , **kwargs )
描述
1、find_all()方法將返回文檔中符合條件的所有tag,儘管有時候我們只想得到一個結果。比如文檔中只有一個<body>標籤,那麼使用find_all()方法來查找<body>標籤就不太合適, 使用find_all()方法並設置 limit=1 參數不如直接使用find()方法。
例11:下面兩行代碼是等價的
import re
from bs4 import BeautifulSoup
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html,"lxml")#指定解析器
print(soup.find_all('title', limit=1))#返回一個列表
print(soup.find('title'))#返回一個tag
"""
[<title>The Dormouse's story</title>]
<title>The Dormouse's story</title>
"""
注:
1、上面兩段代碼:唯一的區別是find_all()方法的返回結果是值包含一個元素的列表(未設置limit參數時則是全部滿足要求的標籤對),而find()方法直接返回結果
2、find_all() 方法沒有找到目標是返回空列表, find()方法找不到目標時,返回 None
3、由輸出結果可以看出find_all()方法返回的是一個列表,需要遍歷後纔是一個tag對象,而find()方法直接返回的就是一個tag對象
例:
from bs4 import BeautifulSoup # 導入bs4庫
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html, "lxml") # 指定解析器,創建beautifulsoup對象
p_string = soup.p.string
print(r"直接查找標籤對中的string:",p_string)
p = soup.find_all("p")
print(r"標籤對:",p)
for i in p:
print(r"先查找標籤對,再在標籤對中找string:",i.string)
print(r"先查找標籤對,再在標籤對中找某個屬性的值:",i["class"])
print(i.get("class"))
"""直接查找標籤對中的string: The Dormouse's story
標籤對: [<p class="title" name="dromouse"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story">...</p>]
先查找標籤對,再在標籤對中找string: The Dormouse's story
先查找標籤對,再在標籤對中找某個屬性的值: ['title']
['title']
先查找標籤對,再在標籤對中找string: None
先查找標籤對,再在標籤對中找某個屬性的值: ['story']
['story']
先查找標籤對,再在標籤對中找string: ...
先查找標籤對,再在標籤對中找某個屬性的值: ['story']
['story']
"""
注:
1、find_all()方法返回的是:一個所有符合查找條件的tag對象組成的列表,需要遍歷後纔是具體的某個tag對象
2、find()方法返回的是:第一個符合查找條件的tag對象,直接返回的就是一個tag對象
3、查找標籤對中字符串的方法 :
⑴直接使用"soup對象.標籤對.string"的方法:這樣查找出來的是第一個符合查找條件的標籤對的字符串
⑵先找出所有符合查找條件的tag對象,在使用"tag對象.string"的方法:這樣查找出來的就是全部符合條件的標籤對的字符串
4、簡析XML文檔時,必須制定簡析器爲"xml",不能是"lxml",不然會報錯
5、對於HTML文檔和XML文檔來說裏面主要的就是:
⑴標籤對:標籤對裏面的屬性和屬性值(key:value)。可通過找到的tag對象,再在tag對象中使用字典的方法,找出具體某個屬性的值
⑵字符串:就是標籤對之間的字符串,查找方法如3中所述
CSS選擇器
1、Beautiful Soup支持大部分的CSS選擇器,在Tag或BeautifulSoup對象的。select()方法中傳入字符串參數,即可使用CSS選擇器的語法找到tag
2、CSS選擇器是一種單獨的文檔搜索語法, 參考 http://www.w3school.com.cn/css/css_selector_type.asp
3、CSS選擇器的方法很多,這裏重點介紹一種很常見的方法,其他方法請參考
https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id87
步驟1:在原網頁通過F12打開開發者模式,選中我們需要的東西,【右鍵】->copy->Copy Selector:複製我們需要的標籤對的路徑
步驟2:將路徑粘貼在任意文本中(我們可以多複製幾條,進行對比),代碼如下:
⑴#mainBox > main > div.article-list > div:nth-child(4) > h4 > a
⑵#mainBox > main > div.article-list > div:nth-child(5) > h4 > a
步驟3:由步驟2中的路徑我們可以發現:不同的部分爲"nth-child(num)",因此需要將冒號後(包括冒號)的部分刪掉,就得到的通用的路徑
#mainBox > main > div.article-list > div > h4 > a
例12:
import requests
from bs4 import BeautifulSoup
url = 'https://blog.csdn.net/qq_39********'
header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 UBrowser/6.1.2107.204 Safari/537.36'}
html = requests.get(url,headers = header)
#使用自帶的html.parser解析,速度慢但通用
soup = BeautifulSoup(html.text,'html.parser')
tag = soup.select("#mainBox > main > div.article-list > div > h4 > a")
print(tag)
拓展
測試HTML
<div class="postlist">
<ul id="pins">
<li><a href="https://www.mzitu.com/198830" target="_blank"><img class="lazy" src="https://i.meizitu.net/thumbs/2019/08/198830_14a48_236.jpg"
<li><a href="https://www.mzitu.com/189169" target="_blank"><img class="lazy" src="https://i.meizitu.net/thumbs/2019/08/189169_11b11_236.jpg"
<li><a href="https://www.mzitu.com/190884" target="_blank"><img class="lazy" src="https://i.meizitu.net/thumbs/2019/08/190884_20c59_236.jpg"
<li><a href="https://www.mzitu.com/190416" target="_blank"><img class="lazy" src="https://i.meizitu.net/thumbs/2019/08/190416_18d11_236.jpg"
<li><a href="https://www.mzitu.com/190947" target="_blank"><img class="lazy" src="https://i.meizitu.net/thumbs/2019/08/190947_21a24_236.jpg"
<li><a href="https://www.mzitu.com/190259" target="_blank"><img class="lazy" src="https://i.meizitu.net/thumbs/2019/08/190259_18a38_236.jpg"
<li><a href="https://www.mzitu.com/195585" target="_blank"><img class="lazy" src="https://i.meizitu.net/thumbs/2019/08/195585_16a41_236.jpg"
<li><a href="https://www.mzitu.com/190177" target="_blank"><img class="lazy" src="https://i.meizitu.net/pfiles/img/lazy.png" data-original="https://i.meizitu.net/thumbs/2019/08/190177_16e34_236.jpg"
<li><a href="https://www.mzitu.com/191199" target="_blank"><img class="lazy" src="https://i.meizitu.net/pfiles/img/lazy.png" data-original="https://i.meizitu.net/thumbs/2019/08/191199_22c42_236.jpg"
<li><a href="https://www.mzitu.com/190636" target="_blank"><img class="lazy" src="https://i.meizitu.net/pfiles/img/lazy.png" data-original="https://i.meizitu.net/thumbs/2019/08/190636_19c12_236.jpg"
<li><a href="https://www.mzitu.com/191054" target="_blank"><img class="lazy" src="https://i.meizitu.net/pfiles/img/lazy.png" data-original="https://i.meizitu.net/thumbs/2019/08/191054_21c15_236.jpg"
<li><a href="https://www.mzitu.com/190302" target="_blank"><img class="lazy" src="https://i.meizitu.net/pfiles/img/lazy.png" data-original="https://i.meizitu.net/thumbs/2019/08/190302_18b26_236.jpg"
<nav class="navigation pagination" role="navigation">
<div class="nav-links"><span aria-current="page" class="page-numbers current">1</span>
<a class="page-numbers" href="https://www.mzitu.com/page/2/">2</a>
<a class="page-numbers" href="https://www.mzitu.com/page/3/">3</a>
<a class="page-numbers" href="https://www.mzitu.com/page/4/">4</a>
<span class="page-numbers dots">…</span>
<a class="page-numbers" href="https://www.mzitu.com/page/228/">228</a>
<a class="next page-numbers" href="https://www.mzitu.com/page/2/">下一頁»</a></div>
</nav> </div>
例13:
import requests
from bs4 import BeautifulSoup
url = 'http://www.mzitu.com'
header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 UBrowser/6.1.2107.204 Safari/537.36'}
html = requests.get(url,headers = header)
#使用自帶的html.parser解析,速度慢但通用
soup = BeautifulSoup(html.text,'html.parser')
#方法1
"""
#實際上是第一個class = 'postlist'的div裏的所有a 標籤是我們要找的信息
all_a = soup.find('div',class_='postlist').find_all('a',target='_blank')
for a in all_a:
print(a["href"])
"""
#方法2
all_a = soup.find_all('a',target="_blank")
for a in all_a:
print(a["href"])
注:在上面例子中我們使用了兩種方法去找符合('a',target="_blank")的標籤對,可以發現兩種方法的輸出結果不一致
1、一個HTML頁面中可能會有一些標籤對包含我們不需要的信息:符合我們的查找條件,但是實際是我們不需要的
2、通過觀察HTML頁面可以發現,我們需要的信息都是在一個叫<div class="postlist">的標籤對下面,因此我們可以先通過find()方法去返回這個tag對象,然後再在這個標籤對對象中去使用find_all()方法去查找我們需要的標籤對,其他在這個標籤對(<div class="postlist")外但又符合方法2查找條件的標籤對就不會被返回
注:
1、通過自己的學習,感覺經常用到的還是fing_all(標籤對名參數,關鍵字參數),當然這種查找當然使用fing_all(標籤對名參數)。加上關鍵字參數可以提高準確性
2、本文是參照BeautifulSoup官方文檔寫的。只是自己在學習過程中的記錄,方便以後查找的,文中肯定有錯誤的和遺漏的,如果有幸被您看到,請不要介意。可以直接去看官方文檔
https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id87