python爬蟲:BeautifulSoup_搜索文檔樹

搜索文檔樹

1、Beautiful Soup定義了很多搜索方法,這裏着重介紹2個: find() 和 find_all()

2、使用find_all()類似的方法可以查找到想要查找的文檔內容

 

過濾器

1、介紹find_all()方法前,先介紹一下過濾器的類型,這些過濾器貫穿整個搜索的API。過濾器可以被用在tag的name中,節點的屬性中,字符串中或他們的混合中

2、過濾器只能作爲搜索文檔的參數,或者說應該叫參數類型更爲貼切(即需要查找什麼,就將其作爲find_all()類似方法的參數)

 

字符串

最簡單的過濾器是字符串(標籤對名)。在搜索方法中傳入一個字符串參數,BeautifulSoup會查找與字符串完整匹配的內容

例1:查找文檔中所有的<b>標籤

from bs4 import BeautifulSoup #導入bs4庫

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html,"lxml")#指定解析器

tag_p = soup.find_all("p")
print(tag_p)

tag_b = soup.find_all("b")#b標籤對是內嵌在第一個p標籤對中的
print(tag_b)

tag_a = soup.find_all("a")#a標籤對是內嵌在第二個p標籤對中的
print(tag_a)

"""
[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>,
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story">...</p>]

[<b>The Dormouse's story</b>]

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, 
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
"""

注:由上面的例子可以看出
1、輸入結果爲所有符合要求的標籤對組成的列表(元素的類型爲tag對象),每一對符合要求的標籤對爲列表中的一個元素

2、不論標籤對中有什麼,只要符合查找要求都會將其整個輸出:a標籤對中內嵌了b標籤對,在查找a標籤對時,也會把a中內嵌的b標籤對一起輸出(當然內嵌的b標籤對可能是整個b標籤對中的一部分)

3、a標籤對中內嵌了b標籤對,在查找b標籤對時:只會輸入符合要求的b標籤對,不會輸入用於內嵌b的a標籤對

4、如果傳入字節碼參數,Beautiful Soup會當作UTF-8編碼,可以傳入一段Unicode 編碼來避免Beautiful Soup解析編碼出錯

5、我們遍歷列表後就可以得到一個一個的類型爲tag對象的標籤對,因此我們也可以對其使用tag對象的方法
例1_1:

for i in tag_a:
    print(i,type(i))
    print(soup.a.get("href"))
    
"""
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a> <class 'bs4.element.Tag'>
http://example.com/elsie
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> <class 'bs4.element.Tag'>
http://example.com/elsie
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> <class 'bs4.element.Tag'>
http://example.com/elsie
"""

 

正則表達式

如果傳入正則表達式作爲參數。Beautiful Soup會通過正則表達式的match()來匹配內容
例2:找出所有以b開頭的標籤

from bs4 import BeautifulSoup #導入bs4庫
import re
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html,"lxml")#指定解析器
tag_b = soup.find_all(re.compile("^b"))#返回的也是一個列表

for i in tag_b:
    print(i,type(i))
    print(i.name)

"""
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body> <class 'bs4.element.Tag'>
body
<b>The Dormouse's story</b> <class 'bs4.element.Tag'>
b

"""

注:由上面的例子可以看出
1、find_all()的搜索條件(過濾器)爲正則表達式(以b開頭的標籤對),則在整個HTML文件中符合條件的有body標籤對和b標籤對,因此分別輸出了兩個標籤對的內容

2、返回的類型爲tag對象,因此我們可以使用tag對象的方法

 

列表

如果傳入列表參數。Beautiful Soup會將與列表中任一元素匹配的內容返回
例3:找到文檔中所有<a>標籤和<b>標籤

from bs4 import BeautifulSoup #導入bs4庫

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html,"lxml")#指定解析器
tag_a_b = soup.find_all(["a","b"])#返回的也是一個列表

print(tag_a_b,type(tag_a_b))

"""
[<b>The Dormouse's story</b>,
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] 

<class 'bs4.element.ResultSet'>
"""

注:由上面的例子可以看出
1、需要查找多個標籤對時,可以將需要查找的內容組成一個列表傳到find_all()方法中作爲過濾器

2、返回的結果是所有符合條件的標籤對組成的列表,且其原始的類型也爲tag對象

 

True

True可以匹配任何值。下面代碼查找到所有的tag,但是不會返回字符串節點
例4:

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html,"lxml")#指定解析器
tag = soup.find_all(True)
print(tag)

#感覺這種方法用得不是很多,所以只是瞭解了下,知道有這種方法就好了

 

方法

1、如果沒有合適過濾器,那麼還可以定義一個方法,方法只接受一個元素參數,如果這個方法返回True表示當前元素匹配並且被找到,如果不是則反回False

2、元素參數:HTML文檔中的一個tag節點,不能是文本節點

例5:包含class屬性卻不包含id屬性,那麼將返回True

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

soup = BeautifulSoup(html,"lxml")

tag = soup.find_all(has_class_but_no_id)#這個方法作爲參數傳入find_all()方法
print(tag)

"""
[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>,

<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;and they lived at the bottom of a well.</p>,

<p class="story">...</p>]
"""

注:
上面例子中的搜索條件爲有class屬性但不包含id屬性,因此整個HTML中p標籤對符合該條件(a標籤對雖然不符合,但是其是內嵌在P標籤對中的,因此在輸入P時會有a)

 

find_all( )方法

語法:

find_all(name , attrs , recursive , text , **kwargs )

描述:

1、find_all()方法搜索當前tag的所有tag子節點,並判斷是否符合過濾器的條件

2、這裏的使用方法感覺跟前面說的過濾器差不多,只是這裏用的是標籤對內中的屬性,而過濾器用得是標籤對的名字


name 參數

1、name 參數可以查找所有名字爲name的tag,字符串對象會被自動忽略掉

2、搜索name參數的值可以使任一類型的 過濾器 ,字符串,正則表達式,列表,方法或是True
例6:

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""


soup = BeautifulSoup(html,"lxml")#指定解析器

tag_title = soup.find_all("title")
print(tag_title)

tag_a = soup.find_all("a")
print(tag_a)

"""
[<title>The Dormouse's story</title>]

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
"""

注:
從上面的結果可以看出,其實這種方法跟前面說的過濾器是一樣的,即name參數的值可以使任一類型的過濾器

 

keyword 參數

如果一個指定名字的參數不是搜索內置的參數名,搜索時會把該參數當作指定名字tag的屬性來搜索
例:如果包含一個名字爲id的參數,Beautiful Soup會搜索每個tag的”id”屬性
例7:

from bs4 import BeautifulSoup
import re

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""


soup = BeautifulSoup(html,"lxml")#指定解析器

tag_link = soup.find_all(id ="link2")#傳入id參數
print(tag_link)

tag_href = soup.find_all(href=re.compile("example"))#傳入href參數
print(tag_href)

tag_True = soup.find_all(id=True)#傳入Trur參數
print(tag_True)

tag_all = soup.find_all(href=re.compile("example"), id='link1')#多個指定名字的參數
print(tag_all)

tag_class = soup.find_all(class_="sister")#傳入class參數
print(tag_class)

"""
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, 
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, 
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, 
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
"""

注:
上面介紹了幾種keyword 參數的搜索方式:搜索指定名字的屬性時可以使用的參數值包括 字符串 , 正則表達式 , 列表, True,各種參數間可以相互組合
    1、使用id關鍵字:包含一個名字爲 id 的參數,Beautiful Soup會搜索每個tag的”id”屬性
    2、使用href關鍵字:如果傳入href參數,Beautiful Soup會搜索每個tag的”href”屬性
    3、使用True關鍵字:在文檔樹中查找所有包含 id 屬性的tag,無論id的值是什麼
    4、多個關鍵字組合:使用多個指定名字的參數可以同時過濾tag的多個屬性
    5、使用class關鍵字:class是python的關鍵詞,所以在使用其作爲關鍵字時需要加個下劃線
    6、多種過濾類型組合在一起可以進一步加強搜索(匹配)結果的準確性

 

按CSS搜索

1、按照CSS類名搜索tag的功能非常實用,但標識CSS類名的關鍵字class在Python中是保留字,使用class做參數會導致語法錯誤。從Beautiful Soup的4.1.1版本開始。可以通過 class_ 參數搜索有指定CSS類名的tag(在上面例子中也有講解)

例8:

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""


soup = BeautifulSoup(html,"lxml")#指定解析器

tag_class_1 = soup.find_all(class_="sister",id="link3")#class參數與id參數組合使用
print(tag_class_1)

"""
[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
"""

2、class_ 參數同樣接受不同類型的 過濾器 ,字符串,正則表達式,方法或 True 
例8_1:

from bs4 import BeautifulSoup
import re

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html,"lxml")#指定解析器

tag_class_1 = soup.find_all(class_=re.compile("itl"))
print(tag_class_1)

#[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>]

 

text 參數

1、通過 text 參數可以搜文檔中的字符串內容與 name 參數的可選值一樣, text 參數接受 字符串 , 正則表達式 , 列表, True 

例9:

from bs4 import BeautifulSoup
import re

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html,"lxml")#指定解析器

more = soup.find_all(text=["Tillie", "Elsie", "Lacie"])
print(more)

all = soup.find_all(text=re.compile("story"))
print(all)

"""
['Lacie', 'Tillie']
["The Dormouse's story", "The Dormouse's story"]
"""

2、雖然 text 參數用於搜索字符串,還可以與其它參數混合使用來過濾tag.Beautiful Soup會找到 .string 方法與 text 參數值相符的tag
例9_1:

from bs4 import BeautifulSoup
import re

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html,"lxml")#指定解析器

tag_a = soup.find_all("a",text= "Tillie")
print(tag_a)

#[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

 

limit 參數

find_all() 方法返回全部的搜索結構,如果文檔樹很大那麼搜索會很慢。如果我們不需要全部結果,可以使用 limit 參數限制返回結果的數量。效果與SQL中的limit關鍵字類似,當搜索到的結果數量達到 limit 的限制時,就停止搜索返回結果。

例10:文檔樹中有3個tag符合搜索條件,但結果只返回了2個,因爲我們限制了返回數量

soup.find_all("a", limit=2)

"""
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
"""

 

find( )方法

語法:

find( name , attrs , recursive , text , **kwargs )

描述
1、find_all()方法將返回文檔中符合條件的所有tag,儘管有時候我們只想得到一個結果。比如文檔中只有一個<body>標籤,那麼使用find_all()方法來查找<body>標籤就不太合適, 使用find_all()方法並設置 limit=1 參數不如直接使用find()方法。
例11:下面兩行代碼是等價的

import re
from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html,"lxml")#指定解析器

print(soup.find_all('title', limit=1))#返回一個列表


print(soup.find('title'))#返回一個tag

"""
[<title>The Dormouse's story</title>]
<title>The Dormouse's story</title>
"""

注:
1、上面兩段代碼:唯一的區別是find_all()方法的返回結果是值包含一個元素的列表(未設置limit參數時則是全部滿足要求的標籤對),而find()方法直接返回結果

2、find_all() 方法沒有找到目標是返回空列表, find()方法找不到目標時,返回 None

3、由輸出結果可以看出find_all()方法返回的是一個列表,需要遍歷後纔是一個tag對象,而find()方法直接返回的就是一個tag對象

例:

from bs4 import BeautifulSoup  # 導入bs4庫

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html, "lxml")  # 指定解析器,創建beautifulsoup對象


p_string = soup.p.string
print(r"直接查找標籤對中的string:",p_string)


p = soup.find_all("p")
print(r"標籤對:",p)
for i in p:
    print(r"先查找標籤對,再在標籤對中找string:",i.string)
    print(r"先查找標籤對,再在標籤對中找某個屬性的值:",i["class"])
    print(i.get("class"))

"""直接查找標籤對中的string: The Dormouse's story
標籤對: [<p class="title" name="dromouse"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story">...</p>]
先查找標籤對,再在標籤對中找string: The Dormouse's story
先查找標籤對,再在標籤對中找某個屬性的值: ['title']
['title']
先查找標籤對,再在標籤對中找string: None
先查找標籤對,再在標籤對中找某個屬性的值: ['story']
['story']
先查找標籤對,再在標籤對中找string: ...
先查找標籤對,再在標籤對中找某個屬性的值: ['story']
['story']
"""

注:

1、find_all()方法返回的是:一個所有符合查找條件的tag對象組成的列表,需要遍歷後纔是具體的某個tag對象

2、find()方法返回的是:第一個符合查找條件的tag對象,直接返回的就是一個tag對象

3、查找標籤對中字符串的方法 : 

⑴直接使用"soup對象.標籤對.string"的方法:這樣查找出來的是第一個符合查找條件的標籤對的字符串    

⑵先找出所有符合查找條件的tag對象,在使用"tag對象.string"的方法:這樣查找出來的就是全部符合條件的標籤對的字符串

4、簡析XML文檔時,必須制定簡析器爲"xml",不能是"lxml",不然會報錯

5、對於HTML文檔和XML文檔來說裏面主要的就是:    

⑴標籤對:標籤對裏面的屬性和屬性值(key:value)。可通過找到的tag對象,再在tag對象中使用字典的方法,找出具體某個屬性的值    

⑵字符串:就是標籤對之間的字符串,查找方法如3中所述

 

 

CSS選擇器

1、Beautiful Soup支持大部分的CSS選擇器,在Tag或BeautifulSoup對象的。select()方法中傳入字符串參數,即可使用CSS選擇器的語法找到tag

2、CSS選擇器是一種單獨的文檔搜索語法, 參考 http://www.w3school.com.cn/css/css_selector_type.asp

3、CSS選擇器的方法很多,這裏重點介紹一種很常見的方法,其他方法請參考

https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id87

步驟1:在原網頁通過F12打開開發者模式,選中我們需要的東西,【右鍵】->copy->Copy Selector:複製我們需要的標籤對的路徑


步驟2:將路徑粘貼在任意文本中(我們可以多複製幾條,進行對比),代碼如下:

    ⑴#mainBox > main > div.article-list > div:nth-child(4) > h4 > a
    ⑵#mainBox > main > div.article-list > div:nth-child(5) > h4 > a

步驟3:由步驟2中的路徑我們可以發現:不同的部分爲"nth-child(num)",因此需要將冒號後(包括冒號)的部分刪掉,就得到的通用的路徑

#mainBox > main > div.article-list > div > h4 > a

例12:

import requests
from bs4 import BeautifulSoup

url = 'https://blog.csdn.net/qq_39********'
header = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 UBrowser/6.1.2107.204 Safari/537.36'}

html = requests.get(url,headers = header)

#使用自帶的html.parser解析,速度慢但通用
soup = BeautifulSoup(html.text,'html.parser')

tag = soup.select("#mainBox > main > div.article-list > div > h4 > a")
print(tag)

 

拓展

測試HTML

<div class="postlist">
        <ul id="pins">
                  <li><a href="https://www.mzitu.com/198830" target="_blank"><img class="lazy" src="https://i.meizitu.net/thumbs/2019/08/198830_14a48_236.jpg" 
           <li><a href="https://www.mzitu.com/189169" target="_blank"><img class="lazy" src="https://i.meizitu.net/thumbs/2019/08/189169_11b11_236.jpg" 
         
           <li><a href="https://www.mzitu.com/190884" target="_blank"><img class="lazy" src="https://i.meizitu.net/thumbs/2019/08/190884_20c59_236.jpg" 
           <li><a href="https://www.mzitu.com/190416" target="_blank"><img class="lazy" src="https://i.meizitu.net/thumbs/2019/08/190416_18d11_236.jpg" 
           <li><a href="https://www.mzitu.com/190947" target="_blank"><img class="lazy" src="https://i.meizitu.net/thumbs/2019/08/190947_21a24_236.jpg" 
           <li><a href="https://www.mzitu.com/190259" target="_blank"><img class="lazy" src="https://i.meizitu.net/thumbs/2019/08/190259_18a38_236.jpg"
           <li><a href="https://www.mzitu.com/195585" target="_blank"><img class="lazy" src="https://i.meizitu.net/thumbs/2019/08/195585_16a41_236.jpg" 
           <li><a href="https://www.mzitu.com/190177" target="_blank"><img class="lazy" src="https://i.meizitu.net/pfiles/img/lazy.png" data-original="https://i.meizitu.net/thumbs/2019/08/190177_16e34_236.jpg"
           <li><a href="https://www.mzitu.com/191199" target="_blank"><img class="lazy" src="https://i.meizitu.net/pfiles/img/lazy.png" data-original="https://i.meizitu.net/thumbs/2019/08/191199_22c42_236.jpg" 
           <li><a href="https://www.mzitu.com/190636" target="_blank"><img class="lazy" src="https://i.meizitu.net/pfiles/img/lazy.png" data-original="https://i.meizitu.net/thumbs/2019/08/190636_19c12_236.jpg"
           <li><a href="https://www.mzitu.com/191054" target="_blank"><img class="lazy" src="https://i.meizitu.net/pfiles/img/lazy.png" data-original="https://i.meizitu.net/thumbs/2019/08/191054_21c15_236.jpg" 
           <li><a href="https://www.mzitu.com/190302" target="_blank"><img class="lazy" src="https://i.meizitu.net/pfiles/img/lazy.png" data-original="https://i.meizitu.net/thumbs/2019/08/190302_18b26_236.jpg" 
      
    <nav class="navigation pagination" role="navigation">
        
        <div class="nav-links"><span aria-current="page" class="page-numbers current">1</span>
<a class="page-numbers" href="https://www.mzitu.com/page/2/">2</a>
<a class="page-numbers" href="https://www.mzitu.com/page/3/">3</a>
<a class="page-numbers" href="https://www.mzitu.com/page/4/">4</a>
<span class="page-numbers dots">…</span>
<a class="page-numbers" href="https://www.mzitu.com/page/228/">228</a>
<a class="next page-numbers" href="https://www.mzitu.com/page/2/">下一頁»</a></div>
    </nav>    </div>

例13:

import requests
from bs4 import BeautifulSoup

url = 'http://www.mzitu.com'
header = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 UBrowser/6.1.2107.204 Safari/537.36'}

html = requests.get(url,headers = header)

#使用自帶的html.parser解析,速度慢但通用
soup = BeautifulSoup(html.text,'html.parser')

#方法1
"""
#實際上是第一個class = 'postlist'的div裏的所有a 標籤是我們要找的信息
all_a = soup.find('div',class_='postlist').find_all('a',target='_blank')

for a in all_a:
    print(a["href"])
"""
#方法2
all_a = soup.find_all('a',target="_blank")

for a in all_a:
    print(a["href"])

注:在上面例子中我們使用了兩種方法去找符合('a',target="_blank")的標籤對,可以發現兩種方法的輸出結果不一致

1、一個HTML頁面中可能會有一些標籤對包含我們不需要的信息:符合我們的查找條件,但是實際是我們不需要的

2、通過觀察HTML頁面可以發現,我們需要的信息都是在一個叫<div class="postlist">的標籤對下面,因此我們可以先通過find()方法去返回這個tag對象,然後再在這個標籤對對象中去使用find_all()方法去查找我們需要的標籤對,其他在這個標籤對(<div class="postlist")外但又符合方法2查找條件的標籤對就不會被返回

 

注:

1、通過自己的學習,感覺經常用到的還是fing_all(標籤對名參數,關鍵字參數),當然這種查找當然使用fing_all(標籤對名參數)。加上關鍵字參數可以提高準確性

2、本文是參照BeautifulSoup官方文檔寫的。只是自己在學習過程中的記錄,方便以後查找的,文中肯定有錯誤的和遺漏的,如果有幸被您看到,請不要介意。可以直接去看官方文檔

https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id87


 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章