beautifulsoup4 bs4 find_all & find 函數解析

原創

2019-07-30 15:05

假定soup是我們下載下來的網頁的對象了

soup = BeautifulSoup(a, "html.parser")

# 第一種，直接將屬性名作爲參數名，但是有些屬性不行，比如像a-b這樣的屬性
soup.find_all('p', id = 'p1') # 一般情況
soup.find_all('p', class_='p3') # class是保留字比較特殊，需要後面加一個_

# 最通用的方法
soup.find_all('p', attrs={'class':'p3'}) # 包含這個屬性就算，而不是隻有這個屬性
soup.find_all('p', attrs={'class':'p3','id':'pp'}) # 使用多個屬性匹配
soup.find_all('p', attrs={'class':'p3','id':False}) # 指定不能有某個屬性
soup.find_all('p', attrs={'id':['p1','p2']}) # 屬性值是p1或p2

# 正則表達式匹配
import re
soup.find_all('p', attrs={'id':re.compile('^p')}) # 使用正則表達式
soup.find_all('p', attrs={'class':True}) # 含有class屬性即可

根據標籤內內容來識別

這部分還是使用find_all函數，增加text參數

a = '''
<p id='p1'>段落1</p>
<p class='p3'>段落2</p>
<p class='p3'>文章</p>
<p></p>
'''

soup = BeautifulSoup(a, "html.parser")

soup.find_all('p', text='文章')
soup.find_all('p', text=['段落1','段落2'])

# 正則表達式
import re
soup.find_all('p', text=re.compile('段落'))
soup.find_all('p',text=True)

# 傳入函數
def nothing(c):
    return c not in ['段落1','段落2','文章']
soup.find_all('p',text=nothing)

# 同上
def nothing(c):  
    return c is None
soup.find_all('p',text=nothing)

根據位置識別

找到第i個a標籤
找到第i個和第j個a標籤

有時三個標籤的標籤屬性全都一樣，所有東西都一樣（內容可能不一樣，但是類型是一樣的），但是我們只想要第二個，這時就不能只通過標籤屬性內容這些方法提取了，可能它的位置是特殊的就可以用位置來提取。這裏其實可以用find_all提取出列表，然後在列表中根據位置再提取一次

使用方法

後綴函數

.name ：標籤類型

.attr ：標籤所有屬性的字典 find特徵查找非常方便

.has ：檢查標籤是否有某屬性 True、False # 沒什麼用

a = '''
<body>
    <h><a href='www.biaoti.com'>標題</a></h>
    <p>段落1</p>
    <p></p>
</body>
'''
soup = BeautifulSoup(a, 'html.parser')
for i in soup.body.find_all(True):
    print(i.name) # 提取標籤名
    print(i.attrs) # 提取標籤所有屬性值
    print(i.has_attr('href')) # 檢查標籤是否有某屬性

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

beautifulsoup4 bs4 find_all & find 函數解析

使用方法

【Python】bs4庫

Meshgrid函數（未完成）

【Sklearn】tree.export_graphviz 可視化函數

Python Matplotlib 等高線

Python Matplotlib屬性 cmap

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結