BeatuifulSoup4

原創

2020-02-25 19:12

基本使用

from bs4 import BeautifulSoup

html = """
<div>test</div>
"""
# 第二個參數指定解釋器：
# 默認html.parser，容錯性差
# lxml速度快，需要安裝c語言庫，容錯能力強，常使用
bs = BeautifulSoup(html, 'lxml')
print(bs.prettify())

提取元素

from bs4 import BeautifulSoup

html = """
<tr>
    <td>1</td>
    <td>2</td>
</tr>
<tr class='even'>
    <td>1</td>
    <td>2</td>
</tr>
<a class='test'id='test' href="www.baidu.com">2</a>
<a href="www.baidu.com"></a>
"""
soup = BeautifulSoup(html, 'lxml')
# 1.獲取所有tr標籤
trs = soup.find_all('tr')
for tr in trs:
    print(tr)
# 2.獲取第二個tr標籤
# limit最多獲取多少個元素
tr = soup.find_all('tr', limit=2)[1]
# 3.獲取所有class等於even的標籤
trs = soup.find_all('tr', class_='even')
trs = soup.find_all('tr', attrs={'class': 'even'})
print(trs)
# 4.將所有id等於test，class也等於test的a標籤提取出來
aList = soup.find_all('a', id='test', class_='test')
# 或者
aList = soup.find_all('a', attrs={'id': 'test', 'class': 'test'})
print(aList)
# 5.獲取所有a標籤的href屬性
aList = soup.find_all('a')
for a in aList:
    # 通過下標操作的方式
    href = a['href']
    print(href)
    # 通過attrs屬性的方式
    href = a.attrs['href']
    print(href)
# 6.獲取純文本信息
trs = soup.find_all('tr')
for tr in trs:
    print(tr)
    print(tr.string)
#存在多行文本string無法進行獲取
# 7.tr標籤下所有文本信息
trs = soup.find_all('tr')
for tr in trs:
    print(list(tr.stripped_strings))
#find()與find_all()
	find返回匹配的第一個標籤，find_all返回匹配的所有標籤，以列表的形式。

select

from bs4 import BeautifulSoup

html = ''
soup = BeautifulSoup(html, 'lxml')
# 1.通過標籤名查找
p = soup.select('p')
# 2.通過類名查找
p = soup.select('.className')
# 3.通過id查找
p = soup.select('#idName')
# 4.通過組合查找
p = soup.select('.box p')
p = soup.select('.box>p')
# 5.通過屬性值查找
p = soup.select('a[name="a"]')
#6.再根據類名或者id進行查找的時候，如果還要根據標籤名進行過濾
p = soup.select('div.line')

四個常用對象

Tag:BeautifulSoup中所有標籤都是Tag類型，並且BeautifulSoup的對象其實本質上也是一個Tag類型，所以其實一些方法比如find，find_all()並不是BeautifulSoup的，而是Tag
NavigableString：繼承python的str，用起來跟python中的str是一樣的
Comment：就是繼承自NavigableString
BeautifulSoup：繼承自Tag。用來生成BeautifulSoup樹的。

遍歷

返回某個標籤下直接子元素，其中也包括字符串。

contents：返回一個列表
children:返回一個迭代器

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

BeatuifulSoup4

基本使用

提取元素

select

四個常用對象

遍歷

Buffer-Overflow Vulnerability Lab

Shellshock Attack Lab

Shellshock

爬取中國天氣網上中國所有城市最低氣溫，存入mongodb，並用pyecharts展示

python對象轉換成json對象，以及json對象轉換成python對象

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結