六、BeautifulSoup 的使用
- Beautiful Soup 是⼀個可以從HTML或XML⽂件中提取數據的⽹⻚信息提取庫
6.1 基本使用方法
方法 | 功能 |
---|---|
BeautifulSoup(html_doc,‘lxml’) ` | 獲取bs對象 |
bs.prettify() | 打印文檔內容 |
bs.title(標籤名) | 獲取標籤內容 |
bs.title.name | 獲取標籤名稱 |
bs.title.string | 獲取標籤裏面的文本內容 |
6.2 bs4的對象種類
對象 | 種類 |
---|---|
tag | 標籤 |
NavigableString | 可導航的字符串 |
BeautifulSoup | bs對象 |
Comment | 註釋 |
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')
# print(type(soup)) # <class 'bs4.BeautifulSoup'>
#
# print(type(soup.title)) # <class 'bs4.element.Tag'>
# print(type(soup.a)) # <class 'bs4.element.Tag'>
# print(type(soup.p)) # <class 'bs4.element.Tag'>
#
# print(soup.p.string) # The Dormouse's story
# print(type(soup.p.string)) # <class 'bs4.element.NavigableString'>
title_tag = soup.p
print(title_tag)
print(title_tag.name)
print(title_tag.string)
html_comment = '<a><!-- 這裏是註釋內容--></a>'
soup = BeautifulSoup(html_comment,'lxml')
print(soup.a.string)
print(type(soup.a.string)) # <class 'bs4.element.Comment'>
6.3 遍歷樹 遍歷子節點
bs⾥⾯有三種情況,第⼀個是遍歷,第⼆個是查找,第三個是修改
-
contents children descendants
-
contents 返回的是⼀個列表
-
children 返回的是⼀個迭代器通過這個迭代器可以進⾏迭代
-
descendants 返回的是⼀個⽣成器遍歷⼦⼦孫孫
-
-
.string .strings .stripped_strings
-
string獲取標籤⾥⾯的內容
-
strings 返回是⼀個⽣成器對象⽤過來獲取多個標籤內容
-
stripped strings 和strings基本⼀致 但是它可以把多餘的空格去掉
-
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')
# tag
# print(soup.title)
# print(soup.p)
# print(soup.p.b)
# print(soup.a)
# all_p = soup.find_all('p')
#
# print(all_p)
# []屬性來取值
title_tag = soup.p
print(title_tag['class'])#['title']
# contents 返回的是一個列表
# children 返回的是一個迭代器通過這個迭代器可以進行迭代
# 迭代 重複 循環(loop)
# python當中 循環 while for 實現迭代 for ... in ...
# 在Python中可以使用for關鍵字來逐個訪問可迭代對象
# descendants 返回的是一個生成器遍歷子子孫孫
# contents 返回的是一個列表
# links = soup.contents
# print(type(links)) # <class 'list'>
# print(links)
# children 返回的是一個迭代器通過這個迭代器可以進行迭代
html = '''
<div>
<a href='#'>百度</a>
<a href='#'>阿里</a>
<a href='#'>騰訊</a>
</div>
'''
# 需要div標籤下的數據
soup2 = BeautifulSoup(html,'lxml')
# links = soup2.contents
# print(type(links))
#
# print(links)
# for i in links:
#
# print()
# links = soup2.div.children
# print(type(links)) # <class 'list_iterator'>
#
# for link in links:
# print(link)
# descendants 返回的是一個生成器遍歷子子孫孫
# print(len(soup.contents))
# # print(len(soup.descendants)) # TypeError: object of type 'generator' has no len()
#
# for x in soup.descendants:
#
# print('----------------')
# print(x)
# string獲取標籤裏面的內容
# strings 返回是一個生成器對象用過來獲取多個標籤內容
# stripped strings 和strings基本一致 但是它可以把多餘的空格去掉
# title_tag = soup.title
# print(title_tag)
# print(title_tag.string)
#
# head_tag = soup.head
# print(head_tag.string)
#
# print(soup.html.string)
# strings = soup.strings
# print(strings) # <generator object _all_strings at 0x000001D9053745C8>
# for s in strings:
# print(s)
strings = soup.stripped_strings
for s in strings:
print(s)
6.4 遍歷樹 遍歷父節點
-
parent直接獲得⽗節點
-
parents獲取所有的⽗節點
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')
# parent直接獲得父節點
# title_tag = soup.title
# print(title_tag)
# print(title_tag.parent)
# print(soup.html.parent)
# parents獲取所有的父節點
a_tag = soup.a
# print(a_tag)
# print(a_tag.parents) # <generator object parents at 0x0000025F937E9678>
for x in a_tag.parents:
print(x)
print('----------------')
6.5 遍歷樹 遍歷兄弟節點
-
next_sibling 下⼀個兄弟結點
-
previous_sibling 上⼀個兄弟結點
-
next_siblings 下⼀個所有兄弟結點
-
previous_siblings上⼀個所有兄弟結點
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
# html = '<a><b>bbb</b><c>ccc</c><a>'
soup = BeautifulSoup(html_doc,'lxml')
#
# # print(soup.prettify())
# b_tag = soup.b
# print(b_tag)
# print(b_tag.next_sibling)
# c_tag = soup.c
# # print(c_tag.next_sibling)
# print(c_tag.previous_sibling)
a_tag = soup.a
# print(a_tag)
for x in a_tag.next_siblings:
print(x)
6.6 搜索樹
-
字符串過濾器
-
正則表達式過濾器:我們⽤正則表達式⾥⾯compile⽅法編譯⼀個正則表達式傳給 find 或者
-
findall這個⽅法可以實現⼀個正則表達式的⼀個過濾器的搜索
-
列表過濾器
-
True過濾器
-
⽅法過濾器
from bs4 import BeautifulSoup
import re
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')
# • 字符串過濾器
# • 正則表達式過濾器
# 我們用正則表達式裏面compile方法編譯一個正則表達式傳給 find 或者 findall這個方法可以實現一個正則表達式的一個過濾器的搜索
# • 列表過濾器
# • True過濾器
# • 方法過濾器
# • 字符串過濾器
# a_tag2 = soup.a
#
# a_tags = soup.find_all('a')
# print(a_tags)
# 我想要找到所有t 打頭的標籤 正則表達式
# print(soup.find_all(re.compile('t')))
# 我想要找p標籤和a標籤 列表過濾器
# print(soup.find_all(['p','a']))
# print(soup.find_all(['title','b']))
# print(soup.find_all(True)) # True過濾器
def fn(tag):
return tag.has_attr('class')
print(soup.find_all(fn))
6.7 複習
方法 | 功能 |
---|---|
soup.prettify() | 格式化源碼 |
soup.title | title整個標籤的內容 |
soup.title.name | 標籤的名字 |
soup.title.string | 標籤內容 |
soup.contents | 返回一個列表 |
soup.div.children | 返回一個迭代器 |
soup.descendants | 返回一個生成器 遍歷子子孫孫 |
soup.string | 獲取標籤內容 |
soup.string | 獲取所有標籤內容 |
soup.stripped_strings | 獲取所有標籤內容並刪除多餘空格 |
soup.a.parent | 獲取a標籤的父節點 |
soup.a.previous_sibling | 上一個兄弟節點 |
soup.a.next_sibling | 下一個兄弟節點 |
soup.a.next_siblings | 下一個所有兄弟節點 |
soup.a.previous_siblings | 上一個所有兄弟節點 |
6.8 find
函數 | 功能 |
---|---|
find(‘標籤’,class=‘屬性’) | 查找單個標籤 |
find_all() | 查找所有標籤 |
find_parents() | 搜索所有⽗親 |
find_parrent() | 搜索單個⽗親 |
find_next_siblings() | 搜索所有兄弟 |
find_next_sibling() | 搜索單個兄弟 |
find_previous_siblings() | 往上搜索所有兄弟 |
find_previous_sibling() | 往上搜索單個兄弟 |
find_all_next() | 往下搜索所有元素 |
find_next() | 往下查找單個元素 |
find_all(self, name=None, attrs={}, recursive=True, text=None,limit=None, **kwargs)
name : tag 名稱
attrs :標籤的屬性
recursive : 是否遞歸
text : 文本內容
limit : 限制返回的條數
**kwargs :不定長參數 以關鍵字來傳參
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')
# find_all(self, name=None, attrs={}, recursive=True, text=None,
# limit=None, **kwargs)
# name : tag 名稱
# attrs :標籤的屬性
# recursive : 是否遞歸
# text : 文本內容
# limit : 限制返回的條數
# **kwargs :不定長參數 以關鍵字來傳參
# a_tags = soup.find_all('a')
# p_tags = soup.find_all('p','title')
# print(soup.find_all(id = 'link1'))
# print(soup.find_all('a',limit=2))
# print(soup.a)
# print(soup.find('a'))
# print(soup.find_all('a',recursive=True))
# print(soup.find_all('a',limit=1)[0])
# print(soup.find('a'))
# find_parents() 搜索所有父親
# find_parrent() 搜索單個父親
# find_next_siblings()搜索所有兄弟
# find_next_sibling()搜索單個兄弟
title_tag = soup.title
# print(title_tag.find_parent('head')) # <head><title>The Dormouse's story</title></head>
s = soup.find(text = 'Elsie')
# print(s.find_previous('p'))
# print(s.find_parents('p'))
# a_tag = soup.a
#
# # print(a_tag)
# #
# # print(a_tag.find_next_sibling('a'))
#
# print(a_tag.find_next_siblings('a'))
# find_previous_siblings() 往上搜索所有兄弟
# find_previous_sibling() 往上搜索單個兄弟
# find_all_next() 往下搜索所有元素
# find_next()往下查找單個元素
a_tag = soup.find(id='link3')
# print(a_tag)
# print(a_tag.find_previous_sibling())
# print(a_tag.find_previous_siblings())
p_tag = soup.p
# print(p_tag.find_all_next())
print(p_tag.find_next('a'))
6.9 修改文檔樹
-
修改tag的名稱和屬性
-
修改string 屬性賦值,就相當於⽤當前的內容替代了原來的內容
-
append() 像tag中添加內容,就好像Python的列表的 .append() ⽅法
-
decompose() 修改刪除段落,對於⼀些沒有必要的⽂章段落我們可以給他刪除掉
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')
# 1.修改tag的名稱和屬性
# tag_p = soup.p
# print(tag_p)
# tag_p.name = 'w' # 修改名稱
# tag_p['class'] = 'content' # 修改屬性
# print(tag_p)
# 2. 修改string
tag_p = soup.p
# print(tag_p.string)
# tag_p.string = '521 wo ai ni men'
# print(tag_p.string)
# 3.tag.append() 方法 向tag中添加內容
# print(tag_p)
# tag_p.append('hahaha')
# print(tag_p)
# 4.decompose() 修改刪除段落
result = soup.find(class_ = 'title')
result.decompose()
print(soup)
6.10 爬取天氣網數據
- 知識點
find('div',class='conMidtab')
:根據標籤屬性獲取數據table.find_all('tr')[2:]
:過濾掉前兩個tr- enumerate 返回2個值,第一個是下標 第二個是下標所對應的元素
- BeautifulSoup中有兩種解析方式
html
和html5lib
import requests
from bs4 import BeautifulSoup
# 定義一個函數來解析網頁
def parse_page(url):
response = requests.get(url)
# 解決亂碼
text = response.content.decode('utf-8')
soup = BeautifulSoup(text,'html5lib') # pip install html5lib
# 網頁解析
# 一、class="conMidtab"
conMidtab = soup.find('div',class_='conMidtab')
# print(conMidtab)
# 二、table
tables = conMidtab.find_all('table')
# print(tables)
for table in tables:
# print(table)
# 三、tr 過濾掉去前2個
trs = table.find_all('tr')[2:]
# enumerate 返回2個值第一個是下標 第二個下標所對應的元素
for index,tr in enumerate(trs):
# print(tr)
tds = tr.find_all('td')
# 判斷
city_td = tds[0] # 城市
if index == 0:
city_td = tds[1] # 省會
# 獲取一個標籤下面的子孫節點的文本信息
city = list(city_td.stripped_strings)[0]
temp_td = tds[-2]
temp = list(temp_td.stripped_strings)[0]
print('城市:',city,'溫度:',temp)
# break # 先打印北京
# 四、td
# print(text)
def main():
url = 'http://www.weather.com.cn/textFC/hb.shtml' # 華東
# url = 'http://www.weather.com.cn/textFC/db.shtml' # 東北
url = 'http://www.weather.com.cn/textFC/gat.shtml' # 港澳臺
urls = ['http://www.weather.com.cn/textFC/hb.shtml','http://www.weather.com.cn/textFC/db.shtml' ,'http://www.weather.com.cn/textFC/gat.shtml']
for url in urls:
parse_page(url)
if __name__ == '__main__':
main()