BeautifulSoup 的使用

六、BeautifulSoup 的使用

Beautiful Soup 是⼀個可以從HTML或XML⽂件中提取數據的⽹⻚信息提取庫

6.1 基本使用方法

方法	功能
BeautifulSoup(html_doc,‘lxml’) `	獲取bs對象
bs.prettify()	打印文檔內容
bs.title(標籤名)	獲取標籤內容
bs.title.name	獲取標籤名稱
bs.title.string	獲取標籤裏面的文本內容

6.2 bs4的對象種類

對象	種類
tag	標籤
NavigableString	可導航的字符串
BeautifulSoup	bs對象
Comment	註釋

 from bs4 import BeautifulSoup
  
  html_doc = """
  <html><head><title>The Dormouse's story</title></head>
  <body>
  <p class="title"><b>The Dormouse's story</b></p>
  <p class="story">Once upon a time there were three little sisters; and their names were
  <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
  <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
  <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
  and they lived at the bottom of a well.</p>
  <p class="story">...</p>
  """
  
  soup = BeautifulSoup(html_doc,'lxml')
  
  # print(type(soup)) # <class 'bs4.BeautifulSoup'>
  #
  # print(type(soup.title)) # <class 'bs4.element.Tag'>
  # print(type(soup.a)) # <class 'bs4.element.Tag'>
  # print(type(soup.p)) # <class 'bs4.element.Tag'>
  #
  # print(soup.p.string) # The Dormouse's story
  # print(type(soup.p.string)) # <class 'bs4.element.NavigableString'>
  title_tag = soup.p
  print(title_tag)
  print(title_tag.name)
  print(title_tag.string)
  html_comment = '<a><!-- 這裏是註釋內容--></a>'
  soup = BeautifulSoup(html_comment,'lxml')
  print(soup.a.string)
  print(type(soup.a.string)) # <class 'bs4.element.Comment'>

6.3 遍歷樹遍歷子節點

bs⾥⾯有三種情況，第⼀個是遍歷，第⼆個是查找，第三個是修改

contents children descendants
- contents 返回的是⼀個列表
- children 返回的是⼀個迭代器通過這個迭代器可以進⾏迭代
- descendants 返回的是⼀個⽣成器遍歷⼦⼦孫孫
.string .strings .stripped_strings
- string獲取標籤⾥⾯的內容
- strings 返回是⼀個⽣成器對象⽤過來獲取多個標籤內容
- stripped strings 和strings基本⼀致但是它可以把多餘的空格去掉

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc,'lxml')

# tag
# print(soup.title)
# print(soup.p)
# print(soup.p.b)
# print(soup.a)

# all_p = soup.find_all('p')
#
# print(all_p)
# []屬性來取值
title_tag = soup.p
print(title_tag['class'])#['title']

# contents 返回的是一個列表

# children 返回的是一個迭代器通過這個迭代器可以進行迭代
# 迭代 重複 循環(loop)
# python當中 循環 while for 實現迭代 for ... in ...
# 在Python中可以使用for關鍵字來逐個訪問可迭代對象



# descendants 返回的是一個生成器遍歷子子孫孫


# contents 返回的是一個列表

# links = soup.contents
# print(type(links)) # <class 'list'>
# print(links)

# children 返回的是一個迭代器通過這個迭代器可以進行迭代

html = '''
<div>
<a href='#'>百度</a>
<a href='#'>阿里</a>
<a href='#'>騰訊</a>
</div>
'''
# 需要div標籤下的數據
soup2 = BeautifulSoup(html,'lxml')

# links = soup2.contents

# print(type(links))
#
# print(links)

# for i in links:
#
#     print()

# links = soup2.div.children
# print(type(links)) # <class 'list_iterator'>
#
# for link in links:
#     print(link)

# descendants 返回的是一個生成器遍歷子子孫孫

# print(len(soup.contents))
# # print(len(soup.descendants)) # TypeError: object of type 'generator' has no len()
#
# for x in soup.descendants:
#
#     print('----------------')
#     print(x)

# string獲取標籤裏面的內容
# strings 返回是一個生成器對象用過來獲取多個標籤內容
# stripped strings 和strings基本一致 但是它可以把多餘的空格去掉

# title_tag = soup.title
# print(title_tag)
# print(title_tag.string)
#
# head_tag = soup.head
# print(head_tag.string)
#
# print(soup.html.string)

# strings = soup.strings


# print(strings) # <generator object _all_strings at 0x000001D9053745C8>

# for s in strings:
#     print(s)

strings = soup.stripped_strings

for s in strings:
    print(s)

6.4 遍歷樹遍歷父節點

parent直接獲得⽗節點
parents獲取所有的⽗節點

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc,'lxml')
# parent直接獲得父節點
# title_tag = soup.title
# print(title_tag)
# print(title_tag.parent)

# print(soup.html.parent)

# parents獲取所有的父節點
a_tag = soup.a
# print(a_tag)
# print(a_tag.parents) # <generator object parents at 0x0000025F937E9678>

for x in a_tag.parents:
    print(x)
    print('----------------')

6.5 遍歷樹遍歷兄弟節點

next_sibling 下⼀個兄弟結點
previous_sibling 上⼀個兄弟結點
next_siblings 下⼀個所有兄弟結點
previous_siblings上⼀個所有兄弟結點

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

# html = '<a><b>bbb</b><c>ccc</c><a>'
soup = BeautifulSoup(html_doc,'lxml')
#
# # print(soup.prettify())
# b_tag = soup.b
# print(b_tag)
# print(b_tag.next_sibling)
# c_tag = soup.c
# # print(c_tag.next_sibling)
# print(c_tag.previous_sibling)

a_tag = soup.a
# print(a_tag)

for x in a_tag.next_siblings:
    print(x)

6.6 搜索樹

字符串過濾器
正則表達式過濾器：我們⽤正則表達式⾥⾯compile⽅法編譯⼀個正則表達式傳給 find 或者
findall這個⽅法可以實現⼀個正則表達式的⼀個過濾器的搜索
列表過濾器
True過濾器
⽅法過濾器

from bs4 import BeautifulSoup
import re

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""


soup = BeautifulSoup(html_doc,'lxml')


# • 字符串過濾器
# • 正則表達式過濾器
#   我們用正則表達式裏面compile方法編譯一個正則表達式傳給 find 或者 findall這個方法可以實現一個正則表達式的一個過濾器的搜索
# • 列表過濾器
# • True過濾器
# • 方法過濾器

# • 字符串過濾器
# a_tag2 = soup.a
#

# a_tags = soup.find_all('a')
# print(a_tags)

# 我想要找到所有t 打頭的標籤 正則表達式
# print(soup.find_all(re.compile('t')))

# 我想要找p標籤和a標籤  列表過濾器
# print(soup.find_all(['p','a']))

# print(soup.find_all(['title','b']))

# print(soup.find_all(True)) # True過濾器

def fn(tag):
    return tag.has_attr('class')

print(soup.find_all(fn))

6.7 複習

方法	功能
soup.prettify()	格式化源碼
soup.title	title整個標籤的內容
soup.title.name	標籤的名字
soup.title.string	標籤內容
soup.contents	返回一個列表
soup.div.children	返回一個迭代器
soup.descendants	返回一個生成器遍歷子子孫孫
soup.string	獲取標籤內容
soup.string	獲取所有標籤內容
soup.stripped_strings	獲取所有標籤內容並刪除多餘空格
soup.a.parent	獲取a標籤的父節點
soup.a.previous_sibling	上一個兄弟節點
soup.a.next_sibling	下一個兄弟節點
soup.a.next_siblings	下一個所有兄弟節點
soup.a.previous_siblings	上一個所有兄弟節點

6.8 find

函數	功能
find(‘標籤’,class=‘屬性’)	查找單個標籤
find_all()	查找所有標籤
find_parents()	搜索所有⽗親
find_parrent()	搜索單個⽗親
find_next_siblings()	搜索所有兄弟
find_next_sibling()	搜索單個兄弟
find_previous_siblings()	往上搜索所有兄弟
find_previous_sibling()	往上搜索單個兄弟
find_all_next()	往下搜索所有元素
find_next()	往下查找單個元素

find_all(self, name=None, attrs={}, recursive=True, text=None,limit=None, **kwargs)
name : tag 名稱
attrs ：標籤的屬性
recursive : 是否遞歸
text : 文本內容
limit : 限制返回的條數
**kwargs ：不定長參數 以關鍵字來傳參

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""


soup = BeautifulSoup(html_doc,'lxml')

# find_all(self, name=None, attrs={}, recursive=True, text=None,
#                  limit=None, **kwargs)
# name : tag 名稱
# attrs ：標籤的屬性
# recursive : 是否遞歸
# text : 文本內容
# limit : 限制返回的條數
# **kwargs ：不定長參數 以關鍵字來傳參
# a_tags = soup.find_all('a')

# p_tags = soup.find_all('p','title')



# print(soup.find_all(id = 'link1'))
# print(soup.find_all('a',limit=2))
# print(soup.a)
# print(soup.find('a'))

# print(soup.find_all('a',recursive=True))

# print(soup.find_all('a',limit=1)[0])
# print(soup.find('a'))

# find_parents() 搜索所有父親
# find_parrent() 搜索單個父親
# find_next_siblings()搜索所有兄弟
# find_next_sibling()搜索單個兄弟

title_tag = soup.title

# print(title_tag.find_parent('head')) # <head><title>The Dormouse's story</title></head>

s = soup.find(text = 'Elsie')

# print(s.find_previous('p'))
# print(s.find_parents('p'))

# a_tag = soup.a
#
# # print(a_tag)
# #
# # print(a_tag.find_next_sibling('a'))
#
# print(a_tag.find_next_siblings('a'))


# find_previous_siblings() 往上搜索所有兄弟
# find_previous_sibling() 往上搜索單個兄弟
# find_all_next() 往下搜索所有元素
# find_next()往下查找單個元素

a_tag = soup.find(id='link3')

# print(a_tag)

# print(a_tag.find_previous_sibling())

# print(a_tag.find_previous_siblings())

p_tag = soup.p

# print(p_tag.find_all_next())

print(p_tag.find_next('a'))

6.9 修改文檔樹

修改tag的名稱和屬性
修改string 屬性賦值,就相當於⽤當前的內容替代了原來的內容
append() 像tag中添加內容,就好像Python的列表的 .append() ⽅法
decompose() 修改刪除段落，對於⼀些沒有必要的⽂章段落我們可以給他刪除掉

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc,'lxml')

# 1.修改tag的名稱和屬性

# tag_p = soup.p
# print(tag_p)

# tag_p.name = 'w' # 修改名稱
# tag_p['class'] = 'content' # 修改屬性

# print(tag_p)


# 2. 修改string
tag_p = soup.p
# print(tag_p.string)

# tag_p.string = '521 wo ai ni men'

# print(tag_p.string)


# 3.tag.append() 方法 向tag中添加內容

# print(tag_p)
# tag_p.append('hahaha')
# print(tag_p)

# 4.decompose() 修改刪除段落

result = soup.find(class_ = 'title')

result.decompose()

print(soup)

6.10 爬取天氣網數據

知識點
- find('div',class='conMidtab'):根據標籤屬性獲取數據
- table.find_all('tr')[2:]:過濾掉前兩個tr
- enumerate 返回2個值，第一個是下標第二個是下標所對應的元素
- BeautifulSoup中有兩種解析方式html和html5lib

import requests

from bs4 import BeautifulSoup

# 定義一個函數來解析網頁
def parse_page(url):

    response = requests.get(url)
    # 解決亂碼
    text = response.content.decode('utf-8')
    soup = BeautifulSoup(text,'html5lib') # pip install html5lib
    # 網頁解析
    # 一、class="conMidtab"
    conMidtab = soup.find('div',class_='conMidtab')
    # print(conMidtab)
    # 二、table
    tables = conMidtab.find_all('table')
    # print(tables)

    for table in tables:
        # print(table)
        # 三、tr 過濾掉去前2個
        trs = table.find_all('tr')[2:]
        # enumerate 返回2個值第一個是下標 第二個下標所對應的元素
        for index,tr in enumerate(trs):
            # print(tr)
            tds = tr.find_all('td')

            # 判斷
            city_td = tds[0] # 城市

            if index == 0:
                city_td = tds[1] # 省會


            # 獲取一個標籤下面的子孫節點的文本信息
            city = list(city_td.stripped_strings)[0]

            temp_td = tds[-2]
            temp = list(temp_td.stripped_strings)[0]
            print('城市:',city,'溫度:',temp)
        # break # 先打印北京

    # 四、td

    # print(text)


def main():

    url = 'http://www.weather.com.cn/textFC/hb.shtml' # 華東
    # url = 'http://www.weather.com.cn/textFC/db.shtml' # 東北
    url = 'http://www.weather.com.cn/textFC/gat.shtml' # 港澳臺

    urls = ['http://www.weather.com.cn/textFC/hb.shtml','http://www.weather.com.cn/textFC/db.shtml' ,'http://www.weather.com.cn/textFC/gat.shtml']

    for url in urls:
        parse_page(url)


if __name__ == '__main__':


    main()

BeautifulSoup 的使用

六、BeautifulSoup 的使用

6.1 基本使用方法

6.2 bs4的對象種類

6.3 遍歷樹遍歷子節點

6.4 遍歷樹遍歷父節點

6.5 遍歷樹遍歷兄弟節點

6.6 搜索樹

6.7 複習

6.8 find

6.9 修改文檔樹

6.10 爬取天氣網數據

如何使用 JS 判斷用戶是否處於活躍狀態

lightdb秒級增加列和刪除列（not null帶默認值）

lightdb數據庫超時相關控制參數

通過HPA+CronHPA組合應對業務複雜彈性伸縮場景

❤️‍🔥 Solon Cloud Event 新的事務特性與應用

lightdb mysql 8.0兼容之不可見主鍵

使用 JS 實現在瀏覽器控制檯打印圖片 console.image()

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（四）使用域名訪問網站應用

python自動化辦公之對Excel的操作

BeautifulSoup 的使用

動態HTML技術瞭解

Python爬取堆糖網的表情包（再也不用擔心鬥圖失敗了）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

BeautifulSoup 的使用

六、BeautifulSoup 的使用

6.1 基本使用方法

6.2 bs4的對象種類

6.3 遍歷樹 遍歷子節點

6.4 遍歷樹 遍歷父節點

6.5 遍歷樹 遍歷兄弟節點

6.6 搜索樹

6.7 複習

6.8 find

6.9 修改文檔樹

6.10 爬取天氣網數據

6.3 遍歷樹遍歷子節點

6.4 遍歷樹遍歷父節點

6.5 遍歷樹遍歷兄弟節點