BeautifulSoup 的使用

六、BeautifulSoup 的使用

  • Beautiful Soup 是⼀個可以從HTML或XML⽂件中提取數據的⽹⻚信息提取庫

6.1 基本使用方法

方法 功能
BeautifulSoup(html_doc,‘lxml’) ` 獲取bs對象
bs.prettify() 打印文檔內容
bs.title(標籤名) 獲取標籤內容
bs.title.name 獲取標籤名稱
bs.title.string 獲取標籤裏面的文本內容

6.2 bs4的對象種類

對象 種類
tag 標籤
NavigableString 可導航的字符串
BeautifulSoup bs對象
Comment 註釋
 from bs4 import BeautifulSoup
  
  html_doc = """
  <html><head><title>The Dormouse's story</title></head>
  <body>
  <p class="title"><b>The Dormouse's story</b></p>
  <p class="story">Once upon a time there were three little sisters; and their names were
  <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
  <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
  <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
  and they lived at the bottom of a well.</p>
  <p class="story">...</p>
  """
  
  soup = BeautifulSoup(html_doc,'lxml')
  
  # print(type(soup)) # <class 'bs4.BeautifulSoup'>
  #
  # print(type(soup.title)) # <class 'bs4.element.Tag'>
  # print(type(soup.a)) # <class 'bs4.element.Tag'>
  # print(type(soup.p)) # <class 'bs4.element.Tag'>
  #
  # print(soup.p.string) # The Dormouse's story
  # print(type(soup.p.string)) # <class 'bs4.element.NavigableString'>
  title_tag = soup.p
  print(title_tag)
  print(title_tag.name)
  print(title_tag.string)
  html_comment = '<a><!-- 這裏是註釋內容--></a>'
  soup = BeautifulSoup(html_comment,'lxml')
  print(soup.a.string)
  print(type(soup.a.string)) # <class 'bs4.element.Comment'>

6.3 遍歷樹 遍歷子節點

bs⾥⾯有三種情況,第⼀個是遍歷,第⼆個是查找,第三個是修改

  • contents children descendants

    • contents 返回的是⼀個列表

    • children 返回的是⼀個迭代器通過這個迭代器可以進⾏迭代

    • descendants 返回的是⼀個⽣成器遍歷⼦⼦孫孫

  • .string .strings .stripped_strings

    • string獲取標籤⾥⾯的內容

    • strings 返回是⼀個⽣成器對象⽤過來獲取多個標籤內容

    • stripped strings 和strings基本⼀致 但是它可以把多餘的空格去掉

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc,'lxml')

# tag
# print(soup.title)
# print(soup.p)
# print(soup.p.b)
# print(soup.a)

# all_p = soup.find_all('p')
#
# print(all_p)
# []屬性來取值
title_tag = soup.p
print(title_tag['class'])#['title']

# contents 返回的是一個列表

# children 返回的是一個迭代器通過這個迭代器可以進行迭代
# 迭代 重複 循環(loop)
# python當中 循環 while for 實現迭代 for ... in ...
# 在Python中可以使用for關鍵字來逐個訪問可迭代對象



# descendants 返回的是一個生成器遍歷子子孫孫


# contents 返回的是一個列表

# links = soup.contents
# print(type(links)) # <class 'list'>
# print(links)

# children 返回的是一個迭代器通過這個迭代器可以進行迭代

html = '''
<div>
<a href='#'>百度</a>
<a href='#'>阿里</a>
<a href='#'>騰訊</a>
</div>
'''
# 需要div標籤下的數據
soup2 = BeautifulSoup(html,'lxml')

# links = soup2.contents

# print(type(links))
#
# print(links)

# for i in links:
#
#     print()

# links = soup2.div.children
# print(type(links)) # <class 'list_iterator'>
#
# for link in links:
#     print(link)

# descendants 返回的是一個生成器遍歷子子孫孫

# print(len(soup.contents))
# # print(len(soup.descendants)) # TypeError: object of type 'generator' has no len()
#
# for x in soup.descendants:
#
#     print('----------------')
#     print(x)

# string獲取標籤裏面的內容
# strings 返回是一個生成器對象用過來獲取多個標籤內容
# stripped strings 和strings基本一致 但是它可以把多餘的空格去掉

# title_tag = soup.title
# print(title_tag)
# print(title_tag.string)
#
# head_tag = soup.head
# print(head_tag.string)
#
# print(soup.html.string)

# strings = soup.strings


# print(strings) # <generator object _all_strings at 0x000001D9053745C8>

# for s in strings:
#     print(s)

strings = soup.stripped_strings

for s in strings:
    print(s)

6.4 遍歷樹 遍歷父節點

  • parent直接獲得⽗節點

  • parents獲取所有的⽗節點

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc,'lxml')
# parent直接獲得父節點
# title_tag = soup.title
# print(title_tag)
# print(title_tag.parent)

# print(soup.html.parent)

# parents獲取所有的父節點
a_tag = soup.a
# print(a_tag)
# print(a_tag.parents) # <generator object parents at 0x0000025F937E9678>

for x in a_tag.parents:
    print(x)
    print('----------------')

6.5 遍歷樹 遍歷兄弟節點

  • next_sibling 下⼀個兄弟結點

  • previous_sibling 上⼀個兄弟結點

  • next_siblings 下⼀個所有兄弟結點

  • previous_siblings上⼀個所有兄弟結點

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

# html = '<a><b>bbb</b><c>ccc</c><a>'
soup = BeautifulSoup(html_doc,'lxml')
#
# # print(soup.prettify())
# b_tag = soup.b
# print(b_tag)
# print(b_tag.next_sibling)
# c_tag = soup.c
# # print(c_tag.next_sibling)
# print(c_tag.previous_sibling)

a_tag = soup.a
# print(a_tag)

for x in a_tag.next_siblings:
    print(x)

6.6 搜索樹

  • 字符串過濾器

  • 正則表達式過濾器:我們⽤正則表達式⾥⾯compile⽅法編譯⼀個正則表達式傳給 find 或者

  • findall這個⽅法可以實現⼀個正則表達式的⼀個過濾器的搜索

  • 列表過濾器

  • True過濾器

  • ⽅法過濾器

from bs4 import BeautifulSoup
import re

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""


soup = BeautifulSoup(html_doc,'lxml')


# • 字符串過濾器
# • 正則表達式過濾器
#   我們用正則表達式裏面compile方法編譯一個正則表達式傳給 find 或者 findall這個方法可以實現一個正則表達式的一個過濾器的搜索
# • 列表過濾器
# • True過濾器
# • 方法過濾器

# • 字符串過濾器
# a_tag2 = soup.a
#

# a_tags = soup.find_all('a')
# print(a_tags)

# 我想要找到所有t 打頭的標籤 正則表達式
# print(soup.find_all(re.compile('t')))

# 我想要找p標籤和a標籤  列表過濾器
# print(soup.find_all(['p','a']))

# print(soup.find_all(['title','b']))

# print(soup.find_all(True)) # True過濾器

def fn(tag):
    return tag.has_attr('class')

print(soup.find_all(fn))

6.7 複習

方法 功能
soup.prettify() 格式化源碼
soup.title title整個標籤的內容
soup.title.name 標籤的名字
soup.title.string 標籤內容
soup.contents 返回一個列表
soup.div.children 返回一個迭代器
soup.descendants 返回一個生成器 遍歷子子孫孫
soup.string 獲取標籤內容
soup.string 獲取所有標籤內容
soup.stripped_strings 獲取所有標籤內容並刪除多餘空格
soup.a.parent 獲取a標籤的父節點
soup.a.previous_sibling 上一個兄弟節點
soup.a.next_sibling 下一個兄弟節點
soup.a.next_siblings 下一個所有兄弟節點
soup.a.previous_siblings 上一個所有兄弟節點

6.8 find

函數 功能
find(‘標籤’,class=‘屬性’) 查找單個標籤
find_all() 查找所有標籤
find_parents() 搜索所有⽗親
find_parrent() 搜索單個⽗親
find_next_siblings() 搜索所有兄弟
find_next_sibling() 搜索單個兄弟
find_previous_siblings() 往上搜索所有兄弟
find_previous_sibling() 往上搜索單個兄弟
find_all_next() 往下搜索所有元素
find_next() 往下查找單個元素
find_all(self, name=None, attrs={}, recursive=True, text=None,limit=None, **kwargs)
name : tag 名稱
attrs :標籤的屬性
recursive : 是否遞歸
text : 文本內容
limit : 限制返回的條數
**kwargs :不定長參數 以關鍵字來傳參
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""


soup = BeautifulSoup(html_doc,'lxml')

# find_all(self, name=None, attrs={}, recursive=True, text=None,
#                  limit=None, **kwargs)
# name : tag 名稱
# attrs :標籤的屬性
# recursive : 是否遞歸
# text : 文本內容
# limit : 限制返回的條數
# **kwargs :不定長參數 以關鍵字來傳參
# a_tags = soup.find_all('a')

# p_tags = soup.find_all('p','title')



# print(soup.find_all(id = 'link1'))
# print(soup.find_all('a',limit=2))
# print(soup.a)
# print(soup.find('a'))

# print(soup.find_all('a',recursive=True))

# print(soup.find_all('a',limit=1)[0])
# print(soup.find('a'))

# find_parents() 搜索所有父親
# find_parrent() 搜索單個父親
# find_next_siblings()搜索所有兄弟
# find_next_sibling()搜索單個兄弟

title_tag = soup.title

# print(title_tag.find_parent('head')) # <head><title>The Dormouse's story</title></head>

s = soup.find(text = 'Elsie')

# print(s.find_previous('p'))
# print(s.find_parents('p'))

# a_tag = soup.a
#
# # print(a_tag)
# #
# # print(a_tag.find_next_sibling('a'))
#
# print(a_tag.find_next_siblings('a'))


# find_previous_siblings() 往上搜索所有兄弟
# find_previous_sibling() 往上搜索單個兄弟
# find_all_next() 往下搜索所有元素
# find_next()往下查找單個元素

a_tag = soup.find(id='link3')

# print(a_tag)

# print(a_tag.find_previous_sibling())

# print(a_tag.find_previous_siblings())

p_tag = soup.p

# print(p_tag.find_all_next())

print(p_tag.find_next('a'))

6.9 修改文檔樹

  • 修改tag的名稱和屬性

  • 修改string 屬性賦值,就相當於⽤當前的內容替代了原來的內容

  • append() 像tag中添加內容,就好像Python的列表的 .append() ⽅法

  • decompose() 修改刪除段落,對於⼀些沒有必要的⽂章段落我們可以給他刪除掉

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc,'lxml')

# 1.修改tag的名稱和屬性

# tag_p = soup.p
# print(tag_p)

# tag_p.name = 'w' # 修改名稱
# tag_p['class'] = 'content' # 修改屬性

# print(tag_p)


# 2. 修改string
tag_p = soup.p
# print(tag_p.string)

# tag_p.string = '521 wo ai ni men'

# print(tag_p.string)


# 3.tag.append() 方法 向tag中添加內容

# print(tag_p)
# tag_p.append('hahaha')
# print(tag_p)

# 4.decompose() 修改刪除段落

result = soup.find(class_ = 'title')

result.decompose()

print(soup)

6.10 爬取天氣網數據

  • 知識點
    • find('div',class='conMidtab'):根據標籤屬性獲取數據
    • table.find_all('tr')[2:]:過濾掉前兩個tr
    • enumerate 返回2個值,第一個是下標 第二個是下標所對應的元素
    • BeautifulSoup中有兩種解析方式htmlhtml5lib
import requests

from bs4 import BeautifulSoup

# 定義一個函數來解析網頁
def parse_page(url):

    response = requests.get(url)
    # 解決亂碼
    text = response.content.decode('utf-8')
    soup = BeautifulSoup(text,'html5lib') # pip install html5lib
    # 網頁解析
    # 一、class="conMidtab"
    conMidtab = soup.find('div',class_='conMidtab')
    # print(conMidtab)
    # 二、table
    tables = conMidtab.find_all('table')
    # print(tables)

    for table in tables:
        # print(table)
        # 三、tr 過濾掉去前2個
        trs = table.find_all('tr')[2:]
        # enumerate 返回2個值第一個是下標 第二個下標所對應的元素
        for index,tr in enumerate(trs):
            # print(tr)
            tds = tr.find_all('td')

            # 判斷
            city_td = tds[0] # 城市

            if index == 0:
                city_td = tds[1] # 省會


            # 獲取一個標籤下面的子孫節點的文本信息
            city = list(city_td.stripped_strings)[0]

            temp_td = tds[-2]
            temp = list(temp_td.stripped_strings)[0]
            print('城市:',city,'溫度:',temp)
        # break # 先打印北京

    # 四、td

    # print(text)


def main():

    url = 'http://www.weather.com.cn/textFC/hb.shtml' # 華東
    # url = 'http://www.weather.com.cn/textFC/db.shtml' # 東北
    url = 'http://www.weather.com.cn/textFC/gat.shtml' # 港澳臺

    urls = ['http://www.weather.com.cn/textFC/hb.shtml','http://www.weather.com.cn/textFC/db.shtml' ,'http://www.weather.com.cn/textFC/gat.shtml']

    for url in urls:
        parse_page(url)


if __name__ == '__main__':


    main()
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章