這節課來學習一下什麼是BeautifulSoup庫

功能：用於網頁的數據解析

BeautifulSoup4將複雜的HTML文檔裝換爲一個複雜的樹形結構

每個節點都是python對象，所有對象可以歸納爲以下四種：

-Tag 標籤及其第一個內容（例如：

百度一下，你就知道，title爲標籤）

-NavigableString 標籤裏的內容（字符串）（例如：百度一下，你就知道，爲內容）

-BeautifulSoup 整個文檔（用於整個文檔方法的訪問）

-Comment 特殊的NavigableString，輸出的內容不包含註釋

from bs4 import BeautifulSoup
file = open("./baidu.html",'rb')
html = file.read()
bs = BeautifulSoup(html,"html.parser")  # 使用 html.parser解析器，解析html文檔

# 1.Tag    標籤及其第一個內容（例如：<title>百度一下，你就知道</title>，title爲標籤）
# 2.NavigableString    標籤裏的內容（字符串）（例如：百度一下，你就知道，爲內容）
# 找到第一個匹配到的標籤及其內容（瞭解）
print(bs.title)
print(bs.a)
print(bs.head)

# 找到第一個匹配到的標籤對應的內容（不打印標籤）
print(bs.title.string)    # 字符串

# 找到第一個匹配到的標籤所有的屬性（不打印標籤）
print(bs.title.attrs) # 字典

# 3.BeautifulSoup      整個文檔（用於整個文檔方法的訪問）
print(bs)
print(bs.name)
print(bs.a.string)

# ----------------------------------------------------------------------（用）

# 遍歷文檔樹（更多內容請百度）

# .contents：獲取Tag的所有子節點，返回一個list
# .children：獲取Tag的所有子節點，返回一個生成器
# .descendants：獲取Tag的所有子孫節點


print(bs.head)
print('-----------------------------')
print(bs.head.contents) # 使用contents方法，以列表形式得到bs的head的內容

# 文檔的搜索
# 1.find_all()：配合字符串、正則表達式、自定義函數，查找內容
# 2.kwargs  參數
# 3.text    文本參數
# 4.limit   參數



#-------------------------------------------------------------------1.find_all()
find_all('字符串')：字符串過濾，查找與字符串完全匹配的內容
t_list = bs.find_all('a')   # 找到所有a標籤
print(t_list)


# 正則表達式搜索：使用search()方法匹配內容
import re
t_list = bs.find_all(re.compile("a"))
print(t_list)


# 方法：傳入一個函數（方法），根據函數要求搜索(瞭解)
def name_is_exists(tag):
    return tag.has_attr('name') # 返回具有name屬性的

t_list = bs.find_all(name_is_exists)

for i in t_list:
    print(i)

#-----------------------------------------------------------------------2.kwargs

t_list = bs.find_all(id="head")

t_list1 = bs.find_all(class_=True)

t_list2 = bs.find_all(id="head")

for item in t_list:
    print(item)
#-----------------------------------------------------------------------3.text    文本參數
t_list = bs.find_all(text="hao123")

t_list1 = bs.find_all(text=["hao123",'地圖'])

t_list2 = bs.find_all(text=re.compile("\d"))    # 用正則表達式查找包含特定文本的內容
for item in t_list:
    print(item)

#-------------------- ---------------------------------------------------4.limit   參數

t_list = bs.find_all("hao123",limit=3)  # 限定查找的個數

# css選擇器

bs.select('標籤')
print(bs.select('title'))

bs.select('類名')
print(bs.select('.mnav'))

bs.select('ID')
print(bs.select('#u1'))

bs.select('標籤[屬性]')
print(bs.select("a[class='bri']"))

bs.select('標籤>子標籤')
print(bs.select("head>title"))

bs.select('類名~兄弟類名')
print(bs.select(".mnav ~ .bri"))

你學會了嗎？

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python爬蟲-bs4-BeautifulSoup

這節課來學習一下什麼是BeautifulSoup庫

redis的key亂碼問題和值自增問題

一個開源且全面的C#算法實戰教程

一款.NET開源、功能強大、跨平臺的繪圖庫 - OxyPlot

CORS error 但是 status code 是200 OK

壓縮上傳的GPU數據的方案

使用skopeo同步鏡像

python-烏龜喫小魚(小遊戲)

python-字典-根據值查找鍵（批量處理（如刪除等）所查到的內容）

python-爬蟲-貓眼電影TOP100

python-request（基本用法）

python-類與對象的基本含義、格式和調用方法

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結