Python爬蟲beautifulsoup4常用的解析方法總結

原創

Lee_Tech

2019-02-25 23:03

今天小編就爲大家分享一篇關於Python爬蟲beautifulsoup4常用的解析方法總結，小編覺得內容挺不錯的，現在分享給大家，具有很好的參考價值，需要的朋友一起跟隨小編來看看吧

摘要

如何用beautifulsoup4解析各種情況的網頁

beautifulsoup4的使用

關於beautifulsoup4，官網已經講的很詳細了，我這裏就把一些常用的解析方法做個總結，方便查閱。

裝載html文檔

使用beautifulsoup的第一步是把html文檔裝載到beautifulsoup中，使其形成一個beautifulsoup對象。

import requests
from bs4 import BeautifulSoup
url = "http://new.qq.com/omn/20180705/20180705A0920X.html"
r = requests.get(url)
htmls = r.text
#print(htmls)
soup = BeautifulSoup(htmls, 'html.parser')

初始化BeautifulSoup類時，需要加入兩個參數，第一個參數即是我們爬到html源碼，第二個參數是html解析器，常用的有三個解析器，分別是”html.parser”,”lxml”,”html5lib”，官網推薦用lxml，因爲效率高，當然需要pip install lxml一下。

當然這三種解析方式在某些情況解析得到的對象內容是不同的，比如對於標籤不完整這一情況（p標籤只有一半）：

soup = BeautifulSoup("<a></p>", "html.parser")
# 只有起始標籤的會自動補全，只有結束標籤的灰自動忽略
# 結果爲：<a></a>
soup = BeautifulSoup("<a></p>", "lxml")
#結果爲：<html><body><a></a></body></html>
soup = BeautifulSoup("<a></p>", "html5lib")
# html5lib則出現一般的標籤都會自動補全
# 結果爲：<html><head></head><body><a><p></p></a></body></html>

使用

在使用中，我儘量按照我使用的頻率介紹，畢竟爲了查閱~

按照標籤名稱、id、class等信息獲取某個標籤

html = '<p class="title" id="p1"><b>The Dormouses story</b></p>'
soup = BeautifulSoup(html, 'lxml')
#根據class的名稱獲取p標籤內的所有內容
soup.find(class_="title")
#或者
soup.find("p",class_="title" id = "p1")
#獲取class爲title的p標籤的文本內容"The Dormouse's story"
soup.find(class_="title").get_text()
#獲取文本內容時可以指定不同標籤之間的分隔符，也可以選擇是否去掉前後的空白。
soup = BeautifulSoup('<p class="title" id="p1"><b> The Dormouses story </b></p><p class="title" id="p1"><b>The Dormouses story</b></p>', "html5lib")
soup.find(class_="title").get_text("|", strip=True)
#結果爲：The Dormouses story|The Dormouses story
#獲取class爲title的p標籤的id
soup.find(class_="title").get("id")
#對class名稱正則：
soup.find_all(class_=re.compile("tit"))
#recursive參數，recursive=False時，只find當前標籤的第一級子標籤的數據
soup = BeautifulSoup('<html><head><title>abc','lxml')
soup.html.find_all("title", recursive=False)

按照標籤名稱、id、class等信息獲取多個標籤

soup = BeautifulSoup('<p class="title" id="p1"><b> The like story </b></p><p class="title" id="p1"><b>The Dormouses story</b></p>', "html5lib")
#獲取所有class爲title的標籤
for i in soup.find_all(class_="title"):
  print(i.get_text())
#獲取特定數量的class爲title的標籤
for i in soup.find_all(class_="title",limit = 2):
  print(i.get_text())

按照標籤的其他屬性獲取某個標籤

html = '<a alog-action="qb-ask-uname" href="/usercent" rel="external nofollow" target="_blank">蝸牛宋</a>'
soup = BeautifulSoup(html, 'lxml')
# 獲取"蝸牛宋",此時，該標籤裏既沒有class也沒有id，需要根據其屬性來定義獲取規則
author = soup.find('a',{"alog-action":"qb-ask-uname"}).get_text()
#或
author = soup.find(attrs={"alog-action": "qb-ask-uname"})

找前頭和後頭的標籤

soup.find_all_previous("p")
soup.find_previous("p")
soup.find_all_next("p")
soup.find_next("p")

找父標籤

soup.find_parents("div")
soup.find_parent("div")

css選擇器

soup.select("title") #標籤名
soup.select("html head title") #多級標籤名
soup.select("p > a") #p內的所有a標籤
soup.select("p > #link1") #P標籤內，按id查標籤
soup.select("#link1 ~ .sister") #查找相同class的兄弟節點
soup.select("#link1 + .sister")
soup.select(".sister") #按class名稱查
soup.select("#sister") #按id名稱查
soup.select('a[href="http://example.com/elsie" rel="external nofollow" ]') # 按標籤的屬性查
soup.select('a[href$="tillie"]')
soup.select_one(".sister")

注意幾個可能出現的錯誤，可以用try捕獲來防止爬蟲進程

UnicodeEncodeError: ‘charmap' codec can't encode character u'\xfoo' in position bar (或其它類型的 UnicodeEncodeError

需要轉碼

AttributeError: ‘NoneType' object has no attribute ‘foo'

沒這個屬性

就介紹這麼多，應該可以覆蓋大部分網頁結構了吧~！

總結

以上就是這篇文章的全部內容了，希望本文的內容對大家的學習或者工作具有一定的參考學習價值，謝謝大家對神馬文庫的支持。如果你想了解更多相關內容請查看下面相關鏈接

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python爬蟲beautifulsoup4常用的解析方法總結

[轉帖]使用NMT和pmap解決JVM資源泄漏問題原創

Python實現大麥網搶票的四大關鍵技術點解析

Python 安裝庫指令大全

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

基於 Milvus + LlamaIndex 實現高級 RAG

【2024-05-21】以茶會友

【mysql】權限管理

【爬蟲】scrapy安裝問題與解決辦法

【爬蟲】scrapy加入多種防爬策略

【python】引用固定路徑的模塊

【mysql】pymysql.err.InterfaceError Interface Error: (0, '')

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結