Python爬蟲 BeautifulSoup庫應用詳解

原創

2020-02-21 12:39

Python爬蟲（四）

學習Python爬蟲過程中的心得體會以及知識點的整理，方便我自己查找，也希望可以和大家一起交流。

—— BeautifulSoup庫應用詳解 ——

文章目錄

Python爬蟲（四）

—— BeautifulSoup庫應用詳解 ——

一.安裝BeautifulSoup庫

可以直接使用pip安裝，如果電腦上沒安裝pip，可以到以下網址教程中安裝。
Linux：安裝教程。
Windows：安裝教程。
MAC OS：直接在終端輸入代碼：sudo easy_install pip。

安裝好pip後，我們就可以安裝BeautifulSoup庫了。
直接輸入代碼：pip install bs4。
因爲BeautifulSoup是bs4的一個部分。

二.導入BeautifulSoup庫

我們首先在Python中導入BeautifulSoup庫：

from bs4 import BeautifulSoup

對於新手來說，我們需要知道BeautifulSoup庫有什麼功能方法，我們可以dir一下：

print （dir(BeautifulSoup))

在這裏我們可以看到BeautifulSoup庫的各種方法。

import requests
from bs4 import BeautifulSoup

r = requests.get("https://www.baidu.com")
n = r.content
m = BeautifulSoup(n,"html.parser")

BeautifulSoup庫最主要的功能就是從網頁爬取我們需要的數據。BeautifulSoup將 html 解析爲對象進行處理，全部頁面轉變爲字典或者數組，相對於正則表達式的方式，可以大大簡化處理過程。BeautifulSoup將htmll對象轉成對象的過程。

三.requests庫的方法

import requests
from bs4 import BeautifulSoup

r = requests.get("https://www.baidu.com")
n = r.content
m = BeautifulSoup(n,"html.parser")
print(m.prettify())
print(m.head.title)
print(m.p)
for i in m.find_all("p"):
    print(i)
    print(type(i))
st = m.find_all("p")[0]
str(st)
print (st)
print (type(st))

（參數）html.parser：避免因BeautifulSoup庫和Python之間的版本不一致而引起的error報警。
prettify：輸出格式有縮進。
當沒有縮進的時候：

當有縮進的時候：
title：獲取源代碼中的title標籤內容。可以在前面規定位置，如想要獲取在head位置的title：

print(m.head.title)

title.name：獲取titile標籤名。
title.string：獲取title內的string類型字符。
title.parent.string：獲取titile父標籤名。

p：獲取源代碼中的p標籤內容。當不唯一的時候，返回第一個。
- p.name：獲取p標籤名。
- p.string：獲取p內的string類型字符。
- p.parent.string：獲取p父標籤名。
find_all()：獲取源代碼中所有的某一規定標籤內容。例如下面的代碼，就是查找所有的p標籤：

  for i in m.find_all("p"):
    print(i)

當然，也可以進行多關鍵詞查找，例如：find_all(“p”,“a”,“title”)。也可以發現，這裏的i類似於同一個列表的類型，但是其實並非是列表，不過可以按照列表來理解，比如我們要找源代碼中第二個p標籤：

print (m.find_all("p")[2])

那麼，find_all這裏究竟是什麼類型呢，這裏我們可以查一下i的類型：

我們可以看到，類型不是string，而是對象。如果我們想改成改爲string，直接str()轉換就可以了：

查找參數：以上方法都是只能在標籤處查找，可如果要查找下面源代碼中的參數href：

<a href="http://ir.baidu.com">About Baidu</a>

需要以下代碼以及方法：

  for i in m.find_all("a"):
    print(i)["href"]

現在查找到a標籤，然後再所有a標籤中查找“href”參數，如果只查找其中某個，只需要找到特殊的標誌，如id值、class值（注意：class在查找的時候是“class_”）：

  s=m.find_all(id="link2",class_="sister")[0].["href"]
  print(s)

正則表達式：BeautifulSoup庫也支持正則表達式：

import requests
from bs4 import BeautifulSoup
import re

r = requests.get("https://www.baidu.com")
n = r.content
m = BeautifulSoup(n,"html.parser")
for tag in m.find_all(re.compile("^p")):
	print(tag.name)

結果如下：

——————更多方法詳細前往官方中文文檔查看。

果粒陳橙

發佈了39 篇原創文章 · 獲贊 12 · 訪問量 7214

私信關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python爬蟲 BeautifulSoup庫應用詳解

Python爬蟲（四）

—— BeautifulSoup庫應用詳解 ——

文章目錄

一.安裝BeautifulSoup庫

二.導入BeautifulSoup庫

三.requests庫的方法

合同法律風險管理動態合同履約銜接與函件往來

計算機網絡計算機網絡體系結構

Python爬蟲 socket庫實踐——模擬連接發送接收數據

Python爬蟲 socket庫應用詳解

合同法律風險管理靜態合同的序言與內容

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結