使用 Beautiful Soup 在 Python 中抓取網頁

本文討論如何使用 Beautiful Soup 庫從 HTML 頁面中提取內容。提取後，我們將使用 Beautiful Soup 將其轉換爲 Python 列表或字典。

爲了讓網絡抓取在 Python 中工作，我們將執行三個基本步驟：

使用requests庫提取 HTML 內容。

分析 HTML 結構並識別包含內容的標籤。

使用 Beautiful Soup 提取標籤並將數據放入 Python 列表中。

安裝庫

首先安裝我們需要的庫。requests庫可以從網站獲取 HTML 內容。Beautiful Soup 解析 HTML 並將其轉換爲 Python 對象。在Python 3中需要安裝下面兩個庫：

[root@localhost ~]# pip3 install requests beautifulsoup4

提取html

本文抓取該網站的技術頁面。如果你轉到該頁面，將看到帶有標題、摘錄和發佈日期的文章列表。我們的目標是創建一個包含該信息的文章列表。

技術頁面的完整 URL 是：

https://notes.ayushsharma.in/technology

我們可以使用requests從這個頁面獲取 HTML 內容：

#!/usr/bin/python3
import requests

url = 'https://notes.ayushsharma.in/technology'

data = requests.get(url)

print(data.text)

變量 data 將包含頁面的 HTML 源代碼。

從 HTML 中提取內容

爲了從 data 中提取我們的數據，我們需要確定哪些標籤具有我們需要的內容。

如果你瀏覽 HTML，會在頂部附近找到此部分：

<div class="col">
  <a href="/2021/08/using-variables-in-jekyll-to-define-custom-content" class="post-card">
    <div class="card">
      <div class="card-body">
        <h5 class="card-title">Using variables in Jekyll to define custom content</h5>
        <small class="card-text text-muted">I recently discovered that Jekyll's config.yml can be used to define custom
          variables for reusing content. I feel like I've been living under a rock all this time. But to err over and
          over again is human.</small>
      </div>
      <div class="card-footer text-end">
        <small class="text-muted">Aug 2021</small>
      </div>
    </div>
  </a>
</div>

這是在每篇文章的整個頁面中重複的部分。我們可以看到 .card-title 有文章標題， .card-text 有摘錄， .card-footer 類下面的small標籤有發佈日期。

讓我們使用 Beautiful Soup 提取這些內容。

#!/usr/bin/python3
import requests
from bs4 import BeautifulSoup
from pprint import pprint

url = 'https://notes.ayushsharma.in/technology'
data = requests.get(url)

my_data = []

html = BeautifulSoup(data.text, 'html.parser')
articles = html.select('a.post-card')

for article in articles:

    title = article.select('.card-title')[0].get_text()
    excerpt = article.select('.card-text')[0].get_text()
    pub_date = article.select('.card-footer small')[0].get_text()

    my_data.append({"title": title, "excerpt": excerpt, "pub_date": pub_date})

pprint(my_data)

上面的代碼提取文章並將它們放入 my_data 變量中。我正在使用 pprint 來打印輸出。

總結

我們可以將它作爲 JSON 返回給另一個應用程序，或者使用自定義樣式將其轉換爲 HTML。

本文原創地址：https://www.linuxprobe.com/bs4-python-web.html

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

使用 Beautiful Soup 在 Python 中抓取網頁

企業的組織架構對技術架構的影響

10 張圖幫你搞定 TensorFlow 數據讀取機制

Zorin OS 16.3 發佈：無縫升級和卓越改進！

Linux Sudo 史上最大bug

Mir 2.14 正式發佈，Ubuntu 使用的 Linux 顯示服務器

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結