Beautiful Soup 解析數據用法

1.簡介

Beautiful Soup提供一些簡單的、python式的函數用來處理導航、搜索、修改分析樹等功能。它是一個工具箱，通過解析文檔爲用戶提供需要抓取的數據，因爲簡單，所以不需要多少代碼就可以寫出一個完整的應用程序。 Beautiful Soup自動將輸入文檔轉換爲Unicode編碼，輸出文檔轉換爲utf-8編碼。你不需要考慮編碼方式，除非文檔沒有指定一個編碼方式，這時，Beautiful Soup就不能自動識別編碼方式了。然後，你僅僅需要說明一下原始編碼方式就可以了。 Beautiful Soup已成爲和lxml、html6lib一樣出色的python解釋器，爲用戶靈活地提供不同的解析策略或強勁的速度。

2.安裝

下載地址:https://pypi.python.org/pypi/beautifulsoup4/4.3.2

官方文檔：

http://beautifulsoup.readthedocs.org/zh_CN/latest

from bs4 import BeautifulSoup

我們創建一個字符串，後面的例子我們便會用它來演示

html = """<html><head><title>The Dormouse's story</title></head>

<body>

The Dormouse's story

Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.

...

"""

　　創建 beautifulsoup 對象

1	`soup` `=` `BeautifulSoup(html)`

　　下面我們來打印一下 soup 對象的內容，格式化輸出

1	`print` `soup.prettify()`

3.1 找標籤

直接打印標籤

print soup.title

#<title>The Dormouse's story</title>

print soup.head

#<head><title>The Dormouse's story</title></head>

print soup.a

#<a class="sister" href="http://example.com/elsie" id="link1"></a>

print soup.p

#The Dormouse's story

我們可以利用 soup加標籤名輕鬆地獲取這些標籤的內容，是不是感覺比正則表達式方便多了？不過有一點是，它查找的是在所有內容中的第一個符合要求的標籤

對於標籤，它有兩個重要的屬性，是 name 和 attrs，下面我們分別來感受一下

print soup.name

print soup.head.name

#[document]

#head

soup 對象本身比較特殊，它的 name 即爲 [document]，對於其他內部標籤，輸出的值便爲標籤本身的名稱

1 2	`print` `soup.p.attrs` `#{'class': ['title'], 'name': 'dromouse'}`

在這裏，我們把 p 標籤的所有屬性打印輸出了出來，得到的類型是一個字典。

如果我們想要單獨獲取某個屬性，可以這樣，例如我們獲取它的 class 叫什麼

1 2	`print` `soup.p['class']` `#['title']`

3.2 獲取文字

既然我們已經得到了標籤的內容，那麼問題來了，我們要想獲取標籤內部的文字怎麼辦呢？很簡單，用 .string 即可，例

1 2	`print` `soup.p.string` `#The Dormouse's story`

3.3 CSS選擇器

在CSS中，標籤名不加任何修飾，類名前加點，id名前加 #，在這裏我們也可以利用類似的方法來篩選元素，用到的方法是 soup.select()，返回類型是 list

3.3.1 通過標籤名查找

1 2	`print` `soup.select('title')` `#[<title>The Dormouse's story</title>]`

3.3.2 通過類名查找

1 2	`print` `soup.select('.sister')` `#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]`

3.3.3 通過 id 名查找

1 2	`print` `soup.select('#link1')` `#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]`

3.3.4 組合查找

組合查找即和寫 class 文件時，標籤名與類名、id名進行的組合原理是一樣的，例如查找 p 標籤中，id 等於 link1的內容，二者需要用空格分開

1 2	`print` `soup.select('p #link1')` `#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]`

3.3.5 直接子標籤查找

1 2	`print` `soup.select("head > title")` `#[<title>The Dormouse's story</title>]`

3.3.6 屬性查找

查找時還可以加入屬性元素，屬性需要用中括號括起來，注意屬性和標籤屬於同一節點，所以中間不能加空格，否則會無法匹配到

print soup.select('a[class="sister"]')

#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print soup.select('a[href="http://example.com/elsie"]')

#[<a class="sister" href="http://example.com/elsie" id="link1"></a>]

同樣，屬性仍然可以與上述查找方式組合，不在同一節點的空格隔開，同一節點的不加空格

1 2	`print` `soup.select('p a[href="http://example.com/elsie"]')` `#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]`

Beautiful Soup 解析數據用法

2.安裝

3.1 找標籤

3.2 獲取文字

3.3 CSS選擇器

3.3.1 通過標籤名查找

3.3.2 通過類名查找

3.3.3 通過 id 名查找

3.3.4 組合查找

3.3.5 直接子標籤查找

3.3.6 屬性查找

kerberods挖礦病毒查殺及分析(crontab 挖礦 curl -fsSL https://p

nginx配置負載均衡

Redis命令使用方法

ELK日誌查詢

Beautiful Soup 解析數據用法

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結