Python16——BeautifulSoup

一、什麼是BeautifulSoup

Beautiful Soup 是一個可以從HTML或XML文件中提取數據的Python庫。在爬蟲領域用的比較多,能夠幫助我們從HTML文件中提取特定的內容,來進行分析。

二、簡單實用BeautifulSoup

from bs4 import BeautifulSoup
from urllib.request import urlopen
import lxml

# 返回一個經過lxml解析的BeautifulSoup對象,
# BeautifulSoup 對象表示的是一個文檔的全部內容
soup = BeautifulSoup(html, "lxml")

三、BeautifulSoup中的對象

<html>
    <head>
        <title>
            The Dormouse's story
        </title>
    </head>
        <body>
           <p class="title">
               <b>
                   The Dormouse's story
               </b>
           </p>
           <p class="story">
               Once upon a time there were three little sisters; and their names were
               <a class="sister" href="http://example.com/elsie" id="link1">
               Elsie
               </a>
                   ,
               <a class="sister" href="http://example.com/lacie" id="link2">
                   Lacie
               </a>
                   and
               <a class="sister" href="http://example.com/tillie" id="link2">
                  Tillie
               </a>
                  ; and they lived at the bottom of a well.
           </p>
           <p class="story">
              ...
          </p>
    </body>
</html>

1、Tag對象

Tag 對象與XML或HTML原生文檔中的tag相同。是一個Tag對象。

from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'lxml')

# 返回一個tag實例,代表整個標籤,包括標籤的名字,屬性和內容
>>> soup.b
<b class="boldest">Extremely bold</b>
>>> type(soup.b)
<class 'bs4.element.Tag'>

# tag爲Tag的一個實例,通過BeautifulSoup對象獲取
>>> tag = soup.b
# 獲取tag對應的屬性值
>>> tag["class"]
['boldest']
# 獲取tag對應的內容
>>> tag.string
'Extremely bold'
# 獲取tag的名字
>>> tag.name
'b'
# 獲取tag的全部屬性
>>> tag.attrs
{'class': ['boldest']}

2、Navigable對象

字符串就是標籤的內容,這在BeautifulSoup中被稱之爲Navigable對象。例如:<b class="boldest">Extremely bold</b>,中對象爲Extremely bold。該對象可以通過tag.string獲取。

>>> soup = BeautifulSoup('<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>', 'lxml')
>>> tag = soup.a
>>> tag.string
'Lacie'
>>> tag.name
'a'
>>> tag.attrs
{'class': ['sister'], 'href': 'http://example.com/lacie', 'id': 'link2'}
>>> tag["href"]
'http://example.com/lacie'

3、BeautifulSoup對象

BeautifulSoup 對象表示的是一個文檔的全部內容。通過soup.Tag可以返回一個Tag標籤對象。

4、Comment對象

四、遍歷文檔樹

子節點

一個Tag可能包含多個字符串或其它的Tag,這些都是這個Tag的子節點.Beautiful Soup提供了許多操作和遍歷子節點的屬性.注意: Beautiful Soup中字符串節點不支持這些屬性,因爲字符串沒有子節點。

通過soup.tagName便可以獲得該標籤的全部內容。例如:

>>> html_doc = """
<html><head><title>The Dormouse's story</title></head>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

# 根據tag名字尋找特定的子節點
>>> soup = BeautifulSoup(html_doc, 'lxml')
>>> soup.head
<head><title>The Dormouse's story</title></head>
>>> soup.title
<title>The Dormouse's story</title>
# 存在多個標籤時,只輸出第一個標籤
>>> soup.a
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# 通過.來取得標籤內的標籤
>>> soup.head.title
<title>The Dormouse's story</title>
# 取標籤p時,會取到第一個標籤,但是第一個標籤p沒有a標籤,所以輸出結果爲None
>>> soup.p.a
>>> 
>>> 


# .contents屬性獲取子節點
>>> soup.p.contents
[<b>The Dormouse's story</b>]
>>> soup.a.contents
['Elsie']
>>> soup.html.contents
[<head><title>The Dormouse's story</title></head>, '\n', <body><p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>]

tag屬性

# tag的 .contents 屬性可以將tag的子節點以列表的方式輸出:
>>> soup.p.contents
[<b>The Dormouse's story</b>]
>>> soup.a.contents
['Elsie']

# 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章