Beautiful Soup 4解析網頁

Beautiful Soup 4的安裝及相關問題

Beautiful Soup的最新版本是4.1.1可以在此獲取(http://www.crummy.com/software/BeautifulSoup/bs4/download/

 

文檔:

http://www.crummy.com/software/BeautifulSoup/bs4/doc/

 

使用:

from bs4 import BeautifulSoup

Example:

html文件:

html_doc = """ <html><head><title>The Dormouse's story</title></head> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """

代碼:

from bs4 import BeautifulSoup 

soup = BeautifulSoup(html_doc)

接下來可以開始使用各種功能

soup.X (X爲任意標籤,返回整個標籤,包括標籤的屬性,內容等)

如:soup.title 

    # <title>The Dormouse's story</title>

    soup.p 

    # <p class="title"><b>The Dormouse's story</b></p>

    soup.a  (注:僅僅返回第一個結果)

    # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

    soup.find_all('a') (find_all 可以返回所有)

    # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 

    # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 

    # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]


    find還可以按屬性查找

    soup.find(id="link3")

    # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>


    要取某個標籤的某個屬性,可用函數有 find_all,get

    for link in soup.find_all('a'): 

      print(link.get('href')) 

    # http://example.com/elsie 

    # http://example.com/lacie 

    # http://example.com/tillie


    要取html文件中的所有文本,可使用get_text()

    print(soup.get_text()) 

    # The Dormouse's story 

    #

    # The Dormouse's story 

    #

    # Once upon a time there were three little sisters; and their names were 

    # Elsie, 

    # Lacie and 

    # Tillie; 

    # and they lived at the bottom of a well.

    #

    # ...


    如果是打開html文件,語句可用:

    soup = BeautifulSoup(open("index.html"))



BeautifulSoup中的Object

tag (對應html中的標籤)

tag.attrs (以字典形式返回tag的所有屬性)

可以直接對tag的屬性進行增、刪、改,跟操作字典一樣

tag['class'] = 'verybold' 

tag['id'] = 1 

tag 

# <blockquote class="verybold" id="1">Extremely bold</blockquote> 


del tag['class'] 

del tag['id'] 

tag 

# <blockquote>Extremely bold</blockquote> 


tag['class'] 

# KeyError: 'class' 

print(tag.get('class')) 

# None


X.contents (X爲標籤,可返回標籤的內容)

eg.

head_tag = soup.head 

head_tag 

# <head><title>The Dormouse's story</title></head> 

head_tag.contents 

[<title>The Dormouse's story</title>] 

title_tag = head_tag.contents[0] 

title_tag 

# <title>The Dormouse's story</title> 

title_tag.contents 

# [u'The Dormouse's story']


解決解析網頁出現亂碼問題:

  import urllib2
2 from BeautifulSoup import BeautifulSoup
3  
4 page = urllib2.urlopen('http://www.leeon.me');
5 soup = BeautifulSoup(page,fromEncoding="gb18030")
6  
7 print soup.originalEncoding
8 print soup.prettify()

如果中文頁面編碼是gb2312,gbk,在BeautifulSoup構造器中傳入fromEncoding="gb18030"參數即可解決亂碼問題,即使分析的頁面是utf8的頁面使用gb18030也不會出現亂碼問題!

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章