BeautifulSoup實戰

原創

2020-02-23 08:03

最近我的博客主要都在自己的網站上寫，所以在CSDN上很少更新，希望各位讀者光臨我的站點http://a2bgeek.me

最近有個項目要用天氣數據，看了一些天氣網站，決定從中國天氣網上抓數據，python抓數據的框架我知道的不多，只聽過BeautifulSoup，下面記錄一下使用BeautifulSoup抓取數據的全過程。BeautifulSoup的文檔見BeautifulSoup官方文檔。

這裏大概介紹一下`BeautifulSoup`的用法，它和javascript的dom一樣，把html文檔看做一棵樹。

可以用下面的代碼取得根節點：

from bs4 import BeautifulSoup
import urllib
rawdata = urllib.urlopen("http://xxx.xxx.xx")
soup = BeautifulSoup(rawdata)

soup就是根節點了，有了這個根節點就能遍歷文檔樹了。

find可以快速得到某一個或一簇標籤：

head = soup.find('head')

可以得到

<head>
    <title>...<title>
    <style>...<style>
</head>

也可以加上id的限制:

soup.find(id="link3")

find_all可以有選擇地得到一些標籤：

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

.也很常用：

soup.title
#可以得到<title>The Dormouse's story</title>
soup.title.name
#可以得到標籤名
soup.title.string
#可以得到標籤之間的文本
soup.p['class']
#可以得到p標籤的class屬性

contents也比較常用：

soup.body.contents
#是一個list，包括了body的所有子節點，既有tag也有文本。

下面介紹抓取中國天氣網數據的過程。

首先打開網頁蘭州天氣，再打開firebug找到我們需要抓取的部分。如圖所示：，其中id="7d"的div就是我們要抓取的部分。下面上代碼：

tag7d = soup.find(id = "7d")
#得到id="7d"的div
tagweatherYubaoBox = tag7d.contents[3]
#得到class="weatherYubaoBox"的div
resultset = [tagweatherYubaoBox.contents[5].find_all("a"), tagweatherYubaoBox.contents[9].find_all("a"), tagweatherYubaoBox.contents[13].find_all("a")]
#這裏我只想取當天、明天、後天的天氣，也就是class="yuBaoTable"的前三個table，接下來就可以循環去數據了。
result = []
for item in resultset:
    tmp_dict = {}
    tmp_dict["date"] = item[0].string
    tmp_dict["imgurl"] = ''.join(["http://www.weather.com.cn", item[1].contents[1]["src"]])
    tmp_dict["weather"] = item[2].string
    if len(item) == 6:
#中國天氣網的數據在6點以後就沒有白天的數據了，所以這裏判斷了一下。
        tmp_dict["low"] = ' '.join([u"夜間", item[3].contents[1].contents[0], item[3].contents[1].contents[1].string])
    else:
        tmp_dict["high"] = ' '.join([u"白天", item[3].contents[1].contents[0], item[3].contents[1].contents[1].string])
        tmp_dict["low"] = ' '.join([u"夜間", item[8].contents[1].contents[0], item[8].contents[1].contents[1].string])
    result.append(tmp_dict)

好了今天就到這裏，歡迎拍磚。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

BeautifulSoup實戰

這裏大概介紹一下`BeautifulSoup`的用法，它和javascript的dom一樣，把html文檔看做一棵樹。

下面介紹抓取中國天氣網數據的過程。

【Android每週專題】Android中的逆向工程

Python發送GET和POST請求

【微信易信公衆平臺開發】開啓開發者模式

Linux上MongoDB的安裝與配置

【Android每週專題】網絡編程

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

BeautifulSoup實戰

這裏大概介紹一下BeautifulSoup的用法，它和javascript的dom一樣，把html文檔看做一棵樹。

下面介紹抓取中國天氣網數據的過程。

這裏大概介紹一下`BeautifulSoup`的用法，它和javascript的dom一樣，把html文檔看做一棵樹。