python抓取網頁數據的三種方法
一、正則表達式提取網頁內容
解析效率:正則表達式>lxml>beautifulsoup
代碼:
import re import urllib2
urllist = 'http://example.webscraping.com/places/default/view/United-Kingdom-239'
html = urllib2.urlopen(urllist).read() num = re.findall('<td class="w2p_fw">(.*?)</td>',html) print num print "num[1]: ",num[1] |
二、BeautifulSoup方法提取網頁內容
代碼如下:
from bs4 import BeautifulSoup import urllib2
urllist = 'http://example.webscraping.com/places/default/view/United-Kingdom-239'
html = urllib2.urlopen(urllist).read() #把html格式進行確定和糾正 soup = BeautifulSoup(html,'html.parser') #找出tr標籤中id屬性爲places_area__row的內容,如果把find改成findall函數則會把匹配所#有的內容顯示出來,find函數只匹配第一次匹配的內容。 tr = soup.find('tr',attrs={'id':'places_area__row'}) td = tr.find('td',attrs={'class':'w2p_fw'}) #取出標籤內容 area = td.text print "area: ",area |
三、lxml
lxml庫功能和使用類似BeautifulSoup庫,不過lxml解析速度比beautifulsoup快。
代碼:
import lxml.html import urllib2
urllist = 'http://example.webscraping.com/places/default/vie w/United-Kingdom-239'
html = urllib2.urlopen(urllist).read() tree = lxml.html.fromstring(html) td = tree.cssselect('tr#places_area__row > td.w2p_fw')[0] area = td.text_content() print area |