python抓取網頁數據的三種方法

原創

幸運券發放

2019-01-29 11:16

python抓取網頁數據的三種方法

一、正則表達式提取網頁內容

解析效率：正則表達式>lxml>beautifulsoup

代碼：

import re

import urllib2

urllist = 'http://example.webscraping.com/places/default/view/United-Kingdom-239'

html = urllib2.urlopen(urllist).read()

num = re.findall('<td class="w2p_fw">(.*?)</td>',html)

print num

print "num[1]: ",num[1]

二、BeautifulSoup方法提取網頁內容

代碼如下：

from bs4 import BeautifulSoup

import urllib2

urllist = 'http://example.webscraping.com/places/default/view/United-Kingdom-239'

html = urllib2.urlopen(urllist).read()

#把html格式進行確定和糾正

soup = BeautifulSoup(html,'html.parser')

#找出tr標籤中id屬性爲places_area__row的內容，如果把find改成findall函數則會把匹配所#有的內容顯示出來，find函數只匹配第一次匹配的內容。

tr = soup.find('tr',attrs={'id':'places_area__row'})

td = tr.find('td',attrs={'class':'w2p_fw'})

#取出標籤內容

area = td.text

print "area: ",area

三、lxml

lxml庫功能和使用類似BeautifulSoup庫，不過lxml解析速度比beautifulsoup快。

代碼：

import lxml.html

import urllib2

urllist = 'http://example.webscraping.com/places/default/vie

w/United-Kingdom-239'

html = urllib2.urlopen(urllist).read()

tree = lxml.html.fromstring(html)

td = tree.cssselect('tr#places_area__row > td.w2p_fw')[0]

area = td.text_content()

print area

本文轉自老鷹a 51CTO博客，原文鏈接:http://blog.51cto.com/laoyinga/1939999

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python抓取網頁數據的三種方法

python抓取網頁數據的三種方法

美團一面：項目中有 10000 個 if else 如何優化？想了半天，被問懵了！

京東面試：如何進行JVM調優？

Python 將PowerPoint (PPT/PPTX) 轉爲HTML

SQL優化-20231016

Javascript千面之變幻莫測的this指向

Java 異常（二）自定義異常

安卓到底是不是Linux

C# 數據操作系列 - 15 SqlSugar 增刪改查詳解

使用 docker buildx 構建多 CPU 架構鏡像

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結