首先,Python中自帶urllib及urllib2這兩個模塊,基本上能滿足一般的頁面抓取,另外,requests也是非常有用的。
Requests:
import requests
response = requests.get(url)
content = requests.get(url).content
print "response headers:", response.headers
print "content:", content
Urllib2:
import urllib2
response = urllib2.urlopen(url)
content = urllib2.urlopen(url).read()
print "response headers:", response.headers
print "content:", content
Httplib2:
import httplib2
http = httplib2.Http()
response_headers, content = http.request(url, 'GET')
print "response headers:", response_headers
print "content:", content
對於帶有查詢字段的url,get請求一般會將來請求的數據附在url之後,以?分割url和傳輸數據,多個參數用&連接。
data = {'data1':'XXXXX', 'data2':'XXXXX'}
Requests:data爲dict,json
import requests
response = requests.get(url=url, params=data)
Urllib2:data爲string
import urllib, urllib2
data = urllib.urlencode(data)
full_url = url+'?'+data
response = urllib2.urlopen(full_url)
RE庫.group():
a
=
"123abc456"
print
re.search("([0-9]*)([a-z]*)([0-9]*)",a).group(0)
#123abc456,返回整體
print
re.search("([0-9]*)([a-z]*)([0-9]*)",a).group(1)
#123
print
re.search("([0-9]*)([a-z]*)([0-9]*)",a).group(2)
#abc
print
re.search("([0-9]*)([a-z]*)([0-9]*)",a).group(3)
#456