python 爬蟲 基本抓取


首先,Python中自帶urllib及urllib2這兩個模塊,基本上能滿足一般的頁面抓取,另外,requests也是非常有用的。

Requests:
	import requests
	response = requests.get(url)
	content = requests.get(url).content
	print "response headers:", response.headers
	print "content:", content
Urllib2:
	import urllib2
	response = urllib2.urlopen(url)
	content = urllib2.urlopen(url).read()
	print "response headers:", response.headers
	print "content:", content
Httplib2:
	import httplib2
	http = httplib2.Http()
	response_headers, content = http.request(url, 'GET')
	print "response headers:", response_headers
	print "content:", content

對於帶有查詢字段的url,get請求一般會將來請求的數據附在url之後,以?分割url和傳輸數據,多個參數用&連接。

data = {'data1':'XXXXX', 'data2':'XXXXX'}
Requests:data爲dict,json
	import requests
	response = requests.get(url=url, params=data)
Urllib2:data爲string
	import urllib, urllib2    
	data = urllib.urlencode(data)
	full_url = url+'?'+data
	response = urllib2.urlopen(full_url)


RE庫.group():

a = "123abc456"
print re.search("([0-9]*)([a-z]*)([0-9]*)",a).group(0)   #123abc456,返回整體
print re.search("([0-9]*)([a-z]*)([0-9]*)",a).group(1)   #123
print re.search("([0-9]*)([a-z]*)([0-9]*)",a).group(2)   #abc
print re.search("([0-9]*)([a-z]*)([0-9]*)",a).group(3)   #456

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章