最近在學爬蟲,看了點視頻學了點東西,寫了一個百度貼吧的爬蟲上來
目前只是把爬取網頁信息,存儲到本地
#-*- coding:utf-8 -*-
# 識別中文註釋
import urllib2
def load_page(url):
user_agent ="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11"
headers = {'User-Agent':user_agent}
req = urllib2.Request(url,headers = headers)
response = urllib2.urlopen(req)
page = response.read()
return page
def tieba_spider(url,beginPage,endPage):
'''
貼吧小爬蟲
'''
for i in range(beginPage,endPage+1):
myurl = url + str(50*(i-1))
print "url :" + myurl
html = load_page(myurl)
file_name =str(i)+".html"
writeFile(file_name,html)
def writeFile(file_name,txt):
f = open(file_name,'w')
f.write(txt)
f.close()
if __name__ == "__main__":
url = raw_input("please input the url :")
beginPage = int(raw_input("begin : "))
endPage = int(raw_input("end : "))
tieba_spider(url,beginPage,endPage)
過幾天有時間再學點正則表達式加進去
url:http://tieba.baidu.com/f?kw=lol&ie=utf-8&pn=