python項目之 爬取代理的ip地址

python項目之 爬取代理的ip地址

爬取網站的代理ip地址,解析,保存爲文本文件。

練習源碼

# coding = utf-8

####################################################
# coding by 劉雲飛
####################################################

import requests
import re

URL_S="http://www.xicidaili.com/"
headers = {
    'Host':'www.xicidaili.com',
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0',
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language':'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
    'Accept-Encoding': 'gzip, deflate',
    'Cookie':'_free_proxy_session=BAh7B0kiD3Nlc3Npb25faWQGOgZFVEkiJTYxMDdmMjBlZGVjMTMyN2QxZjVmMTM1OGI1ZWRiNTVmBjsAVEkiEF9jc3JmX3Rva2VuBjsARkkiMVQzaWNQazE2ZHovZ0NReWFKeFpMakp3dURJOVpyMkZXNUp6WUVqNjJJZ2c9BjsARg%3D%3D--fcb2c5aed90070f18b85d2262278f9e5811f6b56; CNZZDATA1256960793=1456382766-1453291871-http%253A%252F%252Fwww.baidu.com%252F%7C1453291871',
    'Connection':'keep-alive',
    'If-None-Match': 'W/"aa248d9ab9daa155024a37bbfb5ce775"',
    'Cache-Control': 'max-age=0'
}

sess = requests.session()
resp = sess.get(URL_S,headers = headers)
text = resp.text
comp = re.compile(r'(?isu)<td>(\d+)\.(\d+)\.(\d+)\.(\d+)</td>\s*<td>(\d+)</td>')
all_ip = comp.findall(text)
str_all = ""

for ip in all_ip:
    str_all += ip[0]+'.'+ip[1]+'.'+ip[2]+'.'+ip[3]+'.'+ip[4]+"\n"
    print(ip)

with open('ip.txt','w') as f:
    f.write(str_all)
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章