爬蟲【9】建立自己的IP池
爬蟲回顧:
- 爬蟲【1】打開網站,獲取信息
- 爬蟲【2】重構UserAgent
- 爬蟲【3】URL地址編碼
- 爬蟲【4】爬取百度貼吧並生成靜態頁面
- 爬蟲【5】爬蟲貓眼電影100榜單並保存到csv
- 爬蟲【6】鏈家二手房信息和圖片並保存到本地
- 爬蟲【7】鏈家二手房信息和圖片並保存到本地
- 爬蟲【8】request.get()參數詳解
- 爬蟲【9】建立自己的IP池
爲什麼要建立自己的IP池
在做爬蟲的過程中免不了要使用代理IP,但是如果是個人用戶的化用直接購買的價格就有點貴了,所以我們建立之間的IP池,這樣就可以使用不要錢的代理IP了!
爬蟲西刺代理
選擇國內高匿代理
分析url
這個頁面的url非常簡單
https://www.xicidaili.com/nn/{}
即可搞定,一共有4000多頁,感覺可以爬一天
Xpath 代碼
右鍵解析源代碼,發現頁面源碼全是靜態的,通過分析可以得到
ip地址://tr[@class="odd"]/td[position()=2]/text()
post://tr[@class="odd"]/td[position()=3]/text()
類設計
我們設計一個類,類裏面開放的接口有update和get,封裝好後放入python環境中,以後就可以直接調用了
注: ip寫到csv文件中,csv文件一開始不能是空,需初始化如下
ip
222.95.240.13:3000
171.221.79.223:8118
117.88.176.234:3000
117.88.5.41:3000
話不多說,這個爬蟲很簡單就不多做說明了
"""
獲取西刺代理中的ip,並寫入csv文件中
"""
from fake_useragent import UserAgent
import pandas as pd
import requests, time, random
from lxml import etree
class ObtainIP:
def __init__(self):
self.url = 'https://www.xicidaili.com/nn/{}'
def __get_useragent(self):
ua = UserAgent()
return ua.random
def __get_html(self, url):
return requests.get(url=url, headers={'User-Agent': self.__get_useragent()}).text
def __parse_html(self, html):
html = etree.HTML(html)
ip = html.xpath('//tr[@class="odd"]/td[position()=2]/text()')
host = html.xpath('//tr[@class="odd"]/td[position()=3]/text()')
return ip, host
def __get_ip(self, ip, host):
assert len(ip) == len(host)
IP = []
for i in range(len(ip)):
IP.append(ip[i] + ':' + host[i])
return IP
def __check_ip(self, IP, mode='new'):
url = 'https://www.baidu.com/?tn=78040160_26_pg&ch=8'
# 獲取已經存在的ip
exist_ip = self.get()
# 除去已經存在的ip
for item in IP:
if item in exist_ip:
IP.remove(item)
# 去除不能用的ip
for item in IP:
proxies = {
'http': 'http://' + item,
'https': 'https://' + item,
}
useragent = self.__get_useragent()
headers = {'User-Agent': useragent}
try:
requests.get(url=url, headers=headers, proxies=proxies, timeout=5)
except:
IP.remove(item)
return IP
def __write_into_csv(self, IP):
# 讀取csv文件
ips = pd.read_csv('ip.csv')
for item in IP:
item = pd.DataFrame([item], columns=['ip'])
up = ips.loc[:2]
down = ips.loc[3:]
ips = pd.concat([up, item, down], ignore_index=True)
ips.to_csv('ip.csv', index=False)
def update(self):
for i in range(1, 4047):
print('爬蟲到第 %i 頁' % i)
url = self.url.format(i)
html = self.__get_html(url)
ip, post = self.__parse_html(html)
ip = self.__get_ip(ip, post)
ip = self.__check_ip(ip)
self.__write_into_csv(ip)
time.sleep(random.randint(50, 60))
def sift(self):
"""
篩選以前保存的ip
"""
url = 'https://www.baidu.com/?tn=78040160_26_pg&ch=8'
ips = pd.read_csv('ip.csv')
axises = []
axis = -1
for item in ips['ip']:
axis += 1
proxies = {
'http': 'http://' + item,
'https': 'https://' + item,
}
useragent = self.__get_useragent()
headers = {'User-Agent': useragent}
try:
requests.get(url=url, headers=headers, proxies=proxies, timeout=5)
except:
axises.append(axis)
ips.drop(axis=axises,inplace=True)
ips.to_csv('ip.csv', index=False)
# 獲取csv文件中的ip地址
def get(self):
exist_ip = []
ips = pd.read_csv('ip.csv')
for item in ips['ip']:
exist_ip.append(item)
return exist_ip
if __name__ == '__main__':
aaa = ObtainIP()
aaa.update()