Python爬蟲 | 代理IP的獲取和使用

原創

Xylon_

2019-08-25 13:29

在使用爬蟲大規模爬取網站信息時，偶爾會遇到反爬蟲策略，比如當網站檢測到一個IP地址頻繁訪問時，就會默認其爲爬蟲程序，從而禁止該IP地址訪問，此時我們採取的措施有：設置延遲下載，更換user agent ，或是使用代理IP

所需環境：

IDE：Pycharm

第三方庫：requests

瀏覽器：Chrome

代理IP獲取地址：https://www.xicidaili.com/nn/

由於免費代理IP穩定性較差，所以我們不光要獲取代理IP，還要篩選可用的IP地址

一、初始準備

模塊導入以及爬蟲頭準備

from bs4 import BeautifulSoup
import requests
import random
import concurrent.futures
headers = {'Upgrade-Insecure-Requests':'1',
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Encoding':'gzip, deflate, sdch, br',
    'Accept-Language':'zh-CN,zh;q=0.8',
    'Connection':'close',
    }

ip_url = 'http://httpbin.org/ip'

二、獲取代理IP

靜態網頁的數據獲取很簡單，對頁面分析找到ip地址及端口對應的標籤即可

獲取頁面全部IP地址後(100個)，開多線程對獲取的代理IP進行測試訪問，找到可用IP

def get_ip_list(url):
    page = requests.get(url,headers=headers)
    soup = BeautifulSoup(page.text,'lxml')
    # print(soup)
    ips = soup.find_all('tr')
    ip_list = []
    for i in range(1,len(ips)):
        ip_info = ips[i]
        td = ip_info.find_all('td')
        ip_list.append(td[1].text + ':'+ td[2].text)
    ip_set = set(ip_list)
    ip_list = list(ip_set)      #去重
    print(ip_list)
    with concurrent.futures.ThreadPoolExecutor(len(ip_list)) as x:
        for ip in ip_list:
            x.submit(ip_test,ip)

三、篩選可用IP

測試IP用到的網址是 http://httpbin.org/ip

這個網站可用直接返回當前訪問來源的IP地址：

{
"origin": "163.xxx.xxx.210, 163.xxx.xxx.210"
}

將獲取到的IP與端口拼接爲正式的代理IP——proxies

然後使用代理IP訪問網站查看返回信息，能夠成功訪問的將其寫入文件備用

def ip_test(ip):
    proxies = {
        'http': 'http://' + ip,
        'https': 'https://' + ip,
    }
    print(proxies)
    try:
        response = requests.get(ip_url,headers=headers,proxies=proxies,timeout=3)    #timeout 接收回應最大延時
        if response.status_code == 200:
            with open('可用IP.txt','a') as f:
                f.write(ip)
                f.write('\n')
            print('測試通過')
            print(proxies)
            print(response.text)
    except Exception as e:
        print(e)

可以看到，能夠使用的免費代理IP只有少數，如有需求最好使用付費代理IP

最終獲取的可用IP：

四、代理IP的使用

獲取到代理IP，那麼實戰一下代理IP的使用，是否能僞裝自己的訪問地址，還是以網址 http://httpbin.org/ip 爲例

import requests

headers = {'Upgrade-Insecure-Requests':'1',
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Encoding':'gzip, deflate, sdch, br',
    'Accept-Language':'zh-CN,zh;q=0.8',
    'Connection':'keep-alive',
    }

ip_url = 'http://httpbin.org/ip'

if __name__ == '__main__':
    ip_list = []
    with open('可用IP.txt','r') as f:
        while True:
            line = f.readline()
            if not line:
                break
            line = line.strip('\n')
            ip_list.append(line)
    print(ip_list)
    for ip in ip_list:
        proxies = {
            'http': 'http://' + ip,
            'https': 'https://' + ip,
        }
        try:
            page = requests.get(ip_url, headers=headers, proxies=proxies,timeout=3)
            if page.status_code == 200:
                print('可取 '+ str(proxies))
                print(page.text)
        except Exception as e:
            print(e)

由於代理IP的不穩定性，即使是剛剛獲取到的可用IP，也會不同程度地出現錯誤

其中一部分原因是代理IP質量不一，另外的因素就是http連接太多沒有關閉導致的
(比如圖中的HTTPConnectionPool(host='180.154.173.175', port=8118): Max retries exceeded with url: http://httpbin.org/ip )

該原因很玄學，在網上找了很多方案都沒用，於是將獲取代理IP和使用代理IP分成了兩部分進行，最大限度減少這種錯誤的出現

下一篇博客，嘗試實戰使用代理IP——刷訪問量

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python爬蟲 | 代理IP的獲取和使用

所需環境：

一、初始準備

二、獲取代理IP

三、篩選可用IP

四、代理IP的使用

Python實戰 | 爬取天氣信息並數據可視化

Python pandas用法總結

Python 結巴分詞——自然語言處理之中文分詞器

不使用除法來計算兩個正整數的除法操作

編譯原理——FIRST集與FOLLOW集

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結