使用scrapy爬取免費代理ip並存入MongoDB數據庫中

鳴謝：劉碩

部分代碼來源於劉碩編寫的《精通scrapy網絡爬蟲》，在此聲明

通常，我們在爬取一些較大型的網站的時候，都會遇到一個非常令人頭疼的事情，就是他們的反爬機制，稍微爬快一點就被封，真的很難受，爬的太慢了自己等着也煩，所以很多人都會用代理來進行爬取數據，但是，選擇一個代理服務器成本比較高，所以對於我們一些平民來說，爬取一些免費的代理ip更適合我們，當然，如果資金充足，你完全可以去購買代理服務器，在這裏，創建spider項目和啓動spider略過，相信大家都知道，嘻嘻

此爬蟲擁有功能（對於新手來說也很友好）：

1.可以指定爬取的代理ip數（根據代理網站頁數來選擇）

2.可以刪去無用代理ip

3.可自由選擇存儲方式（需要自己編寫，我這裏存入MongoDB數據庫）

廢話不多說：

1.首先，我們需要選擇一個免費提供代理ip的網址，這裏我們選擇http://www.xicidaili.com，如下圖

通過按F12開發者工具，找到每行數據的ip信息，這裏的話自己去琢磨一下吧：

所有的內容都在（xpath） "//table[@id="ip_list"]/tr[position()>1]" 下，然後通過css選擇器來選擇每一列的數據

ip地址：td:nth-child(2)::text

端口號：td:nth-child(3)::text

類型（請求方式）：td:nth-child(6)::text

具體獲取代碼如下：

        for sel in response.xpath('//table[@id="ip_list"]/tr[position()>1]'):
            # 提取代理ip，端口port，請求方式（http，https）
            ip = sel.css('td:nth-child(2)::text').extract_first()
            port = sel.css('td:nth-child(3)::text').extract_first()
            scheme = sel.css('td:nth-child(6)::text').extract_first().lower()

解析：選取所有符合 “//table[@id="ip_list"]/tr[position()>1]” 的子元素，通過遍歷取出所有有用信息，爲接下來做準備。

當然，在獲取信息之前我們需要構建url，代碼如下，通過修改range裏面的數字即可修改要爬去的頁數（間接實現了指定ip數量）：

# 重寫url構造方法
    def start_requests(self):
        for i in range(1, 2):
            yield scrapy.Request('http://www.xicidaili.com/nn/%s' % i)

2.驗證取到的每一個ip是否可用，同時判斷該ip是否爲高匿ip（隱藏ip）：

    def parse(self, response):
        for sel in response.xpath('//table[@id="ip_list"]/tr[position()>1]'):
            # 提取代理ip，端口port，請求方式（http，https）
            ip = sel.css('td:nth-child(2)::text').extract_first()
            port = sel.css('td:nth-child(3)::text').extract_first()
            scheme = sel.css('td:nth-child(6)::text').extract_first().lower()

            # 使用爬取到的代理訪問，驗證ip是否可用
            url = '%s://httpbin.org/ip' % scheme
            proxy = '%s://%s:%s' % (scheme, ip, port)

            meta = {
                'proxy': proxy,
                'dont_retry': True,
                'download_timeout': 10,

                # 以下兩個字段傳給check_available方法的信息，方便檢測
                '_proxy_scheme': scheme,
                '_proxy_ip': ip,
            }

            # 迭代請求
            yield scrapy.Request(url, callback=self.check_available, meta=meta, dont_filter=True)

    def check_available(self, response):
        proxy_ip = response.meta['_proxy_ip']
        if proxy_ip == json.loads(response.text)["origin"]:
            ip = IpProxyItem()
            ip["scheme"] = response.meta["_proxy_scheme"]
            ip["proxy"] = response.meta["proxy"]
            yield ip

解析：check_available函數即爲判斷該ip是否爲高匿ip（通過返回的“origin”參數即可判斷），若程序進入該函數則證明該代理ip可用，所以我們將它存入Item，爲接下來存入Mongodb做準備，爬蟲完整代碼如下（proxy.py）：

# -*- coding: utf-8 -*-
import scrapy
import json
from ..items import IpProxyItem


class ProxySpider(scrapy.Spider):
    # 免費ip代理獲取
    name = 'proxy'
    allowed_domains = ['www.xicidaili.com']
    start_urls = ['http://www.xicidaili.com/']

    # 重寫url構造方法
    def start_requests(self):
        for i in range(1, 2):
            yield scrapy.Request('http://www.xicidaili.com/nn/%s' % i)

    # 獲取數據
    def parse(self, response):
        for sel in response.xpath('//table[@id="ip_list"]/tr[position()>1]'):
            # 提取代理ip，端口port，請求方式（http，https）
            ip = sel.css('td:nth-child(2)::text').extract_first()
            port = sel.css('td:nth-child(3)::text').extract_first()
            scheme = sel.css('td:nth-child(6)::text').extract_first().lower()

            # 使用爬取到的代理訪問，驗證ip是否可用
            url = '%s://httpbin.org/ip' % scheme
            proxy = '%s://%s:%s' % (scheme, ip, port)

            meta = {
                'proxy': proxy,
                'dont_retry': True,
                'download_timeout': 10,

                # 以下兩個字段傳給check_available方法的信息，方便檢測
                '_proxy_scheme': scheme,
                '_proxy_ip': ip,
            }

            # 迭代請求
            yield scrapy.Request(url, callback=self.check_available, meta=meta, dont_filter=True)

    def check_available(self, response):
        proxy_ip = response.meta['_proxy_ip']
        if proxy_ip == json.loads(response.text)["origin"]:
            ip = IpProxyItem()
            ip["scheme"] = response.meta["_proxy_scheme"]
            ip["proxy"] = response.meta["proxy"]
            yield ip

3.開始存入MongoDB：由於註釋比較詳細，所以不做解釋了，代碼如下（pipelines.py）：

# -*- coding: utf-8 -*-
from scrapy import Item
import pymongo
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class IpProxyPipeline(object):
    DB_URL = 'mongodb://localhost:27017/'
    DB_NAME = 'proxy'
    
    # 該函數在spyder啓動時就執行，連接mongodb數據庫
    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.DB_URL)
        self.db = self.client[self.DB_NAME]
    
    # 該函數在spyder運行結束後執行，關閉連接
    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        # 創建數據庫表名
        con = self.db[spider.name]
        # 由於需要插入字典類型的數據（不能插入Item類型），所以需要判斷類型是否是字典類型
        post = dict(item) if isinstance(item, Item) else item
        # 將數據插入MongoDB
        con.insert_one(post)
        return item

4.在Items.py 中要添加如下字段

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class IpProxyItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    scheme = scrapy.Field()
    proxy = scrapy.Field()

5.setting.py 文件中需要啓動pipelines（該地方被註釋了，需要取消註釋），同時要添加請求頭和robot規則

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'ip_proxy.pipelines.IpProxyPipeline': 300,
}


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0'


# Obey robots.txt rules
ROBOTSTXT_OBEY = False

這樣，運行spider，數據將自動寫入數據庫

如有不明白地方，歡迎留言

使用scrapy爬取免費代理ip並存入MongoDB數據庫中

使用python將excel數據導入到mysql（或其它）數據庫中

scrapy遇到keyerror: 'd'

使用scrapy爬取免費代理ip並存入MongoDB數據庫中

scrapy框架爬取數據入庫（附詳細介紹）

我認爲是最簡單的c++實現線性表中鏈表相關過程

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結