隨機請求頭的設置+IP代理池的建立
隨機請求頭的設置
說明
開始,建立一個scrapy爬蟲,這裏在python爬蟲-------scrapy學習筆記(一) 已經詳細講過了。
這裏我已經創建好了一個scrapy爬蟲,這裏我們通過瀏覽器來訪問下http://httpbin.org/headers,這裏可以看我們的請求頭。
下面我們通過爬蟲來訪問一下(不過先要設置一下setting.py裏的robots協議和修改下默認請求頭)。
在剛剛創建的爬蟲中,來通過代碼來訪問下http://httpbin.org/headers
import scrapy
import json
class HeadersettingSpider(scrapy.Spider):
name = 'headersetting'
allowed_domains = ['httpbin.org']
start_urls = ['http://httpbin.org/headers']
def parse(self, response):
print("="*60)
result = json.loads(response.text)['headers']['User-Agent']
print(result)
結果如下(和直接通過瀏覽器訪問的一樣):
設置請求頭
這裏推薦一個網站,可以搜索所有瀏覽器的請求頭
下面,打開scrapy項目的middlewares.py
新建一個類
class UserAgentDownloadMiddlewares(object):
USER_AGENTS=['Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML like Gecko) Chrome/44.0.2403.155 Safari/537.36',
'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Flock/3.5.3.4628',
"Mozilla/5.0 (X11; Linux i686; rv:64.0) Gecko/20100101 Firefox/64.0",
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML like Gecko) Chrome/51.0.2704.79 Safari/537.36 Edge/14.14931',
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 1.1.4322; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; Browzar)"]
def process_request(self, request, spider):
user_agent = random.choice(self.USER_AGENTS)
request.headers['User-Agent'] = user_agent
之後進入setting.py
,修改DOWNLOADER_MIDDLEWARES
關於主爬蟲代碼,在最後加上yield scrapy.Request(self.start_urls[0],dont_filter=True)
重複發送多次請求
import scrapy
import json
class HeadersettingSpider(scrapy.Spider):
name = 'headersetting'
allowed_domains = ['httpbin.org']
start_urls = ['http://httpbin.org/headers']
def parse(self, response):
print("="*60)
result = json.loads(response.text)['headers']['User-Agent']
print(result)
yield scrapy.Request(self.start_urls[0],dont_filter=True)
之後運行(在命令行或者新建一個start.py文件:
結果顯示,每次的請求頭都不一樣了。
IP代理池的建立
推薦一個查看IP地址 的網址
這裏先在原基礎上通過“命令行”再建立一個爬蟲
然後在middlewares.py
中創建一個新的類
class IpProxyDownloadMiddlewares(object):
IP_Pool = [
'182.35.85.61:9999','123.163.27.103:9999','58.22.177.109:9999',"175.155.142.249:1133","112.84.99.134:9999","182.149.83.56:9999"
]
def process_request(self,request,spider):
proxy = random.choice(self.IP_Pool)
request.meta['proxy'] = "http://"+proxy
之後進入setting.py
,修改DOWNLOADER_MIDDLEWARES
關於主爬蟲:
# -*- coding: utf-8 -*-
import scrapy
import json
class IpporxySpider(scrapy.Spider):
name = 'ipporxy'
allowed_domains = ['httpbin.org']
start_urls = ['http://httpbin.org/ip']
def parse(self, response):
result = json.loads(response.text)['origin']
print('='*60)
print(result)
yield scrapy.Request(self.start_urls[0],dont_filter=True)
運行(由於這些ip不是購買的,效果就沒那麼明朗了):