前言
前段時間看了一些介紹Scrapy及用Scarpy進行抓取網絡信息的博客。總體來說信息量還是過少,對於一個成熟的框架來說,只看博客還是不夠。所以還是看了一遍官方文檔。
看完後,總要做點什麼來練練手,正好前段時間在網上閒逛的時候找到了一個國內某大神做的某國外博客的聚合類網站。裏面涉及到大量博客地址。點擊某博客後,會列出該博客下所有視頻地址。其實該網站也是一個爬蟲。
將所有視頻下載下來是不現實的。將博客地址存取下來即可,後續需要的時候再編寫一個爬蟲用於解析該博客下的所有圖片、文字、視頻。
Scrapy安裝
Scrapy安裝用pip即可。本次練習採用的是Python3.5.2,win7 64位系統。集成於Anaconda。官網上推薦如下安裝方式:
conda install -c scrapinghub scrapy
但安裝完後在startproject的時候出現錯誤。於是又用pip卸載了scrapy,再用pip安裝scrapy,就行了,具體原因不詳。。
開始項目
在想要存放項目的位置打開cmd。輸入以下命令(XXX爲項目名稱):
scrapy startproject XXX
編寫item
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class XXXItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
Blogs_per_page = scrapy.Field()
編寫pipeline
安裝MongoDB
Microsoft Windows [版本 6.1.7601]
版權所有 (c) 2009 Microsoft Corporation。保留所有權利。
C:\Program Files\MongoDB\Server\3.2\bin>mongod
2016-10-11T12:36:54.234+0800 I CONTROL [main] Hotfix KB2731284 or later update
is not installed, will zero-out data files
2016-10-11T12:36:54.236+0800 I CONTROL [initandlisten] MongoDB starting : pid=3
3256 port=27017 dbpath=C:\data\db\ 64-bit host=CJPC160816-051
2016-10-11T12:36:54.236+0800 I CONTROL [initandlisten] targetMinOS: Windows 7/W
indows Server 2008 R2
2016-10-11T12:36:54.236+0800 I CONTROL [initandlisten] db version v3.2.10
2016-10-11T12:36:54.236+0800 I CONTROL [initandlisten] git version: 79d9b3ab5ce
20f51c272b4411202710a082d0317
2016-10-11T12:36:54.236+0800 I CONTROL [initandlisten] OpenSSL version: OpenSSL
1.0.1t-fips 3 May 2016
2016-10-11T12:36:54.236+0800 I CONTROL [initandlisten] allocator: tcmalloc
2016-10-11T12:36:54.237+0800 I CONTROL [initandlisten] modules: none
2016-10-11T12:36:54.237+0800 I CONTROL [initandlisten] build environment:
2016-10-11T12:36:54.237+0800 I CONTROL [initandlisten] distmod: 2008plus-ss
l
2016-10-11T12:36:54.237+0800 I CONTROL [initandlisten] distarch: x86_64
2016-10-11T12:36:54.237+0800 I CONTROL [initandlisten] target_arch: x86_64
2016-10-11T12:36:54.237+0800 I CONTROL [initandlisten] options: {}
2016-10-11T12:36:54.239+0800 I - [initandlisten] Detected data files in C
:\data\db\ created by the 'wiredTiger' storage engine, so setting the active sto
rage engine to 'wiredTiger'.
2016-10-11T12:36:54.241+0800 I STORAGE [initandlisten] wiredtiger_open config:
create,cache_size=1G,session_max=20000,eviction=(threads_max=4),config_base=fals
e,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snapp
y),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),stati
stics_log=(wait=0),
2016-10-11T12:36:55.115+0800 I NETWORK [HostnameCanonicalizationWorker] Startin
g hostname canonicalization worker
2016-10-11T12:36:55.115+0800 I FTDC [initandlisten] Initializing full-time d
iagnostic data capture with directory 'C:/data/db/diagnostic.data'
2016-10-11T12:36:55.147+0800 I NETWORK [initandlisten] waiting for connections
on port 27017
說明MongoDB數據庫打開了,端口爲本地27017端口。
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo
from .items import XXXItem
class XXXPipeline(object):
def __init__(self):
client = pymongo.MongoClient("localhost", 27017)
db = client["XXX"]
self.blogs = db["Blogs"]
def process_item(self, item, spider):
if isinstance(item, XXXItem):
self.blogs.insert(dict(item))
編寫spider
# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import Selector
from tumblr_get.items import XXXItem
class XXXSpider(scrapy.Spider):
# 此爬蟲唯一的名字
name = 'XXX_spider'
# 爬蟲開始爬的鏈接
start_urls = ['http://www.XXX.com/blogs.html?page=1&name=']
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.page_count = 100
self.num_count = 1
def start_requests(self):
'''用於不斷構造request請求'''
while self.num_count < self.page_count:
# 翻頁
self.num_count += 1
# 構造請求,並指定回調函數
# 這裏用meta給回調函數傳參數
yield scrapy.Request(url='http://www.XXX.com/blogs.html?page=%s&name=' % str(self.num_count),
meta={'count': self.num_count},
callback=self.parse)
def parse(self, response):
# meta裏帶着構造request時的參數
print('開始第{0:d}頁'.format(response.meta['count']))
# 用XPath解析網頁
selector1 = Selector(response=response)
blogs_per_page_list = selector1.xpath('//*[@id="amz-main"]/div/div/table/tbody/tr/td/a/span/text()').extract()
# 實例化item,用於存儲數據
XXX_item = XXXItem()
XXX_item['Blogs_per_page'] = blogs_per_page_list
# 返回該item,供pipeline作進一步處理
yield XXX_item
其中解析網頁用到了XPath,比bs4要快。相關教程可去w3c找。
設定中間件Middleware
- 降低採集頻率
- 設置UserAgent將爬蟲僞裝成瀏覽器
- 設置代理ip
import random
from XXX.settings import USER_AGENT_LIST
from XXX.settings import PROXY_LIST
class RandomUserAgentMiddleware(object):
def process_request(self, request, spider):
ua = random.choice(USER_AGENT_LIST)
if ua:
request.headers["User-Agent"] = ua
class ProxyMiddleware(object):
def process_request(self, request, spider):
ip1 = 'http://' + random.choice(PROXY_LIST)
request.meta['proxy'] = ip1
編寫配置文件settings
# -*- coding: utf-8 -*-
# Scrapy settings for tumblr_get project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
# http://doc.scrapy.org/en/latest/topics/settings.html
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'XXX'
SPIDER_MODULES = ['XXX.spiders']
NEWSPIDER_MODULE = 'XXX.spiders'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1.0
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'XXX.middlewares.RandomUserAgentMiddleware': 300,
'XXX.middlewares.ProxyMiddleware': 310,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None
}
LOG_LEVEL = 'INFO'
USER_AGENT_LIST = [
'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0',
'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.3; rv:11.0) like Gecko',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11',
'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)']
PROXY_LIST = open('ip.txt','r').read().split('\n')
UserAgent是在網上搜的,有很多,適當選幾個就行。可用的ip可以在網上找免費的,但說實話不太好用,有的時候請求不出來。所以真要用代理ip,還要加上驗證和不斷獲取代理ip的代碼。
最後寫一個啓動文件
from scrapy import cmdline
cmdline.execute("scrapy crawl XXX_spider".split())
爬取結果
總結
- 手動分析相關頁面,確定要保存的數據
- 新建Scrapy項目
- 編寫items.py文件,構造好容納數據的item
- 編寫pipelines.py文件,確保相關item能夠得以處理
- 編寫spider,涉及起始網頁、循環構造request請求及鏈接跟進、response的解析等
- 編寫中間件middlewares,防止被封,或實現其他一些功能
- 在setting.py文件中設定採集頻率、註冊中間件、設置log級別等等
- 啓動爬蟲
- 實際爬取時間比上述要短,因爲中途去喫飯,電腦待機了。。。所以大約要減去50分鐘的樣子。
- 真實數據比7719頁要多,不過中間可能對方服務器有點問題,請求不出來,獲取不到數據。
- 總體上爲了防止被封,爬取得較慢,如果用上代理ip,再將頻率調高,時間會大大縮短。
- 該站較爲簡單,請求全部用get,未涉及到模擬登錄的問題。較爲簡單。