安裝scrapy
https://scrapy-chs.readthedocs.org/zh_CN/latest/intro/install.html
使用pip安裝:
pip install Scrapy
Ubuntu 軟件包
- 把Scrapy簽名的GPG密鑰添加到APT的鑰匙環中:
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 627220E7
- 執行如下命令,創建
/etc/apt/sources.list.d/scrapy.list
文件:
echo 'deb http://archive.scrapy.org/ubuntu scrapy main' | sudo tee /etc/apt/sources.list.d/scrapy.list
- 更新包列表並安裝
scrapy-0.25
:
sudo apt-get update && sudo apt-get install scrapy-0.25
安裝redispy
sudo easy_install redis
安裝pymongo
sudo pip install pymongo==2.7.2
安裝graphite
http://blog.csdn.net/u011008734/article/details/47166469
注意之後 需要每次啓動 carbon
:
$ cd /opt/graphite/
$ sudo ./bin/carbon-cache.py start
(如何配置請查看:statscol/graphite.py
)
配置
- 修改
/opt/graphite/webapp/content/js/composer_widgets.js
,
在toggleAutoRefresh
函數中的'interval'
的值由60
改爲1
- 添加
storage-aggregation.conf
文件在'/opt/graphite/conf'
目錄下 :
[scrapy_min]
pattern = ^scrapy\..*_min$
xFilesFactor = 0.1
aggregationMethod = min
[scrapy_max]
pattern = ^scrapy\..*_max$
xFilesFactor = 0.1
aggregationMethod = max
[scrapy_sum]
pattern = ^scrapy\..*_count$
xFilesFactor = 0.1
aggregationMethod = sum
- in settings set:
#分佈式 修改如下:
STATS_CLASS = 'scrapygraphite.RedisGraphiteStatsCollector'
GRAPHITE_HOST = '127.0.0.1'
GRAPHITE_PORT = 2003
#單機 修改如下:
STATS_CLASS = 'scrapygraphite.GraphiteStatsCollector'
GRAPHITE_HOST = '127.0.0.1'
GRAPHITE_PORT = 2003
安裝mongodb
- 爲軟件包管理系統導入公鑰。
Ubuntu 軟件包管理工具爲了保證軟件包的一致性和可靠性需要用 GPG 密鑰檢驗軟件包。使用下列命令導入 MongoDB 的 GPG 密鑰 ( MongoDB public GPG Key http://docs.mongodb.org/10gen-gpg-key.asc)_:
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 7F0CEB10
- 爲MongoDB創建一個列表文件。
使用下列命令創建/etc/apt/sources.list.d/mongodb.list
列表文件:
echo 'deb http://downloads-distro.mongodb.org/repo/ubuntu-upstart dist 10gen' | sudo tee /etc/apt/sources.list.d/mongodb.list
- 重載本地軟件包數據庫:
sudo apt-get update
- 安裝MongoDB最新的穩定版本
sudo apt-get install mongodb-org
- 使用下列命令啓動 mongod 進程
sudo service mongod start
安裝redis
wget http://download.redis.io/releases/redis-2.8.12.tar.gz
tar xzf redis-2.8.12.tar.gz
cd redis-2.8.12
make
啓動並運行redis:
src/redis-server
可能需要安裝的其他模塊
IPy
zope.interface
pip install IPy
pip install zope.interface
修改項目
修改
woaidu_detail_spider.py
中WoaiduSpider
的父類
參見https://scrapy-chs.readthedocs.org/zh_CN/latest/intro/tutorial.html#spider
from scrapy.spider import BaseSpider
BaseSpider
類以過時,修改爲import scrapy
scrapy.spider.Spider
把
distribute_crawler/woaidu_crawler/woaidu_crawler/pipelines/file.py
的
line 18:from scrapy.contrib.pipeline.images import MediaPipeline
改成:from scrapy.contrib.pipeline.media import MediaPipeline
搭建mongodb集羣
cd woaidu_crawler/commands/
sudo python init_sharding_mongodb.py --path=/usr/bin
修改settings.py
中 ITEM_PIPELINES
參見https://scrapy-chs.readthedocs.org/zh_CN/latest/topics/item-pipeline.html#id4
ITEM_PIPELINES = {'woaidu_crawler.pipelines.cover_image.WoaiduCoverImage':300,
'woaidu_crawler.pipelines.mongodb_book_file.MongodbWoaiduBookFile':400, 'woaidu_crawler.pipelines.drop_none_download.DropNoneBookFile':500,
'woaidu_crawler.pipelines.mongodb.ShardMongodbPipeline':600, 'woaidu_crawler.pipelines.final_test.FinalTestPipeline':700,}
修改woaidu_crawler/pipelines/mongodb_book_file.py
下130的
line 130:info = self.spiderinfo.spider
爲
info = self.spiderinfo
在含有log
文件夾的目錄下執行:需切換到root用戶 sudo su
scrapy crawl woaidu
打開http://127.0.0.1/ 通過圖表查看spider
實時狀態信息
要想嘗試分佈式,可以在另外一個目錄運行此工程
搭建mongodb服務器
cd woaidu_crawler/commands/
python init_single_mongodb.py
設置settings.py:
ITEM_PIPELINES = ['woaidu_crawler.pipelines.cover_image.WoaiduCoverImage':300,
'woaidu_crawler.pipelines.bookfile.WoaiduBookFile':400,
'woaidu_crawler.pipelines.drop_none_download.DropNoneBookFile':500,
'woaidu_crawler.pipelines.mongodb.SingleMongodbPipeline':600,
'woaidu_crawler.pipelines.final_test.FinalTestPipeline':700,]
在含有log
文件夾的目錄下執行:需切換到root用戶 sudo su
scrapy crawl woaidu
打開http://127.0.0.1/ (也就是你運行的graphite-web的url) 通過圖表查看spider實時狀態信息
要想嘗試分佈式,可以在另外一個目錄運行此工程
注意
每次運行完之後都要執行commands/clear_stats.py
文件來清除redis
中的stats
信息
python clear_stats.py