Scrapy爬蟲文件批量運行

原創

SteveForever

2018-09-02 07:17

Scrapy批量運行爬蟲文件的兩種方法：

1、使用CrawProcess實現

https://doc.scrapy.org/en/latest/topics/practices.html

2、修改craw源碼+自定義命令的方式實現

（1）我們打開scrapy.commands.crawl.py 文件可以看到：

    def run(self, args, opts):
        if len(args) < 1:
            raise UsageError()
        elif len(args) > 1:
            raise UsageError("running 'scrapy crawl' with more than one spider is no longer supported")
        spname = args[0]

        self.crawler_process.crawl(spname, **opts.spargs)
        self.crawler_process.start()

這是crawl.py 文件中的run() 方法，在此可以指定運行哪個爬蟲，要運行所有的爬蟲，則需要更改這個方法。

run() 方法中通過crawler_process.crawl(spname, **opts.spargs) 實現了爬蟲文件的運行，spname代表爬蟲名。要運行多個爬蟲文件，首先要獲取所有的爬蟲文件，可以通過crawler_process.spider_loader.list() 實現。

（2）實現過程：

a、在spider目錄的同級目錄下創建存放源代碼的文件夾mycmd，並在該目錄下創建文件mycrawl.py；

b、將crawl.py 中的代碼複製到mycrawl.py 文件中，然後進行修改：

#修改後的run() 方法
    def run(self, args, opts):
        #獲取爬蟲列表
        spd_loader_list = self.crawler_process.spider_loader.list()
        #遍歷各爬蟲
        for spname in spd_loader_list or args:
            self.crawler_process.crawl(spname, **opts.spargs)
            print("此時啓動的爬蟲："+spname)
        self.crawler_process.start()

同時可以修改：

    def short_desc(self):
        return "Run all spider"

c、在mycmd文件夾下添加一個初始化文件__init__.py，在項目配置文件（setting.py）中添加格式爲“COMMANDS_MODULES='項目核心目錄.自定義命令源碼目錄'”的配置；

例如：COMMANDS_MODULE = 'firstpjt.mycmd'

隨後通過命令“scrapy -h”，可以查看到我們添加的命令mycrawl

這樣，我們就可以同時啓動多個爬蟲文件了，使用命令：

scrapy mycrawl --nolog

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Scrapy爬蟲文件批量運行

c++11中的std::initialzer_list

關於會聲會影導入視頻出現卡頓花屏的解決辦法

Git提交代碼步驟

c語言unsigned int和int

Go語言包安裝問題解決方案

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結