scrapy_redis 相關: 將 jobdir 保存的爬蟲進度轉移到 Redis

0.參考

 Scrapy 隱含 bug: 強制關閉爬蟲後從 requests.queue 讀取的已保存 request 數量可能有誤

1.說明

Scrapy 設置 jobdir,停止爬蟲後,保存文件目錄結構:

crawl/apps/
├── requests.queue
│   ├── active.json
│   ├── p0
│   └── p1
├── requests.seen
└── spider.state

requests.queue/p0 文件保存 priority=0 的未調度 request, p-1 對應實際 priority=1 的高優先級 request,轉移到 redis 有序集合時,score 值越小排序越靠前,因此取 score 爲 -1。以此類推,p1 對應 priority=-1 的低優先級 request。

requests.seen 保存請求指紋過濾器對已入隊 request 的 hash 值,每行一個值。

spider.state 涉及自定義屬性的持久化存儲,不在本文處理範圍以內。

 

2.實現代碼

import os
from os.path import join
import re
import struct

import redis


def sadd_dupefilter(jobdir, redis_server, name):
    """See python/lib/site-packages/scrapy/dupefilters.py"""
    
    file = join(jobdir, 'requests.seen')
    with open(file) as f:
        print('Processing %s, it may take minutes...'%file)
        key = '%s:dupefilter'%name
        for x in f:
            redis_server.sadd(key, x.rstrip())
    print('Result: {} {}'.format(key, redis_server.scard(key)))


def zadd_requests(jobdir, redis_server, name):
    """See python/lib/site-packages/queuelib/queue.py"""
    
    SIZE_FORMAT = ">L"
    SIZE_SIZE = struct.calcsize(SIZE_FORMAT)
    
    key = '%s:requests'%name
    queue_dir = join(jobdir, 'requests.queue')
    file_list = os.listdir(queue_dir)
    file_score_dict = dict([(f, int(f[1:])) for f in file_list 
                                                if re.match(r'^p-?\d+$', f)])
    for (file, score) in file_score_dict.items():
        print('Processing %s, it may take minutes...'%file)
        f = open(join(queue_dir, file), 'rb+')
        qsize = f.read(SIZE_SIZE)
        total_size, = struct.unpack(SIZE_FORMAT, qsize)
        f.seek(0, os.SEEK_END)

        actual_size = 0
        while True:
            if f.tell() == SIZE_SIZE:
                break
            f.seek(-SIZE_SIZE, os.SEEK_CUR)
            size, = struct.unpack(SIZE_FORMAT, f.read(SIZE_SIZE)) 
            f.seek(-size-SIZE_SIZE, os.SEEK_CUR)
            data = f.read(size)
            redis_server.execute_command('ZADD', key, score, data)
            f.seek(-size, os.SEEK_CUR)
            actual_size += 1
        print('total_size {}, actual_size {}, score {}'.format(
                total_size, actual_size, score))
        print('Result: {} {}'.format(key, redis_server.zlexcount(key, '-', '+')))


if __name__ == '__main__':
    name = 'test'
    jobdir = '/home/yourproject/crawl/apps'
    database_num = 0
    # apps/
    # ├── requests.queue
    # │   ├── active.json
    # │   ├── p0
    # │   └── p1
    # ├── requests.seen
    # └── spider.state
    
    password = 'password'
    host = '127.0.0.1'
    port = '6379'
    redis_server = redis.StrictRedis.from_url('redis://:{password}@{host}:{port}/{database_num}'.format(
                                                password=password, host=host,
                                                port=port, database_num=database_num))
    
    sadd_dupefilter(jobdir, redis_server, name)
    zadd_requests(jobdir, redis_server, name)

 

 

3.運行結果

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章