在開始介紹 scrapy 的去重之前，先想想我們是怎麼對 requests 對去重的。requests 只是下載器，本身並沒有提供去重功能。所以我們需要自己去做。很典型的做法是事先定義一個去重隊列，判斷抓取的 url 是否在其中，如下：

crawled_urls = set()


def check_url(url):
    if url not in crawled_urls:
        return True
    return False

此時的集合是保存在內存中的，隨着爬蟲抓取內容變多，該集合會越來越大，有什麼辦法呢？

接着往下看，你會知道的。

scrapy 的去重

scrapy 對 request 不做去重很簡單，只需要在 request 對象中設置dont_filter爲 True，如

yield scrapy.Request(url, callback=self.get_response, dont_filter=True)

看看源碼是如何做的，位置

_fingerprint_cache = weakref.WeakKeyDictionary()
def request_fingerprint(request, include_headers=None):
    if include_headers:
        include_headers = tuple(to_bytes(h.lower())
                                 for h in sorted(include_headers))
    cache = _fingerprint_cache.setdefault(request, {})
    if include_headers not in cache:
        fp = hashlib.sha1()
        fp.update(to_bytes(request.method))
        fp.update(to_bytes(canonicalize_url(request.url)))
        fp.update(request.body or b'')
        if include_headers:
            for hdr in include_headers:
                if hdr in request.headers:
                    fp.update(hdr)
                    for v in request.headers.getlist(hdr):
                        fp.update(v)
        cache[include_headers] = fp.hexdigest()
    return cache[include_headers]

註釋過多，我就刪掉了。谷歌翻譯 + 人翻

返回請求指紋

請求指紋是唯一標識請求指向的資源的哈希。 例如，請使用以下兩個網址：

http://www.example.com/query?id=111&cat=222
http://www.example.com/query?cat=222&id=111

即使這兩個不同的 URL 都指向相同的資源並且是等價的（即，它們應該返回相同的響應）

另一個例子是用於存儲會話 ID 的 cookie。 假設以下頁面僅可供經過身份驗證的用戶訪問：

http://www.example.com/members/offers.html

許多網站使用 cookie 來存儲會話 ID，這會隨機添加字段到 HTTP 請求，因此在計算時應該被忽略指紋。

因此，計算時默認會忽略 request headers。 如果要包含特定 headers，請使用 include_headers 參數，它是要計算 Request headers 的列表。

其實就是說：scrapy 使用 sha1 算法，對每一個 request 對象加密，生成 40 爲十六進制數，如：'fad8cefa4d6198af8cb1dcf46add2941b4d32d78'。

我們看源碼，重點是一下三行

        fp = hashlib.sha1()
        fp.update(to_bytes(request.method))
        fp.update(to_bytes(canonicalize_url(request.url)))
        fp.update(request.body or b'')

如果沒有自定義 headers，只計算 method、url、和二進制 body，我們來計算下，代碼：

print(request_fingerprint(scrapy.Request('http://www.example.com/query?id=111&cat=222')))
print(request_fingerprint(scrapy.Request('http://www.example.com/query?cat=222&id=111')))
print(request_fingerprint(scrapy.Request('http://www.example.com/query')))

輸出：

fad8cefa4d6198af8cb1dcf46add2941b4d32d78
fad8cefa4d6198af8cb1dcf46add2941b4d32d78
b64c43a23f5e8b99e19990ce07b75c295165a923

可以看到第一條和第二條的密碼是一樣的，是因爲調用了canonicalize_url方法，該方法返回如下

>>> import w3lib.url
>>>
>>> # sorting query arguments
>>> w3lib.url.canonicalize_url('http://www.example.com/do?c=3&b=5&b=2&a=50')
'http://www.example.com/do?a=50&b=2&b=5&c=3'
>>>
>>> # UTF-8 conversion + percent-encoding of non-ASCII characters
>>> w3lib.url.canonicalize_url(u'http://www.example.com/r\u00e9sum\u00e9')
'http://www.example.com/r%C3%A9sum%C3%A9'
>>>

scrapy 的去重默認會保存到內存中，如果任務重啓，會導致內存中所有去重隊列消失

scrapy-redis 的去重

scrapy-redis 重寫了 scrapy 的調度器和去重隊列，所以需要在 settings 中修改如下兩列

# Enables scheduling storing requests queue in redis.
SCHEDULER = "scrapy_redis.scheduler.Scheduler"

# Ensure all spiders share same duplicates filter through redis.
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

一般我們會在 redis 中看到這兩個，分別是去重隊列和種子鏈接

先看看代碼：重要代碼

    def request_seen(self, request):
        """Returns True if request was already seen.
        Parameters
        ----------
        request : scrapy.http.Request
        Returns
        -------
        bool
        """
        fp = self.request_fingerprint(request)
        # This returns the number of values added, zero if already exists.
        added = self.server.sadd(self.key, fp)
        return added == 0

    def request_fingerprint(self, request):
        """Returns a fingerprint for a given request.
        Parameters
        ----------
        request : scrapy.http.Request
        Returns
        -------
        str
        """
        return request_fingerprint(request)

首先拿到 scrapy.http.Request 會先調用 self.request_fingerprint 去計算，也就是 scrapy 的 sha1 算法去加密，然後會向 redis 中添加該指紋。

該函數的作用是：計算該請求指紋，添加到 redis 的去重隊列，如果已經存在該指紋，返回 True。

我們可以看到，只要有在 settings 中添加DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"，就會在 redis 中新加一列去重隊列，說下這樣做的優劣勢：

優點：將內存中的去重隊列序列化到 redis 中，及時爬蟲重啓或者關閉，也可以再次使用，你可以使用 SCHEDULER_PERSIST 來調整緩存
缺點：如果你需要去重的指紋過大，redis 佔用空間過大。8GB=8589934592Bytes，平均一個去重指紋 40Bytes，約可以存儲 214,748,000 個(2 億)。所以在做關係網絡爬蟲中，序列化到 redis 中可能並不是很好，保存在內存中也不好，所以就產生了布隆過濾器。

布隆過濾器

它的原理是將一個元素通過 k 個哈希函數，將元素映射爲 k 個比特位，在 bitmap 中把它們置爲 1。在驗證的時候只需要驗證這些比特位是否都是 1 即可，如果其中有一個爲 0，那麼元素一定不在集合裏，如果全爲 1，則很可能在集合裏。（因爲可能會有其它的元素也映射到相應的比特位上）

同時這也導致不能從 Bloom filter 中刪除某個元素，無法確定這個元素一定在集合中。以及帶來了誤報的問題，當裏面的數據越來越多，這個可能在集合中的靠譜程度就越來越低。（由於哈希碰撞，可能導致把不屬於集合內的元素認爲屬於該集合）

布隆過濾器的缺點是錯判，就是說，不在裏面的，可能誤判成在裏面，但是在裏面的，就一定在裏面，而且無法刪除其中數據。

>>> import pybloomfilter
>>> fruit = pybloomfilter.BloomFilter(100000, 0.1, '/tmp/words.bloom')
>>> fruit.update(('apple', 'pear', 'orange', 'apple'))
>>> len(fruit)
3
>>> 'mike' in fruit
False
>>> 'apple' in fruit
True

python3 使用pybloomfilter的例子。

那麼如何在 scrapy 中使用布隆過濾器呢，崔大大已經寫好了，地址：ScrapyRedisBloomFilter，已經打包好，可以直接安裝

pip install scrapy-redis-bloomfilter

在 settings 中這樣配置：

# Ensure use this Scheduler
SCHEDULER = "scrapy_redis_bloomfilter.scheduler.Scheduler"

# Ensure all spiders share same duplicates filter through redis
DUPEFILTER_CLASS = "scrapy_redis_bloomfilter.dupefilter.RFPDupeFilter"

# Redis URL
REDIS_URL = 'redis://localhost:6379/0'

# Number of Hash Functions to use, defaults to 6
BLOOMFILTER_HASH_NUMBER = 6

# Redis Memory Bit of Bloomfilter Usage, 30 means 2^30 = 128MB, defaults to 30
BLOOMFILTER_BIT = 30

# Persist
SCHEDULER_PERSIST = True

其實也是修改了調度器與去重方法，有興趣的可以瞭解下。

前些天在查看scrapy過濾器的時候看到一篇博客，感覺寫的不錯，特意搬磚過來，純技術研究，侵告刪

轉載地址：https://www.v2ex.com/t/548825

布隆去重（推薦）：https://blog.csdn.net/golden1314521/article/details/33268719

爬蟲篇（3）scrapy 去重與 scrapy_redis 去重與布隆過濾器（轉）

scrapy 的去重

scrapy-redis 的去重

布隆過濾器

.Net 8.0 下的新RPC，IceRPC之試試的新玩法"打洞"

完美替代postman的軟件

Vue mockjs mock.js

關於遊戲付費的一點想法

我通過CKA和CKS啦！

《最新出爐》系列入門篇-Python+Playwright自動化測試-42-強大的可視化追蹤利器Trace Viewer

大數據怎麼學？對大數據開發領域及崗位的詳細解讀，完整理解大數據開發領域技術體系

Python 常用小技巧（持續更新）

爬蟲篇（3）scrapy 去重與 scrapy_redis 去重與布隆過濾器（轉）

爬蟲篇（2.1）selenium開啓開發者模式

爬蟲篇（2.3）scrapy通用爬蟲以及setting設置中一些提升效率的方式

爬蟲篇（2）使用pyexecjs破解js中cookies

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結