搜索引擎–Scrapy爬蟲使用Bloom Filter算法進行URL去重

主機環境:Ubuntu 13.04

Python版本:2.7.4

轉載請標明:http://blog.geekcome.com/archives/135

1、安裝

1sudo pip install pybloomfiltermmap

或者直接在github獲取最新源代碼,編譯安裝

1sudo python setup.py install

2、使用方法

1class pybloomfilter.BloomFilter(capacity : int, error_rate : float, filename : string)

Create a new BloomFilter object with a given capacity and error_rate. Note that we do not check capacity. This is important, because I want to be able to support logical OR and AND (see below). The capacity and error_rate then together serve as a contract—you add less than capacity items, and the Bloom Filter will have an error rate less than error_rate.

NEW: If you specify None for the filename, then the bloom filter will be backed by malloc’d memory, rather than by a file.

1BloomFilter.add(item) → Boolean

Add the item to the bloom filter.

  • item – Hashable object
  • Boolean (True if item already in the filter)
發佈了383 篇原創文章 · 獲贊 774 · 訪問量 271萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章