簡介
FuzzyWuzzy是github上一個高星項目,根據Edit Distance計算兩個序列之間的距離。Edit Distance是指兩個字符串之間,由一個轉換爲另一個所需的最少編輯次數。編輯操作包括替換、插入、刪除,一般認爲兩個字符串的編輯距離越小,相似度越大。(注意,Edit Distance越小相似度越大,但是FuzzyWuzzy返回的是相似度的數值,所以返回值越大,字符串越相似。
安裝
pip install fuzzywuzzy
或
git clone git://github.com/seatgeek/fuzzywuzzy.git fuzzywuzzy
cd fuzzywuzzy
python setup.py install
用法
-
聲明
from fuzzywuzzy import fuzz
from fuzzywuzzy import process -
簡單匹配
>>>fuzz.ratio("this is a test", "this is a test!")
97
- 非完全匹配
>>>fuzz.partial_ratio("this is a test", "this is a test!")
100
- 忽略順序匹配
>>>fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
100
- 去重子集匹配
>>>fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")
100
- 模糊匹配
>>>choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
>>>process.extract("new york jets", choices, limit=2)
[('New York Jets', 100), ('New York Giants', 78)]
>>> process.extractOne("cowboys", choices)
("Dallas Cowboys", 90)
還可以傳入附加參數到extractOne來設置使用特定的匹配模式,典型用法是用來匹配文件路徑:
>>> process.extractOne("System of a down - Hypnotize - Heroin", songs)
('/music/library/good/System of a Down/2005 - Hypnotize/01 - Attack.mp3', 86)
>>> process.extractOne("System of a down - Hypnotize - Heroin", songs, scorer=fuzz.token_sort_ratio)
("/music/library/good/System of a Down/2005 - Hypnotize/10 - She's Like Heroin.mp3", 61)