字符串模糊匹配工具:FuzzyWuzzy

簡介

FuzzyWuzzy是github上一個高星項目,根據Edit Distance計算兩個序列之間的距離。Edit Distance是指兩個字符串之間,由一個轉換爲另一個所需的最少編輯次數。編輯操作包括替換、插入、刪除,一般認爲兩個字符串的編輯距離越小,相似度越大。(注意,Edit Distance越小相似度越大,但是FuzzyWuzzy返回的是相似度的數值,所以返回值越大,字符串越相似。

安裝

pip install fuzzywuzzy

git clone git://github.com/seatgeek/fuzzywuzzy.git fuzzywuzzy
    cd fuzzywuzzy
    python setup.py install

用法

  1. 聲明
    from fuzzywuzzy import fuzz
    from fuzzywuzzy import process

  2. 簡單匹配

>>>fuzz.ratio("this is a test", "this is a test!")
	97
  1. 非完全匹配
>>>fuzz.partial_ratio("this is a test", "this is a test!")
	100
  1. 忽略順序匹配
>>>fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
	100
  1. 去重子集匹配
>>>fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")
	100
  1. 模糊匹配
>>>choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
>>>process.extract("new york jets", choices, limit=2)
	[('New York Jets', 100), ('New York Giants', 78)]
>>> process.extractOne("cowboys", choices)
	("Dallas Cowboys", 90)

還可以傳入附加參數到extractOne來設置使用特定的匹配模式,典型用法是用來匹配文件路徑:

>>> process.extractOne("System of a down - Hypnotize - Heroin", songs)
        ('/music/library/good/System of a Down/2005 - Hypnotize/01 - Attack.mp3', 86)
>>> process.extractOne("System of a down - Hypnotize - Heroin", songs, scorer=fuzz.token_sort_ratio)
        ("/music/library/good/System of a Down/2005 - Hypnotize/10 - She's Like Heroin.mp3", 61)
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章