R語言字符串相似度 stringdist包

計算字符串相似度可以使用utils包中的adist函數,或者MKmisc包中的stringdist函數,或者RecordLinkage包中也有如jarowinkler之類的距離函數。本文介紹stringdist包中的stringdist函數和stringdistmatrix函數。
stringdist包作者是 Mark der Loo
stringdist用於計算對象a,b中的字符串兩兩之間的相似度,對於一個對象中的元素少於另一個的情況,採用循環補齊機制。stringdistmatrix的出相似度矩陣,其中採用a中的行,b中的列。

stringdist(a, b, method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"), useBytes = FALSE, weight = c(d = 1, i = 1, s = 1, t = 1), maxDist = Inf, q = 1, p = 0, nthread = getOption("sd_num_thread"))

stringdistmatrix(a, b, method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"), useBytes = FALSE, weight = c(d = 1, i = 1, s = 1, t = 1), maxDist = Inf, q = 1, p = 0, useNames = c("none", "strings", "names"), ncores = 1, cluster = NULL, nthread = getOption("sd_num_thread"))

參數:
a,b: 字符串類型的目標對象
method:距離計算方法,默認爲“osa”,可以設置爲jaccard,hamming,jarowinkler等方法。
useBytes:以字節爲單位進行比較
weight:權值必須爲正並且不超過1
maxDist:最大距離限制
q:在使用method=’qgram’, ‘jaccard’ 或 ‘cosine’的時候設置,必須爲非負數
p:jarowinkler距離的懲罰因子,默認爲0,在0-0.25之間取值
nThread:最大線程數
useNames:輸出的行、列名使用輸入變量的行、列名
ncores:核心數
cluster:自定義集羣數

案例:

> stringdistmatrix(c("foo","bar","boo"),c("baz","buz"))
     [,1] [,2]
[1,]    3    3
[2,]    1    2
[3,]    2    2

> # string distance matching is case sensitive:
> stringdist("ABC","abc")
[1] 3
> 
> # so you may want to normalize a bit:
> stringdist(tolower("ABC"),"abc")
[1] 0
> 
> # stringdist recycles the shortest argument:
> stringdist(c('a','b','c'),c('a','c'))
Warning message: longer object length is not a multiple of shorter object length
[1] 0 1 1
> 
> # different edit operations may be weighted; e.g. weighted substitution:
> stringdist('ab','ba',weight=c(1,1,1,0.5))
[1] 0.5
> 
> # Non-unit weights for insertion and deletion makes the distance metric asymetric
> stringdist('ca','abc')
[1] 3
> stringdist('abc','ca')
[1] 3
> stringdist('ca','abc',weight=c(0.5,1,1,1))
[1] 2
> stringdist('abc','ca',weight=c(0.5,1,1,1))
[1] 2.5

> # q-grams are based on the difference between occurrences of q consecutive characters
> # in string a and string b.
> # Since each character abc occurs in 'abc' and 'cba', the q=1 distance equals 0:
> stringdist('abc','cba',method='qgram',q=1)
[1] 0
> 
> # since the first string consists of 'ab','bc' and the second 
> # of 'cb' and 'ba', the q=2 distance equals 4 (they have no q=2 grams in common):
> stringdist('abc','cba',method='qgram',q=2)
[1] 4

> stringdist('MARTHA','MATHRA',method='jw')
[1] 0.08333333
> # Note that stringdist gives a  _distance_ where wikipedia gives the corresponding 
> # _similarity measure_. To get the wikipedia result:
> 1 - stringdist('MARTHA','MATHRA',method='jw')
[1] 0.9166667
> 
> # The corresponding Jaro-Winkler distance can be computed by setting p=0.1
> stringdist('MARTHA','MATHRA',method='jw',p=0.1)
[1] 0.06666667
> # or, as a similarity measure
> 1 - stringdist('MARTHA','MATHRA',method='jw',p=0.1)
[1] 0.9333333
> 
> # This gives distance 1 since Euler and Gauss translate to different soundex codes.
> stringdist('Euler','Gauss',method='soundex')
[1] 1
> # Euler and Ellery translate to the same code and have distance 0
> stringdist('Euler','Ellery',method='soundex')
[1] 0
> 

參考
https://www.rdocumentation.org/packages/stringdist/versions/0.9.4.2/topics/stringdist
https://cran.r-project.org/web/packages/stringdist/stringdist.pdf

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章