python: python求各種距離公式

一.  scipy.spatial 模塊的介紹

在scipy.spatial中最重要的模塊應該就是距離計算模塊distance了。

from scipy import spatial

距離計算
矩陣距離計算函數
矩陣參數每行代表一個觀測值,計算結果就是每行之間的metric距離。Distance matrix computation from a collection of raw observation vectors stored in a rectangular array.

向量距離計算函數Distance functions between two vectors u and v
Distance functions between two vectors u and v. Computingdistances over a large collection of vectors is inefficient for thesefunctions. Use pdist for this purpose.

輸入的參數應該是向量,也就是維度應該是(n, ),當然也可以是(1, n)它會使用squeeze自動去掉維度爲1的維度;但是如果是多維向量,至少有兩個維度>1就會出錯。

e.g. spatial.distance.correlation(u, v)    #計算向量u和v之間的相關係數(pearson correlation coefficient, Centered Cosine)

Note: 如果向量u和v元素數目都只有一個或者某個向量中所有元素相同(分母norm(u - u.mean())爲0),那麼相關係數當然計算無效,會返回nan。

braycurtis(u, v)    Computes the Bray-Curtis distance between two 1-D arrays.
canberra(u, v)    Computes the Canberra distance between two 1-D arrays.
chebyshev(u, v)    Computes the Chebyshev distance.
cityblock(u, v)    Computes the City Block (Manhattan) distance.
correlation(u, v)    Computes the correlation distance between two 1-D arrays.
cosine(u, v)    Computes the Cosine distance between 1-D arrays.
dice(u, v)    Computes the Dice dissimilarity between two boolean 1-D arrays.
euclidean(u, v)    Computes the Euclidean distance between two 1-D arrays.
hamming(u, v)    Computes the Hamming distance between two 1-D arrays.
jaccard(u, v)    Computes the Jaccard-Needham dissimilarity between two boolean 1-D arrays.
kulsinski(u, v)    Computes the Kulsinski dissimilarity between two boolean 1-D arrays.
mahalanobis(u, v, VI)    Computes the Mahalanobis distance between two 1-D arrays.
matching(u, v)    Computes the Matching dissimilarity between two boolean 1-D arrays.
minkowski(u, v, p)    Computes the Minkowski distance between two 1-D arrays.
rogerstanimoto(u, v)    Computes the Rogers-Tanimoto dissimilarity between two boolean 1-D arrays.
russellrao(u, v)    Computes the Russell-Rao dissimilarity between two boolean 1-D arrays.
seuclidean(u, v, V)    Returns the standardized Euclidean distance between two 1-D arrays.
sokalmichener(u, v)    Computes the Sokal-Michener dissimilarity between two boolean 1-D arrays.
sokalsneath(u, v)    Computes the Sokal-Sneath dissimilarity between two boolean 1-D arrays.
sqeuclidean(u, v)    Computes the squared Euclidean distance between two 1-D arrays.
wminkowski(u, v, p, w)    Computes the weighted Minkowski distance between two 1-D arrays.
yule(u, v)    Computes the Yule dissimilarity between two boolean 1-D arrays.

[距離和相似度計算 ]
scipy.spatial.distance.pdist(X, metric=’euclidean’, p=2, w=None, V=None, VI=None)
pdist(X[, metric, p, w, V, VI])Pairwise distances between observations in n-dimensional space.觀測值(n維)兩兩之間的距離。Pairwise distances between observations in n-dimensional space.距離值越大,相關度越小。

注意,距離轉換成相似度時,由於自己和自己的距離是不會計算的默認爲0,所以要先通過dist = spatial.distance.squareform(dist)轉換成dense矩陣,再通過1 - dist計算相似度。

metric:

1 距離計算可以使用自己寫的函數。Y = pdist(X, f) Computes the distance between all pairs of vectors in Xusing the user supplied 2-arity function f.

如歐式距離計算可以這樣:

dm = pdist(X, lambda u, v: np.sqrt(((u-v)**2).sum()))

但是如果scipy庫中有相應的距離計算函數的話,就不要使用dm = pdist(X, sokalsneath)這種方式計算,sokalsneath調用的是python自帶的函數,會調用c(n, 2)次;而應該使用scipy中的optimized C version,使用dm = pdist(X, 'sokalsneath')。

再如矩陣行之間的所有cause effect值的計算可以這樣:

def causal_effect(m):
    effect = lambda u, v: u.dot(v) / sum(u) - (1 - u).dot(v) / sum(1 - u)
    return spatial.distance.squareform(spatial.distance.pdist(m, metric=effect))
2 這裏計算的是兩兩之間的距離,而不是相似度,如計算cosine距離後要用1-cosine才能得到相似度。從下面的consine計算公式就可以看出。

Y = pdist(X, ’euclidean’)    #d=sqrt((x1-x2)^2+(y1-y2)^2+(z1-z2)^2)

Y = pdist(X, ’minkowski’, p)

scipy.spatial.distance.cdist(XA, XB, metric=’euclidean’, p=2, V=None, VI=None, w=None)
cdist(XA, XB[, metric, p, V, VI, w])Computes distance between each pair of the two collections of inputs.

當然XA\XB最簡單的形式是一個二維向量(也必須是,否則報錯ValueError: XA must be a 2-dimensional array.),計算的就是兩個向量之間的metric距離度量。

scipy.spatial.distance.squareform(X, force=’no’, checks=True)
squareform(X[, force, checks])Converts a vector-form distance vector to a square-form distance matrix, and vice-versa.

將向量形式的距離表示轉換成dense矩陣形式。Converts a vector-form distance vector to a square-form distance matrix, and vice-versa.

注意:Distance matrix 'X' must be symmetric&diagonal must be zero.

皮皮blog

矩陣距離計算示例
示例1
x
array([[0, 2, 3],
       [2, 0, 6],
       [3, 6, 0]])
y=dis.pdist(x)
Iy
array([ 4.12310563,  5.83095189,  8.54400375])
z=dis.squareform(y)
z
array([[ 0.        ,  4.12310563,  5.83095189],
       [ 4.12310563,  0.        ,  8.54400375],
       [ 5.83095189,  8.54400375,  0.        ]])
type(z)
numpy.ndarray
type(y)
numpy.ndarray

示例2
print(sim)
print(spatial.distance.cdist(sim[0].reshape((1, 2)), sim[1].reshape((1, 2)), metric='cosine'))
print(spatial.distance.pdist(sim, metric='cosine'))
[[-2.85 -0.45]
 [-2.5   1.04]]
[[ 0.14790689]]

[ 0.14790689]

皮皮blog

檢驗距離矩陣有效性Predicates for checking the validity of distance matrices
Predicates for checking the validity of distance matrices, bothcondensed and redundant. Also contained in this module are functionsfor computing the number of observations in a distance matrix.

is_valid_dm(D[, tol, throw, name, warning])    Returns True if input array is a valid distance matrix.
is_valid_y(y[, warning, throw, name])    Returns True if the input array is a valid condensed distance matrix.
num_obs_dm(d)    Returns the number of original observations that correspond to a square, redundant distance matrix.
num_obs_y(Y)    Returns the number of original observations that correspond to a condensed distance matrix.
from:http://blog.csdn.net/pipisorry/article/details/48814183
ref: Distance computations (scipy.spatial.distance)

Spatial algorithms and data structures (scipy.spatial)

scipy-ref-0.14.0-p933
--------------------- 
 

 

二. 在python中計算各種距離

from scipy.spatial.distance import pdist, squareform
下面結合API文檔標註一下具體用法:
1.X = pdist(X, 'euclidean')
計算數組X樣本之間的歐式距離 返回值爲 Y 爲壓縮距離元組或矩陣(以下等同)
2. X = pdist(X, 'minkowski', p)
計算數組樣本之間的明氏距離 
3. Y = pdist(X, 'cityblock')
計算數組樣本之間的曼哈頓距離
4. X = pdist(X, 'seuclidean', V=None)
計算數組樣本之間的標準化歐式距離 ,v是方差向量,表示 v[i]表示第i個分量的方差,如果缺失。默認自動計算。
5. X = pdist(X, 'sqeuclidean')
計算數組樣本之間歐式距離的平方
6. X = pdist(X, 'cosine')
計算數組樣本之間餘弦距離 公式爲:
7. X = pdist(X, 'correlation')
計算數組樣本之間的相關距離。
8.X = pdist(X, 'hamming')
計算數據樣本之間的漢明距離
9. X = pdist(X, 'jaccard')
計算數據樣本之間的傑卡德距離
10. X = pdist(X, 'chebyshev')
計算數組樣本之間的切比雪夫距離
11. X = pdist(X, 'canberra')
計算數組樣本之間的堪培拉距離
12. X = pdist(X, 'mahalanobis', VI=None)
計算數據樣本之間的馬氏距離
還有好多不常用的距離就不一一寫出了,如果想查閱可以點點我,點我
除了對指定的距離計算該函數還可以穿lmbda表達式進行計算,如下
dm = pdist(X, lambda u, v: np.sqrt(((u-v)**2).sum()))
二、得到壓縮矩陣後還需下一步即:
Y=scipy.spatial.distance.squareform(X, force='no', checks=True)
其中,X就是上文提到的壓縮矩陣Y,force 如同MATLAB一樣,如果force等於‘tovector’ or ‘tomatrix’,輸入就會被當做距離矩陣或距離向量。
cheak當X-X.T比較小或diag(X)接近於零,是是有必要設成True的,返回值Y爲一個距離矩陣Y[i,j]表示樣本i與樣本j的距離。
 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章