data mining notes

兩個對象i和j之間的相異性可以根據不匹配率來計算:

d(i,j) = (p-m)/p;

其中,m是匹配的數目(即i和j取值相同狀態的屬性數), 而p是刻畫對象的屬性總數。

相似性

d(i,j)=1-d(i,j);

 

對於對稱的二元屬性,每個狀態都同樣重要。基於對稱二元屬性的相異性稱做對稱的二元相異性。

d(i,j)=(r+s)/(q+r+s+t);

非對稱的二元屬性,兩個狀態不是同等重要的,非對稱的二元相異性,負匹配數t被認爲是不重要的,

d(i,j)=(r+s)/(q+r+s);

數值屬性的相異性:euclidean distance, manhattan distance,minkoski distance;

euclidean distance :d(i,j)=sqrt(power((x1-y1),2) + power((x2-y2),2)+power((xn-yn),2));

manhattan distance:d(i,j)=abs(x1-y1)+abs(x2-y2)+abs(xn-yn);

upper distance :produce the max minus value between each dimension of the object

-------------------------------------------------------

weighted euclidean distance

that's d(i,j)=sqrt(power((x1-y1),2)*weight+power((x2-y2),2)*weight+power((xn-yn),2)*weight) 


--------------------------------------------------------


So, how can we calculate the dissimilarity of the objects which had mixed attributes .

one method is to group according to the each type of the attribute,then we can proceed 

data mining based on the each attribute.however,in real application,each attribute type

which is anabyzed individually can't produce the compatible result 

One better way is process all attributes at one time,and only do one analysis.one technology can assemble the different attribute combination in one dissimilarity maxtrix.

and transfer all meaningful attributes to common interval [0.0,1.0]

Assume that the dataset include mixed type attribute amount to p,the dissimilarity between

object i and j will be defined 


-------------------------------------------------

the cosine similarity:

s(i,j)=(i*j)/(|i|*|j|)=((x1*y1)+(x2*y2)+(x3*y3)+(xn*yn))/(sqrt(power(x1,2)+power(x2,2)+power(xn,2))*sqrt(power(y1,2)+power(y2,2)+power(yn,2))


---------------------------------------------------

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章