R Clustering 聚類

原創

Clark Kent 2000

2020-06-16 12:42

what is Clustering ?

聚類是一種統計技術，它適用於非監督學習，在數據中創建分組;

與不同集羣中的對象相比，同一集羣中的對象之間的相似性更大;

應用場景：

客戶偏好
基因功能預測
個體化用藥
......

Hierarchical Clustering （分層/層次聚類）

分層聚類首先將每個觀測數據放到單獨的集羣中。
它檢查了所有觀測值之間的所有距離（這個距離可以由不同的算法計算出來，比如歐里幾何距離和曼哈頓距離），並將兩個最近的觀測值配對，形成一個新簇團。
這個過程不斷重複，直到出現一個集羣爲止。

一個例子：

按照層級聚類，下圖種 1-8 八個點將會按照如下順序分類。

可視分層聚類模型（樹狀圖）如下：

優點/缺點

分層聚類的優點是它提供了在兩個不同的點組之間指定不同的組成結果，我們可以計算多個嵌套集羣。

層次集羣的一個缺點是最好的集羣可能不是嵌套的! 還有一個缺點就是計算消耗很大！

假設數據包含性別(M/F)和國籍(英國、美國、澳大利亞)

最佳的2個聚類可以按性別進行數據分割

最好的3個聚類可以按國籍進行數據拆分

這些集羣不是嵌套的，因此，分層集羣的性能很差。

K-means Clustering （k均值聚類）

1. 選擇K個集羣中心（random）。

2. 將每個數據點分配給最近的集羣中心。

3.重新計算集羣中心的位置，作爲所有數據點的平均值。

4. 將數據點分配到最近的集羣中心。

5. 重複步驟3和4，直到沒有重新分配觀察值或達到最大迭代次數。注意:R默認使用10。

6. 算法收斂的啓發式證明:

在這兩個步驟中，點與其形心之間的差值的平方和均減小(根據定義)
集羣的配置數量有限(可能很多)
必須使算法收斂

一些特點：

與層次集羣方法相比，K-means集羣可以處理更大的數據集。
生成的集羣是非嵌套的，這意味着，例如，一個4-集羣不是通過將一個集羣拆分爲3-集羣來生成的。
然而，使用方法意味着所有的變量必須是連續的，並且方法可能受到離羣值的嚴重影響(can we do sth with the outliers(離羣值？要麼移除，要麼scale都是值得考慮的辦法))。
對於不同的起點，聚合到的集羣通常不是惟一的

k - means輸出的解釋（這個其實也很重要）

與層次聚類不同，K-means聚類要求預先指定要提取的聚類數目。
在K-means解決方案中，總組內平方和與集羣數量的關係圖是有用的。（這兩個數據可以在使用R輸出模型之後的結果中找到）
查看圖表可以幫助建議適當的集羣數量（比如下圖我們將要根據曲線選擇最合適的聚類數量，比如目測可能是8，這裏到底是那個點，其實取的是圖像的最大麴率處，怎麼計算這個，我後面有空也許會加上）

R實現Hclustering：

data(iris)
plot(iris$Petal.Length, iris$Petal.Width, pch=21, bg=c("red","green3","blue")[unclass(iris$Species)],
     main="Iris Data")
d1 <- dist(as.matrix(iris[,1:4]))
#Hierarchical clustering is typically used when the number of points are not too high
set.seed(28102019)
#sampling 40 points
chooserows <- sample(1:150, 40)
sampleiris <- iris[chooserows,]
#calculating our distance matrix
distance <- dist(sampleiris[,-5])
#preparing the hierarchical cluster
cluster <- hclust(distance)
#plotting our dendrogram, with hang=-1 to have all labels at same level
plot(cluster, hang=-1, label=sampleiris$Species) 
iriscluster3 <- cutree(cluster,3)
table(iris$Species[chooserows],iriscluster3) #Cross tab of 3 cluster cut vs. Species

R實現K-mean Clustering：

library(plyr)
#K-means clustering
animals <- read.csv("XXXXXX//Animals.csv",header=TRUE)
animals<-rename(animals,c(x="Weight",y="Height",z="Species"))
animals$Species <- factor(animals$Species, letters[1:4],
                          labels=c("Ostrich","Deer","Bear","Giant tortoise"))
x = animals[,-3]
y = animals$Species
x_axis <- c(2:10)
y_axis <- NULL
for (i in 1:9) {
  kc <- kmeans(x,i+1)
  #By the API doc
  #tot.withinss: Total within-cluster sum of squares, i.e. sum(withinss).
  y_axis[i] <- kc$tot.withinss
}

my.dataframe <- data.frame(x_axis, y_axis)
#The point of maximum curvature of the curve occurs at value of x, when k鈥?(x)=0
out.spl <- with(my.dataframe, smooth.spline(x_axis, y_axis, df = 3))
derivative.out <- with(my.dataframe, predict(out.spl, x = x_axis, deriv = 2))
derivative.out
max(derivative.out$y)

ggplot()+
  geom_line(data = my.dataframe,aes(x_axis,y_axis),size=1)+
  geom_point(data = my.dataframe,aes(x_axis,y_axis),size=3)+
  geom_vline(aes(xintercept=4), data=my.dataframe,
             colour="#990000", linetype="dashed")+
  xlab("Number of clusters")+ylab("sum of squares")

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

R Clustering 聚類

what is Clustering ?

應用場景：

Hierarchical Clustering （分層/層次聚類）

K-means Clustering （k均值聚類）

R實現Hclustering：

R實現K-mean Clustering：

SQL 常見操作彙總

Bayesian framework 貝葉斯框架（R）

機器學習概念-model fit , Resampling Methods

Classification methods 分類算法（R）

Java 會話（session）和事務

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結