R中的聚類分析:確定最佳聚類數

本文翻譯自:Cluster analysis in R: determine the optimal number of clusters

Being a newbie in R, I'm not very sure how to choose the best number of clusters to do a k-means analysis. 作爲R的新手,我不太確定如何選擇最佳數目的聚類來進行k均值分析。 After plotting a subset of below data, how many clusters will be appropriate? 在繪製了以下數據的子集之後,多少個簇才合適? How can I perform cluster dendro analysis? 如何執行聚類dendro分析?

n = 1000
kk = 10    
x1 = runif(kk)
y1 = runif(kk)
z1 = runif(kk)    
x4 = sample(x1,length(x1))
y4 = sample(y1,length(y1)) 
randObs <- function()
{
  ix = sample( 1:length(x4), 1 )
  iy = sample( 1:length(y4), 1 )
  rx = rnorm( 1, x4[ix], runif(1)/8 )
  ry = rnorm( 1, y4[ix], runif(1)/8 )
  return( c(rx,ry) )
}  
x = c()
y = c()
for ( k in 1:n )
{
  rPair  =  randObs()
  x  =  c( x, rPair[1] )
  y  =  c( y, rPair[2] )
}
z <- rnorm(n)
d <- data.frame( x, y, z )

#1樓

參考:https://stackoom.com/question/12W1D/R中的聚類分析-確定最佳聚類數


#2樓

If your question is how can I determine how many clusters are appropriate for a kmeans analysis of my data? 如果您的問題是how can I determine how many clusters are appropriate for a kmeans analysis of my data? , then here are some options. ,那麼這裏有一些選擇。 The wikipedia article on determining numbers of clusters has a good review of some of these methods. 維基百科上有關確定簇數的文章對其中一些方法進行了很好的回顧。

First, some reproducible data (the data in the Q are... unclear to me): 首先,一些可重現的數據(Q中的數據對我來說尚不清楚):

n = 100
g = 6 
set.seed(g)
d <- data.frame(x = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))), 
                y = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))))
plot(d)

在此處輸入圖片說明

One . Look for a bend or elbow in the sum of squared error (SSE) scree plot. 在平方誤差總和(SSE)碎石圖上查找彎曲或彎頭。 See http://www.statmethods.net/advstats/cluster.html & http://www.mattpeeples.net/kmeans.html for more. 有關更多信息 ,請參見http://www.statmethods.net/advstats/cluster.htmlhttp://www.mattpeeples.net/kmeans.html The location of the elbow in the resulting plot suggests a suitable number of clusters for the kmeans: 彎頭在結果圖中的位置表明適合kmeans的簇數:

mydata <- d
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
  for (i in 2:15) wss[i] <- sum(kmeans(mydata,
                                       centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
     ylab="Within groups sum of squares")

We might conclude that 4 clusters would be indicated by this method: 我們可以得出結論,此方法將指示4個羣集: 在此處輸入圖片說明

Two . You can do partitioning around medoids to estimate the number of clusters using the pamk function in the fpc package. 您可以使用fpc軟件包中的pamk函數對類pamk進行分區以估計簇數。

library(fpc)
pamk.best <- pamk(d)
cat("number of clusters estimated by optimum average silhouette width:", pamk.best$nc, "\n")
plot(pam(d, pamk.best$nc))

在此處輸入圖片說明在此處輸入圖片說明

# we could also do:
library(fpc)
asw <- numeric(20)
for (k in 2:20)
  asw[[k]] <- pam(d, k) $ silinfo $ avg.width
k.best <- which.max(asw)
cat("silhouette-optimal number of clusters:", k.best, "\n")
# still 4

Three . Calinsky criterion: Another approach to diagnosing how many clusters suit the data. Calinsky準則:診斷有多少簇適合數據的另一種方法。 In this case we try 1 to 10 groups. 在這種情況下,我們嘗試1至10組。

require(vegan)
fit <- cascadeKM(scale(d, center = TRUE,  scale = TRUE), 1, 10, iter = 1000)
plot(fit, sortg = TRUE, grpmts.plot = TRUE)
calinski.best <- as.numeric(which.max(fit$results[2,]))
cat("Calinski criterion optimal number of clusters:", calinski.best, "\n")
# 5 clusters!

在此處輸入圖片說明

Four . Determine the optimal model and number of clusters according to the Bayesian Information Criterion for expectation-maximization, initialized by hierarchical clustering for parameterized Gaussian mixture models 根據期望最大化的貝葉斯信息準則確定最佳模型和聚類數目,並通過分層聚類對參數化的高斯混合模型進行初始化

# See http://www.jstatsoft.org/v18/i06/paper
# http://www.stat.washington.edu/research/reports/2006/tr504.pdf
#
library(mclust)
# Run the function to see how many clusters
# it finds to be optimal, set it to search for
# at least 1 model and up 20.
d_clust <- Mclust(as.matrix(d), G=1:20)
m.best <- dim(d_clust$z)[2]
cat("model-based optimal number of clusters:", m.best, "\n")
# 4 clusters
plot(d_clust)

在此處輸入圖片說明在此處輸入圖片說明在此處輸入圖片說明

Five . Affinity propagation (AP) clustering, see http://dx.doi.org/10.1126/science.1136800 相似性傳播(AP)羣集,請參見http://dx.doi.org/10.1126/science.11​​36800

library(apcluster)
d.apclus <- apcluster(negDistMat(r=2), d)
cat("affinity propogation optimal number of clusters:", length(d.apclus@clusters), "\n")
# 4
heatmap(d.apclus)
plot(d.apclus, d)

在此處輸入圖片說明在此處輸入圖片說明

Six . Gap Statistic for Estimating the Number of Clusters. 估計簇數的差距統計。 See also some code for a nice graphical output . 另請參見一些代碼,以獲得漂亮的圖形輸出 Trying 2-10 clusters here: 在這裏嘗試2-10個集羣:

library(cluster)
clusGap(d, kmeans, 10, B = 100, verbose = interactive())

Clustering k = 1,2,..., K.max (= 10): .. done
Bootstrapping, b = 1,2,..., B (= 100)  [one "." per sample]:
.................................................. 50 
.................................................. 100 
Clustering Gap statistic ["clusGap"].
B=100 simulated reference sets, k = 1..10
 --> Number of clusters (method 'firstSEmax', SE.factor=1): 4
          logW   E.logW        gap     SE.sim
 [1,] 5.991701 5.970454 -0.0212471 0.04388506
 [2,] 5.152666 5.367256  0.2145907 0.04057451
 [3,] 4.557779 5.069601  0.5118225 0.03215540
 [4,] 3.928959 4.880453  0.9514943 0.04630399
 [5,] 3.789319 4.766903  0.9775842 0.04826191
 [6,] 3.747539 4.670100  0.9225607 0.03898850
 [7,] 3.582373 4.590136  1.0077628 0.04892236
 [8,] 3.528791 4.509247  0.9804556 0.04701930
 [9,] 3.442481 4.433200  0.9907197 0.04935647
[10,] 3.445291 4.369232  0.9239414 0.05055486

Here's the output from Edwin Chen's implementation of the gap statistic: 這是Edwin Chen實施差值統計的結果: 在此處輸入圖片說明

Seven . You may also find it useful to explore your data with clustergrams to visualize cluster assignment, see http://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/ for more details. 您可能還會發現使用聚類圖探索數據以可視化聚類分配很有用,請參閱http://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-代碼/瞭解更多詳情。

Eight . The NbClust package provides 30 indices to determine the number of clusters in a dataset. NbClust軟件包提供30個索引來確定數據集中的簇數。

library(NbClust)
nb <- NbClust(d, diss=NULL, distance = "euclidean",
        method = "kmeans", min.nc=2, max.nc=15, 
        index = "alllong", alphaBeale = 0.1)
hist(nb$Best.nc[1,], breaks = max(na.omit(nb$Best.nc[1,])))
# Looks like 3 is the most frequently determined number of clusters
# and curiously, four clusters is not in the output at all!

在此處輸入圖片說明

If your question is how can I produce a dendrogram to visualize the results of my cluster analysis , then you should start with these: http://www.statmethods.net/advstats/cluster.html http://www.r-tutor.com/gpu-computing/clustering/hierarchical-cluster-analysis http://gastonsanchez.wordpress.com/2012/10/03/7-ways-to-plot-dendrograms-in-r/ And see here for more exotic methods: http://cran.r-project.org/web/views/Cluster.html 如果你的問題是how can I produce a dendrogram to visualize the results of my cluster analysis :,那麼你應該用這些啓動http://www.statmethods.net/advstats/cluster.html HTTP://www.r-tutor .com / gpu-computing / clustering / hierarchical-cluster-analysis http://gastonsanchez.wordpress.com/2012/10/03/7-ways-to-plot-dendrograms-in-r/ ,請參閱此處瞭解更多異國情調方法: http//cran.r-project.org/web/views/Cluster.html

Here are a few examples: 這裏有一些例子:

d_dist <- dist(as.matrix(d))   # find distance matrix 
plot(hclust(d_dist))           # apply hirarchical clustering and plot

在此處輸入圖片說明

# a Bayesian clustering method, good for high-dimension data, more details:
# http://vahid.probstat.ca/paper/2012-bclust.pdf
install.packages("bclust")
library(bclust)
x <- as.matrix(d)
d.bclus <- bclust(x, transformed.par = c(0, -50, log(16), 0, 0, 0))
viplot(imp(d.bclus)$var); plot(d.bclus); ditplot(d.bclus)
dptplot(d.bclus, scale = 20, horizbar.plot = TRUE,varimp = imp(d.bclus)$var, horizbar.distance = 0, dendrogram.lwd = 2)
# I just include the dendrogram here

在此處輸入圖片說明

Also for high-dimension data is the pvclust library which calculates p-values for hierarchical clustering via multiscale bootstrap resampling. pvclust庫也用於高維數據,該庫通過多尺度自pvclust來計算p值以進行層次聚類。 Here's the example from the documentation (wont work on such low dimensional data as in my example): 這是文檔中的示例(不會像我的示例中那樣處理低維數據):

library(pvclust)
library(MASS)
data(Boston)
boston.pv <- pvclust(Boston)
plot(boston.pv)

在此處輸入圖片說明

Does any of that help? 有什麼幫助嗎?


#3樓

It's hard to add something too such an elaborate answer. 很難添加過於詳盡的答案。 Though I feel we should mention identify here, particularly because @Ben shows a lot of dendrogram examples. 雖然我覺得我們應該在這裏提到identify ,尤其是因爲@Ben顯示了許多樹狀圖示例。

d_dist <- dist(as.matrix(d))   # find distance matrix 
plot(hclust(d_dist)) 
clusters <- identify(hclust(d_dist))

identify lets you interactively choose clusters from an dendrogram and stores your choices to a list. identify ,您可以從樹狀圖交互式選擇聚類,並將選擇存儲到列表中。 Hit Esc to leave interactive mode and return to R console. 按Esc鍵退出交互模式並返回R控制檯。 Note, that the list contains the indices, not the rownames (as opposed to cutree ). 注意,該列表包含索引,而不包含行名(與cutree )。


#4樓

In order to determine optimal k-cluster in clustering methods. 爲了確定最佳的k聚類方法。 I usually using Elbow method accompany by Parallel processing to avoid time-comsuming. 我通常在並行處理中使用Elbow方法,以避免浪費時間。 This code can sample like this: 這段代碼可以像這樣採樣:

Elbow method 肘法

elbow.k <- function(mydata){
dist.obj <- dist(mydata)
hclust.obj <- hclust(dist.obj)
css.obj <- css.hclust(dist.obj,hclust.obj)
elbow.obj <- elbow.batch(css.obj)
k <- elbow.obj$k
return(k)
}

Running Elbow parallel 平行運行彎頭

no_cores <- detectCores()
    cl<-makeCluster(no_cores)
    clusterEvalQ(cl, library(GMD))
    clusterExport(cl, list("data.clustering", "data.convert", "elbow.k", "clustering.kmeans"))
 start.time <- Sys.time()
 elbow.k.handle(data.clustering))
 k.clusters <- parSapply(cl, 1, function(x) elbow.k(data.clustering))
    end.time <- Sys.time()
    cat('Time to find k using Elbow method is',(end.time - start.time),'seconds with k value:', k.clusters)

It works well. 它運作良好。


#5樓

Splendid answer from Ben. Ben的精彩回答。 However I'm surprised that the Affinity Propagation (AP) method has been here suggested just to find the number of cluster for the k-means method, where in general AP do a better job clustering the data. 但是,令我驚訝的是,這裏建議使用“親和傳播(AP)”方法只是爲了找到k均值方法的聚類數,通常在此方法中AP可以更好地對數據進行聚類。 Please see the scientific paper supporting this method in Science here: 請在此處查看支持此方法的科學論文:

Frey, Brendan J., and Delbert Dueck. Frey,Brendan J.和Delbert Dueck。 "Clustering by passing messages between data points." “通過在數據點之間傳遞消息進行集羣。” science 315.5814 (2007): 972-976. 科學315.5814(2007):972-976。

So if you are not biased toward k-means I suggest to use AP directly, which will cluster the data without requiring knowing the number of clusters: 因此,如果您不偏向於k均值,我建議直接使用AP,它將在不需要知道簇數的情況下對數據進行聚類:

library(apcluster)
apclus = apcluster(negDistMat(r=2), data)
show(apclus)

If negative euclidean distances are not appropriate, then you can use another similarity measures provided in the same package. 如果負歐氏距離不合適,則可以使用同一軟件包中提供的另一種相似性度量。 For example, for similarities based on Spearman correlations, this is what you need: 例如,對於基於Spearman相關性的相似性,這是您需要的:

sim = corSimMat(data, method="spearman")
apclus = apcluster(s=sim)

Please note that those functions for similarities in the AP package are just provided for simplicity. 請注意,爲簡化起見,僅提供了AP軟件包中用於相似性的那些功能。 In fact, apcluster() function in R will accept any matrix of correlations. 實際上,R中的apcluster()函數將接受任何相關矩陣。 The same before with corSimMat() can be done with this: 使用corSimMat()之前可以通過以下操作完成:

sim = cor(data, method="spearman")

or 要麼

sim = cor(t(data), method="spearman")

depending on what you want to cluster on your matrix (rows or cols). 取決於要在矩陣(行或列)上進行聚類的對象。


#6樓

The answers are great. 答案很好。 If you want to give a chance to another clustering method you can use hierarchical clustering and see how data is splitting. 如果您想給其他聚類方法一個機會,可以使用分層聚類並查看數據如何拆分。

> set.seed(2)
> x=matrix(rnorm(50*2), ncol=2)
> hc.complete = hclust(dist(x), method="complete")
> plot(hc.complete)

在此處輸入圖片說明

Depending on how many classes you need you can cut your dendrogram as; 根據需要的類數,可以將樹狀圖剪切爲:

> cutree(hc.complete,k = 2)
 [1] 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 2 1 1 1
[26] 2 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 1 1 1 1 2

If you type ?cutree you will see the definitions. 如果鍵入?cutree您將看到定義。 If your data set has three classes it will be simply cutree(hc.complete, k = 3) . 如果您的數據集具有三個類別,則將只是cutree(hc.complete, k = 3) The equivalent for cutree(hc.complete,k = 2) is cutree(hc.complete,h = 4.9) . cutree(hc.complete,k = 2)的等效cutree(hc.complete,k = 2)cutree(hc.complete,h = 4.9)

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章