算法交流：聚類

原文鏈接：https://wklchris.github.io/R-clustering.html

“聚類分析是一種數據歸約技術，旨在揭露一個數據集中觀測值的子集。它可以把大量的觀測值歸約爲若干個類。這裏的類被定義爲若干個觀測值組成的羣組，羣組內觀測值的相似度比羣間相似度高。”——《R 語言實戰》第二版

常用的兩種聚類方法有：

層次聚類（hierachical clustering）：每個數據點爲一小類，兩兩通過樹的方式合併，直到所有的數據點匯成一類。常用算法：
1. 單聯動（single linkage）：類 A 中的點與類 B 中的點間的最小距離。適合於細長的類。
2. 全聯動（complete linkage ）：類 A 中的點與類 B 中的點間的最大距離。適合於相似半徑的緊湊類，對異常值敏感。
3. 平均聯動（average linkage）：類 A 中的點與類 B 中的點間的平均距離，也稱爲 UPGMA。適合於聚合方差小的類。
4. 質心（centroid）：類 A 與類 B 的質心的距離。質心的定義是“類的變量均值向量”。對異常值不敏感，但表現可能稍弱。
5. Ward 法（ward.D）：兩類之間的所有變量的方差分析平方和。適合於僅聚合少量值、類別數接近數據點數目的情況。
劃分聚類（partitioning clustering）：事先指定類數 KK，然後聚類。
1. K均值（K-means）：
2. 中心劃分（Partitioning Around Medoids，即 PAM）：

聚類步驟

聚類是一個多步驟過程。典型的步驟有 11 步。

變量選取：例如你需要對實驗數據進行聚類，那麼你需要仔細思考哪些變量會對聚類產生影響，而哪些變量是不需加入分析的。
縮放數據：最常用的方法是標準化，將所有變量變爲 ¯¯¯x=0,SE(x)=1x¯=0,SE(x)=1 的變量。
篩選異常：篩選和刪除異常數據對於某些聚類方法是很重要的，這可以藉助 R 的 outliers/mvoutlier 包。或者，你可以換用一種受異常值干擾小的方法，比如中心劃分聚類。
距離計算：兩個數據點間的距離度量有若干種，我們在下一小節專門討論。
選擇聚類方法：每個方法都有其優缺點，請仔細斟酌。
確定一種或多種聚類方法
確定類數：常用的方法是嘗試使用不同的類數進行聚類，然後比較結果。R 中的 NbClust 包提供了一個擁有超過30個指標的 NbClust() 函數。
最終方案
可視化：層次聚類使用樹狀圖；劃分聚類使用可視化雙變量聚類圖。
解釋每個類：通常會對每個類進行彙總統計（如果是連續型數據），或者返回類的衆數/類別分佈（如果含類別型數據）。
驗證：聚類結果有意義嗎？更換聚類方法能得到類似結果嗎？R 中的 fpc, clv 與 clValid 包給出了評估函數。

距離計算：dist() 函數

數據點之間的距離有多種度量方法。在 R 的 dist() 函數參數中，默認選項 method=euclidean。函數中內置的距離方法選項有：

歐幾里得距離（euclidean）：L2L2 norm. 在擁有 nn 個變量的數據集中，數據點 ii 與 jj 的歐式距離是
最大距離（maximun）：L∞L∞ norm. 兩點之間的最大距離，即p→∞p→∞ 時的明科夫斯基距離。
曼哈頓距離（manhattan）: L1L1 norm.
堪培拉距離（canberra）：
二進制距離（binary）：非 0 變量爲 1，爲 0 變量爲 0.然後根據 0 的比例確定距離。
明科夫斯基距離（minkowski）：LpLp norm.

當 p=1p=1 時，即曼哈頓距離；p=2p=2 時，歐幾里得距離；p→∞p→∞ 時，切比雪夫距離。

本文利用 Iris data，數據內容是萼片、花瓣的長與寬。先讀取數據：

datapath <- paste(getwd(), '/data/iris.data.csv', sep='')  # 我將其改成了 csv 格式
iris.raw <- read.csv(datapath, head=F)
head(iris.raw)

# 去掉非數值的第 5 列
iris <- iris.raw[,-c(5)]

V1	V2	V3	V4	V5
5.1	3.5	1.4	0.2	Iris-setosa
4.9	3.0	1.4	0.2	Iris-setosa
4.7	3.2	1.3	0.2	Iris-setosa
4.6	3.1	1.5	0.2	Iris-setosa
5.0	3.6	1.4	0.2	Iris-setosa
5.4	3.9	1.7	0.4	Iris-setosa

算例：歐式距離

R 內置的 dist() 函數默認使用歐式距離，以下與 dist(iris, method='euclidean') 等同。比如我們來計算 iris 的歐氏距離：

iris.e <- dist(iris)
# 顯示前 3 個數據點間的歐式距離。這是一個對角線全0的對稱矩陣
as.matrix(iris.e)[1:3, 1:3]

#	1	2	3
1	0.0000000	0.5385165	0.509902
2	0.5385165	0.0000000	0.300000
3	0.5099020	0.3000000	0.000000

層次聚類算例

層次聚類（HC）的邏輯是：依次把距離最近的兩類合併爲一個新類，直至所有數據點合併爲一個類。

層次聚類的 R 函數是 hclust(d, method=) ，其中 d 通常是一個 dist() 函數的運算結果。

仍然使用上文的 Iris data 數據。

標準化

儘管標準化不一定會用到，但是這是通常的手段之一。

iris.scaled <- scale(iris)
head(iris)
head(iris.scaled)

V1	V2	V3	V4
5.1	3.5	1.4	0.2
4.9	3.0	1.4	0.2
4.7	3.2	1.3	0.2
4.6	3.1	1.5	0.2
5.0	3.6	1.4	0.2
5.4	3.9	1.7	0.4

V1	V2	V3	V4
-0.8976739	1.0286113	-1.336794	-1.308593
-1.1392005	-0.1245404	-1.336794	-1.308593
-1.3807271	0.3367203	-1.393470	-1.308593
-1.5014904	0.1060900	-1.280118	-1.308593
-1.0184372	1.2592416	-1.336794	-1.308593
-0.5353840	1.9511326	-1.166767	-1.046525

樹狀圖與熱力圖

層次聚類中的樹狀圖是不可少的。一般使用歐式距離，聚類方法另行確定。

熱力圖不是必須的。

# 選擇聚類方法
dist_method <- "euclidean"
cluster_method <- "ward.D"

iris.e <- dist(iris.scaled, method=dist_method)
iris.hc <- hclust(iris.e, method=cluster_method)
plot(iris.hc, hang=-1, cex=.8, main="Hierachical Cluster Tree for Iris data")

heatmap(as.matrix(iris.e),labRow = F, labCol = F)

確定聚類個數

使用 NbClust 包。

library(NbClust)
nc <- NbClust(iris.scaled, distance=dist_method, 
              min.nc=2, max.nc=10, method=cluster_method)

*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a 
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot.

*** : The D index is a graphical method of determining the number of clusters. 
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure. 
 
******************************************************************* 
* Among all indices:                                                
* 9 proposed 2 as the best number of clusters 
* 5 proposed 3 as the best number of clusters 
* 4 proposed 5 as the best number of clusters 
* 2 proposed 6 as the best number of clusters 
* 1 proposed 8 as the best number of clusters 
* 3 proposed 10 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is  2 
 
 
*******************************************************************

# 每個類數的投票數
table(nc$Best.n[1,])
barplot(table(nc$Best.n[1,]), xlab="Number of Clusters", ylab="Number of Supporting", 
        main="Determine the Number of Clustering")

 0  2  3  5  6  8 10 
 2  9  5  4  2  1  3

完成聚類

從上圖我們可以確定聚類數量（兩類），但是原數據指出應該是三類。以下 cutree 以及最後一個函數中使用3類。

注：類別已確定是 3 種。在此前提下，ward.D 下的 HC 聚類效果最好，但是 NbClust 投票建議仍是分爲 2 類。 complete 下的 HC 聚類投票建議是 3 類，但是效果反而不如 ward 法。

cluster_num <- 3
clusters <- cutree(iris.hc, k=cluster_num)
table(clusters)  # 每類多少個值

clusters
 1  2  3 
49 74 27

# 每類的各變量中位數
aggregate(iris, by=list(cluster=clusters), median)

cluster	V1	V2	V3	V4
1	5.0	3.4	1.5	0.20
2	6.0	2.8	4.5	1.45
3	6.9	3.1	5.8	2.20

# 畫出矩形框
plot(iris.hc, hang=-1, cex=.8, main="Hierachical Cluster Tree for Iris data")
rect.hclust(iris.hc, k=cluster_num)

用 MDS 可視化結果

使用多維縮放（Multidimensional Scaling）方法進行可視化。原數據的三個種類被標記爲三種不同的點形狀，聚類結果則以顏色顯示。

可以看到setose品種聚類很成功，但有一些virginica品種的花被錯誤和virginica品種聚類到一起。

mds=cmdscale(iris.e,k=2,eig=T)
x = mds$points[,1]
y = mds$points[,2]

library(ggplot2)
p=ggplot(data.frame(x,y),aes(x,y))
p+geom_point(size=3, alpha=0.8, aes(colour=factor(clusters),
             shape=iris.raw[,5]))

附：全聯動HC聚類圖

作爲對比。全聯動 NbClust 投票結果是3類，在此不再列出。

dist_method <- "euclidean"
cluster_method <- "complete"

iris.e <- dist(iris.scaled, method=dist_method)
iris.hc <- hclust(iris.e, method=cluster_method)

cluster_num <- 3
clusters <- cutree(iris.hc, k=cluster_num)
table(clusters)  # 每類多少個值

# 畫出矩形框
plot(iris.hc, hang=-1, cex=.8, main="Hierachical Cluster Tree for Iris data")
rect.hclust(iris.hc, k=cluster_num)

clusters
 1  2  3 
49 24 77

mds=cmdscale(iris.e,k=2,eig=T)
x = mds$points[,1]
y = mds$points[,2]

p=ggplot(data.frame(x,y),aes(x,y))
p+geom_point(size=3, alpha=0.8, aes(colour=factor(clusters),
             shape=iris.raw[,5]))

可以看出，setosa 聚類非常好； virginica 與 versicolor 的效果則是慘不忍睹。

本文內容大量參考：

《R 語言實戰》第二版第16章。
此網頁

算法交流：聚類

聚類步驟

距離計算：dist() 函數

算例：歐式距離

層次聚類算例

標準化

樹狀圖與熱力圖

確定聚類個數

完成聚類

用 MDS 可視化結果

附：全聯動HC聚類圖

Nginx R31 doc-13-Limiting Access to Proxied HTTP Resources 訪問限流

公司剛入職了一名 Java 中級開發，短短 4 行代碼居然湊齊了 3 個 bug！我哭了~~

python包：pandas

Python數據分析與挖掘實戰（5章）

一、什麼是Docker

C++文件/流

二、Docker 組件

揹包九講一 01揹包

中外程序員到底有啥區別？

今天！通義靈碼在北京、成都、杭州三城開講啦

EXCEL 分組取前十的數據

excel 計算複合增長率

msyql 計算中位數

mysql 按照年齡段分組計數

notepad++ 怎麼查看和替換換行符

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

V1	V2	V3	V4
5.1	3.5	1.4	0.2
4.9	3.0	1.4	0.2
4.7	3.2	1.3	0.2
4.6	3.1	1.5	0.2
5.0	3.6	1.4	0.2
5.4	3.9	1.7	0.4

V1	V2	V3	V4
5.1	3.5	1.4	0.2
4.9	3.0	1.4	0.2
4.7	3.2	1.3	0.2
4.6	3.1	1.5	0.2
5.0	3.6	1.4	0.2
5.4	3.9	1.7	0.4

V1	V2	V3	V4
5.1	3.5	1.4	0.2
4.9	3.0	1.4	0.2
4.7	3.2	1.3	0.2
4.6	3.1	1.5	0.2
5.0	3.6	1.4	0.2
5.4	3.9	1.7	0.4