機器學習 K-均值聚類算法(K-Means)

原創

2020-02-21 08:31

1. 介紹

聚類算法： 是一種典型的無監督學習算法，主要用於將相似的樣本自動歸到一個類別中。

2. 算法思想

確定要分成多少個類，設爲常數K，隨機選擇K個樣本作爲初始簇中心
計算每個樣本到K個簇中心的距離，該樣本歸屬於最近的簇，形成K個簇
計算K個簇樣本的平均值作爲新的簇中心
循環2、3步驟
簇中心位置不變（或達到指定循環次數），聚類完成

3. 樣本數據生成

使用sklearn的make_blobs生成樣本數據：

from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

K=4
n_samples=100 # 100個樣本
X, y = make_blobs(n_samples=n_samples, # 100個樣本
                 n_features=2, # 每個樣本2個特徵
                 centers=K # K箇中心
                 )
plt.scatter(
    X[:,0], #第一個特徵值當x座標
    X[:,1], #第二個特徵值當y座標
    c=y # 數據類別標籤當做顏色，相同標籤的顏色也相同
)
plt.show()

4. 原生代碼實現

import numpy as np

# 第1步，選擇簇中心，這裏取X前K個樣本
cluster_centers=np.copy(X[0:K])
print('cluster_centers=',cluster_centers)
print('cluster_centers_type=',y[0:K])

# 初始化數組，用於保存X中每個樣本的類型，（樣本數量，-1填充樣本類型）
X_type = np.full(n_samples, -1)

# 記錄循環此處
count=1
while count:
    # 第2步，計算每個樣本到簇中心的距離，選擇最近的歸類
    for n in range(0,n_samples):
        Min_L=-1
        for i in range(0,K):
            #樣本到簇中心的歐式距離
            L=((X[n][0]-cluster_centers[i][0])**2+(X[n][1]-cluster_centers[i][1])**2)**0.5
            if(Min_L==-1):#最小距離爲-1，則直接更新類型,記錄距離
                X_type[n]=i
                Min_L=L
            elif(Min_L>L):#出現更小的距離則更新類型,記錄距離
                X_type[n]=i
                Min_L=L
    
    # 第3步,計算K個簇樣本的平均值作爲新的簇中心
    #分別累加每個簇的點座標,數量
    sum = np.zeros([K,3])
    for n in range(0,n_samples):
        sum[X_type[n]][0:2]+=X[n]
        sum[X_type[n]][2]+=1
    
    # 記錄上一次簇中心
    last_cluster_centers=np.copy(cluster_centers)

    #計算平均值作爲新的簇中心
    for n in range(0,K):
        cluster_centers[n]=sum[n][0:2]/sum[n][2]
    
    # 如果簇中心沒變則退出循環
    if(last_cluster_centers==cluster_centers).all():
        print('count:',count)
        break
    count+=1

plt.scatter(
    X[:,0], #第一個特徵值當x座標
    X[:,1], #第二個特徵值當y座標
    c=X_type # 數據類別標籤當做顏色，相同標籤的顏色也相同
)
plt.scatter(cluster_centers[:,0],cluster_centers[:,1],s=200,marker='x',c='red')
plt.show()

cluster_centers= [[ -9.55338872  -1.07204676]
 [ -7.72118512   4.60945541]
 [ -8.87714205  -4.04370297]
 [-10.20593433  -2.53019515]]
cluster_centers_type= [3 1 3 3]
count: 5

5. sklearn代碼實現

from sklearn.cluster import KMeans
cluster = KMeans(n_clusters=K).fit(X)

print('cluster.cluster_centers_=',cluster.cluster_centers_) # 每個簇中心的座標

print('cluster.inertia_=',cluster.inertia_) #每個樣本到其中心的距離累加

plt.scatter(
    X[:,0], #第一個特徵值當x座標
    X[:,1], #第二個特徵值當y座標
    c=y # 數據類別標籤當做顏色，相同標籤的顏色也相同
)
plt.scatter(cluster.cluster_centers_[:,0],cluster.cluster_centers_[:,1],s=200,marker='x',c='red')
plt.show()

cluster.cluster_centers_= [[-8.41314315 -1.85636774]
 [-0.95072532  7.87723591]
 [ 0.37558634 -6.38892274]
 [-7.98962683  3.91523476]]
cluster.inertia_= 217.84359295947405

6. 距離公式

歐氏距離（歐幾里得距離，座標系集合距離）：
$d(x,u)=\sqrt{\sum_{i=1}^n(x_i-\mu_i)^2}$

曼哈頓距離（絕對值距離）：
$d(x,u)=\sum_{i=1}^n(|x_i-\mu|)$

餘弦距離：
$cos\theta=\frac{\sum_{i=1}^n(x_i*\mu)}{\sqrt{\sum_i^n(x_i)^2}*\sqrt{\sum_1^n(\mu)^2}}$

參考資料：
《sklearn KMeans聚類算法（總結）》
《利用sklearn.cluster實現k均值聚類》
《【matplotlib】scatter()散點圖的詳細參數》

李乾文博客專家

發佈了156 篇原創文章 · 獲贊 353 · 訪問量 72萬+

他的留言板關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

機器學習 K-均值聚類算法(K-Means)

1. 介紹

2. 算法思想

3. 樣本數據生成

4. 原生代碼實現

5. sklearn代碼實現

6. 距離公式

Kaggle教程機器學習中級3 分類變量

NodeRed安裝與反向代理配置

Kaggle教程機器學習入門3 你的第一個機器學習模型

BeautyGAN論文翻譯

Kaggle教程機器學習中級2 缺失值處理

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結