一，K-means算法原理

基本算法

K-means算法是最常用的一種聚類算法。算法的輸入爲一個樣本集（或者稱爲點集），通過該算法可以將樣本進行聚類，具有相似特徵的樣本聚爲一類。
算法步驟：
step1：選定要聚類的類別數目k，同時選定初始中心點
step2：尋找組織，將每一個樣本點分給k箇中心點（根據距離）
step3：重新計算新的中心點
step4：判斷中心點是否發生變化，若變化則重複，否則break

初始中心點的選取

初始中心點的選取，對聚類的結果影響較大。可以驗證，不同初始中心點，會導致聚類的效果不同。如何選擇初始中心點呢？一個原則是：

初始中心點之間的間距應該較大。因此，可以採取的策略是：

step1：計算所有樣本點之間的距離，選擇距離最大的一個點對（兩個樣本C1, C2）作爲2個初始中心點，從樣本點集中去掉這兩個點。

step2：如果初始中心點個數達到k個，則終止。如果沒有，在剩餘的樣本點中，選一個點C3，這個點優化的目標是：

這是一個雙目標優化問題，可以約束其中一個，極值化另外一個，這樣可以選擇一個合適的C3點，作爲第3個初始中心點。

如果要尋找第4個初始中心點，思路和尋找第3個初始中心點是相同的。

誤差平方和（Sum of Squared Error）

誤差平法和，SSE，用於評價聚類的結果的好壞，SSE的定義如下。

一般情況下，k越大，SSE越小。假設k=N=樣本個數，那麼每個點自成一類，那麼每個類的中心點爲這個類中的唯一一個點本身，那麼SSE=0。

k值的確定

一般k不會很大，大概在2~10之間，因此可以作出這個範圍內的SSE-k的曲線，再選擇一個拐點，作爲合適的k值。

可以看到，k=5之後，SSE下降的變得很緩慢了，因此最佳的k值爲5。

二，基本原理的Python實現

# K-means Algorithm is a clustering algorithm
import numpy as np
import matplotlib.pyplot as plt
import random


def get_distance(p1, p2):
    diff = [x - y for x, y in zip(p1, p2)]
    distance = np.sqrt(sum(map(lambda x: x ** 2, diff)))
    return distance


# 計算多個點的中心
def calc_center_point(cluster):
    N = len(cluster)

    m = np.array(cluster).transpose().tolist()  # m的shape是(2, N)

    center_point = [sum(x) / N for x in m]  # 這裏其實就是分別對x,y求平均
    return center_point


# 檢查兩個點是否有差別
def check_center_diff(center, new_center):
    n = len(center)
    for c, nc in zip(center, new_center):
        if c != nc:
            return False
    return True


# K-means算法的實現
def K_means(points, center_points):
    N = len(points)  # 樣本個數
    n = len(points[0])  # 單個樣本的維度
    k = len(center_points)  # k值大小

    tot = 0
    while True:  # 迭代
        temp_center_points = []  # 記錄中心點

        clusters = []  # 記錄聚類的結果
        for c in range(0, k):
            clusters.append([])  # 初始化

        # 針對每個點，尋找距離其最近的中心點（尋找組織）
        for i, data in enumerate(points):
            distances = []
            for center_point in center_points:
                distances.append(get_distance(data, center_point))

            index = distances.index(min(distances))  # 找到最小的距離的那個中心點的索引，
            clusters[index].append(data)  # 那麼這個中心點代表的簇，裏面增加一個樣本(要理解這裏)


        tot += 1
        print('Epoch:{} Clusters:{}'.format(tot, len(clusters)))
        k = len(clusters)
        colors = ['r.', 'g.', 'b.', 'k.', 'y.']  # 顏色和點的樣式
        for i, cluster in enumerate(clusters):
            data = np.array(cluster)
            data_x = [x[0] for x in data]
            data_y = [x[1] for x in data]
            plt.subplot(2, 3, tot)
            plt.plot(data_x, data_y, colors[i])
            plt.axis([0, 1000, 0, 1000])

        # 重新計算中心點（該步驟可以與下面判斷中心點是否發生變化這個步驟，調換順序）
        for cluster in clusters:
            temp_center_points.append(calc_center_point(cluster))

        # 在計算中心點的時候，需要將原來的中心點算進去
        for j in range(0, k):
            if len(clusters[j]) == 0:  # 這裏是說一旦某一個epoch中，某一個聚類中一個樣本都沒有
                temp_center_points[j] = center_points[j]


        # 判斷中心點是否發生變化：即，判斷聚類前後樣本的類別是否發生變化
        for c, nc in zip(center_points, temp_center_points):
            if not check_center_diff(c, nc):
                center_points = temp_center_points[:]  # 複製一份
                break
        else:  # 如果沒有變化，那麼退出迭代，聚類結束
            break
    plt.show()
    return clusters, temp_center_points  # 返回聚類的結果


# 隨機獲取一個樣本集，用於測試K-means算法
def get_test_data():
    N = 1000

    # 產生點的區域
    area_1 = [0, N / 4, N / 4, N / 2]
    area_2 = [N / 2, 3 * N / 4, 0, N / 4]
    area_3 = [N / 4, N / 2, N / 2, 3 * N / 4]
    area_4 = [3 * N / 4, N, 3 * N / 4, N]
    area_5 = [3 * N / 4, N, N / 4, N / 2]

    areas = [area_1, area_2, area_3, area_4, area_5]


    # 在各個區域內，隨機產生一些點
    points = []
    for area in areas:
        rnd_num_of_points = random.randint(50, 200)
        for r in range(0, rnd_num_of_points):
            rnd_add = random.randint(0, 100)
            rnd_x = random.randint(area[0] + rnd_add, area[1] - rnd_add)
            rnd_y = random.randint(area[2], area[3] - rnd_add)
            points.append([rnd_x, rnd_y])

    # 自定義中心點，目標聚類個數爲5，因此選定5箇中心點
    center_points = [[0, 250], [500, 500], [500, 250], [500, 250], [500, 750]]

    return points, center_points


if __name__ == '__main__':

    points, center_points = get_test_data()
    clusters, temp_center_points = K_means(points, center_points)
    print('#######最終結果##########')
    # for i, cluster in enumerate(clusters): #  打印所有點
    #     print('cluster ', i, ' ', cluster)
    print('最後中心點爲：')
    print(temp_center_points)

Python實戰

數據展示：

1.658985	4.285136
-3.453687	3.424321
4.838138	-1.151539
-5.379713	-3.362104
0.972564	2.924086
-3.567919	1.531611
0.450614	-3.302219
-3.487105	-1.724432
2.668759	1.594842
-3.156485	3.191137
3.165506	-3.999838
-2.786837	-3.099354
4.208187	2.984927
-2.123337	2.943366
0.704199	-0.479481
-0.392370	-3.963704
2.831667	1.574018
-0.790153	3.343144
2.943496	-3.357075
-3.195883	-2.283926
2.336445	2.875106
-1.786345	2.554248
2.190101	-1.906020
-3.403367	-2.778288
1.778124	3.880832
-1.688346	2.230267
2.592976	-2.054368
-4.007257	-3.207066
2.257734	3.387564
-2.679011	0.785119
0.939512	-4.023563
-3.674424	-2.261084
2.046259	2.735279
-3.189470	1.780269
4.372646	-0.822248
-2.579316	-3.497576
1.889034	5.190400
-0.798747	2.185588
2.836520	-2.658556
-3.837877	-3.253815
2.096701	3.886007
-2.709034	2.923887
3.367037	-3.184789
-2.121479	-4.232586
2.329546	3.179764
-3.284816	3.273099
3.091414	-3.815232
-3.762093	-2.432191
3.542056	2.778832
-1.736822	4.241041
2.127073	-2.983680
-4.323818	-3.938116
3.792121	5.135768
-4.786473	3.358547
2.624081	-3.260715
-4.009299	-2.978115
2.493525	1.963710
-2.513661	2.642162
1.864375	-3.176309
-3.171184	-3.572452
2.894220	2.489128
-2.562539	2.884438
3.491078	-3.947487
-2.565729	-2.012114
3.332948	3.983102
-1.616805	3.573188
2.280615	-2.559444
-2.651229	-3.103198
2.321395	3.154987
-1.685703	2.939697
3.031012	-3.620252
-4.599622	-2.185829
4.196223	1.126677
-2.133863	3.093686
4.668892	-2.562705
-2.793241	-2.149706
2.884105	3.043438
-2.967647	2.848696
4.479332	-1.764772
-4.905566	-2.911070

# coding:utf-8

import numpy as np
import matplotlib.pyplot as plt


def loadDataSet(fileName):
    '''
    加載測試數據集，返回一個列表，列表的元素是一個座標
    '''
    dataList = []
    with open(fileName) as fr:
        for line in fr.readlines():
            curLine = line.strip().split('\t')
            fltLine = list(map(float, curLine))
            dataList.append(fltLine)
    return dataList


def randCent(dataSet, k):
    '''
    隨機生成k個初始的質心
    '''
    n = np.shape(dataSet)[1]  # n表示數據集的維度
    centroids = np.mat(np.zeros((k, n)))

    for j in range(n):
        minJ = min(dataSet[:, j])
        rangeJ = float(max(dataSet[:, j]) - minJ)
        centroids[:, j] = np.mat(minJ + rangeJ * np.random.rand(k, 1))
    return centroids


def kMeans(dataSet, k):
    '''
    KMeans算法，返回最終的質心座標和每個點所在的簇
    '''
    m = np.shape(dataSet)[0]  # m表示數據集的長度（個數）
    clusterAssment = np.mat(np.zeros((m, 2)))  # 這裏存儲的是（類別，distance）
    # step：固定初始中心點
    centroids = randCent(dataSet, k)  # 保存k個初始質心的座標

    clusterChanged = True
    iterIndex=1  # 迭代次數

    while clusterChanged:
        clusterChanged = False

        # step2:找組織
        for i in range(m):  # 遍歷每一個樣本
            minDist = np.inf
            minIndex = -1
            for j in range(k):
                distJI = np.linalg.norm(np.array(centroids[j, :])-np.array(dataSet[i, :]))
                if distJI < minDist:
                    minDist = distJI
                    minIndex = j

            if clusterAssment[i, 0] != minIndex:  # 判斷該樣本所屬的類與之前是否發生變化，只要有一個樣本發生變化，迭代繼續
                clusterChanged = True

            clusterAssment[i, :] = minIndex, minDist**2
        print("第%d次迭代後%d個質心的座標:\n%s" % (iterIndex, k, centroids))  # 第一次迭代的質心座標就是初始的質心座標
        iterIndex = iterIndex + 1

        old_centroids = centroids.copy()
        for cent in range(k):
            ptsInClust = dataSet[np.nonzero(clusterAssment[:, 0].A == cent)[0]]  # get all the point in this cluster
            centroids[cent, :] = np.mean(ptsInClust, axis=0)

        # 我這裏採用雙重判斷
        if (centroids == old_centroids).all():
            pass
        else:
            clusterChanged = True

    return centroids, clusterAssment


def showCluster(dataSet, k, centroids, clusterAssment):
    '''
    數據可視化,只能畫二維的圖（若是三維的座標圖則直接返回1）
    這個畫圖方法不好，後期會更新
    '''
    numSamples, dim = dataSet.shape

    if dim != 2:
        return 1

    mark = ['or', 'ob', 'og', 'ok', 'oy', 'om', 'oc', '^r', '+r', 'sr', 'dr', '<r', 'pr']

    # draw all samples
    for i in range(numSamples):

        markIndex = int(clusterAssment[i, 0])
        plt.plot(dataSet[i, 0], dataSet[i, 1], mark[markIndex])

    mark = ['Pr', 'Pb', 'Pg', 'Pk', 'Py', 'Pm', 'Pc', '^b', '+b', 'sb', 'db', '<b', 'pb']
    # draw the centroids
    for i in range(k):
        plt.plot(centroids[i, 0], centroids[i, 1], mark[i], markersize = 12)

    plt.show()


if __name__ == '__main__':
    dataMat = np.mat(loadDataSet('./testSet'))
    k = 4  # 選定k值
    cent, clust = kMeans(dataMat, k)

    showCluster(dataMat, k, cent, clust)

輸出結果：

三，sklearn實現

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.cluster import KMeans
from scipy.spatial import Voronoi, voronoi_plot_2d

# 該數據集表示玩具製造商的產品數據:
#
# 第一個值代表一個玩具:
#    0-2: 人形公仔
#    3-5: 積木
#    6-8: 汽車
#
# 第二個值是購買玩具最多的年齡組:
#    0: 5 year-olds
#    1: 6 year-olds
#    2: 7 year-olds
#    3: 8 year-olds
#    4: 9 year-olds
#    5: 10 year-olds

x = np.array([[0, 4], [1, 3], [2, 5], [3, 2], [4, 0], [5, 1], [6, 4], [7, 5], [8, 3]])

# model
kmeans = KMeans(n_clusters=3, random_state=0).fit(x)

# Plot the data
sns.set_style("darkgrid")
plt.scatter(x[:, 0], x[:, 1], c=kmeans.labels_, cmap=plt.get_cmap("winter"))

# 以下爲畫分割線的方式
# Save the axes limits of the current figure
x_axis = plt.gca().get_xlim()
y_axis = plt.gca().get_ylim()

# Draw cluster boundaries and centers
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], marker='x')
vor = Voronoi(centers)
voronoi_plot_2d(vor, ax=plt.gca(), show_points=False, show_vertices=False)

# Resize figure as needed
plt.gca().set_xlim(x_axis)
plt.gca().set_ylim(y_axis)

# Remove ticks from the plot
plt.xticks([])
plt.yticks([])

plt.tight_layout()
plt.show()