Kmeans

算法梗概

The k-means algorithm is one of the simplest yet most popular machine learning algorithms. It takes in the data points and the number of clusters (k) as input.

Next, it randomly plots k different points on the plane (called centroids). After the k centroids are randomly plotted, the following two steps are repeatedly performed until there is no further change in the set of k centroids:

  • Assignment of points to the centroids: Every data point is assigned to the centroid that is the closest to it. The collection of data points assigned to a particular centroid is called a cluster. Therefore, the assignment of points to k centroids results in the formation of k clusters.
  • Reassignment of centroids: In the next step, the centroid of every cluster is recomputed to be the center of the cluster (or the average of all the points in the cluster). All the data points are then reassigned to the new centroids:

Kmeans演示

Kmeans演示

代碼

import numpy as np
import matplotlib.pyplot as plt

class Kmeans:
    """使用python和numpy實現Kmeans算法"""
    def __init__(self, k_):
        self.k = k_         # k是指定的簇的個數
        self.threhold = 1e-10
        self.last_k_cluster = None

    def fit(self, X):
        # 將X轉變爲ndarray結構
        X = np.array(X)
        # 設置隨機種子
        np.random.seed(20)
        #  隨機取k個向量作爲初始簇中心
        self.k_cluster = X[np.random.randint(0, len(X), self.k)]
        # 初始化X的標籤
        self.labels = np.zeros(len(X))
        times = 0
        plt.scatter(X[:,0], X[:, 1], c='black')
        plt.pause(1)
        while True:
            # 爲X中的每個點分簇
            for index, point in enumerate(X):  # 對於X中的每一個向量point,計算point到每個簇中心的歐式距離的平方和
                distance = np.sum(np.power(point-self.k_cluster, 2), axis=1)    # 得益與numpy的廣播特性,所以可以這麼寫
                self.labels[index] = distance.argmin()   # 將點point分爲歐式距離的平方和最小的簇下標

            # 作圖
            plt.scatter(X[:, 0], X[:, 1], c=self.labels, s=50)   # 將剛分好簇的各點填色展示出來
            plt.scatter(self.k_cluster[:,0], self.k_cluster[:,1], marker='X', c='black', s=100)
            plt.pause(0.5)

            # path = './Images/' + str(times) + '.jpg'
            # plt.savefig(path)
            # times += 1

            # 更新每個簇的中心點,更新辦法爲"the average of all the points in the cluster"
            self.last_k_cluster = self.k_cluster.copy() # 保存上一次所有的簇中心
            for i in range(self.k):
                self.k_cluster[i] = np.mean(X[self.labels == i], axis=0)

            # 比較新更新得到的簇中心,與上一次保留的所有簇中心的歐式距離和,如果這個和小於一個閾值,則跳出循環,算法結束
            dist = np.sqrt(np.sum(np.power(self.last_k_cluster-self.k_cluster, 2)))
            if dist <= self.threhold:
                break

    def predict(self, X):
        # 將X轉變爲ndarray結構
        X = np.array(X)
        result = np.zeros(len(X))
        for index, point in enumerate(X):
            distance = np.sum(np.power(point - self.k_cluster, 2), axis=1)  # 得益與numpy的廣播特性,所以可以這麼寫
            result[index] = distance.argmin()  # 將點point分爲歐式距離的平方和最小的簇下標
        return result

"""測試代碼"""
# from KMeans_Shayue import *

if __name__ == '__main__':
    obj = Kmeans(3)
    np.random.seed(10)
    X = np.random.randint(1, 300, (100, 2))
    obj.fit(X)
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章