Kmeans

Kmeans

原創

2019-04-03 14:20

算法梗概

The k-means algorithm is one of the simplest yet most popular machine learning algorithms. It takes in the data points and the number of clusters (k) as input.

Next, it randomly plots k different points on the plane (called centroids). After the k centroids are randomly plotted, the following two steps are repeatedly performed until there is no further change in the set of k centroids:

Assignment of points to the centroids: Every data point is assigned to the centroid that is the closest to it. The collection of data points assigned to a particular centroid is called a cluster. Therefore, the assignment of points to k centroids results in the formation of k clusters.
Reassignment of centroids: In the next step, the centroid of every cluster is recomputed to be the center of the cluster (or the average of all the points in the cluster). All the data points are then reassigned to the new centroids:

Kmeans演示

代碼

import numpy as np
import matplotlib.pyplot as plt

class Kmeans:
    """使用python和numpy實現Kmeans算法"""
    def __init__(self, k_):
        self.k = k_         # k是指定的簇的個數
        self.threhold = 1e-10
        self.last_k_cluster = None

    def fit(self, X):
        # 將X轉變爲ndarray結構
        X = np.array(X)
        # 設置隨機種子
        np.random.seed(20)
        #  隨機取k個向量作爲初始簇中心
        self.k_cluster = X[np.random.randint(0, len(X), self.k)]
        # 初始化X的標籤
        self.labels = np.zeros(len(X))
        times = 0
        plt.scatter(X[:,0], X[:, 1], c='black')
        plt.pause(1)
        while True:
            # 爲X中的每個點分簇
            for index, point in enumerate(X):  # 對於X中的每一個向量point，計算point到每個簇中心的歐式距離的平方和
                distance = np.sum(np.power(point-self.k_cluster, 2), axis=1)    # 得益與numpy的廣播特性，所以可以這麼寫
                self.labels[index] = distance.argmin()   # 將點point分爲歐式距離的平方和最小的簇下標

            # 作圖
            plt.scatter(X[:, 0], X[:, 1], c=self.labels, s=50)   # 將剛分好簇的各點填色展示出來
            plt.scatter(self.k_cluster[:,0], self.k_cluster[:,1], marker='X', c='black', s=100)
            plt.pause(0.5)

            # path = './Images/' + str(times) + '.jpg'
            # plt.savefig(path)
            # times += 1

            # 更新每個簇的中心點，更新辦法爲"the average of all the points in the cluster"
            self.last_k_cluster = self.k_cluster.copy() # 保存上一次所有的簇中心
            for i in range(self.k):
                self.k_cluster[i] = np.mean(X[self.labels == i], axis=0)

            # 比較新更新得到的簇中心，與上一次保留的所有簇中心的歐式距離和，如果這個和小於一個閾值，則跳出循環，算法結束
            dist = np.sqrt(np.sum(np.power(self.last_k_cluster-self.k_cluster, 2)))
            if dist <= self.threhold:
                break

    def predict(self, X):
        # 將X轉變爲ndarray結構
        X = np.array(X)
        result = np.zeros(len(X))
        for index, point in enumerate(X):
            distance = np.sum(np.power(point - self.k_cluster, 2), axis=1)  # 得益與numpy的廣播特性，所以可以這麼寫
            result[index] = distance.argmin()  # 將點point分爲歐式距離的平方和最小的簇下標
        return result

"""測試代碼"""
# from KMeans_Shayue import *

if __name__ == '__main__':
    obj = Kmeans(3)
    np.random.seed(10)
    X = np.random.randint(1, 300, (100, 2))
    obj.fit(X)

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

算法梗概

Kmeans演示

代碼

容器中nginx無法使用同一個網絡下的容器域名

Python: SunMoonTimeCalculator

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

NETCore中實現一個輕量無負擔的極簡任務調度ScheduleTask

docker使用特定的網絡

使用c#強大的表達式樹實現對象的深克隆之解決循環引用的問題

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

nodejs學習07——API

避免DbContext同時在多個線程調用

GPT-4o 引領人機交互新風向，向量數據庫賽道沸騰了

Course1_Week1_ProgrammingHomeWork

找出3個數中不爲-1的最小數

馬拉車算法

偏差-方差分解

決策樹如何防止過擬合

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結