k-均值

k-均值

原創

2020-02-26 02:28

一、k-均值算法

点分配聚类算法中最著名的一个称为k-均值算法。该算法假设在欧式空间下，并假设最终簇的数目k事先已知。

3.1 k-均值算法的基础

代表簇的k个初始点选择有多种方法。在算法的核心for循环中，我们将k个初始点之外的每个点就近分配给最近（离簇的质心最近）的簇。需要注意的是当新的点分配到一个簇之后，质心可能会漂移。但是由于只有簇附近的点才可能会被分配给自己，所以簇的质心也不会移动太大。算法描述如下：

Initially choose k points that are likely to be in different clusters;

Make these points the centroids of their clusters;

For each remaining point p DO:

Find the centroid to which p is closest;

Add p to the cluster of that centroid;

Adjust the centroid of that cluster to account for p;

END;

算法的一个变形是固定所有簇的质心，然后将包含k个初始点的所有点重新分配到这k个簇中。

k-均值选择K的个数依靠可视化数据和实际的需要手工决定K，随机选择μ1,,...μk即初始的K个簇心centeroid。

如何优化随机化选择K个簇心：

1：如果K∈(2,10)，运行100次左右的随机化过程，然后计算每一次运行的cost函数:也叫Distortion Function.

min J(C1,...Cm, μ1,...μk) = 1/m* sum(||xi - μci||^2)

For each remaining point p DO: （x1到xm）

Find the centroid to which p is closest; (cp = k）

Add p to the cluster of that centroid; (add xp)

ADD STEP TO GET J: ||xp - μk||^2

Adjust the centroid of that cluster to account for p; (调整uk)

END;

J = 1/m* sum();

2：如果k特别的多，进行多次随机化过程没有太多的用处。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.