【Machine Learning】 Understanding K-means Clustering Algorithm

Introduction

This is my first English blog. In order to improve my English comprehension, I will try to write more English blogs. In this blog, I will tell you some details about K-means clustering. Knowledge of machine learing is not required, but it is better if you familiar with basic data analysis and not less than one programming language.

K-means Algorithm

K-means is one of the most popular clustering algorithms, which is a unsupervised learning algorithm whose data is unlabel data. It is an iterative algorithm that tries to partition the dataset to K group. It tries to make the inter-group data point as similar as possible, and keeping the group as different as possible than others. In other words, given a set of data points X=(x1,x2,x3,...,xn)X=(x_1, x_2, x_3,...,x_n), where each data point is a x-dimensional real vector, the goal of k-means clustering is partition the nn data points into kk sets S=S1,S2,S3,...,SkS={S_1,S_2,S_3,...,S_k} so as to minimize the inter-cluster sum of squares. Formally, the objective is to find:argsmini=1kxSixμi2=argsmini=1kSi Var(Si)\arg_{s}\min \sum_{i=1}^{k} \sum_{x\in S_i}||x-\mu_i||^2=\arg_{s}\min \sum_{i=1}^{k}|S_i|\ Var(S_i)
where μi\mu_i is the mean of points in SiS_i.

The way k-means clustering works is as follows:

  1. Given the number of clusters KK.
  2. Randomly initialize KK points, which are called centroids.
  3. For each data point chosing a nearest centroid.
  4. Renew each centroids.
  5. Keep iterating until all centroids are not changing.
  6. Attain KK clusters.
    Figure 1.
    This figure from Stanford CS221(here). K-meas clustering. (a) Original dataset. (b) Random initial cluster centroids. (c-f) Illustration of running two iterations of k-means.

Implementation

In this clustering problem, we are given a training set X=(x1,x2,x3,...,xn)X=(x_1, x_2, x_3,...,x_n), and want to partition the data into KK clusters. Our goal is to predict centroid of each cluster and a label cic_i for each data point. The k-means clustering is as follow:

  1. Initialize cluster centroids μ1,μ2,μ3,...,μkRn\mu_1, \mu_2, \mu_3,...,\mu_k \in R^n
  2. Repeat until centroids are not changing: {
    For every ii, set ci:=argsmini=1kxSixμi2c_i := \arg_{s}\min \sum_{i=1}^{k} \sum_{x\in S_i}||x-\mu_i||^2
    For every jj, set μj:=i=1m[ci=j]xii=1m[ci=j]\mu_j:={{\sum_{i=1}^{m}{[c_i=j]x_i}}\over{\sum_{i=1}^{m}{[c_i=j]}}}
    }
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章