Introduction
This is my first English blog. In order to improve my English comprehension, I will try to write more English blogs. In this blog, I will tell you some details about K-means clustering. Knowledge of machine learing is not required, but it is better if you familiar with basic data analysis and not less than one programming language.
K-means Algorithm
K-means is one of the most popular clustering algorithms, which is a unsupervised learning algorithm whose data is unlabel data. It is an iterative algorithm that tries to partition the dataset to K group. It tries to make the inter-group data point as similar as possible, and keeping the group as different as possible than others. In other words, given a set of data points , where each data point is a x-dimensional real vector, the goal of k-means clustering is partition the data points into sets so as to minimize the inter-cluster sum of squares. Formally, the objective is to find:
where is the mean of points in .
The way k-means clustering works is as follows:
- Given the number of clusters .
- Randomly initialize points, which are called centroids.
- For each data point chosing a nearest centroid.
- Renew each centroids.
- Keep iterating until all centroids are not changing.
- Attain clusters.
This figure from Stanford CS221(here). K-meas clustering. (a) Original dataset. (b) Random initial cluster centroids. (c-f) Illustration of running two iterations of k-means.
Implementation
In this clustering problem, we are given a training set , and want to partition the data into clusters. Our goal is to predict centroid of each cluster and a label for each data point. The k-means clustering is as follow:
- Initialize cluster centroids
- Repeat until centroids are not changing: {
For every , set
For every , set
}