Introduction

This is my first English blog. In order to improve my English comprehension, I will try to write more English blogs. In this blog, I will tell you some details about K-means clustering. Knowledge of machine learing is not required, but it is better if you familiar with basic data analysis and not less than one programming language.

K-means Algorithm

K-means is one of the most popular clustering algorithms, which is a unsupervised learning algorithm whose data is unlabel data. It is an iterative algorithm that tries to partition the dataset to K group. It tries to make the inter-group data point as similar as possible, and keeping the group as different as possible than others. In other words, given a set of data points $X=(x_1, x_2, x_3,...,x_n)$ , where each data point is a x-dimensional real vector, the goal of k-means clustering is partition the $n$ data points into $k$ sets $S={S_1,S_2,S_3,...,S_k}$ so as to minimize the inter-cluster sum of squares. Formally, the objective is to find: $\arg_{s}\min \sum_{i=1}^{k} \sum_{x\in S_i}||x-\mu_i||^2=\arg_{s}\min \sum_{i=1}^{k}|S_i|\ Var(S_i)$
where $\mu_i$ is the mean of points in $S_i$ .

The way k-means clustering works is as follows:

Given the number of clusters $K$ .
Randomly initialize $K$ points, which are called centroids.
For each data point chosing a nearest centroid.
Renew each centroids.
Keep iterating until all centroids are not changing.
Attain $K$ clusters.

This figure from Stanford CS221(here). K-meas clustering. (a) Original dataset. (b) Random initial cluster centroids. (c-f) Illustration of running two iterations of k-means.

Implementation

In this clustering problem, we are given a training set $X=(x_1, x_2, x_3,...,x_n)$ , and want to partition the data into $K$ clusters. Our goal is to predict centroid of each cluster and a label $c_i$ for each data point. The k-means clustering is as follow:

Initialize cluster centroids $\mu_1, \mu_2, \mu_3,...,\mu_k \in R^n$
Repeat until centroids are not changing: {
For every $i$ , set $c_i := \arg_{s}\min \sum_{i=1}^{k} \sum_{x\in S_i}||x-\mu_i||^2$
For every $j$ , set $\mu_j:={{\sum_{i=1}^{m}{[c_i=j]x_i}}\over{\sum_{i=1}^{m}{[c_i=j]}}}$
}

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

【Machine Learning】 Understanding K-means Clustering Algorithm

Introduction

K-means Algorithm

Implementation

AI模型 Llama 3體驗筆記

【面試準備】又一次失敗的面試經歷，題目離譜～資深軟件測試工程師

dotnet 8 版本與銀河麒麟V10和UOS系統的 glibc 兼容性

【深度學習】logistic regression 中的反向傳播 (Back Propagation)

【深度學習】Grad-CAM 使用 MNIST + LeNet 基於 tensorflow 生成分類器對於數據的位置權重(熱圖 HeatMap)

【深度學習】GoogLeNet 中 inception v2 (filter: 33，1n) 的 tensorflow 的簡單實現(沒有使用 slim)

【深度學習】CNN + CIFAR10 學習筆記(數據輸入 mini-batch)(基於 TENSORFLOW)

【機器學習】用 tensorflow 實現隨機森林 RandomForest in tensorflow (mnist 示例)

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結