文章目錄

關於 KMeans 以及 KMeans++ 算法原理以及參數的意義，可以參考這篇文章：無監督學習 | KMeans與KMeans++原理

本文將着重講 KMeans 算法的實現、 K 值的選取以及聚類結果可視化。

1. KMeans in Sklearn

sklearn.cluster.KMeans

KMeans(n_clusters=8, init='k-means++', n_init=10, max_iter=300, tol=0.0001, precompute_distances='auto', verbose=0, random_state=None, copy_x=True, n_jobs=None, algorithm='auto')

參數設置：

n_clusters: int, optional, default: 8

The number of clusters to form as well as the number of centroids to generate. 【簇個數】

init: {‘k-means++’, ‘random’ or an ndarray}

Method for initialization, defaults to ‘k-means++’:

‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details. 【KMeans++，選取一個隨機初始向量並通過輪盤法選取剩餘k-1個初始向量】

‘random’: choose k observations (rows) at random from data for the initial centroids. 【傳統KMeans，隨機選取k個初始向量】

If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.

n_init: int, default: 10

Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia. 【通過多次生成初始點，選取最好的結果】

max_iter: int, default: 300

Maximum number of iterations of the k-means algorithm for a single run.【最大的迭代次數】

tol: float, default: 1e-4

Relative tolerance with regards to inertia to declare convergence 【最小調整幅度閾值】

precompute_distances: {‘auto’, True, False}

Precompute distances (faster but takes more memory).

‘auto’ : do not precompute distances if n_samples * n_clusters > 12 million. This corresponds to about 100MB overhead per job using double precision.

True : always precompute distances

False : never precompute distances

verbose: int, default 0

Verbosity mode.

random_state: int, RandomState instance or None (default)

Determines random number generation for centroid initialization. Use an int to make the randomness deterministic. See Glossary.

copy_x: boolean, optional

When pre-computing distances it is more numerically accurate to center the data first. If copy_x is True (default), then the original data is not modified, ensuring X is C-contiguous. If False, the original data is modified, and put back before the function returns, but small numerical differences may be introduced by subtracting and then adding the data mean, in this case it will also not ensure that data is C-contiguous which may cause a significant slowdown.

n_jobs: int or None, optional (default=None)

The number of jobs to use for the computation. This works by computing each of the n_init runs in parallel.

None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details. 【多線程】

algorithm: “auto”, “full” or “elkan”, default=”auto”

K-means algorithm to use. The classical EM-style algorithm is “full”. The “elkan” variation is more efficient by using the triangle inequality, but currently doesn’t support sparse data. “auto” chooses “elkan” for dense data and “full” for sparse data.

Attributes：

cluster_centers_: array, [n_clusters, n_features]

Coordinates of cluster centers. If the algorithm stops before fully converging (see tol and max_iter), these will not be consistent with labels_.

labels_: array, shape (n_samples,)

Labels of each point

inertia_: float

Sum of squared distances of samples to their closest cluster center.

n_iter_: int

Number of iterations run.

2. Sklearn 實例：電影評分的 k 均值聚類

我們將使用的數據來自 MovieLens 用戶評分數據集，根據用戶對不同電影的評分研究用戶在電影品位上的相似和不同之處。

2.1 數據集概述

該數據集有兩個文件。我們將這兩個文件導入 pandas dataframe 中：

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy.sparse import csr_matrix

# Import the Movies dataset
movies = pd.read_csv('k-means_Clustering of Movie Ratings/ml-latest-small/movies.csv')
movies.head()

	movieId	title	genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
4	5	Father of the Bride Part II (1995)	Comedy

# Import the ratings dataset
ratings = pd.read_csv('k-means_Clustering of Movie Ratings/ml-latest-small/ratings.csv')
ratings.head()

	userId	movieId	rating	timestamp
0	1	31	2.5	1260759144
1	1	1029	3.0	1260759179
2	1	1061	3.0	1260759182
3	1	1129	2.0	1260759185
4	1	1172	4.0	1260759205

print('The dataset contains: ', len(ratings), ' ratings of ', len(movies), ' movies.')

The dataset contains:  100004  ratings of  9125  movies.

現在我們已經知道數據集的結構，可以看到總共有 100004 條影評，對應於 9125 部影片。

電影的類型大致有：喜劇、浪漫、兒童、漫畫…

2.2 二維 KMeans 聚類

我們想要看看在觀衆中，對於愛情片和科幻片的評分是否有明顯的分類，我們通過計算每位用戶對愛情片和科幻片的評分，並對數據集稍微進行偏倚（刪除同時喜歡科幻片和愛情片的用戶），使聚類能夠將他們定義爲更喜歡其中一種類型。

我們將大部分數據預處理過程都隱藏在了輔助函數 helper 中，並重點研究聚類概念。

import helper

# Calculate the average rating of romance and scifi movies
genre_ratings = helper.get_genre_ratings(ratings, movies, ['Romance', 'Sci-Fi'], ['avg_romance_rating', 'avg_scifi_rating'])

# 函數 get_genre_ratings 計算了每位用戶對所有愛情片和科幻片的平均評分。我們對數據集稍微進行偏倚，刪除同時喜歡科幻片和愛情片的用戶，使聚類能夠將他們定義爲更喜歡其中一種類型。
biased_dataset = helper.bias_genre_rating_dataset(genre_ratings, 3.2, 2.5)

print( "Number of records: ", len(biased_dataset))
biased_dataset.head()

Number of records:  183

	index	avg_romance_rating	avg_scifi_rating
0	1	3.50	2.40
1	3	3.65	3.14
2	6	2.90	2.75
3	7	2.93	3.36
4	12	2.89	2.62

可以看出我們有 183 位用戶，對於每位用戶，我們都得出了他們對看過的愛情片和科幻片的平均評分。
我們來繪製該數據集：

%matplotlib inline

helper.draw_scatterplot(biased_dataset['avg_scifi_rating'],'Avg scifi rating', biased_dataset['avg_romance_rating'], 'Avg romance rating')

我們可以在此樣本中看到明顯的偏差（我們故意創建的）。如果使用 k 均值將樣本分成兩組，效果如何？

# Let's turn our dataset into a list
X = biased_dataset[['avg_scifi_rating','avg_romance_rating']].values

from sklearn.cluster import KMeans 
kmeans_1 = KMeans(n_clusters=2)
predictions = kmeans_1.fit_predict(X) 

# Plot
helper.draw_clusters(biased_dataset, predictions)

可以看出分組的依據主要是每個人對愛情片的評分高低。如果愛情片的平均評分超過 3 星，則屬於第一組，否則屬於另一組。

如果分成三組，會發生什麼？

kmeans_2 = KMeans(n_clusters=3)
predictions_2 = kmeans_2.fit_predict(X)

# Plot
helper.draw_clusters(biased_dataset, predictions_2)

現在平均科幻片評分開始起作用了，分組情況如下所示：

喜歡愛情片但是不喜歡科幻片的用戶
喜歡科幻片但是不喜歡愛情片的用戶
即喜歡科幻片又喜歡愛情片的用戶

再添加一組：

kmeans_3 = KMeans(n_clusters=4)
predictions_3 = kmeans_3.fit_predict(X)

# Plot
helper.draw_clusters(biased_dataset, predictions_3)

可以看出將數據集分成的聚類越多，每個聚類中用戶的興趣就相互之間越相似。

3. 肘部法選取最優 K 值

我們可以將數據點拆分爲任何數量的聚類。對於此數據集來說，正確的聚類數量是多少？

可以通過多種方式選擇聚類 k。我們將研究一種簡單的方式，叫做“肘部方法”。肘部方法會繪製 k 的上升值與使用該 k 值計算的總誤差分佈情況。

其思想與網絡搜索類似，通過遍歷參數 K 來選取最小誤差，我們這裏選取輪廓係數（約接近 1 性能越好）來評價聚類性能。

現在的一個任務是對每個 k（介於 1 到數據集中的元素數量之間，以 5 爲步長）執行相同的操作。

df = biased_dataset[['avg_scifi_rating','avg_romance_rating']]

# Choose the range of k values to test.
# We added a stride of 5 to improve performance. We don't need to calculate the error for every k value
possible_k_values = range(2, len(X)+1, 5)

# Calculate error values for all k values we're interested in
errors_per_k = [helper.clustering_errors(k, X) for k in possible_k_values]

# Plot the each value of K vs. the silhouette score at that value
fig, ax = plt.subplots(figsize=(16, 6))
plt.plot(possible_k_values, errors_per_k)

# Ticks and grid
xticks = np.arange(min(possible_k_values), max(possible_k_values)+1, 5.0)
ax.set_xticks(xticks, minor=False)
ax.set_xticks(xticks, minor=True)
ax.xaxis.grid(True, which='both')
yticks = np.arange(round(min(errors_per_k), 2), max(errors_per_k), .05)
ax.set_yticks(yticks, minor=False)
ax.set_yticks(yticks, minor=True)
ax.yaxis.grid(True, which='both')

看了該圖後發現，合適的 k 值包括 7、22、27、32 等（每次運行時稍微不同）。聚類 (k) 數量超過該範圍將開始導致糟糕的聚類情況（根據輪廓分數）

我會選擇 k=7，因爲更容易可視化：

kmeans_4 = KMeans(n_clusters=7)
predictions_4 = kmeans_4.fit_predict(X)

# Plot
helper.draw_clusters(biased_dataset, predictions_4, cmap='Accent')

注意：當你嘗試繪製更大的 k 值（超過 10）時，需要確保你的繪製庫沒有對聚類重複使用相同的顏色。對於此圖，我們需要使用 matplotlib colormap Accent，因爲其他色圖要麼顏色之間的對比度不強烈，要麼在超過 8 個或 10 個聚類後會重複利用某些顏色。

4. 多維 KMeans 聚類

4.1 三維 KMeans 聚類

到目前爲止，我們只查看了用戶如何對愛情片和科幻片進行評分。我們再添加另一種類型，看看加入動作片類型後效果如何。

現在數據集如下所示：

biased_dataset_3_genres = helper.get_genre_ratings(ratings, movies, 
                                                     ['Romance', 'Sci-Fi', 'Action'], 
                                                     ['avg_romance_rating', 'avg_scifi_rating', 'avg_action_rating'])
biased_dataset_3_genres = helper.bias_genre_rating_dataset(biased_dataset_3_genres, 3.2, 2.5).dropna()

print( "Number of records: ", len(biased_dataset_3_genres))
biased_dataset_3_genres.head()

Number of records:  183

	index	avg_romance_rating	avg_scifi_rating	avg_action_rating
0	1	3.50	2.40	2.80
1	3	3.65	3.14	3.47
2	6	2.90	2.75	3.27
3	7	2.93	3.36	3.29
4	12	2.89	2.62	3.21

對三維數據進行聚類並通過三維平面圖可視化。

我們依然分別用 x 軸和 y 軸表示科幻片和愛情片。並用點的大小大致表示動作片評分情況（更大的點表示平均評分超過 3 顆星，更小的點表示不超過 3 顆星）。

X_with_action = biased_dataset_3_genres[['avg_scifi_rating',
                                         'avg_romance_rating', 
                                         'avg_action_rating']].values

kmeans_5 = KMeans(n_clusters=7)
predictions_5 = kmeans_5.fit_predict(X_with_action)

# plot
helper.draw_clusters_3d(biased_dataset_3_genres, predictions_5)

可以看出添加類型後，用戶的聚類分佈發生了變化。爲 k 均值提供的數據越多，每組中用戶之間的興趣越相似。但是如果繼續這麼繪製，我們將無法可視化二維或三維之外的情形。在下個部分，我們將使用另一種圖表，看看多達 50 個維度的聚類情況。

4.2 高維 KMeans 聚類

現在我們已經知道 k 均值會如何根據用戶的類型品位對用戶進行聚類，我們再進一步分析，看看用戶對單個影片的評分情況。爲此，我們將數據集構建成 userId 與用戶對每部電影的評分形式。例如，我們來看看以下數據集子集：

# Merge the two tables then pivot so we have Users X Movies dataframe
ratings_title = pd.merge(ratings, movies[['movieId', 'title']], on='movieId' )
user_movie_ratings = pd.pivot_table(ratings_title, index='userId', columns= 'title', values='rating')

print('dataset dimensions: ', user_movie_ratings.shape, '\n\nSubset example:')
user_movie_ratings.iloc[:6, :10]

dataset dimensions:  (671, 9064) 

Subset example:

title	"Great Performances" Cats (1998)	$9.99 (2008)	'Hellboy': The Seeds of Creation (2004)	'Neath the Arizona Skies (1934)	'Round Midnight (1986)	'Salem's Lot (2004)	'Til There Was You (1997)	'burbs, The (1989)	'night Mother (1986)	(500) Days of Summer (2009)
userId
1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6	NaN	NaN	NaN	NaN	NaN	NaN	NaN	4.0	NaN	NaN

NaN 值表明了一個問題。大多數用戶沒有看過大部分電影，並且沒有爲這些電影評分。這種數據集稱爲“稀疏”數據集，因爲只有少數單元格有值。

爲了解決這一問題，我們按照獲得評分次數最多的電影和對電影評分次數最多的用戶排序。這樣可以形成更“密集”的區域，使我們能夠查看數據集的頂部數據。

如果我們要選擇獲得評分次數最多的前 30 部電影和對電影評分次數最多的 18 個用戶，則如下所示：

n_movies = 30
n_users = 18
most_rated_movies_users_selection = helper.sort_by_rating_density(user_movie_ratings, n_movies, n_users)

print('dataset dimensions: ', most_rated_movies_users_selection.shape)
most_rated_movies_users_selection.head()

dataset dimensions:  (18, 30)

title	Forrest Gump (1994)	Pulp Fiction (1994)	Shawshank Redemption, The (1994)	Silence of the Lambs, The (1991)	Star Wars: Episode IV - A New Hope (1977)	Jurassic Park (1993)	Matrix, The (1999)	Toy Story (1995)	Schindler's List (1993)	Terminator 2: Judgment Day (1991)	...	Dances with Wolves (1990)	Fight Club (1999)	Usual Suspects, The (1995)	Seven (a.k.a. Se7en) (1995)	Lion King, The (1994)	Godfather, The (1972)	Lord of the Rings: The Fellowship of the Ring, The (2001)	Apollo 13 (1995)	True Lies (1994)	Twelve Monkeys (a.k.a. 12 Monkeys) (1995)
29	5.0	5.0	5.0	4.0	4.0	4.0	3.0	4.0	5.0	4.0	...	5.0	4.0	5.0	4.0	3.0	5.0	3.0	5.0	4.0	2.0
508	4.0	5.0	4.0	4.0	5.0	3.0	4.5	3.0	5.0	2.0	...	5.0	4.0	5.0	4.0	3.5	5.0	4.5	3.0	2.0	4.0
14	1.0	5.0	2.0	5.0	5.0	3.0	5.0	2.0	4.0	4.0	...	3.0	5.0	5.0	5.0	4.0	5.0	5.0	3.0	4.0	4.0
72	5.0	5.0	5.0	4.5	4.5	4.0	4.5	5.0	5.0	3.0	...	4.5	5.0	5.0	5.0	5.0	5.0	5.0	3.5	3.0	5.0
653	4.0	5.0	5.0	4.5	5.0	4.5	5.0	5.0	5.0	5.0	...	4.5	5.0	5.0	4.5	5.0	4.5	5.0	5.0	4.0	5.0

5 rows × 30 columns

這樣更好分析。

4.2.1 熱力圖可視化

我們還需要指定一個可視化這些評分的良好方式，以便在查看更龐大的子集時能夠直觀地識別這些評分（稍後變成聚類）。

我們使用顏色代替評分數字：

helper.draw_movies_heatmap(most_rated_movies_users_selection)

每列表示一部電影。每行表示一位用戶。單元格的顏色根據圖表右側的刻度表示用戶對該電影的評分情況。

注意到某些單元格是白色嗎？表示相應用戶沒有對該電影進行評分。在現實中進行聚類時就會遇到這種問題。與一開始經過整理的示例不同，現實中的數據集經常比較稀疏，數據集中的部分單元格沒有值。這樣的話，直接根據電影評分對用戶進行聚類不太方便，因爲 k 均值通常不喜歡缺失值。

4.2.2 稀疏 csr 矩陣

爲了提高性能，我們將僅使用 1000 部電影的評分（數據集中一共有 9000 部以上）。

user_movie_ratings =  pd.pivot_table(ratings_title, index='userId', columns= 'title', values='rating')
most_rated_movies_1k = helper.get_most_rated_movies(user_movie_ratings, 1000)

爲了使 sklearn 對像這樣缺少值的數據集運行 k 均值聚類，我們首先需要將其轉型爲稀疏 csr 矩陣類型（如 SciPi 庫中所定義）。

要從 pandas dataframe 轉換爲稀疏矩陣，我們需要先轉換爲 SparseDataFrame，然後使用 pandas 的 to_coo() 方法進行轉換。

注意：只有較新版本的 pandas 具有to_coo()。

將dataframe 轉換爲稀疏矩陣並進行聚類（隨意選取 K=20，選擇 k 的更佳方式如上述肘部方法所示。但是，該方法需要一定的運行時間），爲了可視化其中一些聚類，我們將每個聚類繪製成熱圖：

sparse_ratings = csr_matrix(pd.SparseDataFrame(most_rated_movies_1k).to_coo())

predictions = KMeans(n_clusters=20, algorithm='full').fit_predict(sparse_ratings)

max_users = 70
max_movies = 50

clustered = pd.concat([most_rated_movies_1k.reset_index(), pd.DataFrame({'group':predictions})], axis=1)
helper.draw_movie_clusters(clustered, max_users, max_movies)

cluster # 2
# of users in cluster: 257. # of users in plot: 70

cluster # 13
# of users in cluster: 74. # of users in plot: 70

cluster # 14
# of users in cluster: 57. # of users in plot: 57

cluster # 18
# of users in cluster: 80. # of users in plot: 70

cluster # 3
# of users in cluster: 46. # of users in plot: 46

cluster # 16
# of users in cluster: 37. # of users in plot: 37

cluster # 5
# of users in cluster: 22. # of users in plot: 22

cluster # 12
# of users in cluster: 33. # of users in plot: 33

cluster # 9
# of users in cluster: 22. # of users in plot: 22

需要注意以下幾個事項：

聚類中的評分越相似，你在該聚類中就越能發現顏色相似的垂直線。
在聚類中發現了非常有趣的規律：
某些聚類比其他聚類更稀疏，其中的用戶可能比其他聚類中的用戶看的電影更少，評分的電影也更少。
某些聚類主要是黃色，匯聚了非常喜歡特定類型電影的用戶。其他聚類主要是綠色或海藍色，表示這些用戶都認爲某些電影可以評 2-3 顆星。
注意每個聚類中的電影有何變化。圖表對數據進行了過濾，僅顯示評分最多的電影，然後按照平均評分排序。
很容易發現具有相似顏色的水平線，表示評分變化不大的用戶。這可能是 Netflix 從基於星級的評分切換到喜歡/不喜歡評分的原因之一。四顆星評分對不同的人來說，含義不同。
我們在可視化聚類時，採取了一些措施（過濾/排序/切片）。因爲這種數據集比較“稀疏”，大多數單元格沒有值（因爲大部分用戶沒有看過大部分電影）。

4.2.3 利用聚類結果進行預測

我們選擇一個聚類和一位特定的用戶，看看該聚類可以使我們執行哪些實用的操作。

首先選擇一個聚類：

# TODO: Pick a cluster ID from the clusters above
cluster_number = 11

# Let's filter to only see the region of the dataset with the most number of values 
n_users = 75
n_movies = 300
cluster = clustered[clustered.group == cluster_number].drop(['index', 'group'], axis=1)

cluster = helper.sort_by_rating_density(cluster, n_movies, n_users)
helper.draw_movies_heatmap(cluster, axis_labels=False)

聚類中的實際評分如下所示：

cluster.fillna('').head()

	Forrest Gump (1994)	Sixteen Candles (1984)	Wizard of Oz, The (1939)	Mummy, The (1999)	Congo (1995)	First Wives Club, The (1996)	West Side Story (1961)	Sting, The (1973)	Sound of Music, The (1965)	Stand by Me (1986)	...	What's Eating Gilbert Grape (1993)	When Harry Met Sally... (1989)	North by Northwest (1959)	Breakfast Club, The (1985)	Casablanca (1942)	Big Lebowski, The (1998)	Mr. Holland's Opus (1995)	Nightmare Before Christmas, The (1993)	Broken Arrow (1996)	Four Weddings and a Funeral (1994)
3	3.0	5.0	3.0	4.0	1.0	4.0	4.0	5.0	3.0	5.0	...	3		5	4	4	5	1	5	3	3
2	5.0	5.0	4.0	4.0	3.0	4.0	4.0	3.0	5.0	5.0	...	5	5		5	3	2		2	4
0	5.0	4.0	4.0	4.0	1.0	4.0	4.0	5.0	4.0	4.0	...	3	5	4	4		4	3			4
1	5.0	3.5	2.0	2.5	0.5	1.5	5.0	3.5	4.5	2.0	...		4	3		3		2.5	0.5	3	5

4 rows × 300 columns

從表格中選擇一個空白單元格。因爲用戶沒有對該電影評分，所以是空白狀態。

因爲該用戶屬於似乎具有相似品位的用戶聚類，我們可以計算該電影在此聚類中的平均評分，結果可以作爲她是否喜歡電影 “Forrest Gump (1994)” 的合理預測依據。

# TODO: Fill in the name of the column/movie. e.g. 'Forrest Gump (1994)'
movie_name = "Forrest Gump (1994)"

cluster[movie_name].mean()

4.5

這就是我們關於她會如何對該電影進行評分的預測。

4.2.4 利用聚類結果進行推薦

我們回顧下上一步的操作。我們使用 k 均值根據用戶的評分對用戶進行聚類。這樣就形成了具有相似評分的用戶聚類，因此通常具有相似的電影品位。基於這一點，當某個用戶對某部電影沒有評分時，我們對該聚類中所有其他用戶的評分取平均值，該平均值就是我們猜測該用戶對該電影的喜歡程度。

根據這一邏輯，如果我們計算該聚類中每部電影的平均分數，就可以判斷該“品位聚類”對數據集中每部電影的喜歡程度。

# The average rating of 20 movies as rated by the users in the cluster
cluster.mean().head(20)

Forrest Gump (1994)                              4.500
Sixteen Candles (1984)                           4.375
Wizard of Oz, The (1939)                         3.250
Mummy, The (1999)                                3.625
Congo (1995)                                     1.375
First Wives Club, The (1996)                     3.375
West Side Story (1961)                           4.250
Sting, The (1973)                                4.125
Sound of Music, The (1965)                       4.125
Stand by Me (1986)                               4.000
Heathers (1989)                                  3.375
Victor/Victoria (1982)                           4.250
Sex, Lies, and Videotape (1989)                  3.375
Pulp Fiction (1994)                              4.250
Outbreak (1995)                                  2.125
Jaws (1975)                                      3.750
Who Framed Roger Rabbit? (1988)                  4.125
Big (1988)                                       4.250
Romy and Michele's High School Reunion (1997)    4.000
Forget Paris (1995)                              2.750
dtype: float64

這對我們來說變得非常實用，因爲現在我們可以使用它作爲推薦引擎，使用戶能夠發現他們可能喜歡的電影。

當用戶登錄我們的應用時，現在我們可以向他們顯示符合他們的興趣品位的電影。推薦方式是選擇聚類中該用戶尚未評分的最高評分的電影。、

無監督學習 | KMeans之Sklearn實現：電影評分聚類

文章目錄

1. KMeans in Sklearn

2. Sklearn 實例：電影評分的 k 均值聚類

2.1 數據集概述

2.2 二維 KMeans 聚類

3. 肘部法選取最優 K 值

4. 多維 KMeans 聚類

4.1 三維 KMeans 聚類

4.2 高維 KMeans 聚類

4.2.1 熱力圖可視化

4.2.2 稀疏 csr 矩陣

4.2.3 利用聚類結果進行預測

4.2.4 利用聚類結果進行推薦

機器學習 | 目錄（持續更新）

無監督學習 | GMM 高斯混合聚類原理及Sklearn實現

無監督學習 | KMeans與KMeans++原理

無監督學習 | DBSCAN 原理及Sklearn實現

SQLite | SQLite 與 Pandas 比較篇之一

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

title	Forrest Gump (1994)	Pulp Fiction (1994)	Shawshank Redemption, The (1994)	Silence of the Lambs, The (1991)	Star Wars: Episode IV - A New Hope (1977)	Jurassic Park (1993)	Matrix, The (1999)	Toy Story (1995)	Schindler's List (1993)	Terminator 2: Judgment Day (1991)	...	Dances with Wolves (1990)	Fight Club (1999)	Usual Suspects, The (1995)	Seven (a.k.a. Se7en) (1995)	Lion King, The (1994)	Godfather, The (1972)	Lord of the Rings: The Fellowship of the Ring, The (2001)	Apollo 13 (1995)	True Lies (1994)	Twelve Monkeys (a.k.a. 12 Monkeys) (1995)
29	5.0	5.0	5.0	4.0	4.0	4.0	3.0	4.0	5.0	4.0	...	5.0	4.0	5.0	4.0	3.0	5.0	3.0	5.0	4.0	2.0
508	4.0	5.0	4.0	4.0	5.0	3.0	4.5	3.0	5.0	2.0	...	5.0	4.0	5.0	4.0	3.5	5.0	4.5	3.0	2.0	4.0
14	1.0	5.0	2.0	5.0	5.0	3.0	5.0	2.0	4.0	4.0	...	3.0	5.0	5.0	5.0	4.0	5.0	5.0	3.0	4.0	4.0
72	5.0	5.0	5.0	4.5	4.5	4.0	4.5	5.0	5.0	3.0	...	4.5	5.0	5.0	5.0	5.0	5.0	5.0	3.5	3.0	5.0
653	4.0	5.0	5.0	4.5	5.0	4.5	5.0	5.0	5.0	5.0	...	4.5	5.0	5.0	4.5	5.0	4.5	5.0	5.0	4.0	5.0

	Forrest Gump (1994)	Sixteen Candles (1984)	Wizard of Oz, The (1939)	Mummy, The (1999)	Congo (1995)	First Wives Club, The (1996)	West Side Story (1961)	Sting, The (1973)	Sound of Music, The (1965)	Stand by Me (1986)	...	What's Eating Gilbert Grape (1993)	When Harry Met Sally... (1989)	North by Northwest (1959)	Breakfast Club, The (1985)	Casablanca (1942)	Big Lebowski, The (1998)	Mr. Holland's Opus (1995)	Nightmare Before Christmas, The (1993)	Broken Arrow (1996)	Four Weddings and a Funeral (1994)
3	3.0	5.0	3.0	4.0	1.0	4.0	4.0	5.0	3.0	5.0	...	3		5	4	4	5	1	5	3	3
2	5.0	5.0	4.0	4.0	3.0	4.0	4.0	3.0	5.0	5.0	...	5	5		5	3	2		2	4
0	5.0	4.0	4.0	4.0	1.0	4.0	4.0	5.0	4.0	4.0	...	3	5	4	4		4	3			4
1	5.0	3.5	2.0	2.5	0.5	1.5	5.0	3.5	4.5	2.0	...		4	3		3		2.5	0.5	3	5

title	Forrest Gump (1994)	Pulp Fiction (1994)	Shawshank Redemption, The (1994)	Silence of the Lambs, The (1991)	Star Wars: Episode IV - A New Hope (1977)	Jurassic Park (1993)	Matrix, The (1999)	Toy Story (1995)	Schindler's List (1993)	Terminator 2: Judgment Day (1991)	...	Dances with Wolves (1990)	Fight Club (1999)	Usual Suspects, The (1995)	Seven (a.k.a. Se7en) (1995)	Lion King, The (1994)	Godfather, The (1972)	Lord of the Rings: The Fellowship of the Ring, The (2001)	Apollo 13 (1995)	True Lies (1994)	Twelve Monkeys (a.k.a. 12 Monkeys) (1995)
29	5.0	5.0	5.0	4.0	4.0	4.0	3.0	4.0	5.0	4.0	...	5.0	4.0	5.0	4.0	3.0	5.0	3.0	5.0	4.0	2.0
508	4.0	5.0	4.0	4.0	5.0	3.0	4.5	3.0	5.0	2.0	...	5.0	4.0	5.0	4.0	3.5	5.0	4.5	3.0	2.0	4.0
14	1.0	5.0	2.0	5.0	5.0	3.0	5.0	2.0	4.0	4.0	...	3.0	5.0	5.0	5.0	4.0	5.0	5.0	3.0	4.0	4.0
72	5.0	5.0	5.0	4.5	4.5	4.0	4.5	5.0	5.0	3.0	...	4.5	5.0	5.0	5.0	5.0	5.0	5.0	3.5	3.0	5.0
653	4.0	5.0	5.0	4.5	5.0	4.5	5.0	5.0	5.0	5.0	...	4.5	5.0	5.0	4.5	5.0	4.5	5.0	5.0	4.0	5.0

	Forrest Gump (1994)	Sixteen Candles (1984)	Wizard of Oz, The (1939)	Mummy, The (1999)	Congo (1995)	First Wives Club, The (1996)	West Side Story (1961)	Sting, The (1973)	Sound of Music, The (1965)	Stand by Me (1986)	...	What's Eating Gilbert Grape (1993)	When Harry Met Sally... (1989)	North by Northwest (1959)	Breakfast Club, The (1985)	Casablanca (1942)	Big Lebowski, The (1998)	Mr. Holland's Opus (1995)	Nightmare Before Christmas, The (1993)	Broken Arrow (1996)	Four Weddings and a Funeral (1994)
3	3.0	5.0	3.0	4.0	1.0	4.0	4.0	5.0	3.0	5.0	...	3		5	4	4	5	1	5	3	3
2	5.0	5.0	4.0	4.0	3.0	4.0	4.0	3.0	5.0	5.0	...	5	5		5	3	2		2	4
0	5.0	4.0	4.0	4.0	1.0	4.0	4.0	5.0	4.0	4.0	...	3	5	4	4		4	3			4
1	5.0	3.5	2.0	2.5	0.5	1.5	5.0	3.5	4.5	2.0	...		4	3		3		2.5	0.5	3	5

title	Forrest Gump (1994)	Pulp Fiction (1994)	Shawshank Redemption, The (1994)	Silence of the Lambs, The (1991)	Star Wars: Episode IV - A New Hope (1977)	Jurassic Park (1993)	Matrix, The (1999)	Toy Story (1995)	Schindler's List (1993)	Terminator 2: Judgment Day (1991)	...	Dances with Wolves (1990)	Fight Club (1999)	Usual Suspects, The (1995)	Seven (a.k.a. Se7en) (1995)	Lion King, The (1994)	Godfather, The (1972)	Lord of the Rings: The Fellowship of the Ring, The (2001)	Apollo 13 (1995)	True Lies (1994)	Twelve Monkeys (a.k.a. 12 Monkeys) (1995)
29	5.0	5.0	5.0	4.0	4.0	4.0	3.0	4.0	5.0	4.0	...	5.0	4.0	5.0	4.0	3.0	5.0	3.0	5.0	4.0	2.0
508	4.0	5.0	4.0	4.0	5.0	3.0	4.5	3.0	5.0	2.0	...	5.0	4.0	5.0	4.0	3.5	5.0	4.5	3.0	2.0	4.0
14	1.0	5.0	2.0	5.0	5.0	3.0	5.0	2.0	4.0	4.0	...	3.0	5.0	5.0	5.0	4.0	5.0	5.0	3.0	4.0	4.0
72	5.0	5.0	5.0	4.5	4.5	4.0	4.5	5.0	5.0	3.0	...	4.5	5.0	5.0	5.0	5.0	5.0	5.0	3.5	3.0	5.0
653	4.0	5.0	5.0	4.5	5.0	4.5	5.0	5.0	5.0	5.0	...	4.5	5.0	5.0	4.5	5.0	4.5	5.0	5.0	4.0	5.0

	Forrest Gump (1994)	Sixteen Candles (1984)	Wizard of Oz, The (1939)	Mummy, The (1999)	Congo (1995)	First Wives Club, The (1996)	West Side Story (1961)	Sting, The (1973)	Sound of Music, The (1965)	Stand by Me (1986)	...	What's Eating Gilbert Grape (1993)	When Harry Met Sally... (1989)	North by Northwest (1959)	Breakfast Club, The (1985)	Casablanca (1942)	Big Lebowski, The (1998)	Mr. Holland's Opus (1995)	Nightmare Before Christmas, The (1993)	Broken Arrow (1996)	Four Weddings and a Funeral (1994)
3	3.0	5.0	3.0	4.0	1.0	4.0	4.0	5.0	3.0	5.0	...	3		5	4	4	5	1	5	3	3
2	5.0	5.0	4.0	4.0	3.0	4.0	4.0	3.0	5.0	5.0	...	5	5		5	3	2		2	4
0	5.0	4.0	4.0	4.0	1.0	4.0	4.0	5.0	4.0	4.0	...	3	5	4	4		4	3			4
1	5.0	3.5	2.0	2.5	0.5	1.5	5.0	3.5	4.5	2.0	...		4	3		3		2.5	0.5	3	5