機器學習——聚類實現

聚類的定義：對大量未知標註的數據集，按數據的內在相似性將數據集劃分爲多個類別，使得類別內的數據相似度較大，類別間的數據相似度較小。-
所以聚類需要解決的問題是：

如何定義相似性

如何選擇類別的數目

一. Kmeans

假設輸入樣本爲 $S=x_1, x_2,x_3,......,x_m$ 則算法步驟爲：

選擇初始的 $k$ 個類別中心， $\mu_1,\mu_2,\mu_3,....,\mu_k$

對於每個樣本 $x_i$ ，將其標記爲距離類別中心最近的類別，即：
$label_i = argmin_{i<=j<=k}\left\vert\vert s \right\vert\vert$
將每個類別中心更新爲隸屬該類別的所有樣本的均值：
$\mu_j=\frac{1}{∣c_j ∣}\sum_i x_i$
重複最後兩步直到類別中心的變化小於某閾值。
下面直觀的看下如何進行聚類：簡單的說就是首先確定分多少類，然後計算機會根據你給出的類別確定相應的聚類中心，當然這個是計算機隨機選定的，不過這個初始值的位置相當重要，後面會用Kmeans++來選擇它。根據聚類中心不斷的計算到聚類中心的距離，最近的那些點放到相應類別中，然後再求所有樣本的均值，作爲聚類中心。這樣不斷的迭代，直到滿足設定的終止條件爲止。

Kmeans聚類的實現：
對下面三類數據進行聚類操作：

import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics, cluster
from sklearn.datasets import samples_generator

x, y_true = samples_generator.make_blobs(n_samples=200, centers=2,cluster_std=0.60, random_state=0)
x2, y2_true = samples_generator.make_moons(n_samples=200, noise=0.05, random_state=0)
x3, y3_true = samples_generator.make_circles(n_samples=200, noise=0.05, random_state=0, factor=0.4)
gmm = cluster.KMeans(2)
label = gmm.fit_predict(x)
label2 = gmm.fit_predict(x2)
label3 = gmm.fit_predict(x3)
plt.subplot(2, 2, 1)
plt.scatter(x[:, 0], x[:, 1], c=label)
plt.title("blobs")
plt.axis('off')

plt.subplot(2, 2, 2)
plt.scatter(x2[:, 0], x2[:, 1], c=label2)
plt.title("moons")
plt.axis('off')

plt.subplot(2, 2, 3)
plt.scatter(x3[:, 0], x3[:, 1], c=label3)
plt.title("circles")
plt.axis('off')
plt.show()

聚類效果：可以看到對第一類數據的聚類效果與原始數據一樣，說明很好，其他兩類數據效果不佳。

Kmeans算法的缺陷

聚類中心的個數K 需要事先給定，但在實際中這個 K 值的選定是非常難以估計的，很多時候，事先並不知道給定的數據集應該分成多少個類別才最合適。
Kmeans需要人爲地確定初始聚類中心，不同的初始聚類中心可能導致完全不同的聚類結果。（可以使用Kmeans++算法來解決）。

二. Kmeans++

Kmeans聚類中心的個數K 需要事先給定，但在實際中這個 K 值的選定是非常難以估計的，很多時候，事先並不知道給定的數據集應該分成多少個類別才最合適, Kmeans需要人爲地確定初始聚類中心，不同的初始聚類中心可能導致完全不同的聚類結果。

K-Means ++ 算法思想

k-means++算法選擇初始seeds的基本思想就是：初始的聚類中心之間的相互距離要儘可能的遠。

從輸入的數據點集合中隨機選擇一個點作爲第一個聚類中心對於數據集中的每一個點x，計算它與最近聚類中心(指已選擇的聚類中心)的距離D(x)

選擇一個新的數據點作爲新的聚類中心，選擇的原則是：D(x)較大的點，被選取作爲聚類中心的概率較大重複2和3直到k個聚類中心被選出來。
選擇兩個初始的聚類中心，使得他們的距離儘可能的遠。如圖點6作爲初始點，那麼選擇離它最遠的距離的點2作爲第二個類別的聚類中心，這樣不斷的迭代，直到滿足約束條件。

三. MeanShift

MeanShift不需要給它設定初始聚類中心，只需要設置圈的大小。圈的大小是影響聚類結果的重要因素。
MeanShift代碼實現：

from sklearn import cluster
import matplotlib.pyplot as plt
from sklearn.datasets import samples_generator

x, y_true = samples_generator.make_blobs(n_samples=200, centers=2, cluster_std=0.5, random_state=0)
x2, y2_true = samples_generator.make_circles(n_samples=200, noise=0.05, random_state=0)
x3, y3_true = samples_generator.make_moons(n_samples=200, noise=0.05, random_state=0)
gmm = cluster.MeanShift()
label = gmm.fit_predict(x)
label2 = gmm.fit_predict(x2)
label3 = gmm.fit_predict(x3)
plt.subplot(2, 2, 1)
plt.scatter(x[:, 0], x[:, 1], c=label)
plt.title("blobs")
plt.axis('off')

plt.subplot(2, 2, 2)
plt.scatter(x2[:, 0], x2[:, 1], c=label2)
plt.title("moons")
plt.axis('off')

plt.subplot(2, 2, 3)
plt.scatter(x3[:, 0], x3[:, 1], c=label3)
plt.title("circles")
plt.axis('off')
plt.show()

聚類效果如圖所示，對於後面兩類數據效果不佳：

四. 層次聚類

層次聚類方法：

凝聚層次聚類：AGNES算法（自底向上）
分裂的層次聚類：DIANA算法（自頂向下）
優點：

距離和規則的相似度容易定義，限制少；

不需要預先制定聚類數；

可以發現類的層次關係；

可以聚類成其它形狀；

缺點：

1.計算複雜度太高；
2. 奇異值也能產生很大影響；
3. 算法很可能聚類成鏈狀

層次聚類代碼實現：

from sklearn import cluster
import matplotlib.pyplot as plt
from sklearn.datasets import samples_generator

x, y_true = samples_generator.make_blobs(n_samples=200, centers=2, cluster_std=0.5, random_state=0)
x2, y2_true = samples_generator.make_circles(n_samples=200, noise=0.05, random_state=0)
x3, y3_true = samples_generator.make_moons(n_samples=200, noise=0.05, random_state=0)
gmm = cluster.AgglomerativeClustering(n_clusters=2)
label = gmm.fit_predict(x)
label2 = gmm.fit_predict(x2)
label3 = gmm.fit_predict(x3)
plt.subplot(2, 2, 1)
plt.scatter(x[:, 0], x[:, 1], c=label)
plt.title("blobs")
plt.axis('off')

plt.subplot(2, 2, 2)
plt.scatter(x2[:, 0], x2[:, 1], c=label2)
plt.title("moons")
plt.axis('off')

plt.subplot(2, 2, 3)
plt.scatter(x3[:, 0], x3[:, 1], c=label3)
plt.title("circles")
plt.axis('off')
plt.show()

五. 密度聚類

代碼實現：

from sklearn import cluster
import matplotlib.pyplot as plt
from sklearn.datasets import samples_generator

x, y_true = samples_generator.make_blobs(n_samples=200, centers=2, cluster_std=0.5, random_state=0)
x2, y2_true = samples_generator.make_circles(n_samples=200, noise=0.05, random_state=0)
x3, y3_true = samples_generator.make_moons(n_samples=200, noise=0.05, random_state=0)
gmm = cluster.DBSCAN()
label = gmm.fit_predict(x)
label2 = gmm.fit_predict(x2)
label3 = gmm.fit_predict(x3)
plt.subplot(2, 2, 1)
plt.scatter(x[:, 0], x[:, 1], c=label)
plt.title("blobs")
plt.axis('off')

plt.subplot(2, 2, 2)
plt.scatter(x2[:, 0], x2[:, 1], c=label2)
plt.title("moons")
plt.axis('off')

plt.subplot(2, 2, 3)
plt.scatter(x3[:, 0], x3[:, 1], c=label3)
plt.title("circles")
plt.axis('off')
plt.show()

密度聚類結果：
如果給它重新設置參數後能很好的處理後面兩類：

# 第一個參數表示領域的大小，第二個參數表示，鄰域內最小包含5個樣本
gmm = cluster.DBSCAN(eps=0.3, min_samples=5)

六. AP 聚類

AP聚類算法思想：

將全部樣本看作網絡的節點，然後通過網絡中各條邊的消息傳遞計算出各樣本的聚類中心。聚類過程中，共有兩種消息在各節點間傳遞，分別是吸引度(responsibility)和歸屬(availability) 。AP算法通過迭代過程不斷更新每一個點的吸引度和歸屬度值，直到產生m個高質量的Exemplar（類似於質心），同時將其餘的數據點分配到相應的聚類中。

Exemplar：指的是聚類中心，K-Means中的質心，AP算法不需要事先指定聚類數目,相反它將所有的數據點都作爲潛在的聚類中心。

Similarity（相似度）：數據點i和點j的相似度記爲s(i,j)，是指點j作爲點i的聚類中心的相似度。一般使用歐氏距離來計算，一般點與點的相似度值全部取爲負值；因此，相似度值越大說明點與點的距離越近，便於後面的比較計算。

Preference：數據點i的參考度稱爲p(i)或s(i,i)，是指點i作爲聚類中心的參考度，以S矩陣的對角線上的數值s (k, k)作爲k點能否成爲聚類中心的評判標準,這意味着該值越大,這個點成爲聚類中心的可能性也就越大。一般取s相似度值的中值(Scikit-learn中默認爲中位數)。聚類的數量受到參考度p的影響,如果認爲每個數據點都有可能作爲聚類中心,那麼p就應取相同的值。如果取輸入的相似度的均值作爲p的值,得到聚類數量是中等的。如果取最小值,得到類數較少的聚類。

吸引度Responsibility：r(i,k)用來描述點k適合作爲數據點i的聚類中心的程度。

歸屬度Availability：a(i,k)用來描述點i選擇點k作爲其聚類中心的適合程度。

Damping factor(阻尼係數)：主要是起收斂作用的。

在實際計算應用中，最重要的兩個參數（也是需要手動指定）是Preference和Damping factor。前者定了聚類數量的多少，值越大聚類數量越多；後者控制算法收斂效果。

AP聚類代碼實現：

from sklearn import cluster
import matplotlib.pyplot as plt
from sklearn.datasets import samples_generator

x, y_true = samples_generator.make_blobs(n_samples=200, centers=2, cluster_std=0.5, random_state=0)
x2, y2_true = samples_generator.make_circles(n_samples=200, noise=0.05, random_state=0,factor=0.4)
x3, y3_true = samples_generator.make_moons(n_samples=200, noise=0.05, random_state=0)
# gmm = cluster.AffinityPropagation(preference=-30)
gmm = cluster.AffinityPropagation()
label = gmm.fit_predict(x)
label2 = gmm.fit_predict(x2)
label3 = gmm.fit_predict(x3)
plt.subplot(2, 2, 1)
plt.scatter(x[:, 0], x[:, 1], c=label)
plt.title("blobs")
plt.axis('off')

plt.subplot(2, 2, 2)
plt.scatter(x2[:, 0], x2[:, 1], c=label2)
plt.title("moons")
plt.axis('off')

plt.subplot(2, 2, 3)
plt.scatter(x3[:, 0], x3[:, 1], c=label3)
plt.title("circles")
plt.axis('off')
plt.show()

說明：後面會對譜聚類高斯混合模型的數學原理和求解算法EM進行詳細的推導。

七. 譜聚類

from sklearn import cluster
import matplotlib.pyplot as plt
from sklearn.datasets import samples_generator

x, y_true = samples_generator.make_blobs(n_samples=200, centers=2, cluster_std=0.5, random_state=0)
x2, y2_true = samples_generator.make_circles(n_samples=200, noise=0.05, random_state=0,factor=0.4)
x3, y3_true = samples_generator.make_moons(n_samples=200, noise=0.05, random_state=0)
gmm = cluster.SpectralClustering(2, affinity="nearest_neighbors")
label = gmm.fit_predict(x)
label2 = gmm.fit_predict(x2)
label3 = gmm.fit_predict(x3)
plt.subplot(2, 2, 1)
plt.scatter(x[:, 0], x[:, 1], c=label)
plt.title("blobs")
plt.axis('off')

plt.subplot(2, 2, 2)
plt.scatter(x2[:, 0], x2[:, 1], c=label2)
plt.title("moons")
plt.axis('off')

plt.subplot(2, 2, 3)
plt.scatter(x3[:, 0], x3[:, 1], c=label3)
plt.title("circles")
plt.axis('off')
plt.show()

八. 高斯混合模型

高斯混合模型

聚類效果：

from sklearn import cluster
import matplotlib.pyplot as plt
from sklearn.datasets import samples_generator
from sklearn.mixture import GaussianMixture
x, y_true = samples_generator.make_blobs(n_samples=200, centers=2, cluster_std=0.5, random_state=0)
x2, y2_true = samples_generator.make_circles(n_samples=200, noise=0.05, random_state=0,factor=0.4)
x3, y3_true = samples_generator.make_moons(n_samples=200, noise=0.05, random_state=0)
gmm = GaussianMixture(n_components=2)
label = gmm.fit_predict(x)
label2 = gmm.fit_predict(x2)
label3 = gmm.fit_predict(x3)
plt.subplot(2, 2, 1)
plt.scatter(x[:, 0], x[:, 1], c=label)
plt.title("blobs")
plt.axis('off')

plt.subplot(2, 2, 2)
plt.scatter(x2[:, 0], x2[:, 1], c=label2)
plt.title("moons")
plt.axis('off')

plt.subplot(2, 2, 3)
plt.scatter(x3[:, 0], x3[:, 1], c=label3)
plt.title("circles")
plt.axis('off')
plt.show()

機器學習——聚類實現

一. Kmeans

Kmeans算法的缺陷

二. Kmeans++

K-Means ++ 算法思想

三. MeanShift

四. 層次聚類

五. 密度聚類

六. AP 聚類

七. 譜聚類

八. 高斯混合模型

高斯混合模型

深度學習中的網絡設計技術(一) ——理論概述

python數據結構——鏈表及其實現

matplotlib實戰二——利用matplotlib畫激活函數曲線

分組卷積和深度可分離卷積

python數據結構——棧、隊列、雙端隊列

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結