無監督學習——聚類（k-means算法）

無監督學習是一種對不含標記的數據建立模型的機器學習範式。
無監督學習應用領域：
- 數據挖掘
- 醫學影像
- 股票市場分析
- 計算機視覺
- 市場分析
最常見的無監督學習就是聚類。
聚類的定義：聚類就是對大量未知標註的數據集，按數據的內在相似性將數據集劃分爲多個類別，使類別內的數據相似度較大而類別間的數據相似度較小
聚類的基本思想：
給定一個有N個對象的數據集，劃分聚類技術將構造數據的k個劃分，每一個劃分代表一個簇， k≤n。也就是說，聚類將數據劃分爲k個簇，而且這k個劃分滿足下列條件：
1. 每一個簇至少包含一個對象
2. 每一個對象屬於且僅屬於一個簇
基本思想：對於給定的k，算法首先給出一個初始的劃分方法，以後通過反覆迭代的方法改變劃分，使得每一次改進之後的劃分方案都較前一次更好。
K-means 算法
1. Clustering 中的經典算法，數據挖掘十大經典算法之一。K-means算法，也被稱爲k-平均或k-均值，是一種得到最廣泛使用的聚類算法，或者成爲其他聚類算法的基礎。
2. 算法接受參數 k ；然後將事先輸入的n個數據對象劃分爲 k個聚類以便使得所獲得的聚類滿足：同一聚類中的對象相似度較高；而不同聚類中的對象相似度較小。
3. 算法思想：以空間中k個點爲中心進行聚類，對最靠近他們的對象歸類。通過迭代的方法，逐次更新各聚類中心的值，直至得到最好的聚類結果
4. 算法描述：

      （1）適當選擇c個類的初始中心；
      （2）在第k次迭代中，對任意一個樣本，求其到c各中心的距離，將該樣本歸到距離最短的中心所在的類；
      （3）利用均值等方法更新該類的中心值；
      （4）對於所有的c個聚類中心，如果利用（2）（3）的迭代法更新後，值保持不變，則迭代結束，否則繼續迭代。

5.算法流程：

        輸入：k, data[n];
      （1） 選擇k個初始中心點，例如c[0]=data[0],…c[k-1]=data[k-1];
      （2） 對於data[0]….data[n], 分別與c[0]…c[k-1]比較，假定與c[i]差值最少，就標記爲i;
      （3） 對於所有標記爲i點，重新計算c[i]={ 所有標記爲i的data[j]之和}/標記爲i的個數；
      （4） 重複(2)(3),直到所有c[i]值的變化小於給定閾值。

使用《Python機器學習經典實例》一書中kmeans代碼展示：

utilities.py

import numpy as np
import matplotlib.pyplot as plt
from sklearn import cross_validation

# Load multivar data in the input file
def load_data(input_file):
    X = []
    with open(input_file, 'r') as f:
        for line in f.readlines():
            data = [float(x) for x in line.split(',')]
            X.append(data)

    return np.array(X)

# Plot the classifier boundaries on input data
def plot_classifier(classifier, X, y, title='Classifier boundaries', annotate=False):
    # define ranges to plot the figure 
    x_min, x_max = min(X[:, 0]) - 1.0, max(X[:, 0]) + 1.0
    y_min, y_max = min(X[:, 1]) - 1.0, max(X[:, 1]) + 1.0

    # denotes the step size that will be used in the mesh grid
    step_size = 0.01

    # define the mesh grid
    x_values, y_values = np.meshgrid(np.arange(x_min, x_max, step_size), np.arange(y_min, y_max, step_size))

    # compute the classifier output
    mesh_output = classifier.predict(np.c_[x_values.ravel(), y_values.ravel()])

    # reshape the array
    mesh_output = mesh_output.reshape(x_values.shape)

    # Plot the output using a colored plot 
    plt.figure()

    # Set the title
    plt.title(title)

    # choose a color scheme you can find all the options 
    # here: http://matplotlib.org/examples/color/colormaps_reference.html
    plt.pcolormesh(x_values, y_values, mesh_output, cmap=plt.cm.Set1)

    # Overlay the training points on the plot 
    plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='black', linewidth=2, cmap=plt.cm.Set1)

    # specify the boundaries of the figure
    plt.xlim(x_values.min(), x_values.max())
    plt.ylim(y_values.min(), y_values.max())

    # specify the ticks on the X and Y axes
    plt.xticks(())
    plt.yticks(())

    if annotate:
        for x, y in zip(X[:, 0], X[:, 1]):
            # Full documentation of the function available here: 
            # http://matplotlib.org/api/text_api.html#matplotlib.text.Annotation
            plt.annotate(
                '(' + str(round(x, 1)) + ',' + str(round(y, 1)) + ')',
                xy = (x, y), xytext = (-15, 15), 
                textcoords = 'offset points', 
                horizontalalignment = 'right', 
                verticalalignment = 'bottom', 
                bbox = dict(boxstyle = 'round,pad=0.6', fc = 'white', alpha = 0.8),
                arrowprops = dict(arrowstyle = '-', connectionstyle = 'arc3,rad=0'))

# Print performance metrics
def print_accuracy_report(classifier, X, y, num_validations=5):
    accuracy = cross_validation.cross_val_score(classifier, 
            X, y, scoring='accuracy', cv=num_validations)
    print ("Accuracy: " + str(round(100*accuracy.mean(), 2)) + "%")

    f1 = cross_validation.cross_val_score(classifier, 
            X, y, scoring='f1_weighted', cv=num_validations)
    print ("F1: " + str(round(100*f1.mean(), 2)) + "%")

    precision = cross_validation.cross_val_score(classifier, 
            X, y, scoring='precision_weighted', cv=num_validations)
    print ("Precision: " + str(round(100*precision.mean(), 2)) + "%")

    recall = cross_validation.cross_val_score(classifier, 
            X, y, scoring='recall_weighted', cv=num_validations)
    print ("Recall: " + str(round(100*recall.mean(), 2)) + "%")

kmeans.py

import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.cluster import KMeans

import utilities

# Load data加載數據
data = utilities.load_data('data_multivar.txt')
num_clusters = 4#定義集羣數量

# Plot data畫出輸入數據
plt.figure()
plt.scatter(data[:,0], data[:,1], marker='o', 
        facecolors='none', edgecolors='k', s=30)
x_min, x_max = min(data[:, 0]) - 1, max(data[:, 0]) + 1
y_min, y_max = min(data[:, 1]) - 1, max(data[:, 1]) + 1
plt.title('Input data')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())

# Train the model訓練模型
kmeans = KMeans(init='k-means++', n_clusters=num_clusters, n_init=10)
kmeans.fit(data)
#可視化邊界
# Step size of the mesh，設置網格數據的步長
step_size = 0.01

# Plot the boundaries畫出邊界
x_min, x_max = min(data[:, 0]) - 1, max(data[:, 0]) + 1
y_min, y_max = min(data[:, 1]) - 1, max(data[:, 1]) + 1
x_values, y_values = np.meshgrid(np.arange(x_min, x_max, step_size), np.arange(y_min, y_max, step_size))

# Predict labels for all points in the mesh預測網格中所有數據點的標記
predicted_labels = kmeans.predict(np.c_[x_values.ravel(), y_values.ravel()])

# Plot the results畫出結果
predicted_labels = predicted_labels.reshape(x_values.shape)
plt.figure()
plt.clf()
plt.imshow(predicted_labels, interpolation='nearest',
           extent=(x_values.min(), x_values.max(), y_values.min(), y_values.max()),
           cmap=plt.cm.Paired,
           aspect='auto', origin='lower')

plt.scatter(data[:,0], data[:,1], marker='o', 
        facecolors='none', edgecolors='k', s=30)

#把中心點畫在圖形上
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:,0], centroids[:,1], marker='o', s=200, linewidths=3,
        color='k', zorder=10, facecolors='black')
x_min, x_max = min(data[:, 0]) - 1, max(data[:, 0]) + 1
y_min, y_max = min(data[:, 1]) - 1, max(data[:, 1]) + 1
plt.title('Centoids and boundaries obtained using KMeans')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()

運行代碼後，可以看到下面的圖形：

K-means聚類方法總結
優點：

是解決聚類問題的一種經典算法，簡單、快速
對處理大數據集，該算法保持可伸縮性和高效率
當結果簇是密集的，它的效果較好

缺點:

在簇的平均值可被定義的情況下才能使用，可能不適用於某些應用
必須事先給出k（要生成的簇的數目），而且對初值敏感，對於不同的初始值，可能會導致不同結果。
不適合於發現非凸形狀的簇或者大小差別很大的簇
對躁聲和孤立點數據敏感

可作爲其他聚類方法的基礎算法，如譜聚類

無監督學習——聚類（k-means算法）

釘釘打卡速度慢

Nginx R31 doc 官方文檔-01-nginx 如何安裝

Qt/C++音視頻開發74-合併標籤圖形/生成yolo運算結果圖形/文字和圖形合併成一個/水印濾鏡

挑戰程序設計競賽 2.2章習題 POJ - 3617 Best Cow Line 貪心

字節面試：MySQL什麼時候鎖表？如何防止鎖表？

.NET8連接SQL SERVER 2008 R2 報：證書鏈是由不受信任的頒發機構頒發的

golang開發環境搭建(win10)

python計算機視覺學習筆記——PIL庫的用法

Golang初學：獲取程序內存使用情況，std runtime

機器學習流程知識結構圖

python OpenCV電腦調用手機攝像頭，更方便物體檢測與人臉識別

機器學習，深度學習相關概念，及用PyTorch實現（一）

【Python數據分析學習筆記Day1】（一）工作環境準備及數據分析建模理論基礎

【Python數據分析學習筆記Day5】（五）時間序列數據分析

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結