KNN算法的學習

KNN的英文叫K-Nearest Neighbor，比較簡單。
github鏈接：github

一、簡單的例子

首先我們先從一個簡單的例子入手，來體會一下KNN算法。

假設，我們想對電影的類型進行分類，統計了電影中打鬥的次數、接吻的次數，當然還有其他的指標也可以統計到，這裏就不列舉了，如下表所示：

我們很容易的理解《戰狼》《紅海行動》《碟中諜6》是動作片，《前任三》《春嬌救志明》《泰塔尼克號》是愛情片，但是有沒有一種辦法讓機器也可以掌握這個分類的規則呢？當有一部新電影的時候，也可以對它的類型自動分類呢?

這裏我們可以把打鬥次數看成x軸，接吻次數看成y軸，然後在二維的座標軸上，對這幾部電影進行標記。

如下圖所示，對於未知的電影A，座標爲(x)，我們需要看下里離電影A最近的都有哪些電影，這些電影中的大多數屬於哪個分類，那麼電影A就屬於哪個分類。

而在實際操作中，我們還需要確定一個K值，也就是我們需要觀察離電影A最近的電影有多少個。

代碼實現

# 導入包
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# film_train_data表示電影數據 film_train_labels表示標籤
film_data =[[100,5],
            [95,3],
            [105,31],
            [2,59],
            [3,60],
            [10,80]]
film_labels = [0,0,0,1,1,1]

film_train_data = np.array(film_data)
film_train_labels = np.array(film_labels)

# 在圖中展示
plt.scatter(film_train_data[film_train_labels==0,0],film_train_data[film_train_labels==0,1],color="g")

plt.scatter(film_train_data[film_train_labels==1,0],film_train_data[film_train_labels==1,1],color="r")

<matplotlib.collections.PathCollection at 0x225163dd4e0>

# 添加電影A
film_data_A = np.array([5,70])

# 一併在圖中展示
plt.scatter(film_train_data[film_train_labels==0,0],film_train_data[film_train_labels==0,1],color="g")
plt.scatter(film_train_data[film_train_labels==1,0],film_train_data[film_train_labels==1,1],color="r")
plt.scatter(film_data_A[0],film_data_A[1],color="b")

<matplotlib.collections.PathCollection at 0x22516488198>

KNN的工作原理

“近朱者赤，近墨者黑”可以說是KNN的工作原理。整個計算過程分爲三步：

1.計算待分類物體與其他物體之間的距離；

2.統計路基最近的k個鄰居

3.對於k個最近的鄰居，它們屬於哪個分類最多，待分類物體就屬於哪一類。

KNN的選擇

我們能看出整個KNN的分類過程，K值的選擇還是很重要的，那麼問題來了，K值選擇多少是適合的呢？

如果K值比較小，就相當於未分類物體與它的鄰居分廠接近才行。這樣產生的一個問題就是，如果鄰居點是個噪聲點，那麼未分類物體的分類也會產生誤差，這樣KNN分類就會產生過擬合。

如果K值比較大，相當於距離過遠的點也會對未知物體的分類產生影響，雖然這種情況的好處是魯棒性強，但是不足也很明顯，會產生欠擬合情況，也就是沒有把未分類物體真正分類出來。

所以K值應該是個實踐出來的結果，並不是我們事先而定的。在工程上，我們一般採用交叉驗證的方式選取K值。

交叉驗證的思路就是，把樣本集中的大部分樣本作爲訓練集，剩餘的小部分樣本用於預測，來驗證分類模型的準確性。所以在KNN算法中，我們一般會把K值選取在較小的範圍內，同時在驗證集中準確率最高的那一個最終確定作爲K值。

距離如何計算

在KNN算法中，還有一個重要的計算就是關於距離的度量，兩個樣本點之間的距離代表了這兩個樣本之間的相似度。距離越大，差異性越大；距離越小，相似度越大。

關於距離的計算方式有下面五種方式：

1.歐式距離；

2.曼哈頓距離；

3.閔可夫斯基距離；

4.切比雪夫距離；

5.餘弦距離。

1 歐式距離

歐式距離是我們最常用的距離公式，也叫作歐幾裏距離。在二維空間中，兩點的歐式距離就是：

在三維空間中，兩點的歐氏距離是：

同理，我們推出在n維空間中兩個點之間的歐式距離是：

2 曼哈頓距離

曼哈頓距離在幾何空間中用的比較多，以下圖爲例，綠色的直線代表兩點之間的歐式距離，而紅色和黃色的線爲兩點的曼哈頓距離，所以曼哈頓距離等於兩個點在座標系上絕對軸距總和，用公式表示就是:

d( i , j )= | xi - xj | + | yi - yj |

3 閔可夫斯基距離

閔可夫斯基不是一個距離，而是一組距離的定義。在n位向量空間中a(x11,x12,…,x1n)與b(x21,x22,…,x2n)間的閔可夫斯基距離定義爲:

其中p代表空間的維數，當p=1時，就是曼哈頓距離；當p=2時，就是歐式距離；當p->∞，就是切比雪夫距離。

4 切比雪夫距離

切比雪夫距離是怎麼計算的呢？兩點之間的切比雪夫距離就是這兩個點座標數值差的絕對值的最大值，用數學表示就是：

max( | x1 - x2 | , | y1 - y2 | )

5 餘弦距離

餘弦距離實際上計算的是兩個向量的夾角，是在方向上計算兩者之間的差異，對絕對值不敏感。在興趣相關性比較上，角度關係比距離的絕對值更重要，因此餘弦距離可以用於衡量用戶對內容興趣的區分度。比如我們用搜索引擎搜索某個關鍵詞，它還會給你推薦其他的相關搜素，這些推薦的關鍵詞就是採用餘弦距離計算得出的。

KNN的擴展內容

a.KD樹

其實從上文我們可以看到，KNN的計算過程就是大量計算樣本點之間的距離。爲了減少計算距離次數，提升KNN的搜素效率，人們提出了KD樹(K-dimensional的縮寫)。KD樹是對數據點在K維空間中劃分的一種數據結構。在KD樹的構造中，每個節點都是k維數值點的二叉樹。既然是二叉樹，就可以採用二叉樹的增刪改查，這樣就大大提升了搜索效率。

其實，我們並不需要對KD樹的數學原理了解太多，只需要知道它是一個二叉樹的數據結構，方便存儲K維空間的數據即可，在sklearn我們可以直接調用KD樹，很方便。

b.KNN做迴歸

KNN不僅可以做分類，還可以做迴歸。

首先說下什麼是迴歸。在開頭電影這個案例中，如果想要對未知電影進行類型劃分，這是一個分類問題。首先看一下要分類的未知電影，離它最近的K部電影大多數屬於哪個分類，這部電影就屬於哪個分類。

那麼如果是一部新電影，已知它是愛情片，想要知道它的打鬥次數、接吻次數可能是多少，這就是一個迴歸問題。

那麼KNN如何做迴歸呢？

對於一個新點，我們需要找出這個點的K個最近鄰居，然後將這些鄰居的屬性的平均值點賦給該點，就可以得到該點的屬性。當然不同鄰居的影響力權重可以設置不同的。舉個例子，比如一部電影A，已知它是動作片，當K=3時，最近的3部電影是《戰狼》，《紅海行動》和《碟中諜6》，那麼它的打鬥次數和接吻次數的預估值分別爲(100+95+105)/3=100次，(5+3+31)/3=13次

KNN的過程

繼續剛纔那個小例子，首先計算電影A到所有訓練電影的距離

from math import sqrt

distance = []
for  film_data in film_train_data:
    d = sqrt(np.sum((film_data_A-film_data)**2))
    distance.append(d)

distance

[115.10864433221339,
 112.2007130102122,
 107.33592129385204,
 11.40175425099138,
 10.198039027185569,
 11.180339887498949]

對其下標進行排序輸出

nearest = np.argsort(distance)
nearest

array([4, 5, 3, 2, 1, 0], dtype=int64)

選取k值，這裏選取k=3

k = 3

從標籤中選取k個最近的標籤

topK_labels = [film_train_labels[i] for i in nearest[:k]]
topK_labels

[1, 1, 1]

統計最近的標籤的次數

from collections import Counter
votes = Counter(topK_labels)
votes.most_common(1)

[(1, 3)]

輸出電影A的預測值

predict_labels = votes.most_common(1)[0][0]
predict_labels

KNN算法封裝

import numpy as np
from math import sqrt
from collections import Counter

def accuracy_score(y_true, y_predict):
    """計算y_true和y_predict之間的準確率"""
    assert len(y_true) == len(y_predict), \
        "the size of y_true must be equal to the size of y_predict"

    return np.sum(y_true == y_predict) / len(y_true)

class KNNClassifier:

    def __init__(self, k):
        """初始化kNN分類器"""
        assert k >= 1, "k must be valid"
        self.k = k
        self._x_train = None
        self._y_train = None

    def fit(self, x_train, y_train):
        """根據訓練數據集X_train和y_train訓練kNN分類器"""
        assert x_train.shape[0] == y_train.shape[0], \
            "the size of x_train must be equal to the size of y_train"
        assert self.k <= x_train.shape[0], \
            "the size of x_train must be at least k."

        self._x_train = x_train
        self._y_train = y_train
        return self

    def predict(self, x_predict):
        """給定待預測數據集X_predict，返回表示X_predict的結果向量"""
        assert self._x_train is not None and self._y_train is not None, \
                "must fit before predict!"
        assert x_predict.shape[1] == self._x_train.shape[1], \
                "the feature number of x_predict must be equal to x_train"

        y_predict = [self._predict(x) for x in x_predict]
        return np.array(y_predict)

    def _predict(self, x):
        """給定單個待預測數據x，返回x的預測結果值"""
        assert x.shape[0] == self._x_train.shape[1], \
            "the feature number of x must be equal to x_train"

        distances = [sqrt(np.sum((x_train - x) ** 2))
                     for x_train in self._x_train]
        nearest = np.argsort(distances)

        topK_y = [self._y_train[i] for i in nearest[:self.k]]
        votes = Counter(topK_y)

        return votes.most_common(1)[0][0]

    def score(self, x_test, y_test):
        """根據測試數據集 x_test 和 y_test 確定當前模型的準確度"""

        y_predict = self.predict(x_test)
        return accuracy_score(y_test, y_predict)

    def __repr__(self):
        return "KNN(k=%d)" % self.k

調用自己寫好的KNN實現上面的小例子

knn_clf = KNNClassifier(k=3)
knn_clf.fit(film_train_data,film_train_labels)
# 將其轉換爲二維數據
film_data_A = film_data_A.reshape(1,-1)
predict_labels = knn_clf.predict(film_data_A)
predict_labels[0]

二、使用KNN對鳶尾花數據進行分類識別

# 導包
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

加載鳶尾花數據及對數據的探索

iris = datasets.load_iris()

查看鳶尾花數據的屬性

iris.keys()

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

查看對鳶尾花數據的描述

print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%[email protected])
    :Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

.. topic:: References

   - Fisher, R.A. "The use of multiple measurements in taxonomic problems"
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
   - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
     Structure and Classification Rule for Recognition in Partially Exposed
     Environments".  IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-2, No. 1, 67-71.
   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
     on Information Theory, May 1972, 431-433.
   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...

查看鳶尾花數據

iris.data

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.2],
       [5. , 3.2, 1.2, 0.2],
       [5.5, 3.5, 1.3, 0.2],
       [4.9, 3.6, 1.4, 0.1],
       [4.4, 3. , 1.3, 0.2],
       [5.1, 3.4, 1.5, 0.2],
       [5. , 3.5, 1.3, 0.3],
       [4.5, 2.3, 1.3, 0.3],
       [4.4, 3.2, 1.3, 0.2],
       [5. , 3.5, 1.6, 0.6],
       [5.1, 3.8, 1.9, 0.4],
       [4.8, 3. , 1.4, 0.3],
       [5.1, 3.8, 1.6, 0.2],
       [4.6, 3.2, 1.4, 0.2],
       [5.3, 3.7, 1.5, 0.2],
       [5. , 3.3, 1.4, 0.2],
       [7. , 3.2, 4.7, 1.4],
       [6.4, 3.2, 4.5, 1.5],
       [6.9, 3.1, 4.9, 1.5],
       [5.5, 2.3, 4. , 1.3],
       [6.5, 2.8, 4.6, 1.5],
       [5.7, 2.8, 4.5, 1.3],
       [6.3, 3.3, 4.7, 1.6],
       [4.9, 2.4, 3.3, 1. ],
       [6.6, 2.9, 4.6, 1.3],
       [5.2, 2.7, 3.9, 1.4],
       [5. , 2. , 3.5, 1. ],
       [5.9, 3. , 4.2, 1.5],
       [6. , 2.2, 4. , 1. ],
       [6.1, 2.9, 4.7, 1.4],
       [5.6, 2.9, 3.6, 1.3],
       [6.7, 3.1, 4.4, 1.4],
       [5.6, 3. , 4.5, 1.5],
       [5.8, 2.7, 4.1, 1. ],
       [6.2, 2.2, 4.5, 1.5],
       [5.6, 2.5, 3.9, 1.1],
       [5.9, 3.2, 4.8, 1.8],
       [6.1, 2.8, 4. , 1.3],
       [6.3, 2.5, 4.9, 1.5],
       [6.1, 2.8, 4.7, 1.2],
       [6.4, 2.9, 4.3, 1.3],
       [6.6, 3. , 4.4, 1.4],
       [6.8, 2.8, 4.8, 1.4],
       [6.7, 3. , 5. , 1.7],
       [6. , 2.9, 4.5, 1.5],
       [5.7, 2.6, 3.5, 1. ],
       [5.5, 2.4, 3.8, 1.1],
       [5.5, 2.4, 3.7, 1. ],
       [5.8, 2.7, 3.9, 1.2],
       [6. , 2.7, 5.1, 1.6],
       [5.4, 3. , 4.5, 1.5],
       [6. , 3.4, 4.5, 1.6],
       [6.7, 3.1, 4.7, 1.5],
       [6.3, 2.3, 4.4, 1.3],
       [5.6, 3. , 4.1, 1.3],
       [5.5, 2.5, 4. , 1.3],
       [5.5, 2.6, 4.4, 1.2],
       [6.1, 3. , 4.6, 1.4],
       [5.8, 2.6, 4. , 1.2],
       [5. , 2.3, 3.3, 1. ],
       [5.6, 2.7, 4.2, 1.3],
       [5.7, 3. , 4.2, 1.2],
       [5.7, 2.9, 4.2, 1.3],
       [6.2, 2.9, 4.3, 1.3],
       [5.1, 2.5, 3. , 1.1],
       [5.7, 2.8, 4.1, 1.3],
       [6.3, 3.3, 6. , 2.5],
       [5.8, 2.7, 5.1, 1.9],
       [7.1, 3. , 5.9, 2.1],
       [6.3, 2.9, 5.6, 1.8],
       [6.5, 3. , 5.8, 2.2],
       [7.6, 3. , 6.6, 2.1],
       [4.9, 2.5, 4.5, 1.7],
       [7.3, 2.9, 6.3, 1.8],
       [6.7, 2.5, 5.8, 1.8],
       [7.2, 3.6, 6.1, 2.5],
       [6.5, 3.2, 5.1, 2. ],
       [6.4, 2.7, 5.3, 1.9],
       [6.8, 3. , 5.5, 2.1],
       [5.7, 2.5, 5. , 2. ],
       [5.8, 2.8, 5.1, 2.4],
       [6.4, 3.2, 5.3, 2.3],
       [6.5, 3. , 5.5, 1.8],
       [7.7, 3.8, 6.7, 2.2],
       [7.7, 2.6, 6.9, 2.3],
       [6. , 2.2, 5. , 1.5],
       [6.9, 3.2, 5.7, 2.3],
       [5.6, 2.8, 4.9, 2. ],
       [7.7, 2.8, 6.7, 2. ],
       [6.3, 2.7, 4.9, 1.8],
       [6.7, 3.3, 5.7, 2.1],
       [7.2, 3.2, 6. , 1.8],
       [6.2, 2.8, 4.8, 1.8],
       [6.1, 3. , 4.9, 1.8],
       [6.4, 2.8, 5.6, 2.1],
       [7.2, 3. , 5.8, 1.6],
       [7.4, 2.8, 6.1, 1.9],
       [7.9, 3.8, 6.4, 2. ],
       [6.4, 2.8, 5.6, 2.2],
       [6.3, 2.8, 5.1, 1.5],
       [6.1, 2.6, 5.6, 1.4],
       [7.7, 3. , 6.1, 2.3],
       [6.3, 3.4, 5.6, 2.4],
       [6.4, 3.1, 5.5, 1.8],
       [6. , 3. , 4.8, 1.8],
       [6.9, 3.1, 5.4, 2.1],
       [6.7, 3.1, 5.6, 2.4],
       [6.9, 3.1, 5.1, 2.3],
       [5.8, 2.7, 5.1, 1.9],
       [6.8, 3.2, 5.9, 2.3],
       [6.7, 3.3, 5.7, 2.5],
       [6.7, 3. , 5.2, 2.3],
       [6.3, 2.5, 5. , 1.9],
       [6.5, 3. , 5.2, 2. ],
       [6.2, 3.4, 5.4, 2.3],
       [5.9, 3. , 5.1, 1.8]])

查看鳶尾花數據的維度

iris.data.shape

(150, 4)

查看鳶尾花的特徵

iris.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

查看鳶尾花的的標籤數據

iris.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

查看鳶尾花的標籤名字

iris.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

在圖中展示鳶尾花數據，先加載鳶尾花前兩列的特徵

# 加載鳶尾花的前兩列數據
data =  iris.data[:,:2]

# 加載鳶尾花的標籤數據
labels = iris.target

plt.scatter(data[labels==0,0],data[labels==0,1],color="red",marker="o")
plt.scatter(data[labels==1,0],data[labels==1,1],color="blue",marker="+")
plt.scatter(data[labels==2,0],data[labels==2,1],color="green",marker="x")

<matplotlib.collections.PathCollection at 0x225179969b0>

在圖中展示鳶尾花數據，加載鳶尾花後兩列的特徵

data = iris.data[:,2:]
labels = iris.target

plt.scatter(data[labels==0,0],data[labels==0,1],color="red",marker="o")
plt.scatter(data[labels==1,0],data[labels==1,1],color="blue",marker="+")
plt.scatter(data[labels==2,0],data[labels==2,1],color="green",marker="x")

<matplotlib.collections.PathCollection at 0x225189c3f60>

對數據進行切分

將數據劃分爲訓練數據和測試數據

data = iris.data
shuffle_indexs = np.random.permutation(len(data))
shuffle_indexs

array([ 22,  94, 131,  15,  99,  62,  68,  89, 113, 114, 146, 128, 139,
        38,  50,  95,  70,  91, 123,  49, 138,  57, 117, 136,  58, 132,
        25,  60, 142,  77,  98, 141, 144,  61, 119,  40,  75,  35,   7,
        97,  16, 124,  83, 120,   6, 127,  87,  41,   0, 102, 110,  66,
       107,  84,  29,  18, 101,  21,  72, 121,  33,  14, 115,  63, 147,
        20, 116, 111,  93, 108,  52,  69, 105,  82,  39, 118,  47,  86,
        85, 137,  31,  27,  28, 140, 106,  46, 130,  80,  73,  55,  92,
        19,  88,  10, 112,  24,  36,  78,  65,  79,  74, 143, 129,  71,
       126,   9,  59,  44,   5,  45,  37,   4,  30, 125,  56,  43,  11,
       133,  51, 122, 148,  13,  81, 103, 100, 135,   3,  34,  54,  67,
        26,  53,   1,  90,  48,  32,   8,  76,  12,   2, 145,  23,  42,
       104,  64,  17, 109, 134,  96, 149])

test_ratio = 0.2
test_size = int(len(data)*test_ratio)

test_indexs = shuffle_indexs[:test_size]
train_indexs = shuffle_indexs[test_size:]

x_test = data[test_indexs]
y_test = labels[test_indexs]
x_train = data[train_indexs]
y_train = labels[train_indexs]

print(x_test.shape)
print(y_test.shape)
print(x_train.shape)
print(y_train.shape)

(30, 4)
(30,)
(120, 4)
(120,)

封裝的切割函數

def train_test_split(X,y,test_ratio=0.2,seed=None):
    """將數據X和y按照test_ratio分割成X_train,X_test,y_train,y_test"""
    assert X.shape[0] == y.shape[0],"the size of X must be equal to the size of y"
    assert 0.0 <=test_ratio<=1.0,"test_ratio must be valid"
    
    if seed:
        np.random.seed(seed)
    
    shuffle_indexs = np.random.permutation(len(X))
    
    test_size = int(len(X)*test_ratio)
    test_indexs = shuffle_indexs[:test_size]
    train_indexs = shuffle_indexs[test_size:]
    
    X_train = X[train_indexs]
    y_train = y[train_indexs]
    
    X_test = X[test_indexs]
    y_test = y[test_indexs]
    
    return X_train,X_test,y_train,y_test

對data數據分成訓練數據和測試數據

x_train,x_test,y_train,y_test = train_test_split(data,labels)

創建一個knn分類器

my_knn_clf = KNNClassifier(k=3)

對knn分類器進行訓練

my_knn_clf.fit(x_train,y_train)

KNN(k=3)

對測試數據進行預測

y_predict = my_knn_clf.predict(x_test)

y_predict

array([1, 2, 0, 1, 1, 1, 2, 2, 0, 0, 2, 0, 1, 0, 0, 2, 2, 0, 0, 1, 0, 0,
       1, 1, 2, 2, 0, 1, 0, 0])

與測試標籤進行對比

y_test

array([1, 2, 0, 1, 1, 1, 2, 2, 0, 0, 2, 0, 1, 0, 0, 2, 2, 0, 0, 1, 0, 0,
       1, 1, 2, 2, 0, 1, 0, 0])

計算預測百分比

sum(y_predict==y_test)/len(x_test)

1.0

三、KNN對手寫數字進行識別

這裏將直接調用sklearn調用KNN算法、使用sklearn中自帶的手寫數字數據集進行實戰。

在sklearn中使用KNN

在Python的sklearn的工具包中有KNN算法。KNN既可以做分類器，也可以做迴歸。如果是做分類，你需要引用；

from sklearn.neighbors import KNeighborsClassifier

如果是做迴歸，那麼你需要引用:

from sklearn.neighbors import KNeighborsRegressor

這裏，我們看下如何在sklearn中創建KNN分類器：

使用構造函數KNeighborsClassifier(n_neighbors=5, weights=‘uniform’, algorithm=‘auto’, leaf_size=30)

1.n_neighbors:及KNN中的k值，代表的是鄰居的數量。k值如果比較小，會造成過擬合。如果k值比較大，無法將未知物體分類出來。一般我們使用默認值5

2.weights：用來確定鄰居的權重，有三種方式：

  weights=uniform,代表所有鄰居的權重相同      
  weights=distance,代表權重是距離的倒數，即與距離成反比
  自定義函數，你可以自定義不同距離所對應的權重。大部分情況下不需要自己定義函數。

3.algorithm:用來規定計算鄰居的方式，它有四種方式:

   algorithm=auto,根據數據的情況自動選擇適合的算法，默認情況選擇auto
   algorithm=kd_tree,也叫作KD樹，是多維空間的數據結構，方便對關鍵數據進行檢索，不過KD樹適用於維度少的情況，一般維數不超過20，如果維數大於20之後，效率反而會下降；
   algorithm=ball_tree,也叫作球樹，它和KD樹一樣都是多維空間的數據結構，不同於KD樹，球樹更適用於維度大的情況；
   algorithm=brute，也叫作暴力搜索，它和KD數不同的地方是在於採用的是線性掃描，而不是通過構造樹結構進行快速檢索。當訓練集大的時候，效率很低。

4.leaf_size：代表構造KD樹或球樹時的葉子數，默認是30，調整leaf_size會影響到樹的構造和搜索速度。

總之，創建完KNN分類器之後，我們就可以輸入訓練集對它進行訓練，這裏使用fit()函數，傳入訓練集中的樣本特徵矩陣和分類標識，會自動得到訓練好的KNN分類器。然後使用predict()函數來對結果進行預測，這裏傳入測試集的特徵矩陣，可以得到測試集的預測分類結果。

knn對手寫書寫識別的整體流程

整體訓練過程基本上都會包括三個階段:

   1.數據加載:
   我們可以直接從sklearn中加載自帶的手寫數字數據集；
   
   2.準備階段：在這個階段中，我們需要對數據集有個初步的瞭解，比如樣本的個數、圖像長什麼樣，識別結果是怎麼樣的。這裏我們可以通過可視化的方式來查看圖像的呈現。通過數據規範化可以讓數據都在同一個數量級的維度。另外，因爲訓練集是圖像，每幅圖像是8*8的矩陣，我們不需要對它進行特徵選取，將全部的圖像數據作爲特徵值矩陣即可。
   
   3.分類階段：通過訓練可以得到分類器，然後用測試集進行準確率的計算。

1.數據加載

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

digits = datasets.load_digits()

2.準備階段

# 查看digits數據的屬性值
digits.keys()

dict_keys(['data', 'target', 'target_names', 'images', 'DESCR'])

# 查看digits數據
data = digits.data
# 查看維度
print(data.shape)
# 查看前三行數據
print(data[:3])

(1797, 64)
[[ 0.  0.  5. 13.  9.  1.  0.  0.  0.  0. 13. 15. 10. 15.  5.  0.  0.  3.
  15.  2.  0. 11.  8.  0.  0.  4. 12.  0.  0.  8.  8.  0.  0.  5.  8.  0.
   0.  9.  8.  0.  0.  4. 11.  0.  1. 12.  7.  0.  0.  2. 14.  5. 10. 12.
   0.  0.  0.  0.  6. 13. 10.  0.  0.  0.]
 [ 0.  0.  0. 12. 13.  5.  0.  0.  0.  0.  0. 11. 16.  9.  0.  0.  0.  0.
   3. 15. 16.  6.  0.  0.  0.  7. 15. 16. 16.  2.  0.  0.  0.  0.  1. 16.
  16.  3.  0.  0.  0.  0.  1. 16. 16.  6.  0.  0.  0.  0.  1. 16. 16.  6.
   0.  0.  0.  0.  0. 11. 16. 10.  0.  0.]
 [ 0.  0.  0.  4. 15. 12.  0.  0.  0.  0.  3. 16. 15. 14.  0.  0.  0.  0.
   8. 13.  8. 16.  0.  0.  0.  0.  1.  6. 15. 11.  0.  0.  0.  1.  8. 13.
  15.  1.  0.  0.  0.  9. 16. 16.  5.  0.  0.  0.  0.  3. 13. 16. 16. 11.
   5.  0.  0.  0.  0.  3. 11. 16.  9.  0.]]

# 查看標籤數據
labels = digits.target
# 查看維度
print(labels.shape)
# 查看前三十行數據
print(labels[:30])

(1797,)
[0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9]

# 查看圖像
plt.imshow(digits.images[0])
print(digits.target[0])

對原始數據集中的第一幅圖進行可視化，我們可以看到圖像時一8*8的像素矩陣，從上面這幅圖像看出這是一個"0",從訓練集的分類標註中我們也可以看到分類標註爲"0"

sklearn自帶的手寫數字數據集一共包括了1797個樣本，每幅圖像都是8*8像素的矩陣。因爲我們並沒有專門的測試集，所以我們需要對數據集做劃分，劃分成測試集和訓練集。因爲KNN算法和距離定義相關，所以我們還需要對數據進行規範化處理，採用Z-Score規範化。

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# 分割數據，將25%的數據作爲測試集，其餘作爲訓練集
train_x,test_x,train_y,test_y = train_test_split(data,labels,test_size=0.2,random_state=111)

# 採用Z-Score規範化
standardScaler = StandardScaler()
train_ss_x = standardScaler.fit_transform(train_x)
test_ss_x = standardScaler.transform(test_x)

然後我們構造一個KNN分類器，把訓練集的數據傳入構造好的knn，並通過測試集進行結果預測，與測試集的結果進行對比，得到knn分類器準確率

knn_clf = KNeighborsClassifier()
knn_clf.fit(train_ss_x,train_y)
knn_clf.score(test_ss_x,test_y)

0.9694444444444444

Gird Search尋找最佳

from sklearn.model_selection import GridSearchCV
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

digits = datasets.load_digits()
data = digits.data
labels = digits.target

x_train,x_test,y_train,y_test = train_test_split(data,labels,test_size=0.2,random_state=666)

param_grid=[
    {
        "weights":["uniform"],
        "n_neighbors":[i for i in range(1,11)]
    },
    {
        "weights":["distance"],
        "n_neighbors":[i for i in range(1,11)],
        "p":[i for i in range(1,6)]
    }
] 

knn_clf = KNeighborsClassifier()

UsageError: Line magic function `%%time` not found.

%%time
grid_search = GridSearchCV(knn_clf,param_grid,n_jobs=-1,verbose=2)
grid_search.fit(x_train,y_train)

D:\software\Anaconda\workplace\lib\site-packages\sklearn\model_selection\_split.py:2053: FutureWarning: You should specify a value for 'cv' instead of relying on the default value. The default value will change from 3 to 5 in version 0.22.
  warnings.warn(CV_WARNING, FutureWarning)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


Fitting 3 folds for each of 60 candidates, totalling 180 fits


[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:    3.5s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:   39.2s


Wall time: 50.6 s


[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed:   50.5s finished

grid_search.best_params_

{'n_neighbors': 3, 'p': 3, 'weights': 'distance'}

grid_search.best_estimator_

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=3, p=3,
           weights='distance')

grid_search.best_score_

0.9853862212943633

01 機器學習算法之KNN