簡要介紹KNN（K Nearest Neighbor）

Nearest Neighbor

介紹KNN以前，我們先了解一下Nearest Neighbor。Nearest Neighbor主要用於分類，以CIFAR-10數據集爲例，有50000張訓練圖片和與之對應的標籤。我們將要爲測試集標上標籤，我們取一張測試集中的圖片，在訓練集中找出與之最相似的一張圖片，那麼該測試集圖片與該最相似的一張的圖片的標籤相同。但是這樣也有侷限性。

以改圖爲例，我們現在要對綠色圓形進行分類，看起來綠色圓形彷彿應該屬於紅色三角形，但是事實可能往往並不是這樣，所以便有了KNN。

K Nearest Neighbor

如果一個樣本在特徵空間中的K個最相鄰的樣本中的大多數屬於某一個類別，則該樣本也屬於這個類別，並具有這個類別上樣本的特性。該方法在確定分類決策上只依據最鄰近的一個或者幾個樣本的類別來決定待分樣本所屬的類別。 KNN方法在類別決策時，只與極少量的相鄰樣本有關。由於KNN方法主要靠周圍有限的鄰近的樣本，而不是靠判別類域的方法來確定所屬類別的，因此對於類域的交叉或重疊較多的待分樣本集來說，KNN方法較其他方法更爲適合。

KNN算法比NN算法誤差要小一些，因爲KNN中的K我們可以通過在訓練集找到一個比較好的K，在測試集上KNN的誤差比NN的誤差要小。

還是以此圖爲例，如果用NN算法，綠色圓形本應屬於紅色三角形，當K=3時，也是如此。但是當K=5時，綠色圓形屬於藍色三角性，可見，綠色圓點到底屬於哪一類，與具體的K有關。所以我們要看分類的對象，並訓練KNN分類器，畢竟如果NN算法要好的話，也是KNN算法中K=1的特殊情況。

也就是說：K Nearest Neighbor 算法包含了 Nearest Neighbor算法

如何度量相似

L1（Manhattan）distance

L2 distance

numpy技巧

idxs = np.flatnonzero()：返回非0元素的位置

idxs = np.random.choice()：從數組中隨機抽取元素

plt.subplot()：子圖繪製

np.linalg.norm()：計算範數

np.array_split()：數組劃分

np.vstack(): 按垂直方向（行順序）堆疊數組構成一個新的數組

np.hstack(): 按水平方向（列順序）堆疊數組構成一個新的數組

np.argsort()：排序，返回索引值

np.bincount()：返回每個索引出現的次數

KNN代碼實現

github：https://github.com/GIGpanda/CS231n

一共兩個.py文件，knn.py和k_nearest_neighbor.py

knn.py

數據加載

加載CIFAR-10數據，並打印訓練集和測試集的圖片、標籤尺寸。

# KNN

from __future__ import print_function
import random
import numpy as np
from cs231n.data_utils import load_CIFAR10
import matplotlib.pyplot as plt
from cs231n.classifiers import KNearestNeighbor
import time

plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# Load the raw CIFAR-10 data.
cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'
X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)

# As a sanity check, we print out the size of the training and test data.
print('Training data shape: ', X_train.shape)
print('Training labels shape: ', y_train.shape)
print('Test data shape: ', X_test.shape)
print('Test labels shape: ', y_test.shape)

可視化部分數據

對於訓練集中的10類數據，每類隨機取出7張並可視化。

# Visualize some examples from the dataset.
# We show a few examples of training images from each class.
classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
num_classes = len(classes)
samples_per_class = 7
for y, cls in enumerate(classes):
    idxs = np.flatnonzero(y_train == y)    # np.flatnonzero 返回非0元素的位置
    idxs = np.random.choice(idxs, samples_per_class, replace=False)
    # np.random.choice 從數組中隨機抽取元素
    # 從數組idxs中隨機抽取數字, 組成大小爲samples_per_class的數組, replace=False表示不可以取相同的數字
    for i, idx in enumerate(idxs):
        plt_idx = i * num_classes + y + 1
        # plt_idx 算出每張圖片打印出來的位置 也就是同一類圖片在同一列
        plt.subplot(samples_per_class, num_classes, plt_idx)
        # plt.subplot 參數描述子圖的位置信息
        # samples_per_class: 行 num_classes: 列 plt_idx: 索引值
        # 索引從1開始 從左上角到右下角
        plt.imshow(X_train[idx].astype('uint8'))
        plt.axis('off')
        # off: turn off axis line and labels
        if i == 0:
            plt.title(cls) # 畫出每個子圖的標題
plt.show()

Subsample

取出訓練集和測試集中的部分數據訓練KNN。

# Subsample the data for more efficient code execution in this exercise
num_training = 5000
mask = list(range(num_training))
# range(start, stop, step) 創建一個整數列表 默認從0開始
X_train = X_train[mask]
y_train = y_train[mask]

num_test = 500
mask = list(range(num_test))
X_test = X_test[mask]
y_test = y_test[mask]

# Reshape the image data into rows
X_train = np.reshape(X_train, (X_train.shape[0], -1))
X_test = np.reshape(X_test, (X_test.shape[0], -1))
print(X_train.shape, X_test.shape)

Train a KNN classifier

訓練K=1和K=5時，分類的準確率。

# Create a kNN classifier instance.
# Remember that training a kNN classifier is a noop:
# the Classifier simply remembers the data and does no further processing
classifier = KNearestNeighbor()
classifier.train(X_train, y_train)

# Open cs231n/classifiers/k_nearest_neighbor.py and implement
# compute_distances_two_loops.

# Test your implementation:
dists = classifier.compute_distances_two_loops(X_test)
print(dists.shape)

# We can visualize the distance matrix: each row is a single test example and
# its distances to training examples
plt.imshow(dists, interpolation='none')
plt.show()

# Now implement the function predict_labels and run the code below:
# We use k = 1 (which is Nearest Neighbor).
y_test_pred = classifier.predict_labels(dists, k=1)
# np.argsort

# Compute and print the fraction of correctly predicted examples
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print('Got %d / %d correct => accuracy: %f'%(num_correct, num_test, accuracy))

y_test_pred = classifier.predict_labels(dists, k=5)
num_correct = np.sum(y_test_pred  == y_test)
accuracy = float(num_correct) / num_test
print('Got %d / %d correct => accuaracy: %f'%(num_correct, num_test, accuracy))

Use one loop and check

用一重循環計算L2距離，並驗證。

# Now lets speed up distance matrix computation by using partial vectorization
# with one loop. Implement the function compute_distances_one_loop and run the
# code below:
dists_one = classifier.compute_distances_one_loop(X_test)

# To ensure that our vectorized implementation is correct, we make sure that it
# agrees with the naive implementation. There are many ways to decide whether
# two matrices are similar; one of the simplest is the Frobenius norm. In case
# you haven't seen it before, the Frobenius norm of two matrices is the square
# root of the squared sum of differences of all elements; in other words, reshape
# the matrices into vectors and compute the Euclidean distance between them.
difference = np.linalg.norm(dists-dists_one, ord='fro')
print('Difference was: %f'%(difference, ))
if difference < 0.001:
    print('Good! The distance matrices are the same')
else:
    print('Uh-oh! The distance matrices are different')

Use no loop and check

不用循環計算L2距離，並驗證。

# Now implement the fully vectorized version inside compute_distances_no_loops
# and run the code
dists_two = classifier.compute_distances_no_loops(X_test)

# check that the distance matrix agrees with the one we computed before:
difference = np.linalg.norm(dists - dists_two, ord='fro')
print('Difference was: %f' % (difference, ))
if difference < 0.001:
    print('Good! The distance matrices are the same')
else:
    print('Uh-oh! The distance matrices are different')

比較no_loop, one_loop, two_loop的時間開銷

# Let's compare how fast the implementations are
def time_function(f, *args):
    """
    Call a function f with args and return the time (in seconds) that it took to execute.
    """
    tic = time.time()
    f(*args)
    toc = time.time()
    return toc-tic

two_loop_time = time_function(classifier.compute_distances_two_loops, X_test)
print('Two loop version took %f seconds' % two_loop_time)

one_loop_time = time_function(classifier.compute_distances_one_loop, X_test)
print('One loop version took %f seconds' % one_loop_time)

no_loop_time = time_function(classifier.compute_distances_no_loops, X_test)
print('No loop version took %f seconds' % no_loop_time)

# you should see significantly faster performance with the fully vectorized implementation

交叉驗證

將訓練集分成5份，4份作爲訓練數據，1份作爲測試數據。

num_folds = 5
k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]
X_train_folds = []
y_train_folds = []
################################################################################
# TODO:                                                                        #
# Split up the training data into folds. After splitting, X_train_folds and    #
# y_train_folds should each be lists of length num_folds, where                #
# y_train_folds[i] is the label vector for the points in X_train_folds[i].     #
# Hint: Look up the numpy array_split function.                                #
################################################################################
# Your code
X_train_folds = np.array_split(X_train, num_folds)
y_train_folds = np.array_split(y_train, num_folds)
################################################################################
#                                 END OF YOUR CODE                             #
################################################################################

# A dictionary holding the accuracies for different values of k that we find
# when running cross-validation. After running cross-validation,
# k_to_accuracies[k] should be a list of length num_folds giving the different
# accuracy values that we found when using that value of k.
k_to_accuracies = {}
for k in k_choices:
    k_to_accuracies[k] = []

################################################################################
# TODO:                                                                        #
# Perform k-fold cross validation to find the best value of k. For each        #
# possible value of k, run the k-nearest-neighbor algorithm num_folds times,   #
# where in each case you use all but one of the folds as training data and the #
# last fold as a validation set. Store the accuracies for all fold and all     #
# values of k in the k_to_accuracies dictionary.                               #
################################################################################
# Your code
for i in k_choices:
    for j in range(num_folds):
        idx = j
        X_train = np.vstack(X_train_folds[0:idx] + X_train_folds[idx+1:num_folds])
        y_train = np.hstack(y_train_folds[0:idx] + y_train_folds[idx+1:num_folds])
        X_test = X_train_folds[idx]
        y_test = y_train_folds[idx]
        classifier.train(X_train, y_train)
        dists = classifier.compute_distances_no_loops(X_test)
        y_test_pred = classifier.predict_labels(dists, i)
        num_correct = np.sum(y_test_pred == y_test)
        num_test = len(y_test)
        accuracy = float(num_correct) / num_test
        k_to_accuracies[i].append(accuracy)

################################################################################
#                                 END OF YOUR CODE                             #
################################################################################

# Print out the computed accuracies
for k in sorted(k_to_accuracies):
    for accuracy in k_to_accuracies[k]:
        print('k = %d, accuracy = %f' % (k, accuracy))

可視化

可視化每一個K的誤差。

# plot the raw observations
for k in k_choices:
    accuracies = k_to_accuracies[k]
    plt.scatter([k] * len(accuracies), accuracies)

# plot the trend line with error bars that correspond to standard deviation
accuracies_mean = np.array([np.mean(v) for k, v in sorted(k_to_accuracies.items())])
accuracies_std = np.array([np.std(v) for k, v in sorted(k_to_accuracies.items())])
plt.errorbar(k_choices, accuracies_mean, yerr=accuracies_std)
plt.title('Cross-validation on k')
plt.xlabel('k')
plt.ylabel('Cross-validation accuracy')
plt.show()

選擇最佳的k，訓練分類器

通過可視化每一個K的誤差，選擇最佳的K重新訓練分類器。

# Based on the cross-validation results above, choose the best value for k,
# retrain the classifier using all the training data, and test it on the test
# data. You should be able to get above 28% accuracy on the test data.
# Based on the cross-validation results above, choose the best value for k,
# retrain the classifier using all the training data, and test it on the test
# data. You should be able to get above 28% accuracy on the test data.
best_k = 1

classifier = KNearestNeighbor()
classifier.train(X_train, y_train)
y_test_pred = classifier.predict(X_test, k=best_k)

# Compute and display the accuracy
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))

k_nearest_neighbor.py

初始化

import numpy as np

class KNearestNeighbor(object):
  """ a kNN classifier with L2 distance """

  def __init__(self):
    pass

  def train(self, X, y):
    """
    Train the classifier. For k-nearest neighbors this is just 
    memorizing the training data.

    Inputs:
    - X: A numpy array of shape (num_train, D) containing the training data
      consisting of num_train samples each of dimension D.
    - y: A numpy array of shape (N,) containing the training labels, where
         y[i] is the label for X[i].
    """
    self.X_train = X
    self.y_train = y
    
  def predict(self, X, k=1, num_loops=0):
    """
    Predict labels for test data using this classifier.

    Inputs:
    - X: A numpy array of shape (num_test, D) containing test data consisting
         of num_test samples each of dimension D.
    - k: The number of nearest neighbors that vote for the predicted labels.
    - num_loops: Determines which implementation to use to compute distances
      between training points and testing points.

    Returns:
    - y: A numpy array of shape (num_test,) containing predicted labels for the
      test data, where y[i] is the predicted label for the test point X[i].  
    """
    if num_loops == 0:
      dists = self.compute_distances_no_loops(X)
    elif num_loops == 1:
      dists = self.compute_distances_one_loop(X)
    elif num_loops == 2:
      dists = self.compute_distances_two_loops(X)
    else:
      raise ValueError('Invalid value %d for num_loops' % num_loops)

    return self.predict_labels(dists, k=k)

兩重循環計算距離

  def compute_distances_two_loops(self, X):
    """
    Compute the distance between each test point in X and each training point
    in self.X_train using a nested loop over both the training data and the 
    test data.

    Inputs:
    - X: A numpy array of shape (num_test, D) containing test data.

    Returns:
    - dists: A numpy array of shape (num_test, num_train) where dists[i, j]
      is the Euclidean distance between the ith test point and the jth training
      point.
    """
    num_test = X.shape[0]
    num_train = self.X_train.shape[0]
    dists = np.zeros((num_test, num_train))
    for i in range(num_test):
      for j in range(num_train):
        #####################################################################
        # TODO:                                                             #
        # Compute the l2 distance between the ith test point and the jth    #
        # training point, and store the result in dists[i, j]. You should   #
        # not use a loop over dimension.                                    #
        #####################################################################
        dists[i][j] = np.sqrt(np.sum(np.square(X[i] - self.X_train[j])))
        #####################################################################
        #                       END OF YOUR CODE                            #
        #####################################################################
    return dists

一重循環計算距離

  def compute_distances_one_loop(self, X):
    """
    Compute the distance between each test point in X and each training point
    in self.X_train using a single loop over the test data.

    Input / Output: Same as compute_distances_two_loops
    """
    num_test = X.shape[0]
    num_train = self.X_train.shape[0]
    dists = np.zeros((num_test, num_train))
    for i in range(num_test):
      #######################################################################
      # TODO:                                                               #
      # Compute the l2 distance between the ith test point and all training #
      # points, and store the result in dists[i, :].                        #
      #######################################################################
      dists[i, :] = np.sqrt(np.sum(np.square(X[i] - self.X_train), axis=1))
      #######################################################################
      #                         END OF YOUR CODE                            #
      #######################################################################
    return dists

不用循環計算距離

我們有兩個矩陣Ma, Mb，並計算這兩個矩陣的L2距離。

考慮 , 同理 $(M_a-M_b)^2 = {M_a}^2 - 2 M_aM_b + {M_b}^2$

只要我們通過改變Ma、Mb的形狀，便可以完成上述計算。

  def compute_distances_no_loops(self, X):
    """
    Compute the distance between each test point in X and each training point
    in self.X_train using no explicit loops.

    Input / Output: Same as compute_distances_two_loops
    """
    num_test = X.shape[0]
    num_train = self.X_train.shape[0]
    dists = np.zeros((num_test, num_train)) 
    #########################################################################
    # TODO:                                                                 #
    # Compute the l2 distance between all test points and all training      #
    # points without using any explicit loops, and store the result in      #
    # dists.                                                                #
    #                                                                       #
    # You should implement this function using only basic array operations; #
    # in particular you should not use functions from scipy.                #
    #                                                                       #
    # HINT: Try to formulate the l2 distance using matrix multiplication    #
    #       and two broadcast sums.                                         #
    #########################################################################
    # 現在我們要算(a-b)^2 = a^2 - 2*a*b - b^2
    # a: X    b : self.X_train
    a2 = np.diag(np.dot(X, X.T))
    b2 = np.diag(np.dot(self.X_train, self.X_train.T))
    ab = np.dot(X, self.X_train.T)
    row = ab.shape[0]
    col = ab.shape[1]
    a2 = np.reshape(np.repeat(a2, col), ab.shape)
    b2 = np.reshape(np.repeat(b2, row), ab.T.shape)
    dists = np.sqrt(a2-2*ab+b2.T)
    #########################################################################
    #                         END OF YOUR CODE                              #
    #########################################################################
    return dists

預測標籤

樣本的標籤爲相鄰的K個樣本中出現次數最多的那個標籤。

  def predict_labels(self, dists, k=1):
    """
    Given a matrix of distances between test points and training points,
    predict a label for each test point.

    Inputs:
    - dists: A numpy array of shape (num_test, num_train) where dists[i, j]
      gives the distance betwen the ith test point and the jth training point.

    Returns:
    - y: A numpy array of shape (num_test,) containing predicted labels for the
      test data, where y[i] is the predicted label for the test point X[i].  
    """
    num_test = dists.shape[0]
    y_pred = np.zeros(num_test)
    for i in range(num_test):
      # A list of length k storing the labels of the k nearest neighbors to
      # the ith test point.
      closest_y = []
      #########################################################################
      # TODO:                                                                 #
      # Use the distance matrix to find the k nearest neighbors of the ith    #
      # testing point, and use self.y_train to find the labels of these       #
      # neighbors. Store these labels in closest_y.                           #
      # Hint: Look up the function numpy.argsort.                             #
      #########################################################################
      tmpdists = dists[i,: ]
      idxs = np.argsort(tmpdists)
      closest_y = self.y_train[idxs[:k]]
      #########################################################################
      # TODO:                                                                 #
      # Now that you have found the labels of the k nearest neighbors, you    #
      # need to find the most common label in the list closest_y of labels.   #
      # Store this label in y_pred[i]. Break ties by choosing the smaller     #
      # label.                                                                #
      #########################################################################
      count = np.bincount(closest_y)
      y_pred[i] = np.argmax(count)
      #########################################################################
      #                           END OF YOUR CODE                            # 
      #########################################################################

    return y_pred

【CS231n】KNN淺談 + KNN代碼實現