誤差模型:過擬合,交叉驗證,偏差-方差權衡

Introduction

In this post you will get to grips with what is perhaps the most essential concept in machine learning: thebias-variance trade-off(偏差-方差權衡). The main idea here is that you want to create models that areas good at prediction as possible but that are still applicable to new data (i.e. they are generalizable). The danger is that you can easily create models thatoverfit to the local noise in your specific dataset, which isn’t too helpful and leads to poor generalizability since the noise is random and therefore different in each dataset. Essentially, you want to create models that capture only the useful components of a dataset. On the other hand, models that generalize very well but are too inflexible to generate good predictions are the other extreme you want to avoid (this is calledunderfitting).

We discuss and demonstrate these concepts using the k-nearest neighbors algorithm, which has a simple parameter k which can be varied to cleanly demonstrate these ideas of underfitting, overfitting and generalization(泛化:指模型對未知樣本的適應能力). Together, this bundle of concepts related to the balance between underfitting and overfitting is referred to as the bias-variance trade-off. Here is a table summarizing some different but related characteristics of models which either underfit or overfit, which you can refer to throughout this post:

bias_variance_table

We will explain what all of these terms mean and how they are interrelated. We will also discusscross-validation, which is a good way of estimating the accuracy and generalizability of your models.

You will encounter all of these concepts in future blog posts, which will cover model optimization, random forests, Naive Bayes, logistic regression and how to combine different models into an ensembled meta-model.

Generating the dataset

Let’s start off by building an artificial dataset to play with. You can do this easily with themake_classification() function from sklearn.datasets. Specifically, you will generate a relatively simple binary classification problem. To make it a bit more interesting, let’s make the data crescent-shaped and add some random noise(數據呈現月牙型並加入一些隨機噪聲). This should make it more realistic and increase the difficulty of classifying observations.

# Creating the dataset
# e.g. make_moons generates crescent-shaped data
# Check out make_classification, which generates linearly-separable data
from sklearn.datasets import make_moons
 
X, y =make_moons(
    n_samples=500, # the number of observations
    random_state=1,
    noise=0.3
)
 
# Take a peek
print(X[:10,])
print(y[:10])


The dataset you just generated looks a bit like this:
[[ 0.50316464  0.11135559]
 [ 1.06597837 -0.63035547]
 [ 0.95663377  0.58199637]
 [ 0.33961202  0.40713937]
 [ 2.17952333 -0.08488181]
 [ 2.00520942  0.7817976 ]
 [ 0.12531776 -0.14925731]
 [ 1.06990641  0.36447753]
 [-0.76391099 -0.6136396 ]
 [ 0.55678871  0.8810501 ]]
[1 1 0 0 1 1 1 0 0 0]
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
 
%matplotlib inline # for the plots to appear inline in jupyter notebooks
 
# Plot the first feature against the other, color by class
plt.scatter(X[y == 1, 0], X[y==1, 1], color="#EE3D34", marker="x")
plt.scatter(X[y == 0, 0], X[y==0, 1], color="#4458A7", marker="o")


intro

Next up, let’s split the dataset into a training set andtest set. The training set will be used to develop and tune our models. The test set will be completely left alone until the very end, at which point you’ll run your finished models on it. Having a test set will allow you to get a good estimate of how well your models would perform out in the wild on previously unseen data.

from sklearn.cross_validation import train_test_split
 
# Split into training and test sets
XTrain, XTest, yTrain, yTest= train_test_split(X, y, random_state=1, test_size=0.5)


You are going to try to predict the classes in our dataset with a k Nearest Neighbor (kNN) classifier. Chapter 2 of theIntroduction to Statistical Learning book provides a great intro to the theory behind kNN. We are huge fans of the ISLR book, so definitely check it out if you have the time. You could also have a look at this previous post that teaches youhow to implement the algorithm from scratch in Python.

Introducing the k hyperparameter in kNN

The kNN algorithm works by using information about the k-nearest neighbors of a new data point in order to assign it a class label. It simply looks at the class of other data points most similar to it (its ‘nearest neighbors’) and assigns the new data point to the most common class of these neighbors. When using kNN, you have to set the value ofk that you want the algorithm to use ahead of time, and it is not trivial to know which value to use.

If the value for k is high (e.g. k=99), then the model considers a large number of neighbors when making a a decision about the class of an unknown datapoint. This means that the model is quite constrained, since it has to take a large amount of information into account when classifying instances. In other words, a high number fork give rise to relatively “rigid” model behaviour.

By contrast, if the value for k is low (e.g. k=1 or k=2), then only a few neighbors are taken into account when making a classification decision. It is a very flexible model with a lot of complexity – it really fits very closely to the precise shape of the dataset. Hence, the predictions of the model are much more dependent on the local tendencies of the data (crucially, this includes the noise!).

Take a look at how the kNN algorithm separates the training cases when k=99 compared to whenk=1. The green line is the decision boundary on the training data (i.e. the threshold at which the algorithm decides whether a data point belong in the blue or red class).

training_data_only_99_1

At the bottom of the post you learn how to generate these plots yourself, but let’s delve into some theory first.

When k=99 (on the left), it looks like the model fit might be a bit too smooth and could stand to fit the data a bit closer. The model haslow flexibility and low complexity. It paints the decision boundary with a broad brush. It has relativelyhigh bias because one can tell it is not modelling the data as well as it could – it models the underlying generative process of the data as something too simple, and this is highly biased away from the ground truth. But, the decision boundary would probably look very similar if you redrew it on a slightly different dataset. It is a stable model that won’t vary a lot – it haslow variance.

When k=1 (on the right), you can see that the model is massively overfitting to the noise. It is technically generating perfectly correct predictions on the training set (the error in the bottom right hand corner is equal to 0.00!), but hopefully you can see how this fit is way too sensitive to individual data points. Keep in mind that you added noise to the dataset – it looks like this model fit is taking the noise too seriously and is fitting very closely to it. You can say that thek=1 model has high flexibility andhigh complexity because it tunes very tightly to the data. It also haslow bias – if nothing else, the decision boundary certainly fits the trends you observe in the data. But, the fitted boundary would drastically change on even slightly different data – it would vary significantly, i.e. the k=1 model has high variance.

But how well do these models generalize, i.e. how well would they perform on new data?

You have so far only looked at the training data, but quantifying training error isn’t that useful. You are not interested in how well models can recapitulate what they just learned on the training set.Let’s take a look at how they perform on test data, since that gives a better impression of whether your models are actually good or not. Try it yourself using a few different values ofk:

from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
 
knn99 =KNeighborsClassifier(n_neighbors =99)
knn99.fit(XTrain, yTrain)
yPredK99 = knn99.predict(XTest)
print"Overall Error of k=99 Model:",1 - round(metrics.accuracy_score(yTest, yPredK99), 2)
 
knn1 =KNeighborsClassifier(n_neighbors =1)
knn1.fit(XTrain, yTrain)
yPredK1 = knn1.predict(XTest)
print"Overall Error of k=1 Model:",1 - round(metrics.accuracy_score(yTest, yPredK1), 2)


Overall Error of k=99 Model: 0.15
Overall Error of k=1 Model: 0.15

Actually, it looks like these models perform approximately equally well on the test data. Here are the decision boundaries you learned on the training set, applied to the test set. See if you can figure out where the two models are making their mistakes.

training_data_and_test_99_1

The two models are making mistakes for very different reasons. It seems that thek=99 model isn’t doing a good job at capturing the crescent shape of the data (it isunderfitting), while the k=1 model is making mistakes by being horriblyoverfitted to the noise. Remember, the hallmark of overfitting is good training performance and bad testing performance, which is what you observe here.

Maybe intermediate values of k are where you want to be? Try a value of k between 99 and 1, e.g.:

knn50 =KNeighborsClassifier(n_neighbors =50)
knn50.fit(XTrain, yTrain)
yPredK50 = knn50.predict(XTest)
print"Overall Error of k=50 Model:",1 - round(metrics.accuracy_score(yTest, yPredK50), 2)

Overall Error of k=50 Model: 0.11

Looking better! Let’s check out the decision boundary for the k=50 model.

training_data_and_test_99_50_1

Much better – the model fit is similar to the actual trend in the dataset and this improvement is reflected in a lower test set error.

The bias-variance trade-off: concluding comments

Hopefully you now have a good intuition for what it means for models to underfit and overfit. See if all of the terms in the table at the beginning of this post now make sense. Basically, finding the right balance between overfitting and underfitting corresponds to the bias-variance trade-off.

To recap, when you train machine learning algorithms on a data set, what you are really interested in ishow your model will perform on an independent data set. It is not enough to do a good job classifying instances on the training set. Essentially, you are only interested in building models that aregeneralizable(提高泛化能力) – getting 100% accuracy on the training set is not impressive, and is simply an indicator ofoverfitting. Overfitting is the situation in which you have fitted your model too closely to the data, and have tuned to the noise instead of just to the signal.

To be clear: strictly speaking, you are not trying to model the trends in the dataset. You try to model the real world process that has led to us observing the data. The specific dataset you happen to be working with is just a small set of instances (i.e. a sample) of the ground truth, which brings with it its own noise and peculiarities.

Here is a summary figure showing how under-fitting (high bias, low variance), properly fitting, andover-fitting (low bias, high variance) models fare on the training compared to the test sets:

figures_mod4_over_underfitting

This idea of building generalizable models is the motivationg your dataset into atraining set (on which models can be trained) and a test set (which is held out until the very end of your analysis, and provides an accurate measure of model performance).

But – BIG warning! It’s also possibly to overfit to the test set. If you were to try out lots of different models and keep changing them in order to chase accuracy points on the test set, then the information from the test set can inadvertentlyleak into our model creation phase. You need a way around this.

Estimating model performance using k-fold cross validation

Enter k-fold cross-validation, which is a handy technique for measuring a model’s performance usingonly the training set. Say that you want to do e.g. 10-fold cross-validation. The process is as follows:you randomly partition the training set into 10 equal sections.Then, we train an algorithm on 9/10ths (i.e. 9 out of the 10 sections) of that training set. You then evaluate its performance on the remaining 1 section. This gives you some measure of the model’s performance (e.g. overall accuracy). You then train the algorithm on a different 9/10ths of the training set, and evaluate on the other (different from before) remaining 1 section.You continue the process 10 times, get 10 different measures of model performance, and average these values to get an overall measure of performance. Of course, you could have chosen some number other than 10. To keep on with the example, the process behind 10-fold CV looks like this:

cv

You can use k-fold cross validation to get an estimate of model accuracy, and you can use these estimates to tweak(調整) your model until you are happy. This lets you leave the test data alone until the very end, thus side-stepping the danger of overfitting to it. In other words, cross-validation provides a way to simulate having more data than you actually have so that you do not have to “spend” your test data until the very end of model building. k-fold cross validation, and its variants, are extremely popular and very useful, especially if you’re trying out lots and lots of different models (e.g. if you want to test how well a load of differently parameterized models perform).

Comparing training error, cross-validation error, and test error

So what would be the best value for k(k指的是K-NN算法中的參數k)? Try out different values fork when building models with the training data and see how well the resulting models fare when predicting the classes of either the training set itself or the test set. Finally, see how well k-fold cross validation will be at indicating the bestk.

Note: in practice, when scanning a parameter like this, it would be a bad idea to use the training set for testing the model. In equal measure, you would never scan a parameter using the test set multiple times (once for each parameter value tried). In the following plot you use these calculations just as an illustration to see what they would look like. In practice, only k-fold cross validation is a safe approach!

import numpy as np
from sklearn.cross_validation import train_test_split, cross_val_score
 
knn =KNeighborsClassifier()
 
# the range of number of neighbors you want to test
n_neighbors = np.arange(1,141, 2)
 
# here you store the models for each dataset used
train_scores = list()
test_scores = list()
cv_scores = list()
 
# loop through possible n_neighbors and try them out
for n in n_neighbors:
    knn.n_neighbors= n
    knn.fit(XTrain, yTrain)
    train_scores.append(1- metrics.accuracy_score(yTrain, knn.predict(XTrain))) # this will over-estimate the accuracy
    test_scores.append(1- metrics.accuracy_score(yTest, knn.predict(XTest)))
    cv_scores.append(1- cross_val_score(knn, XTrain, yTrain, cv= 10).mean()) # you take the mean of the CV scores
So what would be the best value to pick for k? When multiple values are giving the same prediction error, you just pick the smallest value fork.
# what do these different datasets think is the best value of k?
print(
    'The best values of k are:\n'\
    '{} according to the Training Set\n'\
    '{} according to the Test Set and\n'\
    '{} according to Cross-Validation'.format(
        min(n_neighbors[train_scores==min(train_scores)]),
        min(n_neighbors[test_scores==min(test_scores)]),
        min(n_neighbors[cv_scores==min(cv_scores)])       
    )
)    

The best values of k are: 
1 according to the Training Set
23 according to the Test Set and
11 according to Cross-Validation

Rather than just collecting the best ks, have a peek at what the prediction error is over the range ofks tested.

# let's plot the error you get with different values of k
plt.figure(figsize=(10,7.5))
plt.plot(n_neighbors, train_scores, c="black", label="Training Set")
plt.plot(n_neighbors, test_scores, c="black", linestyle="--", label="Test Set")
plt.plot(n_neighbors, cv_scores, c="green", label="Cross-Validation")
plt.xlabel('Number of K Nearest Neighbors')
plt.ylabel('Classification Error')
plt.gca().invert_xaxis()
plt.legend(loc = "lower left")
plt.show()


bias_variance_lines

Let’s talk about the classification error on the training set. The fewer neighbors you consider, the lower the prediction error when evaluating the models on the training set (black solid line). This makes sense, since you approach the scenario where each point only considers its own self when making new classifications, which leads to perfect “predictions”. The test data error follows a similar trajectory, but experiences an increase after a certain point because of local overfitting. This behaviour indicates that now the specific test set sample is not modelled very well by the model fit which was built on the training data.

In this plot, you see that especially for low values of k, using k-fold cross-validation highlights a region in the parameter space (i.e. very low values ofk) that is very prone to overfitting. Even though cross-validation and the test set evaluation lead to somewhat different optima, they are both pretty decent and you are clearly in the right ballpark. You can also see that cross-validation is a reasonable estimator of test error. This type of plot is nice to study in order to get a good feeling for how a certain parameter influences model performance and help build an intuition about your dataset.

Just show me the code!

Here is the code for generating the above plots, and doing the training and testing of different kNN algorithms. This code is an adapted version ofthis scikit-learn example, and most it deals with the finicky details of calculating decision boundaries and making the plots look nice. The meaty machine learning parts of splitting the dataset, fitting the algorithm, and testing it were covered above.

def detect_plot_dimension(X, h=0.02, b=0.05):
    x_min, x_max= X[:, 0].min()- b, X[:, 0].max()+ b
    y_min, y_max= X[:, 1].min()- b, X[:, 1].max()+ b
    xx, yy= np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    dimension= xx, yy
    returndimension
 
def detect_decision_boundary(dimension, model):
    xx, yy= dimension # unpack the dimensions
    boundary= model.predict(np.c_[xx.ravel(), yy.ravel()])
    boundary= boundary.reshape(xx.shape) # Put the result into a color plot
    returnboundary
 
def plot_decision_boundary(panel, dimension, boundary, colors=['#DADDED','#FBD8D8']):
    xx, yy= dimension # unpack the dimensions
    panel.contourf(xx, yy, boundary, cmap=ListedColormap(colors), alpha=1)
    panel.contour(xx, yy, boundary, colors="g", alpha=1, linewidths=0.5) # the decision boundary in green
 
def plot_dataset(panel, X, y, colors=["#EE3D34","#4458A7"], markers=["x","o"]):
    panel.scatter(X[y==1, 0], X[y==1, 1], color=colors[0], marker=markers[0])
    panel.scatter(X[y==0, 0], X[y==0, 1], color=colors[1], marker=markers[1])
 
def calculate_prediction_error(model, X, y):
    yPred= model.predict(X)
    score= 1 - round(metrics.accuracy_score(y, yPred),2)
    returnscore
 
def plot_prediction_error(panel, dimension, score, b=.3):
    xx, yy= dimension # unpack the dimensions
    panel.text(xx.max()- b, yy.min()+ b, ('%.2f'% score).lstrip('0'), size=15, horizontalalignment='right')
 
def explore_fitting_boundaries(model, n_neighbors, datasets, width): 
    # determine the height of the plot given the aspect ration of each panel should be equal
    height= float(width)/len(n_neighbors)* len(datasets.keys())
 
    nrows= len(datasets.keys())
    ncols= len(n_neighbors)
 
    # set up the plot
    figure, axes= plt.subplots(
        nrows,
        ncols,
        figsize=(width, height),
        sharex=True,
        sharey=True
    )
 
    dimension= detect_plot_dimension(X, h=0.02) # the dimension each subplot based on the data
 
    # Plotting the dataset and decision boundaries
    i= 0
    forn in n_neighbors:
        model.n_neighbors= n
        model.fit(datasets["Training Set"][0], datasets["Training Set"][1])
        boundary= detect_decision_boundary(dimension, model)
        j= 0
        ford in datasets.keys():
            try:
                panel= axes[j, i]
            except(TypeError, IndexError):
                if(nrows * ncols) ==1:
                    panel= axes
                elifnrows ==1:  # if you only have one dataset
                    panel= axes[i]
                elifncols ==1:  # if you only try one number of neighbors
                    panel= axes[j]
            plot_decision_boundary(panel, dimension, boundary) # plot the decision boundary
            plot_dataset(panel, X=datasets[d][0], y=datasets[d][1]) # plot the observations
            score= calculate_prediction_error(model, X=datasets[d][0], y=datasets[d][1])
            plot_prediction_error(panel, dimension, score, b=0.2) # plot the score
 
            # make compacted layout
            panel.set_frame_on(False)
            panel.set_xticks([])
            panel.set_yticks([])
 
            # format the axis labels
            ifi ==0:
                panel.set_ylabel(d)
            ifj ==0:
                panel.set_title('k={}'.format(n))
            j+=1
        i+=1   
 
    plt.subplots_adjust(hspace=0, wspace=0) # make compacted layout
You can then run the code like this:
# specify the model and settings
model =KNeighborsClassifier()
n_neighbors = [200,99, 50,23, 11,1]
datasets = {
    "Training Set": [XTrain, yTrain],
    "Test Set": [XTest, yTest]
}
width =20
 
# explore_fitting_boundaries(model, n_neighbors, datasets, width)
explore_fitting_boundaries(model=model, n_neighbors=n_neighbors, datasets=datasets, width=width)


exploration_grid

Conclusion

The bias-variance trade-off appears in a lot of different areas of machine learning. All algorithms can be considered to have a certain degree of flexibility and this is certainly not specific to kNN. The goal of finding the sweet spot of flexibility that describes the patterns in the data well but is still generalizable to new data applies to basically all algorithms.

發佈了72 篇原創文章 · 獲贊 26 · 訪問量 31萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章