學習曲線判斷模型狀態:欠擬合 or 過擬合

模型的方差與偏差

每種模型都有優點和缺點,一個模型的泛化誤差可以分解爲偏差(bias)、方差(variance)和噪音(noise)。偏差是不同訓練集的平均誤差,方差是對不同訓練集的敏感程度,而噪音是數據本身的屬性。爲了使得方差和偏差最小(也就是泛化能力最大),常用的方法是選擇算法(線性或非線性)及算法的超參數,另一種減小方差的方法是增加樣本的數量,極限的情況是把所有的樣本作爲訓練樣本,那麼就不存在方差的情況了。但是在實際的項目中,往往訓練數量是無法增加的。
在這裏插入圖片描述
對於方程f(x)=cos(3* pi *x/2)的擬合,用了三個不同的多項式estimator,第一個是underfitting,模型太簡單,偏差太大,第二個是nomal,正好擬合,第三個是overfitting,模型過於複雜,對不同的訓練數據集非常敏感,也就是如果換了training set,模型的得分會特別差。
在這裏插入圖片描述
對於一維數據可以可視化,但是實際應用中,對於無法可視化的多維數據有兩種方法可以使用:

  • validation curve
  • learning curve

Validation Curve

一個良好的模型是擁有好的泛化能力。爲了評估模型,我們需要一個metric,也就是一個scoring function,比如accuracy,precision。選擇多個超參數的方法有GridSearchCV或RandomiedSearchCV,但是在選擇的超參數是基於validation set的上得分。如果我們是根據validation set上的得分來優化超參數,則驗證分數會有偏差,就不再是對泛化能力的良好估計。理論上,爲了得到對泛化能力的正確估計,必須在另一個測試集上計算得分。
然而,有時候繪製單個超參數的training score和validation score可以判斷模型的狀態,是過擬合還欠擬合。
如下圖,是使用SVC手寫數字識別的多分類問題,橫座標代表超參數gamma的大小,y代表得分,分別有training score和cross-validation score。根據位置來判斷:A、B代表欠擬合,C代表正好,D代表過擬合。
在這裏插入圖片描述

Learning Curve

學習曲線展示了隨着樣本數量的增加模型在驗證集和訓練集上的得分情況。從學習曲線中我們可以看出隨着樣本的增加模型的獲益情況,以及模型處在什麼樣的狀態。
隨着樣本的增加,validation score和training score都在一個比較低的水平,則此時爲欠擬合,再增加樣本也無法增加模型的score。使用Naive Bayes模型手寫數字識別:
在這裏插入圖片描述
如果在最大樣本數量的地方,training score比validation score大很多,則處理過過擬合的情況。下圖中,training score一直處在比較高的位置,而validation score逐漸增大。雖然在training set上的score一起比較接近1,但是cross-validation score逐漸增大,最後兩個點幾乎重合,說明此時並沒有過擬合。 在這裏插入圖片描述
同樣使用SVC,不同gamma的學習曲線對應validation curve中ABCD個點的情況,同樣,AB爲欠擬合,C爲正好擬合,D爲過擬合。A和D比較好區分,而B和C不好區分,B的training score在數據量小的情況,沒並有達到接近1的位置,而C是。但兩者的相同點是在數據量達到最達時兩條線基本重合。如下:

在這裏插入圖片描述
在這裏插入圖片描述
在這裏插入圖片描述
在這裏插入圖片描述
代碼:

%matplotlib inline
print(__doc__)

from sklearn.model_selection import validation_curve
param_range = np.array([1.0e-6,1.0e-5,1.0e-4,1.0e-3,1.0e-2,1.0e-1,])
train_scores, test_scores = validation_curve(
    SVC(), X, y, param_name="gamma", param_range=param_range,
    cv=cv, scoring="accuracy", n_jobs=1)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

plt.title("Validation Curve with SVM")
plt.xlabel(r"$\gamma$")
plt.ylabel("Score")
plt.ylim(0.0, 1.1)
lw = 2
plt.semilogx(param_range, train_scores_mean, label="Training score",
             color="darkorange", lw=lw)
plt.fill_between(param_range, train_scores_mean - train_scores_std,
                 train_scores_mean + train_scores_std, alpha=0.2,
                 color="darkorange", lw=lw)
plt.semilogx(param_range, test_scores_mean, label="Cross-validation score",
             color="navy", lw=lw)
plt.fill_between(param_range, test_scores_mean - test_scores_std,
                 test_scores_mean + test_scores_std, alpha=0.2,
                 color="navy", lw=lw)
plt.legend(loc="best")
plt.grid()
plt.show()

import numpy as np
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.datasets import load_digits
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit



def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)):
    """
    Generate a simple plot of the test and training learning curve.

    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.

    title : string
        Title for the chart.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.

    cv : int, cross-validation generator or an iterable, optional
        Determines the cross-validation splitting strategy.
        Possible inputs for cv are:
          - None, to use the default 3-fold cross-validation,
          - integer, to specify the number of folds.
          - :term:`CV splitter`,
          - An iterable yielding (train, test) splits as arrays of indices.

        For integer/None inputs, if ``y`` is binary or multiclass,
        :class:`StratifiedKFold` used. If the estimator is not a classifier
        or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.

        Refer :ref:`User Guide <cross_validation>` for the various
        cross-validators that can be used here.

    n_jobs : int or None, optional (default=None)
        Number of jobs to run in parallel.
        ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
        ``-1`` means using all processors. See :term:`Glossary <n_jobs>`
        for more details.

    train_sizes : array-like, shape (n_ticks,), dtype float or int
        Relative or absolute numbers of training examples that will be used to
        generate the learning curve. If the dtype is float, it is regarded as a
        fraction of the maximum size of the training set (that is determined
        by the selected validation method), i.e. it has to be within (0, 1].
        Otherwise it is interpreted as absolute sizes of the training sets.
        Note that for classification the number of samples usually have to
        be big enough to contain at least one sample from each class.
        (default: np.linspace(0.1, 1.0, 5))
    """
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt


digits = load_digits()
X, y = digits.data, digits.target


title = "Learning Curves (Naive Bayes)"
# Cross validation with 100 iterations to get smoother mean test and train
# score curves, each time with 20% data randomly selected as a validation set.
cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=0)

estimator = GaussianNB()
plot_learning_curve(estimator, title, X, y, ylim=(0.7, 1.01), cv=cv, n_jobs=4)

plt.show()

ylim_min=0.7

title = r"Learning Curves (SVM, RBF kernel, $\gamma=0.00001$)"
plot_learning_curve(SVC(gamma=1.0e-5),title,X,y,(ylim_min,1.01),cv=cv,n_jobs=4)

title = r"Learning Curves (SVM, RBF kernel, $\gamma=0.0001$)"
# SVC is more expensive so we do a lower number of CV iterations:
cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
estimator = SVC(gamma=1.0e-4)
plot_learning_curve(estimator, title, X, y, (ylim_min, 1.01), cv=cv, n_jobs=4)

title = r"Learning Curves (SVM, RBF kernel, $\gamma=0.0005$)"
plot_learning_curve(SVC(gamma=0.5e-3),title,X,y,(ylim_min,1.01),cv=cv,n_jobs=4)

title = r"Learning Curves (SVM, RBF kernel, $\gamma=0.001$)"
plot_learning_curve(SVC(gamma=1.0e-3),title,X,y,(ylim_min,1.01),cv=cv,n_jobs=4)

title = r"Learning Curves (SVM, RBF kernel, $\gamma=0.0015$)"
plot_learning_curve(SVC(gamma=1.5e-3),title,X,y,(ylim_min,1.01),cv=cv,n_jobs=4)

title = r"Learning Curves (SVM, RBF kernel, $\gamma=0.008$)"
plot_learning_curve(SVC(gamma=8e-3),title,X,y,(ylim_min,1.01),cv=cv,n_jobs=4)

title = r"Learning Curves (SVM, RBF kernel, $\gamma=0.01$)"
plot_learning_curve(SVC(gamma=1.0e-2),title,X,y,(ylim_min,1.01),cv=cv,n_jobs=4)

plt.show()

參考:https://scikit-learn.org/stable/modules/learning_curve.html

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章