(過擬合及其防治)Overfitting and Its Avoidance


主要內容:

  1. Overfitting(問題)
  1. 判斷和防止overfitting 的方式

——————————————————————————————————

 

過度擬合的模型往往不能進行一般化推廣(generalization)

 

擬合問題需要在兩個方面進行權衡

 


 

需要注意的是 如果用來訓練的數據和測試的數據是同樣的,那麼這樣的檢驗是沒有意義的,就像  "Table Model" 一樣

 

一般我們會將數據集分爲training/testing(holdout) 兩個部分

 

Code 注: 在python中可以這樣做Code

from sklearn.cross_validation import train_test_split

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y)

 


識別 overfitting 的方式

  • Fitting graph(模型誤差與模型複雜度的圖像)

 


 

基於兩個模型類型討論 Overfitting

  1. Tree induction problem
  • 模型的複雜程度與節點的數量相關

 

 

  1. Numeric  model (Mathematical Functions)
  • 變量的數量與模型的複雜度相關
  • 直觀的例子:

二維情況下,兩個點可以用一條直線擬合

三維情況下,三個點可以用一個平面擬合

……

  • 隨着維度的增加,我們可以擬合任意數量的點(模型的參數就會變得很多)

此時很容易過擬合(我們需縮減特徵(attributes)數量來防止過擬合)

 

 

SVM 與 Logistics Regression 的比較

  • 模型敏感性

SVM對個別樣本更不敏感(相對邏輯斯蒂模型)

 

 

過度擬合的劣勢

  • 一個完全記憶式的模型是無用的,它不能被一般化
  • 當一個模型過度複雜時,它很容易去利用那些看似正確實則無用的(spurious)關係

 

 

 

Overfitting 識別的進一步分析

  • 前面提及的fitting graph 是利用Holdout-Evaluation 的方式來判斷,這個只是的單次的檢驗(single test)
  • 在此基礎上,我們得到 Cross-Validation 用來防止過度擬合
  • Cross-Validation 本質上是應用不同split的方式,多次進行Holdout-Evaluation

 

  • Cross-Validation 原理圖

 


 

 

Further idea : Buliding a modeling  "labortory"

 

 

Learning curves (學習曲線)

  • 模型一般化表現和訓練集數量的圖像

 

Logistics regression 和 decision tree 的學習曲線

 

 

  • 從圖中可以看出,學習曲線初期比較陡,然後,增長速率逐漸放緩,到後期基本平坦了(邊際遞減)
  • 要合理分析當前自己的數據數量對於使用的模型而言處於哪個水平,依據學習曲線來做出是否繼續在數據量上做投資的決策

 

 

避免 Tree induction  過擬合的方式

  1. 在樹過於大之前便停止生長
  • 每個葉中至少需要多少個數據(threshold)
  • 如何判斷這個閾值(threshold)是重點【可以考慮用假設檢驗/P-值】

 

  1. 等樹生長到足夠大之後進行修剪
  • 修剪枝葉,直到任何改動都會降低正確率

 

Nest cross-Validation

 

 

  • 對訓練數據集進行再劃分,以此來優化模型(例如選擇參數,選擇特徵數量(features))
  • Cull the feature set (example: SFS【sequence forward selection】)

 

 

重要 Regularization

  • 重點是構造一個帶懲罰(penalty)的目標函數,並最優化
  •  

 

  • 擴展:
    1. 將二範數懲罰函數與最小二乘法結合在一起就可以得到嶺迴歸
    2. 將一範數懲罰函數與最小二乘法結合在一起就可以得到lasso方法

 

重要 The problem of multiple comparison is the underlying reason for overfitting

Sidebar: Beware of “multiple comparisons”

Consider the following scenario. You run an investment firm. Five years ago, you wanted to have some marketable small-cap mutual fund products to sell, but your analysts had been awful at picking small-cap stocks. So you undertook the following procedure. You started 1,000 different mutual funds, each including a small set of stocks randomly chosen from those that make up the Russell 2000 index (the main index for small-cap stocks). Your firm invested in all 1,000 of these funds, but told no one about them. Now, five years later, you look at their performance. Since they have different stocks in them, they will have had different returns. Some will be about the same as the index, some will be worse, and some will be better. The best one might be a lot better. Now, you liquidate all the funds but the best few, and you present these to the public. You can “honestly” claim that their 5-year return is substantially better than the return of the Russell 2000 index.

So, what’s the problem? The problem is that you randomly chose the stocks! You have no idea whether the stocks in these “best” funds performed better because they indeed are fundamentally better, or because you cherry-picked the best from a large set that simply varied in performance. If you flip 1,000 fair coins many times each, one of them will have come up heads much more than 50% of the time. However, choosing that coin as the “best” of the coins for later flipping obviously is silly. These are instances of “the problem of multiple comparisons,” a very important statistical phenomenon that business analysts and data scientists should always keep in mind. Beware whenever someone does many tests and then picks the results that look good. Statistics books will warn against running multiple statistical hypothesis tests, and then looking at the ones that give “significant” results. These usually violate the assumptions behind the statistical tests, and the actual significance of the results is dubious.

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章