訓練集(train set), 驗證集(validation set), 測試集(test set)

轉載博文地址:原博文地址

在有監督(supervise)的機器學習中,數據集常被分成2~3個,即:訓練集(train set) 驗證集(validation set) 測試集(test set)。

http://blog.sina.com.cn/s/blog_4d2f6cf201000cjx.html

一般需要將樣本分成獨立的三部分訓練集(train set),驗證集(validation set)和測試集(test set)。其中訓練集用來估計模型,驗證集用來確定網絡結構或者控制模型複雜程度的參數,而測試集則檢驗最終選擇最優的模型的性能如何。一個典型的劃分是訓練集佔總樣本的50%,而其它各佔25%,三部分都是從樣本中隨機抽取。
樣本少的時候,上面的劃分就不合適了。常用的是留少部分做測試集。然後對其餘N個樣本採用K折交叉驗證法。就是將樣本打亂,然後均勻分成K份,輪流選擇其中K-1份訓練,剩餘的一份做驗證,計算預測誤差平方和,最後把K次的預測誤差平方和再做平均作爲選擇最優模型結構的依據。特別的K取N,就是留一法(leave one out)。

http://www.cppblog.com/guijie/archive/2008/07/29/57407.html

這三個名詞在機器學習領域的文章中極其常見,但很多人對他們的概念並不是特別清楚,尤其是後兩個經常被人混用。Ripley, B.D(1996)在他的經典專著Pattern Recognition and Neural Networks中給出了這三個詞的定義。
Training set: A set of examples used for learning, which is to fit the parameters [i.e., weights] of the classifier. 
Validation set: A set of examples used to tune the parameters [i.e., architecture, not weights] of a classifier, for example to choose the number of hidden units in a neural network. 
Test set: A set of examples used only to assess the performance [generalization] of a fully specified classifier.
顯然,training set是用來訓練模型或確定模型參數的,如ANN中權值等; validation set是用來做模型選擇(model selection),即做模型的最終優化及確定的,如ANN的結構;而 test set則純粹是爲了測試已經訓練好的模型的推廣能力。當然,test set這並不能保證模型的正確性,他只是說相似的數據用此模型會得出相似的結果。但實際應用中,一般只將數據集分成兩類,即training set 和test set,大多數文章並不涉及validation set。
Ripley還談到了Why separate test and validation sets?
1. The error rate estimate of the final model on validation data will be biased (smaller than the true error rate) since the validation set is used to select the final model.
2. After assessing the final model with the test set, YOU MUST NOT tune the model any further.

http://stats.stackexchange.com/questions/19048/what-is-the-difference-between-test-set-and-validation-set

Step 1) Training: Each type of algorithm has its own parameter options (the number of layers in a Neural Network, the number of trees in a Random Forest, etc). For each of your algorithms, you must pick one option. That’s why you have a validation set.

Step 2) Validating: You now have a collection of algorithms. You must pick one algorithm. That’s why you have a test set. Most people pick the algorithm that performs best on the validation set (and that's ok). But, if you do not measure your top-performing algorithm’s error rate on the test set, and just go with its error rate on the validation set, then you have blindly mistaken the “best possible scenario” for the “most likely scenario.” That's a recipe for disaster.

Step 3) Testing: I suppose that if your algorithms did not have any parameters then you would not need a third step. In that case, your validation step would be your test step. Perhaps Matlab does not ask you for parameters or you have chosen not to use them and that is the source of your confusion.

My Idea is that those option in neural network toolbox is for avoiding overfitting. In this situation the weights are specified for the training data only and don't show the global trend. By having a validation set, the iterations are adaptable to where decreases in the training data error cause decreases in validation data and increases in validation data error; along with decreases in training data error, this demonstrates the overfitting phenomenon.

http://blog.sciencenet.cn/blog-397960-666113.html

http://stackoverflow.com/questions/2976452/whats-is-the-difference-between-train-validation-and-test-set-in-neural-networ

for each epoch
for each training data instance
propagate error through the network
adjust the weights
calculate the accuracy over training data
for each validation data instance
calculate the accuracy over the validation data
if the threshold validation accuracy is met
exit training
else
continue training

Once you're finished training, then you run against your testing set and verify that the accuracy is sufficient.

Training Set: this data set is used to adjust the weights on the neural network.

Validation Set: this data set is used to minimize overfitting. You're not adjusting the weights of the network with this data set, you're just verifying that any increase in accuracy over the training data set actually yields an increase in accuracy over a data set that has not been shown to the network before, or at least the network hasn't trained on it (i.e. validation data set). If the accuracy over the training data set increases, but the accuracy over then validation data set stays the same or decreases, then you're overfitting your neural network and you should stop training.

Testing Set: this data set is used only for testing the final solution in order to confirm the actual predictive power of the network.

Validating set is used in the process of training. Testing set is not. The Testing set allows

1)to see if the training set was enough and 
2)whether the validation set did the job of preventing overfitting. If you use the testing set in the process of training then it will be just another validation set and it won't show what happens when new data is feeded in the network.

 

 

Training set: A set of examples used for learning, that is to fit the parameters [i.e., weights] of the classifier.

Validation set: A set of examples used to tune the parameters [i.e., architecture, not weights] of a classifier, for example to choose the number of hidden units in a neural network.

Test set: A set of examples used only to assess the performance [generalization] of a fully specified classifier.

The error surface will be different for different sets of data from your data set (batch learning). Therefore if you find a very good local minima for your test set data, that may not be a very good point, and may be a very bad point in the surface generated by some other set of data for the same problem. Therefore you need to compute such a model which not only finds a good weight configuration for the training set but also should be able to predict new data (which is not in the training set) with good error. In other words the network should be able to generalize the examples so that it learns the data and does not simply remembers or loads the training set by overfitting the training data.

The validation data set is a set of data for the function you want to learn, which you are not directly using to train the network. You are training the network with a set of data which you call the training data set. If you are using gradient based algorithm to train the network then the error surface and the gradient at some point will completely depend on the training data set thus the training data set is being directly used to adjust the weights. To make sure you don't overfit the network you need to input the validation dataset to the network and check if the error is within some range. Because the validation set is not being using directly to adjust the weights of the netowork, therefore a good error for the validation and also the test set indicates that the network predicts well for the train set examples, also it is expected to perform well when new example are presented to the network which was not used in the training process.

Early stopping is a way to stop training. There are different variations available, the main outline is, both the train and the validation set errors are monitored, the train error decreases at each iteration (backprop and brothers) and at first the validation error decreases. The training is stopped at the moment the validation error starts to rise. The weight configuration at this point indicates a model, which predicts the training data well, as well as the data which is not seen by the network . But because the validation data actually affects the weight configuration indirectly to select the weight configuration. This is where the Test set comes in. This set of data is never used in the training process. Once a model is selected based on the validation set, the test set data is applied on the network model and the error for this set is found. This error is a representative of the error which we can expect from absolutely new data for the same problem.


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章