K折交叉驗證:sklearn.model_selection.KFold(n_splits=3, shuffle=False, random_state=None)
思路:將訓練/測試數據集劃分n_splits個互斥子集,每次用其中一個子集當作驗證集,剩下的n_splits-1個作爲訓練集,進行n_splits次訓練和測試,得到n_splits個結果
注意點:對於不能均等份的數據集,其前n_samples % n_splits子集擁有n_samples // n_splits + 1個樣本,其餘子集都只有n_samples // n_splits樣本
參數說明:
n_splits:表示劃分幾等份
shuffle:在每次劃分時,是否進行洗牌
①若爲Falses時,其效果等同於random_state等於整數,每次劃分的結果相同
②若爲True時,每次劃分的結果都不一樣,表示經過洗牌,隨機取樣的
random_state:隨機種子數
屬性:
①get_n_splits(X=None, y=None, groups=None):獲取參數n_splits的值
②split(X, y=None, groups=None):將數據集劃分成訓練集和測試集,返回索引生成器
通過一個不能均等劃分的栗子,設置不同參數值,觀察其結果
①設置shuffle=False,運行兩次,發現兩次結果相同
- In [1]: from sklearn.model_selection import KFold
- ...: import numpy as np
- ...: X = np.arange(24).reshape(12,2)
- ...: y = np.random.choice([1,2],12,p=[0.4,0.6])
- ...: kf = KFold(n_splits=5,shuffle=False)
- ...: for train_index , test_index in kf.split(X):
- ...: print('train_index:%s , test_index: %s ' %(train_index,test_index))
- ...:
- ...:
- train_index:[ 3 4 5 6 7 8 9 10 11] , test_index: [0 1 2]
- train_index:[ 0 1 2 6 7 8 9 10 11] , test_index: [3 4 5]
- train_index:[ 0 1 2 3 4 5 8 9 10 11] , test_index: [6 7]
- train_index:[ 0 1 2 3 4 5 6 7 10 11] , test_index: [8 9]
- train_index:[0 1 2 3 4 5 6 7 8 9] , test_index: [10 11]
-
- In [2]: from sklearn.model_selection import KFold
- ...: import numpy as np
- ...: X = np.arange(24).reshape(12,2)
- ...: y = np.random.choice([1,2],12,p=[0.4,0.6])
- ...: kf = KFold(n_splits=5,shuffle=False)
- ...: for train_index , test_index in kf.split(X):
- ...: print('train_index:%s , test_index: %s ' %(train_index,test_index))
- ...:
- ...:
- train_index:[ 3 4 5 6 7 8 9 10 11] , test_index: [0 1 2]
- train_index:[ 0 1 2 6 7 8 9 10 11] , test_index: [3 4 5]
- train_index:[ 0 1 2 3 4 5 8 9 10 11] , test_index: [6 7]
- train_index:[ 0 1 2 3 4 5 6 7 10 11] , test_index: [8 9]
- train_index:[0 1 2 3 4 5 6 7 8 9] , test_index: [10 11]
②設置shuffle=True時,運行兩次,發現兩次運行的結果不同
- In [3]: from sklearn.model_selection import KFold
- ...: import numpy as np
- ...: X = np.arange(24).reshape(12,2)
- ...: y = np.random.choice([1,2],12,p=[0.4,0.6])
- ...: kf = KFold(n_splits=5,shuffle=True)
- ...: for train_index , test_index in kf.split(X):
- ...: print('train_index:%s , test_index: %s ' %(train_index,test_index))
- ...:
- ...:
- train_index:[ 0 1 2 4 5 6 7 8 10] , test_index: [ 3 9 11]
- train_index:[ 0 1 2 3 4 5 9 10 11] , test_index: [6 7 8]
- train_index:[ 2 3 4 5 6 7 8 9 10 11] , test_index: [0 1]
- train_index:[ 0 1 3 4 5 6 7 8 9 11] , test_index: [ 2 10]
- train_index:[ 0 1 2 3 6 7 8 9 10 11] , test_index: [4 5]
-
- In [4]: from sklearn.model_selection import KFold
- ...: import numpy as np
- ...: X = np.arange(24).reshape(12,2)
- ...: y = np.random.choice([1,2],12,p=[0.4,0.6])
- ...: kf = KFold(n_splits=5,shuffle=True)
- ...: for train_index , test_index in kf.split(X):
- ...: print('train_index:%s , test_index: %s ' %(train_index,test_index))
- ...:
- ...:
- train_index:[ 0 1 2 3 4 5 7 8 11] , test_index: [ 6 9 10]
- train_index:[ 2 3 4 5 6 8 9 10 11] , test_index: [0 1 7]
- train_index:[ 0 1 3 5 6 7 8 9 10 11] , test_index: [2 4]
- train_index:[ 0 1 2 3 4 6 7 9 10 11] , test_index: [5 8]
- train_index:[ 0 1 2 4 5 6 7 8 9 10] , test_index: [ 3 11]
③設置shuffle=True和random_state=整數,發現每次運行的結果都相同
- In [5]: from sklearn.model_selection import KFold
- ...: import numpy as np
- ...: X = np.arange(24).reshape(12,2)
- ...: y = np.random.choice([1,2],12,p=[0.4,0.6])
- ...: kf = KFold(n_splits=5,shuffle=True,random_state=0)
- ...: for train_index , test_index in kf.split(X):
- ...: print('train_index:%s , test_index: %s ' %(train_index,test_index))
- ...:
- ...:
- train_index:[ 0 1 2 3 5 7 8 9 10] , test_index: [ 4 6 11]
- train_index:[ 0 1 3 4 5 6 7 9 11] , test_index: [ 2 8 10]
- train_index:[ 0 2 3 4 5 6 8 9 10 11] , test_index: [1 7]
- train_index:[ 0 1 2 4 5 6 7 8 10 11] , test_index: [3 9]
- train_index:[ 1 2 3 4 6 7 8 9 10 11] , test_index: [0 5]
-
- In [6]: from sklearn.model_selection import KFold
- ...: import numpy as np
- ...: X = np.arange(24).reshape(12,2)
- ...: y = np.random.choice([1,2],12,p=[0.4,0.6])
- ...: kf = KFold(n_splits=5,shuffle=True,random_state=0)
- ...: for train_index , test_index in kf.split(X):
- ...: print('train_index:%s , test_index: %s ' %(train_index,test_index))
- ...:
- ...:
- train_index:[ 0 1 2 3 5 7 8 9 10] , test_index: [ 4 6 11]
- train_index:[ 0 1 3 4 5 6 7 9 11] , test_index: [ 2 8 10]
- train_index:[ 0 2 3 4 5 6 8 9 10 11] , test_index: [1 7]
- train_index:[ 0 1 2 4 5 6 7 8 10 11] , test_index: [3 9]
- train_index:[ 1 2 3 4 6 7 8 9 10 11] , test_index: [0 5]
④n_splits屬性值獲取方式
- In [8]: kf.split(X)
- Out[8]: <generator object _BaseKFold.split at 0x00000000047FF990>
-
- In [9]: kf.get_n_splits()
- Out[9]: 5
-
- In [10]: kf.n_splits
- Out[10]: 5