數據集劃分:sklearn.model_selection.train_test_split(*arrays, **options)
主要參數說明:
*arrays:可以是列表、numpy數組、scipy稀疏矩陣或pandas的數據框
test_size:可以爲浮點、整數或None,默認爲None
①若爲浮點時,表示測試集佔總樣本的百分比
②若爲整數時,表示測試樣本樣本數
③若爲None時,test size自動設置成0.25
train_size:可以爲浮點、整數或None,默認爲None
①若爲浮點時,表示訓練集佔總樣本的百分比
②若爲整數時,表示訓練樣本的樣本數
③若爲None時,train_size自動被設置成0.75
random_state:可以爲整數、RandomState實例或None,默認爲None
①若爲None時,每次生成的數據都是隨機,可能不一樣
②若爲整數時,每次生成的數據都相同
stratify:可以爲類似數組或None
①若爲None時,劃分出來的測試集或訓練集中,其類標籤的比例也是隨機的
②若不爲None時,劃分出來的測試集或訓練集中,其類標籤的比例同輸入的數組中類標籤的比例相同,可以用於處理不均衡的數據集
通過簡單栗子看看各個參數的作用:
①test_size決定劃分測試、訓練集比例
- In [1]: import numpy as np
- ...: from sklearn.model_selection import train_test_split
- ...: X = np.arange(20)
- ...: y = ['A','B','A','A','A','B','A','B','B','A','A','B','B','A','A','B','A
- ...: ','B','A','A']
- ...: X_train , X_test , y_train,y_test = train_test_split(X,y,test_size=0.25
- ...: ,random_state=0)
- ...:
- In [2]: X_test.shape
- Out[2]: (5,)
- In [3]: X_train.shape
- Out[3]: (15,)
- In [4]: X_test ,y_test
- Out[4]: (array([18, 1, 19, 8, 10]), ['A', 'B', 'A', 'B', 'A'])
設置random_state=0再運行一次,結果同上述相同
- In [5]: import numpy as np
- ...: from sklearn.model_selection import train_test_split
- ...: X = np.arange(20)
- ...: y = ['A','B','A','A','A','B','A','B','B','A','A','B','B','A','A','B','A
- ...: ','B','A','A']
- ...: X_train , X_test , y_train,y_test = train_test_split(X,y,test_size=0.25
- ...: ,random_state=0)
- ...: X_test ,y_test
- ...:
- Out[5]: (array([18, 1, 19, 8, 10]), ['A', 'B', 'A', 'B', 'A'])
- In [6]: import numpy as np
- ...: from sklearn.model_selection import train_test_split
- ...: X = np.arange(20)
- ...: y = ['A','B','A','A','A','B','A','B','B','A','A','B','B','A','A','B','A
- ...: ','B','A','A']
- ...: X_train , X_test , y_train,y_test = train_test_split(X,y,test_size=0.25
- ...: )
- ...: X_test ,y_test
- ...:
- Out[6]: (array([ 3, 18, 14, 7, 4]), ['A', 'A', 'A', 'B', 'A'])
- In [7]: import numpy as np
- ...: from sklearn.model_selection import train_test_split
- ...: X = np.arange(20)
- ...: y = ['A','B','A','A','A','B','A','B','B','A','A','B','B','A','A','B','A
- ...: ','B','A','A']
- ...: X_train , X_test , y_train,y_test = train_test_split(X,y,test_size=0.25
- ...: )
- ...: X_test ,y_test
- ...:
- Out[7]: (array([18, 6, 3, 14, 8]), ['A', 'A', 'A', 'A', 'B'])
- In [8]: import numpy as np
- ...: from sklearn.model_selection import train_test_split
- ...: X = np.arange(20)
- ...: y = ['A','B','A','A','A','B','A','B','B','A','A','B','B','A','A','B','A
- ...: ','B','A','A']
- ...: X_train , X_test , y_train,y_test = train_test_split(X,y,test_size=0.25
- ...: ,stratify=y)
- ...: X_test ,y_test
- ...:
- Out[8]: (array([18, 8, 3, 10, 11]), ['A', 'B', 'A', 'A', 'B'])
- In [9]: import numpy as np
- ...: from sklearn.model_selection import train_test_split
- ...: X = np.arange(20)
- ...: y = ['A','B','A','A','A','B','A','B','B','A','A','B','B','A','A','B','A
- ...: ','B','A','A']
- ...: X_train , X_test , y_train,y_test = train_test_split(X,y,test_size=0.25
- ...: ,stratify=y)
- ...: X_test ,y_test
- ...:
- Out[9]: (array([ 6, 19, 8, 17, 0]), ['A', 'A', 'B', 'B', 'A'])
- In [10]: X_train,y_train
- Out[10]:
- (array([ 7, 1, 11, 10, 15, 2, 3, 5, 4, 13, 12, 16, 18, 14, 9]),
- ['B', 'B', 'B', 'A', 'B', 'A', 'A', 'B', 'A', 'A', 'B', 'A', 'A', 'A', 'A'])