評估算法 2020/5/28
1.分離數據集
方法:
1)分離訓練數據評估數據
2)K折交叉驗證分離
3)棄一交叉驗證分離
4)重複隨機評估,訓練數據集分離
2.實例
2.1.分離訓練數據評估數據
通常訓練集:評估集=0.67:0.33
應用:
簡潔快速;大量數據且數據比較平衡,或對問題的展示比較平均的情況。
#實例1:
import pandas as pd,numpy as np
from sklearn.linear_model import LogisticRegression
np.set_printoptions(precision=3)
# 導入數據
filename = 'pima_data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pd.read_csv(filename, names=names)
array = data.values# 將數據分爲輸入數據和輸出結果
X = array[:, 0:8]
Y = array[:, 8]
#
from sklearn.model_selection import train_test_split
def test_train_test_split(X=X,Y=Y):
test_size = 0.33
seed = 4
X_train, X_test, Y_traing, Y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)
model = LogisticRegression(max_iter=10000)#默認max_iter=1000
model.fit(X_train, Y_traing)
result = model.score(X_test, Y_test)
print("算法評估結果:%.3f%%" % (result * 100))
test_train_test_split()
#算法評估結果:80.315%
2.2.K折交叉驗證分離
統計分析方法;也稱循環評估;按某種規則將數據分組,一部分用作訓練集,一部分用作評估集;
K折交叉驗證是將原始數據分成K組,K一般>2,只有數據量小時纔會嘗試2;通常取3,5,10;
可有效避免過學習和欠學習
#實例2:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
def test_KFold(X=X,Y=Y):
num_folds = 10
seed = 7
kfold = KFold(n_splits=num_folds, random_state=seed,shuffle=True)
model = LogisticRegression(max_iter=10000)#默認max_iter=1000
result = cross_val_score(model, X, Y, cv=kfold)
print("算法評估結果:%.3f%% (%.3f%%)" % (result.mean() * 100, result.std() * 100))
test_KFold()
#算法評估結果:77.216% (4.968%)
2.3.棄一交叉驗證分離
如原始數據有N個樣本,N-1個交叉驗證,即每個樣本單獨作爲驗證集,其餘N-1個樣本作爲訓練集;
相比K折交叉驗證優點:每一回合幾乎所有的樣本界用於訓練模型,因此最接近原始樣本的分佈,評估結果比較可靠
實驗過程中沒有隨機因素會影響實驗數據,確保實驗過程是可被複制。
#實例3:
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
def test_LeaveOneOut(X=X,Y=Y):
loocv = LeaveOneOut()
model = LogisticRegression(max_iter=1000)#默認max_iter=1000
result = cross_val_score(model, X, Y, cv=loocv)
print("算法評估結果:%.3f%% (%.3f%%)" % (result.mean() * 100, result.std() * 100))
test_LeaveOneOut()
# 算法評估結果:77.604% (41.689%)
2.4.重複隨機評估,訓練數據集分離
隨機分離數據爲訓練數據集和評估數據集,重複這個過程多次,如同交叉驗證分離;
#實例4:將數據按67:33比例分配,重複10次
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score
def test_LeaveOneOut(X=X,Y=Y):
n_splits = 10
test_size = 0.33
seed = 7
kfold = ShuffleSplit(n_splits=n_splits, test_size=test_size, random_state=seed)
model = LogisticRegression(max_iter=1000)#默認max_iter=1000
result = cross_val_score(model, X, Y, cv=kfold)
print("算法評估結果:%.3f%% (%.3f%%)" % (result.mean() * 100, result.std() * 100))
test_LeaveOneOut()
#算法評估結果:76.969% (2.631%)