sklearn-分離數據-評估算法(第6講)

評估算法   2020/5/28


1.分離數據集

方法:
1)分離訓練數據評估數據
2)K折交叉驗證分離
3)棄一交叉驗證分離
4)重複隨機評估,訓練數據集分離
2.實例

2.1.分離訓練數據評估數據
通常訓練集:評估集=0.67:0.33
應用:
    簡潔快速;大量數據且數據比較平衡,或對問題的展示比較平均的情況。

#實例1:
import pandas as pd,numpy as np
from sklearn.linear_model import LogisticRegression

np.set_printoptions(precision=3)

# 導入數據
filename = 'pima_data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pd.read_csv(filename, names=names)

array = data.values# 將數據分爲輸入數據和輸出結果
X = array[:, 0:8]
Y = array[:, 8]

#
from sklearn.model_selection import train_test_split

def test_train_test_split(X=X,Y=Y):
    test_size = 0.33
    seed = 4
    X_train, X_test, Y_traing, Y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)
    model = LogisticRegression(max_iter=10000)#默認max_iter=1000
    
    model.fit(X_train, Y_traing)
    result = model.score(X_test, Y_test)
    print("算法評估結果:%.3f%%" % (result * 100))
    
test_train_test_split()
#算法評估結果:80.315%
2.2.K折交叉驗證分離
    統計分析方法;也稱循環評估;按某種規則將數據分組,一部分用作訓練集,一部分用作評估集;
    K折交叉驗證是將原始數據分成K組,K一般>2,只有數據量小時纔會嘗試2;通常取3,5,10;
    可有效避免過學習和欠學習

#實例2:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

def test_KFold(X=X,Y=Y):
    num_folds = 10
    seed = 7
    kfold = KFold(n_splits=num_folds, random_state=seed,shuffle=True)
    model = LogisticRegression(max_iter=10000)#默認max_iter=1000
    result = cross_val_score(model, X, Y, cv=kfold)
    print("算法評估結果:%.3f%% (%.3f%%)" % (result.mean() * 100, result.std() * 100))
        
test_KFold()
#算法評估結果:77.216% (4.968%)
2.3.棄一交叉驗證分離

如原始數據有N個樣本,N-1個交叉驗證,即每個樣本單獨作爲驗證集,其餘N-1個樣本作爲訓練集;
相比K折交叉驗證優點:每一回合幾乎所有的樣本界用於訓練模型,因此最接近原始樣本的分佈,評估結果比較可靠
實驗過程中沒有隨機因素會影響實驗數據,確保實驗過程是可被複制。

#實例3:
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score

def test_LeaveOneOut(X=X,Y=Y):
    loocv = LeaveOneOut()
    model = LogisticRegression(max_iter=1000)#默認max_iter=1000
    result = cross_val_score(model, X, Y, cv=loocv)
    print("算法評估結果:%.3f%% (%.3f%%)" % (result.mean() * 100, result.std() * 100))

test_LeaveOneOut()
# 算法評估結果:77.604% (41.689%)
2.4.重複隨機評估,訓練數據集分離

隨機分離數據爲訓練數據集和評估數據集,重複這個過程多次,如同交叉驗證分離;

#實例4:將數據按67:33比例分配,重複10次
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score

def test_LeaveOneOut(X=X,Y=Y):
    n_splits = 10
    test_size = 0.33
    seed = 7
    kfold = ShuffleSplit(n_splits=n_splits, test_size=test_size, random_state=seed)
    model = LogisticRegression(max_iter=1000)#默認max_iter=1000
    result = cross_val_score(model, X, Y, cv=kfold)
    print("算法評估結果:%.3f%% (%.3f%%)" % (result.mean() * 100, result.std() * 100))

test_LeaveOneOut()
#算法評估結果:76.969% (2.631%)

 

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章