實現sklearn中的train_test_split測試集訓練集拆分

原創

葡萄数

2018-12-09 16:54

實現sklearn中的train_test_split

sklearn.model_selection裏面有個方法-> train_test_split
其作用是把數據集按訓練集和測試集拆分, 爲後面做調參(煉丹)做準備

先看下sklearn中的使用:

# Jupyter Notebook
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y)
my_knn_clf.fit(X_train, y_train)

y_predict = my_knn_clf.predict(X_test)
my_knn_clf.fit(X_train, y_train)
y_predict = my_knn_clf.predict(X_test)

print(np.sum(y_predict==y_test)/len(y_test))

輸出:

方法很簡單, 就是把數據集分割成指定比例即可:
* 需要把數據要打亂, 因爲訓練集可能是整理好的有序的
* 有可能在多次打亂情況下, 還要指定的部分訓練集, 所以需要numpy中的隨機seed, 默認是不指定隨機seed的
見代碼: train_test_split.py

import numpy as np

def train_test_split(X,y,test_ratio=0.2,seed=None):
    assert X.shape[0] == y.shape[0], '樣本和標籤個數不一致'
    assert 0<=test_ratio<1, '無效的測試比例'
    if seed:
        np.random.seed(seed)
    shuffled_indexes = np.random.permutation(len(X))
    test_size = int(len(X) * test_ratio)
    train_index = shuffled_indexes[test_size:]
    test_index = shuffled_indexes[:test_size]
    return X[train_index], X[test_index], y[train_index], y[test_index]

現在上上面的sklearn中的分隔方法換成自己寫的試一下:

# Jupyter Notebook
%run train_test_split.py # 載入文件

X_train, X_test, y_train, y_test = train_test_split(X,y)
my_knn_clf.fit(X_train, y_train)

y_predict = my_knn_clf.predict(X_test)
my_knn_clf.fit(X_train, y_train)
y_predict = my_knn_clf.predict(X_test)

print(np.sum(y_predict==y_test)/len(y_test))

輸出:
因爲鳶尾花數據集比較小, knn沒有達到100%準確,
在split成測試集和訓練集是隨機數不一樣,
準確率有點波動, 不過在正常範圍內

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

實現sklearn中的train_test_split測試集訓練集拆分

實現sklearn中的train_test_split

SQL優化-20231016

對比梯度下降和正規方程解性能

leetCode 349號題目詳解兩個數組的交集 ,python3兩種方式實現, 複雜度分別爲O(n^2) 和 O(n)

leetCode 1號題詳解, 兩數之和的key , python3三種方法實現

斯坦福cs224n教程--- 學習筆記2-11

sklearn 中的GridSearchCV網格搜索

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結