sklearn.cross_validation.train_test_split隨機劃分訓練集和測試集
一般形式:
train_test_split是交叉驗證中常用的函數,功能是從樣本中隨機的按比例選取train data和testdata,形式爲:
X_train,X_test, y_train, y_test =
cross_validation.train_test_split(train_data,train_target,test_size=0.4, random_state=0)
參數解釋:
train_data:所要劃分的樣本特徵集
train_target:所要劃分的樣本結果
test_size:樣本佔比,如果是整數的話就是樣本的數量
random_state:是隨機數的種子。
隨機數種子:其實就是該組隨機數的編號,在需要重複試驗的時候,保證得到一組一樣的隨機數。比如你每次都填1,其他參數一樣的情況下你得到的隨機數組是一樣的。但填0或不填,每次都會不一樣。隨機數的產生取決於種子,隨機數和種子之間的關係遵從以下兩個規則:種子不同,產生不同的隨機數;種子相同,即使實例不同也產生相同的隨機數。
示例
fromsklearn.cross_validation import train_test_split
train= loan_data.iloc[0: 55596, :]
test= loan_data.iloc[55596:, :]
# 避免過擬合,採用交叉驗證,驗證集佔訓練集20%,固定隨機種子(random_state)
train_X,test_X, train_y, test_y = train_test_split(train,
target,
test_size = 0.2,
random_state = 0)
train_y= train_y['label']
test_y= test_y['label']
plot_confusion_matrix.py(混淆矩陣實現實例)
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix
# import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# Run classifier, using a model that is too regularized (C too low) to see
# the impact on the results
classifier = svm.SVC(kernel='linear', C=0.01)
y_pred = classifier.fit(X_train, y_train).predict(X_test)
def plot_confusion_matrix(cm, title='Confusion matrix', cmap=plt.cm.Blues):
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(iris.target_names))
plt.xticks(tick_marks, iris.target_names, rotation=45)
plt.yticks(tick_marks, iris.target_names)
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)
np.set_printoptions(precision=2)
print('Confusion matrix, without normalization')
print(cm)
plt.figure()
plot_confusion_matrix(cm)
# Normalize the confusion matrix by row (i.e by the number of samples
# in each class)
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print('Normalized confusion matrix')
print(cm_normalized)
plt.figure()
plot_confusion_matrix(cm_normalized, title='Normalized confusion matrix')
plt.show()