kflod,cross_validation等函數包提供了很好的“成比例分割數據集”的方法,但是當我們希望獲得指定大小的數據集時,應該怎麼做呢?
筆者在課程實驗中遇到了這個問題,編寫了如下函數:
#Split the data into target number
def trainTestSplit(X,Y,train_num_of_X):
'''
This function can split the data into desire num for test and train by random.
Variables Describe:
X: Datafram without label
Y: Data labels
train_num_of_X: numbers of train set
'''
X_num=X.shape[0]
test_index= list(range(X_num))
train_index=[]
train_num=train_num_of_X
for i in range(train_num):
randomIndex=int(np.random.uniform(0,len(test_index)))#Choose train set by random
train_index.append(test_index[randomIndex])
del test_index[randomIndex]
#Control the label consistency
train=X.iloc[train_index]
label_train=Y.iloc[train_index]
test=X.iloc[test_index]
label_test=Y.iloc[test_index]
return train,test,label_train,label_test
上面的函數是在x存儲特徵,y存儲label的操作。如果你的daraframe把label也放進去了,那麼應該嘗試如下函數:
#Split the data into target number
def trainTestSplit(X,train_num_of_X):
'''
This function can split the data into desire num for test and train by random.
Variables Describe:
X: Datafram without label
train_num_of_X: numbers of train set
'''
X_num=X.shape[0]
test_index= list(range(X_num))
train_index=[]
train_num=train_num_of_X
for i in range(train_num):
randomIndex=int(np.random.uniform(0,len(test_index)))#Choose train set by random
train_index.append(test_index[randomIndex])
del test_index[randomIndex]
#Control the label consistency
train=X.iloc[train_index]
test=X.iloc[test_index]
return train,test