上一篇:【Kaggle】Intermediate Machine Learning(缺失值+文字特徵處理)
4. Pipelines 管道
該模塊可以把數據前處理+建模
整合起來
好處:
- 更清晰的代碼:在預處理的每個步驟中對數據的核算都可能變得混亂。使用管道,您無需在每個步驟中手動跟蹤訓練和驗證數據。
- 錯誤更少:錯誤地使用步驟或忘記預處理步驟的機會更少。
- 易於生產部署
- 對模型驗證也有好處
步驟1: 定義前處理步驟
- 對缺失的數字數據,進行插值
- 對文字特徵進行one-hot編碼
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
# Preprocessing for numerical data 數字數據插值
numerical_transformer = SimpleImputer(strategy='constant')
# Preprocessing for categorical data 文字特徵處理,插值+編碼轉換
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# Bundle preprocessing for numerical and categorical data
# 上面兩者合併起來,形成完整的數據處理流程
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_cols),
('cat', categorical_transformer, categorical_cols)
])
步驟2: 定義模型
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=100, random_state=0)
步驟3: 創建和評估管道
我們使用Pipeline類來定義將預處理和建模步驟捆綁在一起的管道。
管道會在生成預測之前自動對數據進行預處理(如果沒有管道,我們必須在進行預測之前先對數據進行預處理)。
# Bundle preprocessing and modeling code in a pipeline
# 將 前處理管道 + 模型管道,再次疊加形成新管道
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('model', model)
])
# Preprocessing of training data, fit model
my_pipeline.fit(X_train, y_train)
# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)
# 用定義好的pipeline 對test進行預測,提交,代碼很簡潔,不易出錯
preds_test = my_pipeline.predict(X_test)
# Save test predictions to file
output = pd.DataFrame({'Id': X_test.index,
'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)
You advanced 5,020 places on the leaderboard!
Your submission scored 16459.13640, which is an improvement of your previous score of 16619.07644. Great job!
誤差有點提升,哈哈,加油!🚀
5. Cross-Validation 交叉驗證
交叉驗證可以更好的驗證模型,把數據分成幾份(Folds),依次選取一份作爲驗證集,其餘的用來訓練,顯然交叉驗證會花費更多的時間
如何選擇是否使用:
-
對於
較小
的數據集,不需要太多的計算負擔,則應運行交叉驗證 -
對於
較大
的數據集,單個驗證集就足夠了,因爲數據足夠多了,交叉驗證花費的時間成本變大 -
沒有簡單的準則,如果模型花費幾分鐘或更短的時間來運行,那就使用交叉驗證吧
-
可以運行交叉驗證,看看每個實驗的分數是否接近。如果每個實驗產生相同的結果,則單個驗證集可能就足夠了
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
my_pipeline = Pipeline(steps=[
('preprocessor', SimpleImputer()),
('model', RandomForestRegressor(n_estimators=50,random_state=0))
])
from sklearn.model_selection import cross_val_score
# Multiply by -1 since sklearn calculates *negative* MAE
scores = -1 * cross_val_score(my_pipeline, X, y,
cv=5,
scoring='neg_mean_absolute_error')
print("MAE scores:\n", scores)
print("Average MAE score (across experiments):")
print(scores.mean())
# 樹的棵數不同情況下,交叉驗證的得分均值
def get_score(n_estimators):
"""Return the average MAE over 3 CV folds of random forest model.
Keyword argument:
n_estimators -- the number of trees in the forest
"""
my_pipeline = Pipeline(steps=[
('preprocessing',SimpleImputer()),
('model',RandomForestRegressor(n_estimators=n_estimators,random_state=0))
])
scores = -1*cross_val_score(my_pipeline,X,y,cv=3,scoring='neg_mean_absolute_error')
return scores.mean()
results = {}
for i in range(1,9):# 獲取樹的棵樹是50,100,。。。,400時,模型的效果
results[50*i] = get_score(50*i)
# 可視化不同參數下的模型效果
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(list(results.keys()), list(results.values()))
plt.show()
n_estimators_best = min(results, key=results.get) #最合適的參數
還可以通過 sklearn.model_selection.GridSearchCV 網格式搜索最佳的參數