【Kaggle】Intermediate Machine Learning（管道+交叉驗證）

原創

Michael阿明

2020-05-13 03:47

文章目錄

上一篇：【Kaggle】Intermediate Machine Learning（缺失值+文字特徵處理）

4. Pipelines 管道

該模塊可以把數據前處理+建模整合起來

好處：

更清晰的代碼：在預處理的每個步驟中對數據的核算都可能變得混亂。使用管道，您無需在每個步驟中手動跟蹤訓練和驗證數據。
錯誤更少：錯誤地使用步驟或忘記預處理步驟的機會更少。
易於生產部署
對模型驗證也有好處

步驟1： 定義前處理步驟

對缺失的數字數據，進行插值
對文字特徵進行one-hot編碼

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Preprocessing for numerical data 數字數據插值
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data 文字特徵處理，插值+編碼轉換
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
# 上面兩者合併起來，形成完整的數據處理流程
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

步驟2： 定義模型

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100, random_state=0)

步驟3： 創建和評估管道

我們使用Pipeline類來定義將預處理和建模步驟捆綁在一起的管道。

管道會在生成預測之前自動對數據進行預處理（如果沒有管道，我們必須在進行預測之前先對數據進行預處理）。

# Bundle preprocessing and modeling code in a pipeline
# 將 前處理管道 + 模型管道，再次疊加形成新管道
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# 用定義好的pipeline 對test進行預測，提交，代碼很簡潔，不易出錯
preds_test = my_pipeline.predict(X_test)
# Save test predictions to file
output = pd.DataFrame({'Id': X_test.index,
                       'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)

You advanced 5,020 places on the leaderboard!
Your submission scored 16459.13640, which is an improvement of your previous score of 16619.07644. Great job!
誤差有點提升，哈哈，加油！🚀

5. Cross-Validation 交叉驗證

交叉驗證可以更好的驗證模型，把數據分成幾份（Folds），依次選取一份作爲驗證集，其餘的用來訓練，顯然交叉驗證會花費更多的時間

如何選擇是否使用：

對於較小的數據集，不需要太多的計算負擔，則應運行交叉驗證
對於較大的數據集，單個驗證集就足夠了，因爲數據足夠多了，交叉驗證花費的時間成本變大
沒有簡單的準則，如果模型花費幾分鐘或更短的時間來運行，那就使用交叉驗證吧
可以運行交叉驗證，看看每個實驗的分數是否接近。如果每個實驗產生相同的結果，則單個驗證集可能就足夠了

from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

my_pipeline = Pipeline(steps=[
		('preprocessor', SimpleImputer()),
		('model', RandomForestRegressor(n_estimators=50,random_state=0))
])

from sklearn.model_selection import cross_val_score
# Multiply by -1 since sklearn calculates *negative* MAE
scores = -1 * cross_val_score(my_pipeline, X, y,
                              cv=5,
                              scoring='neg_mean_absolute_error')
print("MAE scores:\n", scores)
print("Average MAE score (across experiments):")
print(scores.mean())

# 樹的棵數不同情況下，交叉驗證的得分均值
def get_score(n_estimators):
    """Return the average MAE over 3 CV folds of random forest model.
    Keyword argument:
    n_estimators -- the number of trees in the forest
    """
    my_pipeline = Pipeline(steps=[
        ('preprocessing',SimpleImputer()),
        ('model',RandomForestRegressor(n_estimators=n_estimators,random_state=0))
    ])
    scores = -1*cross_val_score(my_pipeline,X,y,cv=3,scoring='neg_mean_absolute_error')
    return scores.mean()

results = {}
for i in range(1,9):# 獲取樹的棵樹是50，100，。。。，400時，模型的效果
    results[50*i] = get_score(50*i)

# 可視化不同參數下的模型效果
import matplotlib.pyplot as plt
%matplotlib inline

plt.plot(list(results.keys()), list(results.values()))
plt.show()
n_estimators_best = min(results, key=results.get) #最合適的參數

還可以通過 sklearn.model_selection.GridSearchCV 網格式搜索最佳的參數

上一篇：【Kaggle】Intermediate Machine Learning（缺失值+文字特徵處理）

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

【Kaggle】Intermediate Machine Learning（管道+交叉驗證）

文章目錄

4. Pipelines 管道

5. Cross-Validation 交叉驗證

Spring Cloud 部署時如何使用 Kubernetes 作爲註冊中心和配置中心

LeetCode 159. 至多包含兩個不同字符的最長子串（滑動窗口）

LeetCode 651. 4鍵鍵盤（DP，Ctrl+CV）

LeetCode 298. 二叉樹最長連續序列（自頂向下）

LeetCode 366. 尋找二叉樹的葉子節點（上下翻轉二叉樹+BFS）

LeetCode 484. 尋找排列（找規律+貪心）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結