Kaggle教程機器學習中級4 Pipeline

原創

李乾文

2019-10-25 18:30

原文鏈接：https://www.kaggle.com/alexisbcook/pipelines

轉載請註明出處：https://leytton.blog.csdn.net/article/details/101351814
如果本文對您有所幫助，請點個贊讓我知道哦 😃

在本教程中，你講學習如何使用pipeline來清理你的建模代碼。

1、介紹

Pipeline是一種簡單的方法，能讓你的數據預處理和建模步驟一步到位。

很多數據科學家沒有使用pipeline來建模，但pipeline有很多重要好處。包含：

更精簡的代碼：考慮到數據處理時會造成混亂，使用pipeline不需要在每個步驟都特別注意訓練和驗證數據。
更少的Bug：錯誤應用和忘記處理步驟的概率更小。
更易產品化：把模型轉化成規模化發佈原型是比較難的，再此我們不做過多討論，但pipeline對這個有幫助。
模型驗證更加多樣化：您將會在下一個課程中看到交叉驗證的案例。

2、案例

跟前面教程一樣，我們將會使用 Melbourne Housing 數據集

我們不會關注數據加載步驟。假設您已經擁有了X_train、X_valid、y_train和y_valid中的訓練和驗證數據。

我們先使用head()方法瞄一眼訓練數據。注意這些數據保護分類數據和缺失數據。使用pipelines將會很方便處理這兩者。

X_train.head()

輸出結果：

	Type 	Method 	Regionname 	Rooms 	Distance 	Postcode 	Bedroom2 	Bathroom 	Car 	Landsize 	BuildingArea 	YearBuilt 	Lattitude 	Longtitude 	Propertycount
12167 	u 	S 	Southern Metropolitan 	1 	5.0 	3182.0 	1.0 	1.0 	1.0 	0.0 	NaN 	1940.0 	-37.85984 	144.9867 	13240.0
6524 	h 	SA 	Western Metropolitan 	2 	8.0 	3016.0 	2.0 	2.0 	1.0 	193.0 	NaN 	NaN 	-37.85800 	144.9005 	6380.0
8413 	h 	S 	Western Metropolitan 	3 	12.6 	3020.0 	3.0 	1.0 	1.0 	555.0 	NaN 	NaN 	-37.79880 	144.8220 	3755.0
2919 	u 	SP 	Northern Metropolitan 	3 	13.0 	3046.0 	3.0 	1.0 	1.0 	265.0 	NaN 	1995.0 	-37.70830 	144.9158 	8870.0
6043 	h 	S 	Western Metropolitan 	3 	13.3 	3020.0 	3.0 	1.0 	2.0 	673.0 	673.0 	1970.0 	-37.76230 	144.8272 	4217.0

我們通過三個步驟來使用pipelines：

步驟1：定義處理步驟

與pipeline將預處理與建模步驟打包一樣，我們使用ColumnTransformer類來將不同的步驟打包在一起。下面的代碼做了兩件事情：

填充缺失數據爲數值類型（用均值等方法填充數值類型數據）
用one-hot編碼來填充分類缺失數據

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# 預處理數值類型數據
numerical_transformer = SimpleImputer(strategy='constant')

# 預處理分類數據
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# 將處理步驟打包
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

步驟2：定義模型

接下來我們使用熟悉的RandomForestRegressor類定義一個隨機森林模型。

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100, random_state=0)

步驟3：創建和評估Pipeline

最後，我們使用Pipeline類來定義一個打包預處理和建模步驟的pipeline。這裏有些重點需要注意：

使用pipeline，我們預處理訓練數據以及擬合模型只有了一行代碼。（相反，如果不使用pipeline，我們需要做填充、編碼、和模型訓練的步驟。如果我們需要同時處理數值和分類變量，這將非常繁瑣！）
我們在調用predict()指令時使用的X_valid中包含未經處理的特徵值，pipeline會在預測前自動進行預處理。（然而，如果不使用pipeline，在預測前，我們需要記住對驗證數據進行預處理）。

from sklearn.metrics import mean_absolute_error

# 在pipeline中打包預處理和建模代碼
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])

# 預處理訓練數據，擬合模型 
my_pipeline.fit(X_train, y_train)

# 預處理驗證數據, 獲取預測值
preds = my_pipeline.predict(X_valid)

# 評估模型
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)

輸出結果：

MAE: 160679.18917034855

3、結論

Pipeline在機器學習代碼清理和規避錯誤中，尤其是複雜數據預處理的工作流，非常實用。

4、去吧，皮卡丘

在接下來的練習中，使用pipeline來體驗高級數據預處理技術並改善你的預測！

原文：
https://www.kaggle.com/alexisbcook/pipelines

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Kaggle教程機器學習中級4 Pipeline

1、介紹

2、案例

3、結論

4、去吧，皮卡丘

Kaggle教程機器學習中級3 分類變量

NodeRed安裝與反向代理配置

Kaggle教程機器學習入門3 你的第一個機器學習模型

BeautyGAN論文翻譯

Kaggle教程機器學習中級2 缺失值處理

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

Kaggle教程 機器學習中級4 Pipeline

1、介紹

2、案例

3、結論

4、去吧，皮卡丘

Kaggle教程機器學習中級4 Pipeline