機器學習項目實戰-能源利用率2-建模

* 導入預處理數據

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
pd.options.mode.chained_assignment = None
pd.set_option('display.max_columns', 50)

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.rcParams['font.size'] = 24
sns.set(font_scale = 2)

train_features = pd.read_csv('data/training_features.csv')
test_features = pd.read_csv('data/testing_features.csv')
train_labels = pd.read_csv('data/training_labels.csv')
test_labels = pd.read_csv('data/testing_labels.csv')
print(train_features.shape, '\t', test_features.shape)

train_features.head(8)

(6622, 64) (2839, 64)

*.1 缺失值填充

在sklearn中，可以使用Scikit-learn Imputer object來進行缺失值填充，對於測試集我們使用數據集中的結果來進行填充，目的在於data leakage

from sklearn.imputer import SimpleImputer

imputer = SimpleImputer(strategy = 'median')
imputer.fit(train_features)

X = imputer.transform(train_features)
X_test = imputer.transform(test_features)

print('Missing values in training features: ', np.sum(np.isnan(X)))
print('Missing values in testing features: ', np.sum(np.isnan(X_test)))

print(np.where(~np.isfinite(X)))
print(np.where(~np.isfinite(X_test)))

Missing values in training features: 0
Missing values in testing features: 0
(array([], dtype=int64), array([], dtype=int64))
(array([], dtype=int64), array([], dtype=int64))

*.2 特徵歸一化

from sklearn.preprocessing import MinMaxScaler  # StandardScaler
minmax_scaler = MinMaxScaler()
minmax_scaler.fit(X)
X = minmax_scaler.transform(X)
X_test = minmax_scaler.transform(X_test)

y = np.array(train_labels).reshape((-1, ))
y_test = np.array(test_labels).reshape((-1, ))

四. 建立基礎模型, 嘗試多種算法

4.1 建立一個Baseline

在建模之前，我們得有一個最壞的打算，就是模型起碼得有點作用才行。

def mae(y_true, y_pred):
	return np.mean(abs(y_true - y_pred))
baseline_guess = np.median(y)

print('The baseline guess is a score of %.2f' % baseline_guess)
print('Baseline Performance on the test set: MAE = %.4f' % mae*y_test, baseline_guess))

The baseline guess is a score of 66.00
Baseline Performance on the test set: MAE = 24.5164

4.2 選擇的機器學習算法（迴歸問題）

Linear Regression
Logistic Regression
Support Vector Machine Regression
Random Forest Regression
Decision Tree Regressor
Gradient Boosting Regression
SGDRegressor
K-Nearest Neighbors Regression

def mae(y_true, y_pred):
	return np.mean(abs(y_true - y_pred))
def fit_and_evaluate(mode):
	model.fit(X, y)
	model_pred = model.predict(X_test)
	model_mae = mae(y_test, model_pred)
	return model_mae

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr_mae = fit_and_evaluate(lr)
print('Linear Regression Performance on the test set: MAE = %.4f' % lr_mae)

from sklearn.linear_model import LogisticRegression
logistic = LogisticRegression()
logistic_mae = fit_and_evaluate(logistic)
print('Logistic Regression Performance on the test set: MAE = %.4f' % logistic_mae)

from sklearn.svm import SVR
svm = SVR(C = 1000, gamma = 0.1)
svm_mae = fit_and_evaluate(svm)
print('Support Vector Machine Regression Performance on the test set: MAE = %.4f' % svm_mae)

from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(random_state = 42)
rfr_mae = fit_and_evaluate(rfr)
print('Random Forest Regressor Performance on the test set: MAE = %.4f' % rfr_mae)

from sklearn.tree import DecisionTreeRegressor
dtr = DecisionTreeRegressor()
dtr_mae = fit_and_evaluate(dtr)
print('Decision Tree Regressor Performance on the test set: MAE = %.4f' % dtr_mae)

from sklearn.ensemble import GradientBoostingRegressor
gbr = GradientBoostingRegressor()
gbr_mae = fit_and_evaluate(gbr)
print('Gradient Boosting Regressor Performance on the test set: MAE = %.4f' % gbr_mae)

from sklearn.linear_model import SGDRegressor
sgdr = SGDRegressor(random_state = 42)
sgdr_mae = fit_and_evaluate(sgdr)
print('SGDRegressor Performance on the test set: MAE = %.4f' % sgdr_mae)

from sklearn.neighbors import KNeighborsRegressor
knn = KNeighorsRegressor(n_neighbors = 10)
knn_mae = fit_and_evaluate(knn)
print('K-Nearest Neighbors Regressor Performance on the test set: MAE = %.4f' % knn_mae)

Linear Regression Performance on the test set: MAE = 13.4651
Logistic Regression Performance on the test set: MAE = 19.6823
Support Vector Machine Regression Performance on the test set: MAE = 10.9337
Random Forest Regressor Performance on the test set: MAE = 9.9025
Decision Tree Regressor Performance on the test set: MAE = 12.8373
Gradient Boosting Regressor Performance on the test set: MAE = 10.0130
SGDRegressor Performance on the test set: MAE = 19.0039
K-Nearest Neighbors Regressor Performance on the test set: MAE = 13.0131

plt.style.use('fivethirtyeight')
plt.figure(figsize = (12, 8))

model_comparison = pd.DataFrame({'model': ['Linear Regression', 'Logistic Regression',
                                           'Support Vector Machine', 'Random Forest',
                                           'Decision Tree', 'Gradient Boosting',
                                           'SGDRegressor', 'K-Nearest Neighbors'],
                                'mae': [lr_mae, logistic_mae, svm_mae, rfr_mae,
                                       dtr_mae, gbr_mae, sgdr_mae, knn_mae]})
model_comparison.sort_values('mae',ascending=False).plot(x='model',y='mae',kind='barh',
                                                        color='red', edgecolor='black')
plt.ylabel(''); plt.xlabel('Mean Absolute Error')
plt.yticks(size = 14); plt.xticks(size = 14)
plt.title('Model Comparison on Test MAE', size = 20);

看起來隨機森林和集成算法比較佔優勢一些，這裏存在一些不公平，因爲參數只用了默認，但是對於SVM來說參數可能影響會更大一些。

五. 模型調參

5.1 RandomizedSearchCV

from sklearn.model_selection import RandomizedSearchCV

loss = ['ls', 'lad', 'huber']
n_estimators = [100, 500, 1000, 1500]
max_depth = [2, 3, 5, 10]
min_samples_leaf = [1, 2, 4, 6, 8]
min_samples_split = [2, 4, 6, 10]
max_features = ['auto', 'sqrt', 'log2', None]

hyperparameter_grid = {'loss':loss, 'min_samples_split':min_samples_split,
                       'max_depth':max_depth,'min_samples_leaf':min_samples_leaf,
                       'n_estimators':n_estimators,'max_features':max_features}
model = GradientBoostingRegressor(random_state = 42)
random_cv = RandomizedSearchCV(estimator = model, cv = 3, n_iter = 30, verbose = 1,
                              param_distributions = hyperparameter_grid,
                              scoring = 'neg_mean_absolute_error', n_jobs = -1,
                              return_train_score = True, random_state = 42)
random_cv.fit(X, y)

random_cv.best_estimator_

random_cv.best_estimator_.fit(X, y)
random_cv_pred = random_cv.best_estimator_.predict(X_test)
mae(y_test, random_cv_pred)

9.122027188485426

5.2 GridSearchCV

from sklearn.model_selection import GridSearchCV

trees_grid = {'min_samples_split': [6, 10], 'min_samples_leaf': [4, 6],
              'max_depth': [5, 6], 'loss': ['huber', 'lad']}
model = GradientBoostingRegressor(max_features=None, e_estimators=500, random_state=42)
grid_search = GridSearchCV(estimator = model, param_grid = trees_grid, cv = 3, verbose = 1,
                          scoring = 'neg_mean_absolute_error, n_jobs = -1, return_train_score = True)
grid_search.fit(X, y)

grid_search.best_estimator_

# 再來一次, 單獨搜索n_estimators
trees_grid = {'n_estimators': [100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800]}
model = GredientBoostingRegressor(loss = 'lad', max_features = None, max_depth = 6,
                                 min_samples_leaf = 4, min_samples_split = 10, random_state = 42)
grid_search = GridSearchCV(estimator = model, param_grid = trees_grid, cv = 3, verbose = 1,
                          scoring = 'neg_mean_absolute_error, n_jobs = -1, return_train_score = True)
grid_search.fit(X, y)

results = pd.DataFrame(grid_search.cv_results_)

plt.figure(figsize = (8, 8))
plt.plot(results['param_n_estimators'], -1*results['mean_test_score'], label='Testing Error')
plt.plot(results['param_n_estimators'], -1*results['mean_train_score'], label='Training Error')
plt.xlabel('Numbel of Trees'); plt.ylabel('Mean Abosolute Error')
plt.title('Performance vs Number of Trees')
plt.legend()

results.sort_values('mean_test_score', ascending = False).head()

六. 評估與測試

6.1 測試模型

default_model = GradientBoostingRegressor(random_state = 42)
final_model = grid_search.best_estimator_
final_model

%%timeit -n 1 -r 5
default_model.fit(X, y)

805 ms ± 20.4 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

%%timeit -n 1 -r 5
final_model.fit(X, y)

10.5 s ± 159 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

default_pred = default_model.predict(X_test)
final_pred = final_model.predict(X_test)

print('Default model performance on the test set:MAE = %.4f.' % mae(y_test, default_pred))
print('Final model performance on the test set:  MAE = %.4f.' % mae(y_test, final_pred))

Default model performance on the test set:MAE = 10.0130.
Final model performance on the test set: MAE = 9.0963.

對比試驗結果，訓練時間差異較大但是模型差不多得到了10%的提升。通常來說訓練時間只要可以容忍都是可以的，模型的提升還是很寶貴的。

6.2 預測和真實之間的差異圖

plt.figure(figsize = (8, 8))

sns.kdeplot(default_pred, label = 'Default Predictions')
sns.kdeplot(final_pred, label = 'Predictions')
sns.kdeplot(y_test, label = 'Values')
plt.xlabel('Energy Star Score'); plt.ylabel('Density')
plt.title('Test Values and Predictions')

6.3 殘差分佈

residuals = final_pred - y_test
plt.hist(residuals, color = 'red', bins = 40, edgecolor = 'black')
plt.xlabel('Error'); plt.ylabel('Count')
plt.title("Distribution of Residuals")

未完待續:
[ 機器學習項目實戰-能源利用率3-分析 ]

機器學習項目實戰-能源利用率2-建模

目錄:

* 導入預處理數據

*.1 缺失值填充

*.2 特徵歸一化

四. 建立基礎模型, 嘗試多種算法

4.1 建立一個Baseline

4.2 選擇的機器學習算法（迴歸問題）

五. 模型調參

5.1 RandomizedSearchCV

5.2 GridSearchCV

六. 評估與測試

6.1 測試模型

6.2 預測和真實之間的差異圖

6.3 殘差分佈

數據挖掘之房價預測任務

協同過濾與隱語義模型推薦系統實例2: 基於相似度的推薦

ARIMA 時間序列2: 評估和參數選擇

時間處理date_range,truncate,Timestamp,Period,Timedelta,resample,rolling

HMM隱馬爾科夫模型與實例2: 預測股票走勢

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結