Datawhale 零基礎入門數據挖掘-Task4 建模調參


Tip:此部分爲零基礎入門數據挖掘的 Task4 建模調參 部分,帶你來了解各種模型以及模型的評價和調參策略,歡迎大家後續多多交流。

賽題:零基礎入門數據挖掘 - 二手車交易價格預測


5.1 學習目標

  • 瞭解常用的機器學習模型,並掌握機器學習模型的建模與調參流程
  • 完成相應學習打卡任務

5.2 內容介紹

  1. 線性迴歸模型:
    • 線性迴歸對於特徵的要求;
    • 處理長尾分佈;
    • 理解線性迴歸模型;
  2. 模型性能驗證:
    • 評價函數與目標函數;
    • 交叉驗證方法;
    • 留一驗證方法;
    • 針對時間序列問題的驗證;
    • 繪製學習率曲線;
    • 繪製驗證曲線;
  3. 嵌入式特徵選擇:
    • Lasso迴歸;
    • Ridge迴歸;
    • 決策樹;
  4. 模型對比:
    • 常用線性模型;
    • 常用非線性模型;
  5. 模型調參:
    • 貪心調參方法;
    • 網格調參方法;
    • 貝葉斯調參方法;

5.3 相關原理介紹與推薦


5.3.1 線性迴歸模型


5.3.2 決策樹模型


5.3.3 GBDT模型


5.3.4 XGBoost模型


5.3.5 LightGBM模型


5.3.6 推薦教材:

5.4 代碼示例

5.4.1 讀取數據

import pandas as pd
import numpy as np
import warnings

reduce_mem_usage 函數通過調整數據類型,幫助我們減少數據在內存中佔用的空間

def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    start_mem = df.memory_usage().sum() 
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    for col in df.columns:
        col_type = df[col].dtype
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                    df[col] = df[col].astype(np.float64)
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() 
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    return df
sample_feature = reduce_mem_usage(pd.read_csv('data_for_tree.csv'))
Memory usage of dataframe is 57322784.00 MB
Memory usage after optimization is: 14755178.00 MB
Decreased by 74.3%
name model brand bodyType fuelType gearbox power kilometer notRepairedDamage price ... used_time city brand_amount brand_price_max brand_price_median brand_price_min brand_price_sum brand_price_std brand_price_average power_bin
0 736 30 6 1.0 0.0 0.0 60 12.5 0.0 1850.0 ... 4384.0 1.0 10192.0 35990.0 1800.0 13.0 36457520.0 4564.0 3576.0 5.0
1 2262 40 1 2.0 0.0 0.0 0 15.0 MISSING 3600.0 ... 4756.0 4.0 13656.0 84000.0 6400.0 15.0 124044600.0 8992.0 9080.0 NaN
2 14874 115 15 1.0 0.0 0.0 163 12.5 0.0 6222.0 ... 4384.0 2.0 1458.0 45000.0 8496.0 100.0 14373814.0 5424.0 9848.0 16.0
3 71865 109 10 0.0 0.0 1.0 193 15.0 0.0 2400.0 ... 7124.0 NaN 13992.0 92900.0 5200.0 15.0 113034208.0 8248.0 8076.0 19.0
4 111080 110 5 1.0 0.0 0.0 68 5.0 0.0 5200.0 ... 1531.0 6.0 4664.0 31500.0 2300.0 20.0 15414322.0 3344.0 3306.0 6.0

5 rows × 36 columns

continuous_feature_names = [x for x in sample_feature.columns if x not in ['price','brand','model','name', 'bodyType', 'fuelType', 'notRepairedDamage']]

5.4.2 線性迴歸 & 五折交叉驗證 & 模擬真實業務情況

sample_feature = sample_feature.dropna().replace('-', 0).reset_index(drop=True)
sample_feature = sample_feature.replace('MISSING', 0)
sample_feature['notRepairedDamage'] = sample_feature['notRepairedDamage'].astype(np.float32)
train = sample_feature[continuous_feature_names + ['price']]

train_X = train[continuous_feature_names]
train_y = train['price']
     name model  brand bodyType fuelType gearbox  power  kilometer  \
0     736    30      6      1.0      0.0     0.0     60       12.5   
1   14874   115     15      1.0      0.0     0.0    163       12.5   
2  111080   110      5      1.0      0.0     0.0     68        5.0   
3  137642    24     10      0.0      1.0     0.0    109       10.0   
4    2402    13      4      0.0      0.0     1.0    150       15.0   

  notRepairedDamage   price  ...  used_time  city  brand_amount  \
0               0.0  1850.0  ...     4384.0   1.0       10192.0   
1               0.0  6222.0  ...     4384.0   2.0        1458.0   
2               0.0  5200.0  ...     1531.0   6.0        4664.0   
3               0.0  8000.0  ...     2482.0   3.0       13992.0   
4               0.0  3500.0  ...     6184.0   3.0       16576.0   

   brand_price_max  brand_price_median  brand_price_min  brand_price_sum  \
0          35990.0              1800.0             13.0       36457520.0   
1          45000.0              8496.0            100.0       14373814.0   
2          31500.0              2300.0             20.0       15414322.0   
3          92900.0              5200.0             15.0      113034208.0   
4          99999.0              6000.0             12.0      138279072.0   

   brand_price_std  brand_price_average  power_bin  
0           4564.0               3576.0        5.0  
1           5424.0               9848.0       16.0  
2           3344.0               3306.0        6.0  
3           8248.0               8076.0       10.0  
4           8088.0               8344.0       14.0  

[5 rows x 36 columns]

5.4.2 - 1 簡單建模

from sklearn.linear_model import LinearRegression
model = LinearRegression(normalize=True)
model = model.fit(train_X, train_y)


print('intercept:'+ str(model.intercept_))
sorted(dict(zip(continuous_feature_names, model.coef_)).items(), key=lambda x:x[1], reverse=True)

[('v_6', 3418891.8661930403),
 ('v_5', 2297424.5932109547),
 ('v_8', 1219155.1076147154),
 ('v_9', 689318.2798646946),
 ('v_7', 441213.76855152246),
 ('v_11', 33695.21532491009),
 ('v_12', 12751.353677724243),
 ('v_10', 11940.051759841095),
 ('gearbox', 915.6123144165624),
 ('v_14', 232.73304565903075),
 ('city', 38.480371412194856),
 ('power', 35.6062576498682),
 ('brand_price_median', 0.45503872387225264),
 ('brand_price_std', 0.45190937612308135),
 ('brand_amount', 0.18847929674706093),
 ('brand_price_max', 0.005052224552260477),
 ('train', 3.818422555923462e-07),
 ('brand_price_sum', -3.0058302585516643e-05),
 ('used_time', -0.02462480273046145),
 ('brand_price_average', -0.34643690063272586),
 ('brand_price_min', -2.5319297835051704),
 ('power_bin', -109.76386262213285),
 ('v_13', -150.7399294462697),
 ('kilometer', -351.36103146189294),
 ('v_3', -1344.0675611153686),
 ('v_0', -2798.561647500838),
 ('v_4', -3853.612550388567),
 ('v_2', -39428.5597862382),
 ('v_1', -44599.82891284008)]
from matplotlib import pyplot as plt
subsample_index = np.random.randint(low=0, high=len(train_y), size=50)


plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index], color='black')
plt.scatter(train_X['v_9'][subsample_index], model.predict(train_X.loc[subsample_index]), color='blue')
plt.legend(['True Price','Predicted Price'],loc='upper right')
print('The predicted price is obvious different from true price')
The predicted price is obvious different from true price



import seaborn as sns
print('It is clear to see the price shows a typical exponential distribution')
sns.distplot(train_y[train_y < np.quantile(train_y, 0.9)])
It is clear to see the price shows a typical exponential distribution

<matplotlib.axes._subplots.AxesSubplot at 0x1833c28a4c8>


在這裏我們對標籤進行了 log(x+1)log(x+1) 變換,使標籤貼近於正態分佈

train_y_ln = np.log(train_y + 1)
import seaborn as sns
print('The transformed price seems like normal distribution')
sns.distplot(train_y_ln[train_y_ln < np.quantile(train_y_ln, 0.9)])
The transformed price seems like normal distribution

<matplotlib.axes._subplots.AxesSubplot at 0x1833c60cd08>


model = model.fit(train_X, train_y_ln)

print('intercept:'+ str(model.intercept_))
sorted(dict(zip(continuous_feature_names, model.coef_)).items(), key=lambda x:x[1], reverse=True)

[('v_9', 6.73837719656898),
 ('v_1', 1.8764010743138013),
 ('v_12', 1.5490066205584243),
 ('v_5', 1.3684828986478352),
 ('v_13', 0.9381007016475442),
 ('v_11', 0.8601076136541934),
 ('v_3', 0.6908662876168846),
 ('v_7', 0.07176605184338732),
 ('power_bin', 0.009208120503260045),
 ('gearbox', 0.005832463904491905),
 ('power', 0.0004532988300577831),
 ('brand_price_min', 2.958178448217908e-05),
 ('used_time', 7.25708186886524e-06),
 ('brand_amount', 3.2317294039114087e-06),
 ('brand_price_median', 1.2316687308102237e-06),
 ('brand_price_max', 7.440945604426392e-07),
 ('brand_price_average', 6.077520449532623e-07),
 ('train', -1.2789769243681803e-11),
 ('brand_price_sum', -2.437300649289914e-10),
 ('brand_price_std', -4.2133697033978156e-07),
 ('v_14', -0.0003021985128370997),
 ('city', -0.003247381687803047),
 ('kilometer', -0.012962886393843829),
 ('v_0', -0.031397921158372956),
 ('v_2', -0.698190677552077),
 ('v_4', -0.8159958185074844),
 ('v_10', -1.5348138603344603),
 ('v_8', -42.38488913963534),
 ('v_6', -253.24942729281895)]


plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index], color='black')
plt.scatter(train_X['v_9'][subsample_index], np.exp(model.predict(train_X.loc[subsample_index])), color='blue')
plt.legend(['True Price','Predicted Price'],loc='upper right')
print('The predicted price seems normal after np.log transforming')
The predicted price seems normal after np.log transforming


5.4.2 - 2 五折交叉驗證


由於驗證數據集不參與模型訓練,當訓練數據不夠用時,預留大量的驗證數據顯得太奢侈。一種改善的方法是K折交叉驗證(K-fold cross-validation)。在K折交叉驗證中,我們把原始訓練數據集分割成K個不重合的子數據集,然後我們做K次模型訓練和驗證。每一次,我們使用一個子數據集驗證模型,並使用其他K-1個子數據集來訓練模型。在這K次訓練和驗證中,每次用來驗證模型的子數據集都不同。最後,我們對這K次訓練誤差和驗證誤差分別求平均。


因爲在實際的訓練中,訓練的結果對於訓練集的擬合程度通常還是挺好的(初始條件敏感),但是對於訓練集之外的數據的擬合程度通常就不那麼令人滿意了。因此我們通常並不會把所有的數據集都拿來訓練,而是分出一部分來(這一部分不參加訓練)對訓練集生成的參數進行測試,相對客觀的判斷這些參數對訓練集之外的數據的符合程度。這種思想就稱爲交叉驗證(Cross Validation)

from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error,  make_scorer ## MAE
def log_transfer(func):
    def wrapper(y, yhat):
        result = func(np.log(y), np.nan_to_num(np.log(yhat)))
        return result
    return wrapper
scores = cross_val_score(model, X=train_X, y=train_y, verbose=1, cv = 5, scoring=make_scorer(log_transfer(mean_absolute_error)))
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    1.6s finished

使用線性迴歸模型,對未處理標籤的特徵數據進行五折交叉驗證(Error 1.36)

print('AVG:', np.mean(scores))
AVG: 1.369013918691876

使用線性迴歸模型,對處理過標籤的特徵數據進行五折交叉驗證(Error 0.19)

scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=1, cv = 5, scoring=make_scorer(mean_absolute_error))
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    1.5s finished
print('AVG:', np.mean(scores))
AVG: 0.19576915594425995
scores = pd.DataFrame(scores.reshape(1,-1))
scores.columns = ['cv' + str(x) for x in range(1, 6)]
scores.index = ['MAE']
cv1 cv2 cv3 cv4 cv5
MAE 0.194274 0.195956 0.195945 0.194693 0.197977

5.4.2 - 3 模擬真實業務情況


import datetime
sample_feature = sample_feature.reset_index(drop=True)
name model brand bodyType fuelType gearbox power kilometer notRepairedDamage price ... used_time city brand_amount brand_price_max brand_price_median brand_price_min brand_price_sum brand_price_std brand_price_average power_bin
0 736 30 6 1.0 0.0 0.0 60 12.5 0.0 1850.0 ... 4384.0 1.0 10192.0 35990.0 1800.0 13.0 36457520.0 4564.0 3576.0 5.0
1 14874 115 15 1.0 0.0 0.0 163 12.5 0.0 6222.0 ... 4384.0 2.0 1458.0 45000.0 8496.0 100.0 14373814.0 5424.0 9848.0 16.0
2 111080 110 5 1.0 0.0 0.0 68 5.0 0.0 5200.0 ... 1531.0 6.0 4664.0 31500.0 2300.0 20.0 15414322.0 3344.0 3306.0 6.0
3 137642 24 10 0.0 1.0 0.0 109 10.0 0.0 8000.0 ... 2482.0 3.0 13992.0 92900.0 5200.0 15.0 113034208.0 8248.0 8076.0 10.0
4 2402 13 4 0.0 0.0 1.0 150 15.0 0.0 3500.0 ... 6184.0 3.0 16576.0 99999.0 6000.0 12.0 138279072.0 8088.0 8344.0 14.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
96262 43073 42 1 1.0 0.0 0.0 122 3.0 0.0 14780.0 ... 1538.0 5.0 13656.0 84000.0 6400.0 15.0 124044600.0 8992.0 9080.0 12.0
96263 163978 121 10 4.0 0.0 1.0 163 15.0 0.0 5900.0 ... 5772.0 4.0 13992.0 92900.0 5200.0 15.0 113034208.0 8248.0 8076.0 16.0
96264 184535 116 11 0.0 0.0 0.0 125 10.0 0.0 9500.0 ... 2322.0 2.0 2944.0 34500.0 2900.0 30.0 13398006.0 4724.0 4548.0 12.0
96265 147587 60 11 1.0 1.0 0.0 90 6.0 0.0 7500.0 ... 2003.0 3.0 2944.0 34500.0 2900.0 30.0 13398006.0 4724.0 4548.0 8.0
96266 45907 34 10 3.0 1.0 0.0 156 15.0 0.0 4999.0 ... 3672.0 1.0 13992.0 92900.0 5200.0 15.0 113034208.0 8248.0 8076.0 15.0

96267 rows × 36 columns

split_point = len(sample_feature) // 5 * 4
train = sample_feature.loc[:split_point].dropna()
val = sample_feature.loc[split_point:].dropna()

train_X = train[continuous_feature_names]
train_y_ln = np.log(train['price'] + 1)
val_X = val[continuous_feature_names]
val_y_ln = np.log(val['price'] + 1)
model = model.fit(train_X, train_y_ln)
mean_absolute_error(val_y_ln, model.predict(val_X))

5.4.2 - 4 繪製學習率曲線與驗證曲線

from sklearn.model_selection import learning_curve, validation_curve
? learning_curve
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,n_jobs=1, train_size=np.linspace(.1, 1.0, 5 )):  
    if ylim is not None:  
    plt.xlabel('Training example')  
    train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_size, scoring = make_scorer(mean_absolute_error))  
    train_scores_mean = np.mean(train_scores, axis=1)  
    train_scores_std = np.std(train_scores, axis=1)  
    test_scores_mean = np.mean(test_scores, axis=1)  
    test_scores_std = np.std(test_scores, axis=1)  
    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,  
                     train_scores_mean + train_scores_std, alpha=0.1,  
                     color="r")  #填充train_mean +- train_scores_std
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,  
                     test_scores_mean + test_scores_std, alpha=0.1,  
    plt.plot(train_sizes, train_scores_mean, 'o-', color='r',  
             label="Training score")  
    plt.plot(train_sizes, test_scores_mean,'o-',color="g",  
             label="Cross-validation score")  
    return plt  
plot_learning_curve(LinearRegression(), 'Liner_model', train_X[:1000], train_y_ln[:1000], ylim=(0.0, 0.5), cv=5, n_jobs=1)  
<module 'matplotlib.pyplot' from 'C:\\Users\\94890\\Anaconda3\\lib\\site-packages\\matplotlib\\pyplot.py'>


5.4.3 多種模型對比

train = sample_feature[continuous_feature_names + ['price']].dropna()

train_X = train[continuous_feature_names]
train_y = train['price']
train_y_ln = np.log(train_y + 1)

5.4.3 - 1 線性模型 & 嵌入式特徵選擇



from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
models = [LinearRegression(),
result = dict()
for model in models:
    model_name = str(model).split('(')[0]
    scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error))
    result[model_name] = scores
    print(model_name + ' is finished')
LinearRegression is finished
Ridge is finished
Lasso is finished


result = pd.DataFrame(result)
result.index = ['cv' + str(x) for x in range(1, 6)]
LinearRegression Ridge Lasso
cv1 0.194274 0.199028 0.392064
cv2 0.195956 0.200631 0.389369
cv3 0.195945 0.200816 0.391919
cv4 0.194693 0.199294 0.386594
cv5 0.197977 0.202830 0.392358
model = LinearRegression().fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)

<matplotlib.axes._subplots.AxesSubplot at 0x1834b047d88>



model = Ridge().fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)

<matplotlib.axes._subplots.AxesSubplot at 0x18334871908>



model = Lasso().fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)

<matplotlib.axes._subplots.AxesSubplot at 0x183445fa988>



5.4.3 - 2 非線性模型


from sklearn.linear_model import LinearRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from xgboost.sklearn import XGBRegressor
from lightgbm.sklearn import LGBMRegressor
models = [LinearRegression(),
          MLPRegressor(solver='lbfgs', max_iter=100), 
          XGBRegressor(n_estimators = 100, objective='reg:squarederror'), 
          LGBMRegressor(n_estimators = 100)]
result = dict()
del train_X['gearbox']
for model in models:
    model_name = str(model).split('(')[0]
    scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error))
    result[model_name] = scores
    print(model_name + ' is finished')
LinearRegression is finished
DecisionTreeRegressor is finished
RandomForestRegressor is finished
GradientBoostingRegressor is finished
MLPRegressor is finished
XGBRegressor is finished
LGBMRegressor is finished
result = pd.DataFrame(result)
result.index = ['cv' + str(x) for x in range(1, 6)]
LinearRegression DecisionTreeRegressor RandomForestRegressor GradientBoostingRegressor MLPRegressor XGBRegressor LGBMRegressor
cv1 0.194302 0.197484 0.147026 0.177392 116.598407 0.143866 0.147577
cv2 0.196005 0.205765 0.148228 0.179418 67.139661 0.147769 0.150367
cv3 0.195943 0.200433 0.149695 0.179679 26.075124 0.146292 0.148571
cv4 0.194723 0.197947 0.147885 0.176486 248.037378 0.144756 0.148429
cv5 0.197991 0.202071 0.151757 0.181362 244.320036 0.148483 0.152119


5.4.4 模型調參


## LGB的參數集合:

objective = ['regression', 'regression_l1', 'mape', 'huber', 'fair']

num_leaves = [3,5,10,15,20,40, 55]
max_depth = [3,5,10,15,20,40, 55]
bagging_fraction = []
feature_fraction = []
drop_rate = []

5.4.4 - 1 貪心調參

best_obj = dict()
for obj in objective:
    model = LGBMRegressor(objective=obj)
    score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
    best_obj[obj] = score
best_leaves = dict()
for leaves in num_leaves:
    model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0], num_leaves=leaves)
    score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
    best_leaves[leaves] = score
best_depth = dict()
for depth in max_depth:
    model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0],
                          num_leaves=min(best_leaves.items(), key=lambda x:x[1])[0],
    score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
    best_depth[depth] = score
sns.lineplot(x=['0_initial','1_turning_obj','2_turning_leaves','3_turning_depth'], y=[0.143 ,min(best_obj.values()), min(best_leaves.values()), min(best_depth.values())])
<matplotlib.axes._subplots.AxesSubplot at 0x1834d252888>


5.4.4 - 2 Grid Search 調參

from sklearn.model_selection import GridSearchCV
parameters = {'objective': objective , 'num_leaves': num_leaves, 'max_depth': max_depth}
model = LGBMRegressor()
clf = GridSearchCV(model, parameters, cv=5)
clf = clf.fit(train_X, train_y)
{'max_depth': 40, 'num_leaves': 55, 'objective': 'regression'}
model = LGBMRegressor(objective='regression',
np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))

5.4.4 - 3 貝葉斯調參

from bayes_opt import BayesianOptimization
def rf_cv(num_leaves, max_depth, subsample, min_child_samples):
    val = cross_val_score(
        LGBMRegressor(objective = 'regression_l1',
            subsample = subsample,
            min_child_samples = int(min_child_samples)
        X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)
    return 1 - val
rf_bo = BayesianOptimization(
    'num_leaves': (2, 100),
    'max_depth': (2, 100),
    'subsample': (0.1, 1),
    'min_child_samples' : (2, 100)
|   iter    |  target   | max_depth | min_ch... | num_le... | subsample |
| [0m 1       [0m | [0m 0.863   [0m | [0m 22.84   [0m | [0m 34.83   [0m | [0m 99.34   [0m | [0m 0.8796  [0m |
| [0m 2       [0m | [0m 0.8522  [0m | [0m 83.66   [0m | [0m 17.52   [0m | [0m 31.27   [0m | [0m 0.8104  [0m |
| [0m 3       [0m | [0m 0.8401  [0m | [0m 12.26   [0m | [0m 70.64   [0m | [0m 14.1    [0m | [0m 0.9035  [0m |
| [0m 4       [0m | [0m 0.8622  [0m | [0m 52.82   [0m | [0m 42.12   [0m | [0m 87.04   [0m | [0m 0.3299  [0m |
| [0m 5       [0m | [0m 0.8508  [0m | [0m 72.6    [0m | [0m 68.95   [0m | [0m 28.94   [0m | [0m 0.3273  [0m |
| [0m 6       [0m | [0m 0.863   [0m | [0m 99.96   [0m | [0m 94.19   [0m | [0m 99.23   [0m | [0m 0.5421  [0m |
| [95m 7       [0m | [95m 0.8631  [0m | [95m 90.84   [0m | [95m 2.475   [0m | [95m 99.34   [0m | [95m 0.8307  [0m |
| [0m 8       [0m | [0m 0.863   [0m | [0m 16.05   [0m | [0m 97.82   [0m | [0m 99.56   [0m | [0m 0.6616  [0m |
| [0m 9       [0m | [0m 0.8015  [0m | [0m 2.946   [0m | [0m 6.104   [0m | [0m 99.57   [0m | [0m 0.4299  [0m |
| [0m 10      [0m | [0m 0.7949  [0m | [0m 97.79   [0m | [0m 95.25   [0m | [0m 3.469   [0m | [0m 0.5979  [0m |
| [0m 11      [0m | [0m 0.8628  [0m | [0m 98.38   [0m | [0m 92.14   [0m | [0m 96.48   [0m | [0m 0.7543  [0m |
| [0m 12      [0m | [0m 0.8629  [0m | [0m 98.71   [0m | [0m 53.87   [0m | [0m 97.57   [0m | [0m 0.1395  [0m |
| [0m 13      [0m | [0m 0.7657  [0m | [0m 38.46   [0m | [0m 2.306   [0m | [0m 2.199   [0m | [0m 0.6655  [0m |
| [0m 14      [0m | [0m 0.8015  [0m | [0m 2.42    [0m | [0m 73.93   [0m | [0m 64.9    [0m | [0m 0.8561  [0m |
| [0m 15      [0m | [0m 0.7657  [0m | [0m 7.479   [0m | [0m 99.23   [0m | [0m 2.366   [0m | [0m 0.5894  [0m |
| [0m 16      [0m | [0m 0.7657  [0m | [0m 99.9    [0m | [0m 8.24    [0m | [0m 2.117   [0m | [0m 0.4889  [0m |
| [0m 17      [0m | [0m 0.863   [0m | [0m 59.63   [0m | [0m 96.77   [0m | [0m 98.46   [0m | [0m 0.6084  [0m |
| [0m 18      [0m | [0m 0.8585  [0m | [0m 99.67   [0m | [0m 35.61   [0m | [0m 55.5    [0m | [0m 0.7627  [0m |
| [0m 19      [0m | [0m 0.7657  [0m | [0m 54.1    [0m | [0m 52.56   [0m | [0m 2.834   [0m | [0m 0.7866  [0m |
| [0m 20      [0m | [0m 0.856   [0m | [0m 98.43   [0m | [0m 98.37   [0m | [0m 44.02   [0m | [0m 0.8604  [0m |
| [0m 21      [0m | [0m 0.8585  [0m | [0m 39.19   [0m | [0m 2.257   [0m | [0m 55.38   [0m | [0m 0.9486  [0m |
| [0m 22      [0m | [0m 0.8015  [0m | [0m 2.655   [0m | [0m 17.16   [0m | [0m 25.65   [0m | [0m 0.9142  [0m |
| [0m 23      [0m | [0m 0.8575  [0m | [0m 96.75   [0m | [0m 2.662   [0m | [0m 48.79   [0m | [0m 0.5475  [0m |
| [0m 24      [0m | [0m 0.8628  [0m | [0m 49.3    [0m | [0m 2.516   [0m | [0m 95.63   [0m | [0m 0.3689  [0m |
| [0m 25      [0m | [0m 0.8565  [0m | [0m 47.02   [0m | [0m 97.36   [0m | [0m 45.15   [0m | [0m 0.892   [0m |
| [95m 26      [0m | [95m 0.8632  [0m | [95m 40.13   [0m | [95m 66.03   [0m | [95m 99.89   [0m | [95m 0.1812  [0m |
| [0m 27      [0m | [0m 0.8572  [0m | [0m 42.57   [0m | [0m 48.11   [0m | [0m 48.67   [0m | [0m 0.9011  [0m |
| [0m 28      [0m | [0m 0.8597  [0m | [0m 74.9    [0m | [0m 10.61   [0m | [0m 64.42   [0m | [0m 0.9671  [0m |
| [95m 29      [0m | [95m 0.8632  [0m | [95m 66.48   [0m | [95m 28.57   [0m | [95m 99.3    [0m | [95m 0.9887  [0m |
| [0m 30      [0m | [0m 0.8592  [0m | [0m 74.97   [0m | [0m 71.13   [0m | [0m 59.46   [0m | [0m 0.9376  [0m |
1 - rf_bo.max['target']



sns.lineplot(x=['0_origin','1_log_transfer','2_L1_&_L2','3_change_model','4_parameter_turning'], y=[1.36 ,0.19, 0.19, 0.14, 0.13])
<matplotlib.axes._subplots.AxesSubplot at 0x1834d621708>


Task5 建模調參 END.

