Datawhale 零基礎入門數據挖掘-Task4 建模調參

四、建模與調參

Tip:此部分爲零基礎入門數據挖掘的 Task4 建模調參部分，帶你來了解各種模型以及模型的評價和調參策略，歡迎大家後續多多交流。

賽題：零基礎入門數據挖掘 - 二手車交易價格預測

地址：https://tianchi.aliyun.com/competition/entrance/231784/introduction?spm=5176.12281957.1004.1.38b02448ausjSX

4.1 學習目標

瞭解常用的機器學習模型，並掌握機器學習模型的建模與調參流程
完成相應學習打卡任務

4.2 內容介紹

線性迴歸模型：
- 線性迴歸對於特徵的要求；
- 處理長尾分佈；
- 理解線性迴歸模型；
模型性能驗證：
- 評價函數與目標函數；
- 交叉驗證方法；
- 留一驗證方法；
- 針對時間序列問題的驗證；
- 繪製學習率曲線；
- 繪製驗證曲線；
嵌入式特徵選擇：
- Lasso迴歸；
- Ridge迴歸；
- 決策樹；
模型對比：
- 常用線性模型；
- 常用非線性模型；
模型調參：
- 貪心調參方法；
- 網格調參方法；
- 貝葉斯調參方法；

4.3 相關原理介紹與推薦

由於相關算法原理篇幅較長，本文推薦了一些博客與教材供初學者們進行學習。

4.4 代碼示例

4.4.1 讀取數據

import pandas as pd

import numpy as np

import warnings

warnings.filterwarnings('ignore')

reduce_mem_usage 函數通過調整數據類型，幫助我們減少數據在內存中佔用的空間

def reduce_mem_usage(df):

    """ iterate through all the columns of a dataframe and modify the data type

        to reduce memory usage.

"""

    start_mem = df.memory_usage().sum()

    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

    for col in df.columns:

        col_type = df[col].dtype

        if col_type != object:

            c_min = df[col].min()

            c_max = df[col].max()

            if str(col_type)[:3] == 'int':

                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:

                    df[col] = df[col].astype(np.int8)

                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:

                    df[col] = df[col].astype(np.int16)

                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:

                    df[col] = df[col].astype(np.int32)

                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:

                    df[col] = df[col].astype(np.int64)

            else:

                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:

                    df[col] = df[col].astype(np.float16)

                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:

                    df[col] = df[col].astype(np.float32)

                else:

                    df[col] = df[col].astype(np.float64)

        else:

            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum()

    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))

    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

    return df

sample_feature = reduce_mem_usage(pd.read_csv('data_for_tree.csv'))

Memory usage of dataframe is 60507328.00 MB

Memory usage after optimization is: 15724107.00 MB

Decreased by 74.0%

continuous_feature_names = [x for x in sample_feature.columns if x not in ['price','brand','model','brand']]

4.4.2 線性迴歸 & 五折交叉驗證 & 模擬真實業務情況

sample_feature = sample_feature.dropna().replace('-', 0).reset_index(drop=True)

sample_feature['notRepairedDamage'] = sample_feature['notRepairedDamage'].astype(np.float32)

train = sample_feature[continuous_feature_names + ['price']]

train_X = train[continuous_feature_names]

train_y = train['price']

4.4.2 - 1 簡單建模

from sklearn.linear_model import LinearRegression

model = LinearRegression(normalize=True)

model = model.fit(train_X, train_y)

查看訓練的線性迴歸模型的截距（intercept）與權重(coef)

'intercept:'+ str(model.intercept_)

sorted(dict(zip(continuous_feature_names, model.coef_)).items(), key=lambda x:x[1], reverse=True)

[9]:

[('v_6', 3342612.384537345),
, ('v_8', 684205.534533214),
, ('v_9', 178967.94192530424),
, ('v_7', 35223.07319016895),
, ('v_5', 21917.550249749802),
, ('v_3', 12782.03250792227),
, ('v_12', 11654.925634146672),
, ('v_13', 9884.194615297649),
, ('v_11', 5519.182176035517),
, ('v_10', 3765.6101415594258),
, ('gearbox', 900.3205339198406),
, ('fuelType', 353.5206495542567),
, ('bodyType', 186.51797317460046),
, ('city', 45.17354204168846),
, ('power', 31.163045441455335),
, ('brand_price_median', 0.535967111869784),
, ('brand_price_std', 0.4346788365040235),
, ('brand_amount', 0.15308295553300566),
, ('brand_price_max', 0.003891831020467389),
, ('seller', -1.2684613466262817e-06),
, ('offerType', -4.759058356285095e-06),
, ('brand_price_sum', -2.2430642281682917e-05),
, ('name', -0.00042591632723759166),
, ('used_time', -0.012574429533889028),
, ('brand_price_average', -0.414105722833381),
, ('brand_price_min', -2.3163823428971835),
, ('train', -5.392535065078232),
, ('power_bin', -59.24591853031839),
, ('v_14', -233.1604256172217),
, ('kilometer', -372.96600915402496),
, ('notRepairedDamage', -449.29703564695365),
, ('v_0', -1490.6790578168238),
, ('v_4', -14219.648899108111),
, ('v_2', -16528.55239086934),
, ('v_1', -42869.43976200439)]

from matplotlib import pyplot as plt

subsample_index = np.random.randint(low=0, high=len(train_y), size=50)

繪製特徵v_9的值與標籤的散點圖，圖片發現模型的預測結果（藍色點）與真實標籤（黑色點）的分佈差異較大，且部分預測值出現了小於0的情況，說明我們的模型存在一些問題

plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index], color='black')

plt.scatter(train_X['v_9'][subsample_index], model.predict(train_X.loc[subsample_index]), color='blue')

plt.xlabel('v_9')

plt.ylabel('price')

plt.legend(['True Price','Predicted Price'],loc='upper right')

print('The predicted price is obvious different from true price')

plt.show()

The predicted price is obvious different from true price

通過作圖我們發現數據的標籤（price）呈現長尾分佈，不利於我們的建模預測。原因是很多模型都假設數據誤差項符合正態分佈，而長尾分佈的數據違背了這一假設。參考博客：https://blog.csdn.net/Noob_daniel/article/details/76087829

import seaborn as sns

print('It is clear to see the price shows a typical exponential distribution')

plt.figure(figsize=(15,5))

plt.subplot(1,2,1)

sns.distplot(train_y)

plt.subplot(1,2,2)

sns.distplot(train_y[train_y < np.quantile(train_y, 0.9)])

It is clear to see the price shows a typical exponential distribution

[13]:

<matplotlib.axes._subplots.AxesSubplot at 0x1b33efb2f98>

在這裏我們對標籤進行了 log(x+1)log(x+1) 變換，使標籤貼近於正態分佈

train_y_ln = np.log(train_y + 1)

import seaborn as sns

print('The transformed price seems like normal distribution')

plt.figure(figsize=(15,5))

plt.subplot(1,2,1)

sns.distplot(train_y_ln)

plt.subplot(1,2,2)

sns.distplot(train_y_ln[train_y_ln < np.quantile(train_y_ln, 0.9)])

The transformed price seems like normal distribution

[15]:

<matplotlib.axes._subplots.AxesSubplot at 0x1b33f077160>

model = model.fit(train_X, train_y_ln)

print('intercept:'+ str(model.intercept_))

sorted(dict(zip(continuous_feature_names, model.coef_)).items(), key=lambda x:x[1], reverse=True)

intercept:23.515920686637713

[16]:

[('v_9', 6.043993029165403),
, ('v_12', 2.0357439855551394),
, ('v_11', 1.3607608712255672),
, ('v_1', 1.3079816298861897),
, ('v_13', 1.0788833838535354),
, ('v_3', 0.9895814429387444),
, ('gearbox', 0.009170812023421397),
, ('fuelType', 0.006447089787635784),
, ('bodyType', 0.004815242907679581),
, ('power_bin', 0.003151801949447194),
, ('power', 0.0012550361843629999),
, ('train', 0.0001429273782925814),
, ('brand_price_min', 2.0721302299502698e-05),
, ('brand_price_average', 5.308179717783439e-06),
, ('brand_amount', 2.8308531339942507e-06),
, ('brand_price_max', 6.764442596115763e-07),
, ('offerType', 1.6765966392995324e-10),
, ('seller', 9.308109838457312e-12),
, ('brand_price_sum', -1.3473184925468486e-10),
, ('name', -7.11403461065247e-08),
, ('brand_price_median', -1.7608143661053008e-06),
, ('brand_price_std', -2.7899058266986454e-06),
, ('used_time', -5.6142735899344175e-06),
, ('city', -0.0024992974087053223),
, ('v_14', -0.012754139659375262),
, ('kilometer', -0.013999175312751872),
, ('v_0', -0.04553774829634237),
, ('notRepairedDamage', -0.273686961116076),
, ('v_7', -0.7455902679730504),
, ('v_4', -0.9281349233755761),
, ('v_2', -1.2781892166433606),
, ('v_5', -1.5458846136756323),
, ('v_10', -1.8059217242413748),
, ('v_8', -42.611729973490604),
, ('v_6', -241.30992120503035)]

再次進行可視化，發現預測結果與真實值較爲接近，且未出現異常狀況

plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index], color='black')

plt.scatter(train_X['v_9'][subsample_index], np.exp(model.predict(train_X.loc[subsample_index])), color='blue')

plt.xlabel('v_9')

plt.ylabel('price')

plt.legend(['True Price','Predicted Price'],loc='upper right')

print('The predicted price seems normal after np.log transforming')

plt.show()

The predicted price seems normal after np.log transforming

4.4.2 - 2 五折交叉驗證

在使用訓練集對參數進行訓練的時候，經常會發現人們通常會將一整個訓練集分爲三個部分（比如mnist手寫訓練集）。一般分爲：訓練集（train_set），評估集（valid_set），測試集（test_set）這三個部分。這其實是爲了保證訓練效果而特意設置的。其中測試集很好理解，其實就是完全不參與訓練的數據，僅僅用來觀測測試效果的數據。而訓練集和評估集則牽涉到下面的知識了。

因爲在實際的訓練中，訓練的結果對於訓練集的擬合程度通常還是挺好的（初始條件敏感），但是對於訓練集之外的數據的擬合程度通常就不那麼令人滿意了。因此我們通常並不會把所有的數據集都拿來訓練，而是分出一部分來（這一部分不參加訓練）對訓練集生成的參數進行測試，相對客觀的判斷這些參數對訓練集之外的數據的符合程度。這種思想就稱爲交叉驗證（Cross Validation）

from sklearn.model_selection import cross_val_score

from sklearn.metrics import mean_absolute_error,  make_scorer

def log_transfer(func):

    def wrapper(y, yhat):

        result = func(np.log(y), np.nan_to_num(np.log(yhat)))

        return result

    return wrapper

scores = cross_val_score(model, X=train_X, y=train_y, verbose=1, cv = 5, scoring=make_scorer(log_transfer(mean_absolute_error)))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.

[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    1.1s finished

使用線性迴歸模型，對未處理標籤的特徵數據進行五折交叉驗證（Error 1.36）

print('AVG:', np.mean(scores))

AVG: 1.3641908155886227

使用線性迴歸模型，對處理過標籤的特徵數據進行五折交叉驗證（Error 0.19）

scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=1, cv = 5, scoring=make_scorer(mean_absolute_error))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.

[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    1.1s finished

print('AVG:', np.mean(scores))

AVG: 0.19382863663604424

scores = pd.DataFrame(scores.reshape(1,-1))

scores.columns = ['cv' + str(x) for x in range(1, 6)]

scores.index = ['MAE']

scores

[24]:

, , , , , , , , , , , , , , , , , , , , , , ,

	cv1	cv2	cv3	cv4	cv5
MAE	0.191642	0.194986	0.192737	0.195329	0.19445

4.4.2 - 3 模擬真實業務情況

但在事實上，由於我們並不具有預知未來的能力，五折交叉驗證在某些與時間相關的數據集上反而反映了不真實的情況。通過2018年的二手車價格預測2017年的二手車價格，這顯然是不合理的，因此我們還可以採用時間順序對數據集進行分隔。在本例中，我們選用靠前時間的4/5樣本當作訓練集，靠後時間的1/5當作驗證集，最終結果與五折交叉驗證差距不大

import datetime

sample_feature = sample_feature.reset_index(drop=True)

split_point = len(sample_feature) // 5 * 4

train = sample_feature.loc[:split_point].dropna()

val = sample_feature.loc[split_point:].dropna()

train_X = train[continuous_feature_names]

train_y_ln = np.log(train['price'] + 1)

val_X = val[continuous_feature_names]

val_y_ln = np.log(val['price'] + 1)

model = model.fit(train_X, train_y_ln)

mean_absolute_error(val_y_ln, model.predict(val_X))

[30]:

0.19443858353490887

4.4.2 - 4 繪製學習率曲線與驗證曲線

from sklearn.model_selection import learning_curve, validation_curve

? learning_curve

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,n_jobs=1, train_size=np.linspace(.1, 1.0, 5 )):

    plt.figure()

    plt.title(title)

    if ylim is not None:

        plt.ylim(*ylim)

    plt.xlabel('Training example')

    plt.ylabel('score')

    train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_size, scoring = make_scorer(mean_absolute_error))

    train_scores_mean = np.mean(train_scores, axis=1)

    train_scores_std = np.std(train_scores, axis=1)

    test_scores_mean = np.mean(test_scores, axis=1)

    test_scores_std = np.std(test_scores, axis=1)

    plt.grid()#區域

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,

                     train_scores_mean + train_scores_std, alpha=0.1,

                     color="r")

    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,

                     test_scores_mean + test_scores_std, alpha=0.1,

                     color="g")

    plt.plot(train_sizes, train_scores_mean, 'o-', color='r',

             label="Training score")

    plt.plot(train_sizes, test_scores_mean,'o-',color="g",

             label="Cross-validation score")

    plt.legend(loc="best")

    return plt

plot_learning_curve(LinearRegression(), 'Liner_model', train_X[:1000], train_y_ln[:1000], ylim=(0.0, 0.5), cv=5, n_jobs=1)

[53]:

<module 'matplotlib.pyplot' from 'C:\\ProgramData\\Anaconda3\\lib\\site-packages\\matplotlib\\pyplot.py'>

4.4.3 多種模型對比

train = sample_feature[continuous_feature_names + ['price']].dropna()

train_X = train[continuous_feature_names]

train_y = train['price']

train_y_ln = np.log(train_y + 1)

4.4.3 - 1 線性模型 & 嵌入式特徵選擇

本章節默認，學習者已經瞭解關於過擬合、模型複雜度、正則化等概念。否則請尋找相關資料或參考如下連接：

用簡單易懂的語言描述「過擬合 overfitting」？ https://www.zhihu.com/question/32246256/answer/55320482
模型複雜度與模型的泛化能力 http://yangyingming.com/article/434/
正則化的直觀理解 https://blog.csdn.net/jinping_shi/article/details/52433975

在過濾式和包裹式特徵選擇方法中，特徵選擇過程與學習器訓練過程有明顯的分別。而嵌入式特徵選擇在學習器訓練過程中自動地進行特徵選擇。嵌入式選擇最常用的是L1正則化與L2正則化。在對線性迴歸模型加入兩種正則化方法後，他們分別變成了嶺迴歸與Lasso迴歸。

from sklearn.linear_model import LinearRegression

from sklearn.linear_model import Ridge

from sklearn.linear_model import Lasso

models = [LinearRegression(),

          Ridge(),

          Lasso()]

result = dict()

for model in models:

    model_name = str(model).split('(')[0]

    scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error))

    result[model_name] = scores

    print(model_name + ' is finished')

LinearRegression is finished

Ridge is finished

Lasso is finished

對三種方法的效果對比

result = pd.DataFrame(result)

result.index = ['cv' + str(x) for x in range(1, 6)]

result

[44]:

, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

	LinearRegression	Ridge	Lasso
cv1	0.191642	0.195665	0.382708
cv2	0.194986	0.198841	0.383916
cv3	0.192737	0.196629	0.380754
cv4	0.195329	0.199255	0.385683
cv5	0.194450	0.198173	0.383555

model = LinearRegression().fit(train_X, train_y_ln)

print('intercept:'+ str(model.intercept_))

sns.barplot(abs(model.coef_), continuous_feature_names)

intercept:23.515984499017883

[45]:

<matplotlib.axes._subplots.AxesSubplot at 0x1feb933ca58>

L2正則化在擬合過程中通常都傾向於讓權值儘可能小，最後構造一個所有參數都比較小的模型。因爲一般認爲參數值小的模型比較簡單，能適應不同的數據集，也在一定程度上避免了過擬合現象。可以設想一下對於一個線性迴歸方程，若參數很大，那麼只要數據偏移一點點，就會對結果造成很大的影響；但如果參數足夠小，數據偏移得多一點也不會對結果造成什麼影響，專業一點的說法是『抗擾動能力強』

model = Ridge().fit(train_X, train_y_ln)

print('intercept:'+ str(model.intercept_))

sns.barplot(abs(model.coef_), continuous_feature_names)

intercept:5.901527844424091

[46]:

<matplotlib.axes._subplots.AxesSubplot at 0x1fea9056860>

L1正則化有助於生成一個稀疏權值矩陣，進而可以用於特徵選擇。如下圖，我們發現power與userd_time特徵非常重要。

model = Lasso().fit(train_X, train_y_ln)

print('intercept:'+ str(model.intercept_))

sns.barplot(abs(model.coef_), continuous_feature_names)

intercept:8.674427764003347

[47]:

<matplotlib.axes._subplots.AxesSubplot at 0x1fea90b69b0>

除此之外，決策樹通過信息熵或GINI指數選擇分裂節點時，優先選擇的分裂特徵也更加重要，這同樣是一種特徵選擇的方法。XGBoost與LightGBM模型中的model_importance指標正是基於此計算的

4.4.3 - 2 非線性模型

除了線性模型以外，還有許多我們常用的非線性模型如下，在此篇幅有限不再一一講解原理。我們選擇了部分常用模型與線性模型進行效果比對。

from sklearn.linear_model import LinearRegression

from sklearn.svm import SVC

from sklearn.tree import DecisionTreeRegressor

from sklearn.ensemble import RandomForestRegressor

from sklearn.ensemble import GradientBoostingRegressor

from sklearn.neural_network import MLPRegressor

from xgboost.sklearn import XGBRegressor

from lightgbm.sklearn import LGBMRegressor

models = [LinearRegression(),

          DecisionTreeRegressor(),

          RandomForestRegressor(),

          GradientBoostingRegressor(),

          MLPRegressor(solver='lbfgs', max_iter=100),

          XGBRegressor(n_estimators = 100, objective='reg:squarederror'),

          LGBMRegressor(n_estimators = 100)]

result = dict()

for model in models:

    model_name = str(model).split('(')[0]

    scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error))

    result[model_name] = scores

    print(model_name + ' is finished')

LinearRegression is finished

DecisionTreeRegressor is finished

RandomForestRegressor is finished

GradientBoostingRegressor is finished

MLPRegressor is finished

XGBRegressor is finished

LGBMRegressor is finished

result = pd.DataFrame(result)

result.index = ['cv' + str(x) for x in range(1, 6)]

result

[51]:

, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

	LinearRegression	DecisionTreeRegressor	RandomForestRegressor	GradientBoostingRegressor	MLPRegressor	XGBRegressor	LGBMRegressor
cv1	0.191642	0.184566	0.136266	0.168626	124.299426	0.168698	0.141159
cv2	0.194986	0.187029	0.139693	0.171905	257.886236	0.172258	0.143363
cv3	0.192737	0.184839	0.136871	0.169553	236.829589	0.168604	0.142137
cv4	0.195329	0.182605	0.138689	0.172299	130.197264	0.172474	0.143461
cv5	0.194450	0.186626	0.137420	0.171206	268.090236	0.170898	0.141921

可以看到隨機森林模型在每一個fold中均取得了更好的效果

4.4.4 模型調參

在此我們介紹了三種常用的調參方法如下：

貪心算法 https://www.jianshu.com/p/ab89df9759c8
網格調參 https://blog.csdn.net/weixin_43172660/article/details/83032029
貝葉斯調參 https://blog.csdn.net/linxid/article/details/81189154

## LGB的參數集合：

objective = ['regression', 'regression_l1', 'mape', 'huber', 'fair']

num_leaves = [3,5,10,15,20,40, 55]

max_depth = [3,5,10,15,20,40, 55]

bagging_fraction = []

feature_fraction = []

drop_rate = []

4.4.4 - 1 貪心調參

best_obj = dict()

for obj in objective:

    model = LGBMRegressor(objective=obj)

    score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))

    best_obj[obj] = score

best_leaves = dict()

for leaves in num_leaves:

    model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0], num_leaves=leaves)

    score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))

    best_leaves[leaves] = score

best_depth = dict()

for depth in max_depth:

    model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0],

                          num_leaves=min(best_leaves.items(), key=lambda x:x[1])[0],

                          max_depth=depth)

    score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))

    best_depth[depth] = score

sns.lineplot(x=['0_initial','1_turning_obj','2_turning_leaves','3_turning_depth'], y=[0.143 ,min(best_obj.values()), min(best_leaves.values()), min(best_depth.values())])

[54]:

<matplotlib.axes._subplots.AxesSubplot at 0x1fea93f6080>

4.4.4 - 2 Grid Search 調參

from sklearn.model_selection import GridSearchCV

parameters = {'objective': objective , 'num_leaves': num_leaves, 'max_depth': max_depth}

model = LGBMRegressor()

clf = GridSearchCV(model, parameters, cv=5)

clf = clf.fit(train_X, train_y)

clf.best_params_

[57]:

{'max_depth': 15, 'num_leaves': 55, 'objective': 'regression'}

model = LGBMRegressor(objective='regression',

                          num_leaves=55,

                          max_depth=15)

np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))

[59]:

0.13626164479243302

4.4.4 - 3 貝葉斯調參

from bayes_opt import BayesianOptimization

def rf_cv(num_leaves, max_depth, subsample, min_child_samples):

    val = cross_val_score(

        LGBMRegressor(objective = 'regression_l1',

            num_leaves=int(num_leaves),

            max_depth=int(max_depth),

            subsample = subsample,

            min_child_samples = int(min_child_samples)

),

        X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)

    ).mean()

    return 1 - val

rf_bo = BayesianOptimization(

    rf_cv,

    'num_leaves': (2, 100),

    'max_depth': (2, 100),

    'subsample': (0.1, 1),

    'min_child_samples' : (2, 100)

rf_bo.maximize()

|   iter    |  target   | max_depth | min_ch... | num_le... | subsample |

-------------------------------------------------------------------------

|  1        |  0.8649   |  89.57    |  47.3     |  55.13    |  0.1792   |

|  2        |  0.8477   |  99.86    |  60.91    |  15.35    |  0.4716   |

|  3        |  0.8698   |  81.74    |  83.32    |  92.59    |  0.9559   |

|  4        |  0.8627   |  90.2     |  8.754    |  43.34    |  0.7772   |

|  5        |  0.8115   |  10.07    |  86.15    |  4.109    |  0.3416   |

|  6        |  0.8701   |  99.15    |  9.158    |  99.47    |  0.494    |

|  7        |  0.806    |  2.166    |  2.416    |  97.7     |  0.224    |

|  8        |  0.8701   |  98.57    |  97.67    |  99.87    |  0.3703   |

|  9        |  0.8703   |  99.87    |  43.03    |  99.72    |  0.9749   |

|  10       |  0.869    |  10.31    |  99.63    |  99.34    |  0.2517   |

|  11       |  0.8703   |  52.27    |  99.56    |  98.97    |  0.9641   |

|  12       |  0.8669   |  99.89    |  8.846    |  66.49    |  0.1437   |

|  13       |  0.8702   |  68.13    |  75.28    |  98.71    |  0.153    |

|  14       |  0.8695   |  84.13    |  86.48    |  91.9     |  0.7949   |

|  15       |  0.8702   |  98.09    |  59.2     |  99.65    |  0.3275   |

|  16       |  0.87     |  68.97    |  98.62    |  98.93    |  0.2221   |

|  17       |  0.8702   |  99.85    |  63.74    |  99.63    |  0.4137   |

|  18       |  0.8703   |  45.87    |  99.05    |  99.89    |  0.3238   |

|  19       |  0.8702   |  79.65    |  46.91    |  98.61    |  0.8999   |

|  20       |  0.8702   |  99.25    |  36.73    |  99.05    |  0.1262   |

|  21       |  0.8702   |  85.51    |  85.34    |  99.77    |  0.8917   |

|  22       |  0.8696   |  99.99    |  38.51    |  89.13    |  0.9884   |

|  23       |  0.8701   |  63.29    |  97.93    |  99.94    |  0.9585   |

|  24       |  0.8702   |  93.04    |  71.42    |  99.94    |  0.9646   |

|  25       |  0.8701   |  99.73    |  16.21    |  99.38    |  0.9778   |

|  26       |  0.87     |  86.28    |  58.1     |  99.47    |  0.107    |

|  27       |  0.8703   |  47.28    |  99.83    |  99.65    |  0.4674   |

|  28       |  0.8703   |  68.29    |  99.51    |  99.4     |  0.2757   |

|  29       |  0.8701   |  76.49    |  73.41    |  99.86    |  0.9394   |

|  30       |  0.8695   |  37.27    |  99.87    |  89.87    |  0.7588   |

=========================================================================

1 - rf_bo.max['target']

[64]:

0.1296693644053145

零基礎入門數據挖掘 task4

Datawhale 零基礎入門數據挖掘-Task4 建模調參

四、建模與調參

4.1 學習目標

4.2 內容介紹

4.3 相關原理介紹與推薦

4.3.1 線性迴歸模型

4.3.2 決策樹模型

4.3.3 GBDT模型

4.3.4 XGBoost模型

4.3.5 LightGBM模型

4.3.6 推薦教材：

4.4 代碼示例

4.4.1 讀取數據

4.4.2 線性迴歸 & 五折交叉驗證 & 模擬真實業務情況

PARL 強化學習框架學習

零基礎入門數據挖掘 task5

零基礎入門數據分析 task3

[Aha]鏢局運鏢

零基礎入門數據挖掘 task4

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結