【python 機器學習】機器學習算法之CatBoost

主要內容：
一、算法背景
二、CatBoost簡介
三、CatBoost的優點
四、CatBoost的安裝與使用
五、CatBoost迴歸實戰
六、CatBoost調參模塊
七、CatBoost 參數詳解

一、算法背景：

2017年俄羅斯的搜索巨頭 Yandex 開源 Catboost 框架。Catboost（Categorical Features+Gradient Boosting）採用的策略在降低過擬合的同時保證所有數據集都可用於學習。性能卓越、魯棒性與通用性更好、易於使用而且更實用。據其介紹 Catboost 的性能可以匹敵任何先進的機器學習算法。
實際上，XGBoost和lightGBM，CatBoost都屬於GBDT的一種實現，旨在優化算法的性能，提升算法的訓練速度，與XGBoost相比，lightGBM更適應於數據量更大的場景。從GBDT->XGBoost->lightGBM->CatBoost，在模型訓練階段，是不能百分百地斷定lightGBM就比GBDT和XGBoost好，因爲數據量的大小也決定了模型的可行性。XGBoost,LightGBM,CatBoost三個都是基於 GBDT 最具代表性的算法，都說自己的性能表現、效率及準確率很優秀，究竟它們誰更勝一籌呢？所以實際場景中，還是建議一一嘗試之後再做抉擇。

二、CatBoost簡介
CatBoost這個名字來自兩個詞“Category”和“Boosting”。如前所述，該庫可以很好地處理各種類別型數據，是一種能夠很好地處理類別型特徵的梯度提升算法庫。

三、CatBoost的優點
性能卓越：在性能方面可以匹敵任何先進的機器學習算法
魯棒性/強健性：它減少了對很多超參數調優的需求，並降低了過度擬合的機會，這也使得模型變得更加具有通用性
易於使用：提供與scikit集成的Python接口，以及R和命令行界面
實用：可以處理類別型、數值型特徵
可擴展：支持自定義損失函數
多GPU支持：CatBoost中的GPU實現可支持多個GPU。分佈式樹學習可以通過數據或特徵進行並行化。CatBoost採用多個學習數據集排列的計算方案，在訓練期間計算分類特徵的統計數據。

四、CatBoost的安裝與使用
CatBoost的安裝非常的簡單，只需執行pip install catboost即可。

五、CatBoost迴歸實戰

# -*- coding: utf-8 -*-
from sklearn.metrics import *
from sklearn.model_selection import train_test_split
import pandas as pd

from catboost import CatBoostRegressor



# 讀取數據
data_path='train_data.txt'
# 導入數據
data=pd.read_table(data_path)

# 篩選自變量
X=data.iloc[:,1:]
# 篩選因變量
y=data.iloc[:,0]
# 提取特徵名
feature_names=list(X.columns)
# 切分數據，劃分爲訓練集和測試集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=33)


params = {
    'iterations':330,
    'learning_rate':0.1,
    'depth':10,
    'loss_function':'RMSE'

}


clf = CatBoostRegressor(**params)
clf.fit(X_train, y_train,verbose=3)


# 效果評估，均方誤差,均方根誤差，R2
y_predict=clf.predict(X_test)
mae=mean_absolute_error(y_test, y_predict)
mse =mean_squared_error(y_test, y_predict)
rmse=round(mse **0.5,4)
r2=round(r2_score(y_test, y_predict),4)


# 評估指標
print("MAE: %.4f" % mae)
print("MSE: %.4f" % mse)
print("RMSE: %.4f" % rmse)
print("R2: %.4f "% r2 )

六、CatBoost調參模塊


cv_params  = {'depth': [8,9,10,11,12,13,14]}


other_params = {
    'iterations': 800,
    'learning_rate': 0.09,
    # 'depth': 10,
    'loss_function': 'RMSE'
}


cat_model_ = CatBoostRegressor(**other_params)
cat_search = GridSearchCV(cat_model_,
                          param_grid=cv_params ,
                          scoring='neg_mean_squared_error',
                          iid=False,n_jobs=-1,
                          cv=5)

cat_search.fit(X_train, y_train)


means = cat_search.cv_results_['mean_test_score']
params = cat_search.cv_results_['params']

print(means)
print(params)
print(cat_search.best_params_)
print(cat_search.best_score_)

七、CatBoost 參數詳解

CatBoost參數詳解
通用參數：

loss_function 損失函數，支持的有RMSE, Logloss, MAE, CrossEntropy, Quantile, LogLinQuantile, Multiclass, MultiClassOneVsAll, MAPE, Poisson。默認RMSE。
custom_metric 訓練過程中輸出的度量值。這些功能未經優化，僅出於信息目的顯示。默認None。
eval_metric 用於過擬合檢驗（設置True）和最佳模型選擇（設置True）的loss function，用於優化。
iterations 最大樹數。默認1000。
learning_rate 學習率。默認03。
random_seed 訓練時候的隨機種子
l2_leaf_reg L2正則參數。默認3
bootstrap_type 定義權重計算邏輯，可選參數：Poisson (supported for GPU only)/Bayesian/Bernoulli/No，默認爲Bayesian
bagging_temperature 貝葉斯套袋控制強度，區間[0, 1]。默認1。
subsample 設置樣本率，當bootstrap_type爲Poisson或Bernoulli時使用，默認66
sampling_frequency 設置創建樹時的採樣頻率，可選值PerTree/PerTreeLevel，默認爲PerTreeLevel
random_strength 分數標準差乘數。默認1。
use_best_model 設置此參數時，需要提供測試數據，樹的個數通過訓練參數和優化loss function獲得。默認False。
best_model_min_trees 最佳模型應該具有的樹的最小數目。
depth 樹深，最大16，建議在1到10之間。默認6。
ignored_features 忽略數據集中的某些特徵。默認None。
one_hot_max_size 如果feature包含的不同值的數目超過了指定值，將feature轉化爲float。默認False
has_time 在將categorical features轉化爲numerical features和選擇樹結構時，順序選擇輸入數據。默認False（隨機）
rsm 隨機子空間（Random subspace method）。默認1。
nan_mode 處理輸入數據中缺失值的方法，包括Forbidden(禁止存在缺失)，Min(用最小值補)，Max(用最大值補)。默認Min。
fold_permutation_block_size 數據集中的對象在隨機排列之前按塊分組。此參數定義塊的大小。值越小，訓練越慢。較大的值可能導致質量下降。
leaf_estimation_method 計算葉子值的方法，Newton/ Gradient。默認Gradient。
leaf_estimation_iterations 計算葉子值時梯度步數。
leaf_estimation_backtracking 在梯度下降期間要使用的回溯類型。
fold_len_multiplier folds長度係數。設置大於1的參數，在參數較小時獲得最佳結果。默認2。
approx_on_full_history 計算近似值，False：使用1／fold_len_multiplier計算；True：使用fold中前面所有行計算。默認False。
class_weights 類別的權重。默認None。
scale_pos_weight 二進制分類中class 1的權重。該值用作class 1中對象權重的乘數。
boosting_type 增壓方案
allow_const_label 使用它爲所有對象訓練具有相同標籤值的數據集的模型。默認爲False
CatBoost默認參數：

‘iterations’: 1000,
‘learning_rate’:0.03,
‘l2_leaf_reg’:3,
‘bagging_temperature’:1,
‘subsample’:0.66,
‘random_strength’:1,
‘depth’:6,
‘rsm’:1,
‘one_hot_max_size’:2
‘leaf_estimation_method’:’Gradient’,
‘fold_len_multiplier’:2,
‘border_count’:128,
CatBoost參數取值範圍：

‘learning_rate’:Log-uniform distribution [e^{-7}, 1]
‘random_strength’:Discrete uniform distribution over a set {1, 20}
‘one_hot_max_size’:Discrete uniform distribution over a set {0, 25}
‘l2_leaf_reg’:Log-uniform distribution [1, 10]
‘bagging_temperature’:Uniform [0, 1]
‘gradient_iterations’:Discrete uniform distribution over a set {1, 10}‘