xgboost+python參數介紹的簡單使用

官網參數介紹(英文版)

http://xgboost.readthedocs.io/en/latest/how_to/param_tuning.html
http://xgboost.readthedocs.io/en/latest/parameter.html

中文部分翻譯版

http://blog.csdn.net/zc02051126/article/details/46711047

1. xgboost的參數介紹

  1. 控制過擬合
    • 直接控制模型的複雜度
      • max_depth, min_child_weight, gamma
    • 增大產生樹的隨機性
      • subsample, colsample_bytree
      • eta, num_round
  2. 處理不平衡的數據集
    • 預測的排序(AUC)
      • scale_pos_weight
    • 預測可靠性
      • max_delta_step
  3. 參數分別介紹
    • booster: [default=gbtree],可選gbtree和gblinear,gbtree使用基於樹的模型進行提升計算,gblinear使用線性模型進行提升計算
    • silent: [default=0], 是否打印運行時信息,0爲打印
    • nthread: [默認爲支持的最大線程數], 運行時的線程數
    • num_pbuffer: [自動生成,不需要用戶自己設置], 預測數量,一般是輸入樣本數
    • num_feature: [自動生成,不需要用戶自己設置], 特徵維數
    • eta: [default=0.3],取值範圍[0,1],學習率,迭代的步長比例
    • gamma: [default=0],取值範圍[0,$\infty$],損失閾值
    • max_depth: [default=6], 取值範圍[0,$\infty$],樹的最大深度
    • min_child_weight: [default=1], 取值範圍[0,$\infty$],拆分節點權重和閾值,如果節點的樣本權重和小於該閾值,就不再進行拆分
    • max_delta_step: [default=0],取值範圍[0,$\infty$],每棵樹的最大權重估計,0爲沒有限制
    • subsample: [default=1],取值範圍(0,1],隨機選取一定比例的樣本來訓練樹
    • colsample_bytree: [default=1],取值範圍(0,1],選取構造樹的特徵比例
    • colsample_bylevel: [default=1],取值範圍(0,1],每個層分裂的節點數
    • lambda: [default=0],L2 正則的懲罰係數
    • alpha: [default=0],L1 正則的懲罰係數
    • tree_method: string,[default=’auto’],xgboost構建樹的算法,‘auto’‘exact’‘approx’‘hist’
    • lambda_bias: 在偏置上的L2正則
    • sketch_eps: [default=0.03],只在approximate greedy algorithm上使用
    • scale_pos_weight: [default=1],用來控制正負樣本的比例
    • updater: [default=’grow_colmaker,prune’],提供模塊化的方式來構建樹,一般不需要由用戶設置
    • refresh_leaf: [default=1],刷新參數,如果爲1,刷新葉子和樹節點,否則只刷新樹節點
    • process_type: [default=’default’],提升的方式
    • grow_policy: string [default=’depthwise’],控制新增節點的方式,‘depthwise’,分裂離根節點最近的節點,‘lossguide’,分裂損失函數變化最大的節點
    • max_leaves: [default=0],增加的最大節點數,只和lossguide’ grow policy相關
    • max_bins: [default=256],只和tree_method的‘hist’相關
    • objective: [default=reg:linear], 定義學習任務及相應的學習目標,可選的目標函數如下:
      • “reg:linear”, 線性迴歸。
      • “reg:logistic”, 邏輯迴歸。
      • “binary:logistic”, 二分類的邏輯迴歸問題,輸出爲概率。
      • “binary:logitraw”, 二分類的邏輯迴歸問題,輸出的結果爲wTx。
      • “count:poisson”, 計數問題的poisson迴歸,輸出結果爲poisson分佈。
        在poisson迴歸中,max_delta_step的缺省值爲0.7。(used to safeguard optimization)
      • “multi:softmax”, 讓XGBoost採用softmax目標函數處理多分類問題,同時需要設置參數num_class(類別個數)
      • “multi:softprob”, 和softmax一樣,但是輸出的是ndata * nclass的向量,可以將該向量reshape成ndata行nclass列的矩陣。沒行數據表示樣本所屬於每個類別的概率。
      • “rank:pairwise” –set XGBoost to do ranking task by minimizing the pairwise loss
        base_score [ default=0.5 ]
        the initial prediction score of all instances, global bias
    • eval_metric: [默認和objective相關],校驗數據所需要的評價指標,不同的目標函數將會有缺省的評價指標(rmse for regression, and error for classification, mean average precision for ranking),用戶可以添加多種評價指標,對於Python用戶要以list傳遞參數對給程序,而不是map參數list參數不會覆蓋
      • ’eval_metric’,可選參數如下:
        • “rmse”: root mean square error,均方根誤差
        • “logloss”: negative log-likelihood,對數似然
        • “error”: Binary classification error rate,二值誤差率,計算方法爲誤分樣本/總樣本
        • “merror”: Multiclass classification error rate,多分類誤差率,計算方法同上
        • “auc”: Area under the curve for ranking evaluation.
        • “ndcg”:Normalized Discounted Cumulative Gain
        • “map”:Mean average precision
        • “ndcg@n”,”map@n”: n can be assigned as an integer to cut off the top positions in the lists for evaluation.
        • “ndcg-“,”map-“,”ndcg@n-“,”map@n-“: In XGBoost, NDCG and MAP will evaluate the score of a list without any positive samples as 1. By adding “-” in the evaluation metric XGBoost will evaluate these score as 0 to be consistent under some conditions.
          training repeatively
      • seed: [default=0], 隨機數的種子。缺省值爲0

2. xgboost的基本使用方法

import xgboost as xgb
# 在這裏設置需要的參數
gbm = xgb.XGBClassifier(max_depth=3, n_estimators=300, learning_rate=0.05)
# 傳入訓練集
gbm = fit(train_X, train_y)
# 預測
predictions = gbm.predict(test_X)

Kaggle競賽上一個例子

https://www.kaggle.com/cbrogan/xgboost-example-python/code/code

# This script shows you how to make a submission using a few
# useful Python libraries.
# It gets a public leaderboard score of 0.76077.
# Maybe you can tweak it and do better...?

import pandas as pd
import xgboost as xgb
from sklearn.preprocessing import LabelEncoder
import numpy as np

# Load the data
train_df = pd.read_csv('../input/train.csv', header=0)
test_df = pd.read_csv('../input/test.csv', header=0)

# We'll impute missing values using the median for numeric columns and the most
# common value for string columns.
# This is based on some nice code by 'sveitser' at http://stackoverflow.com/a/25562948
from sklearn.base import TransformerMixin
class DataFrameImputer(TransformerMixin):
    def fit(self, X, y=None):
        self.fill = pd.Series([X[c].value_counts().index[0]
            if X[c].dtype == np.dtype('O') else X[c].median() for c in X],
            index=X.columns)
        return self
    def transform(self, X, y=None):
        return X.fillna(self.fill)

feature_columns_to_use = ['Pclass','Sex','Age','Fare','Parch']
nonnumeric_columns = ['Sex']

# Join the features from train and test together before imputing missing values,
# in case their distribution is slightly different
big_X = train_df[feature_columns_to_use].append(test_df[feature_columns_to_use])
big_X_imputed = DataFrameImputer().fit_transform(big_X)

# XGBoost doesn't (yet) handle categorical features automatically, so we need to change
# them to columns of integer values.
# See http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing for more
# details and options
le = LabelEncoder()
for feature in nonnumeric_columns:
    big_X_imputed[feature] = le.fit_transform(big_X_imputed[feature])

# Prepare the inputs for the model
train_X = big_X_imputed[0:train_df.shape[0]].as_matrix()
test_X = big_X_imputed[train_df.shape[0]::].as_matrix()
train_y = train_df['Survived']

# You can experiment with many other options here, using the same .fit() and .predict()
# methods; see http://scikit-learn.org
# This example uses the current build of XGBoost, from https://github.com/dmlc/xgboost
gbm = xgb.XGBClassifier(max_depth=3, n_estimators=300, learning_rate=0.05).fit(train_X, train_y)
predictions = gbm.predict(test_X)

# Kaggle needs the submission to have a certain format;
# see https://www.kaggle.com/c/titanic-gettingStarted/download/gendermodel.csv
# for an example of what it's supposed to look like.
submission = pd.DataFrame({ 'PassengerId': test_df['PassengerId'],
                            'Survived': predictions })
submission.to_csv("submission.csv", index=False)
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章