XGBoost-Python完全調參指南-介紹篇

在analytics vidhya上看到一篇<Complete Guide to Parameter Tuning in XGBoost in Python>,寫的很好。因此打算翻譯一下這篇文章,也讓自己有更深的印象。具體內容主要翻譯文章的關鍵意思。

原文見:

http://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

這篇文章按照原文的分節,共分爲三個部分,其中本章介紹第一部分。

 1、簡介與XGboost

2、參數理解

3、參數調優


Introduction

If things don't go your way in predictive modeling, use XGboost.  XGBoost algorithmhas become the ultimate weapon of many data scientist. It’s a highly sophisticated algorithm, powerful enough to deal with all sorts of irregularities of data.

如果你在你的預測模型中遇到不順利,可以嘗試一下XGBoost。XGBoost是一個精巧而強大的模型,可以解決各種類型的不規則數據。

Building a model using XGBoost is easy. But, improving the model using XGBoost is difficult (at least I struggled a lot). This algorithm uses multiple parameters. To improve the model, parameter tuning is must. It is very difficult to get answers to practical questions like – Which set of parameters you should tune ? What is the ideal value of these parameters to obtain optimal output ?

用XGBoost建立一個模型是簡單的,但是要用XGBoost對模型做進一步的提升卻很困難。原因在於XGBoost有許多可以調整的參數,選擇需要調整的參數並調整到合適的值是一件不容易的事情。

This article isbest suited to people who are new to XGBoost. In this article, we’ll learn the art of parameter tuning along with some useful information about XGBoost. Also,we’ll practice this algorithm using a  data set in Python.

What should you know ?

XGBoost(eXtreme Gradient Boosting) is an advanced implementation of gradient boosting algorithm. Since I covered Gradient Boosting Machine in detail in my previous article – Complete Guide to Parameter Tuning in GradientBoosting (GBM) in Python,I highly recommend going through that before reading further. It will help youbolster your understanding of boosting in general and parameter tuning for GBM.

SpecialThanks: Personally, I would like to acknowledge the timeless support provided by Mr. Sudalai Rajkumar (aka SRK), currently AV Rank 2. This article wouldn’t be possible without his help. Heis helping us guide thousands of data scientists. A big thanks to SRK!

 

Table of Contents

  1. The XGBoost Advantage
  2. Understanding XGBoost Parameters
  3. Tuning Parameters (with Example)

 

1. The XGBoost Advantage

I’ve always admired the boosting capabilities that this algorithm infuses in a predictive model. When I explored more about its performance and science behind its high accuracy, I discovered many advantages:

  1. Regularization(正則化):
    • Standard GBM implementation has no regularization like XGBoost, therefore it also helps to reduce overfitting.
    • In fact, XGBoost is also known as ‘regularized boosting‘ technique
    • 標準的GBM方法沒有正則化,因此XGBoost的正則化有助於降低過擬合
  2. Parallel Processing(並行處理):
    • XGBoost implements parallel processing and is blazingly faster as compared to GBM.
    • XGBoost 的並行化執行使其速度比GBM快很多。
    • But hang on, we know that boosting is sequential process so how can it be parallelized? We know that each tree can be built only after the previous one, so what stops us from making a tree using all cores? I hope you get where I’m coming from. Check this link out to explore further.
    • boosting是一種時序化處理過程,如何進行並行處理可以看以上鍊接中的文章。
    • XGBoost also supports implementation on Hadoop.

  3. High Flexibility(高度的彈性
    • XGBoost allow users to define custom optimization objectives and evaluation criteria.
    • This adds a whole new dimension to the model and there is no limit to what we can do.
    • XGBoost 允許用戶自定義最優化目標與評估標準,這使得XGBoost模型有更大的可能性
  4. Handling Missing Values(處理缺失值
    • XGBoost has an in-built routine to handle missing values.
    • XGBoost 有內置的缺失值處理程序
    • User is required to supply a different value than other observations and pass that as a parameter. XGBoost tries different things as it encounters a missing value on each node and learns which path to take for missing values in future.
    • 用戶需要提交與其他觀測值不同的值作爲缺失值的取值(作爲參數進行傳遞)。XGBoost在每個節點遇到缺失值時都會嘗試不同的事情,並學習如何處理未來的缺失值
  5. Tree Pruning(剪枝):
    • A GBM would stop splitting a node when it encounters a negative loss in the split. Thus it is more of a greedy algorithm.
    • XGBoost on the other hand make splits upto the max_depth specified and then start pruning the tree backwards and remove splits beyond which there is no positive gain.
    • XGBoost 先從頂到底建立所有可以建立的子樹,再從底到頂反向進行剪枝。比起GBM,這樣不容易陷入局部最優解
    • Another advantage is that sometimes a split of negative loss say -2 may be followed by a split of positive loss +10. GBM would stop as it encounters -2. But XGBoost will go deeper and it will see a combined effect of +8 of the split and keep both.
  6. Built-in Cross-Validation(內置的交叉檢驗)
    • XGBoost allows user to run a cross-validation at each iteration of the boosting process and thus it is easy to get the exact optimum number of boosting iterations in a single run.
    • XGBoost允許用戶在每次迭代過程中運行交叉檢驗,這樣方便用戶獲取最優的boosting迭代次數。
    • This is unlike GBM where we have to run a grid-search and only a limited values can be tested.
  7. Continue on Existing Model(繼續已經中斷的工作
    • User can start training an XGBoost model from its last iteration of previous run. This can be of significant advantage in certain specific applications.
    • GBM implementation of sklearn also has this feature so they are even on this point.
    • sklearn中的GBM模型與XGBoost模型都可以接着上次停止的訓練位置繼續進行訓練。


Did I whet yourappetite ? Good. You can refer to following web-pages for a deeper understanding:


發佈了33 篇原創文章 · 獲贊 225 · 訪問量 50萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章