XGBoost: A Scalable Tree Boosting System（XGBoost：一個可擴展的樹提升系統）

XGBoost: A Scalable Tree Boosting System

ABSTRACT

Tree boosting is a highly e ective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quan-tile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compres-sion and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.
樹推進是一種高效且廣泛使用的機器學習方法。在本文中，我們描述了一個可擴展的端到端樹推進系統XGBoost，它被數據科學家廣泛使用，以在許多機器學習挑戰中獲得最先進的結果。我們提出了一種新的稀疏數據稀疏感知算法和近似樹學習的加權量化草圖。更重要的是，我們提供有關緩存訪問模式，數據壓縮和分片的見解，以構建可擴展的樹提升系統。通過結合這些見解，XGBoost使用比現有系統少得多的資源來擴展數十億個示例。

Keywords

Large-scale Machine Learning

1.INTRODUCTION

Machine learning and data-driven approaches are becoming very important in many areas. Smart spam classi ers protect our email by learning from massive amounts of spam data and user feedback; advertising systems learn to match the right ads with the right context; fraud detection systems protect banks from malicious attackers; anomaly event detection systems help experimental physicists to nd events that lead to new physics. There are two important factors that drive these successful applications: usage of e ective (statistical) models that capture the complex data dependencies and scalable learning systems that learn the model of interest from large datasets.
機器學習和數據驅動方法在許多領域都非常重要。智能垃圾郵件分類器通過學習大量的spam數據和用戶反饋來保護我們的電子郵件; 廣告系統學會將正確的廣告與正確的背景相匹配; 欺詐檢測系統保護銀行免受惡意攻擊者的侵害異常事件檢測系統幫助實驗物理學家找到導致新物理學的事件。驅動這些成功應用程序有兩個重要因素：使用捕獲複雜數據依賴關係的有效（統計）模型和可從大型數據集中學習感興趣模型的可擴展學習系統。

Among the machine learning methods used in practice, gradient tree boosting [10]1 is one technique that shines in many applications. Tree boosting has been shown to give state-of-the-art results on many standard classification benchmarks [16]. LambdaMART [5], a variant of tree boost-ing for ranking, achieves state-of-the-art result for ranking problems. Besides being used as a stand-alone predictor, it is also incorporated into real-world production pipelines for ad click through rate prediction [15]. Finally, it is the de-facto choice of ensemble method and is used in challenges such as the Netix prize [3].

在實踐中使用的機器學習方法中，梯度樹增強[10] 1是一種在許多應用中閃耀的技術。樹木增強已被證明可以在許多標準分類基準上給出最先進的結果[16]。 LambdaMART [5]是用於排名的樹推進的變體，它實現了排名問題的最新結果。除了用作獨立預測器之外，它還被整合到實際生產流水線中，用於廣告點擊率預測[15]。最後，它是集合方法的事實上的選擇，並用於Netix獎[3]等挑戰。

In this paper, we describe XGBoost, a scalable machine learning system for tree boosting. The system is available as an open source package2. The impact of the system has been widely recognized in a number of machine learning and data mining challenges. Take the challenges hosted by the machine learning competition site Kaggle for example. A-mong the 29 challenge winning solutions 3 published at Kag-gle’s blog during 2015, 17 solutions used XGBoost. Among these solutions, eight solely used XGBoost to train the mod-el, while most others combined XGBoost with neural net-s in ensembles. For comparison, the second most popular method, deep neural nets, was used in 11 solutions. The success of the system was also witnessed in KDDCup 2015, where XGBoost was used by every winning team in the top-10. Moreover, the winning teams reported that ensemble methods outperform a well-con gured XGBoost by only a small amount [1].

在本文中，我們描述了XGBoost，一種用於樹木提升的可擴展機器學習系統。該系統可作爲開源軟件包2使用。該系統的影響已在許多機器學習和數據挖掘挑戰中得到廣泛認可。以機器學習競賽網站Kaggle主持的挑戰爲例。 A-mong在2015年Kag-gle的博客上發佈了29個挑戰獲勝解決方案，17個解決方案使用了XGBoost。在這些解決方案中，八個僅使用XGBoost來訓練模型，而大多數其他解決方案將XGBoost與神經網絡結合在一起。爲了比較，第二種最常用的方法是深度神經網絡，用於11種解決方案。該系統的成功也在KDDCup 2015中見證，其中XGBoost被前10名中的每個獲勝團隊使用。此外，獲勝團隊報告說，整體方法僅僅在很少量的情況下勝過良好的XGBoost [1]。

These results demonstrate that our system gives state-of-the-art results on a wide range of problems. Examples of the problems in these winning solutions include: store sales prediction; high energy physics event classi cation; web text classi cation; customer behavior prediction; motion detec-tion; ad click through rate prediction; malware classi cation; product categorization; hazard risk prediction; massive on-line course dropout rate prediction. While domain depen-dent data analysis and feature engineering play an important role in these solutions, the fact that XGBoost is the consen-sus choice of learner shows the impact and importance of our system and tree boosting.

這些結果表明，我們的系統在廣泛的問題上提供了最先進的結果。這些獲勝解決方案中存在的問題包括：商店銷售預測; 高能物理事件分類; 網絡文本分類; 顧客行爲預測; 運動檢測; 廣告點擊率預測; 惡意軟件分類; 產品分類; 危險風險預測; 大規模的在線課程輟學率預測。雖然域依賴數據分析和特徵工程在這些解決方案中發揮着重要作用，但XGBoost是學習者的共識選擇這一事實表明了我們的系統和樹提升的影響和重要性。

The most important factor behind the success of XGBoost is its scalability in all scenarios. The system runs more than ten times faster than existing popular solutions on a single machine and scales to billions of examples in distributed or memory-limited settings. The scalability of XGBoost is due to several important systems and algorithmic optimizations. These innovations include: a novel tree learning algorithm is for handling sparse data; a theoretically justi ed weighted quantile sketch procedure enables handling instance weights in approximate tree learning. Parallel and distributed com-puting makes learning faster which enables quicker model ex-ploration. More importantly, XGBoost exploits out-of-core computation and enables data scientists to process hundred millions of examples on a desktop. Finally, it is even more exciting to combine these techniques to make an end-to-end system that scales to even larger data with the least amount of cluster resources. The major contributions of this paper is listed as follows:

XGBoost成功背後最重要的因素是它在所有場景中的可擴展性。該系統在單臺機器上運行速度比現有流行解決方案快十倍以上，並且可以在分佈式或內存限制設置中擴展到數十億個示例。 XGBoost的可擴展性歸功於幾個重要的系統和算法優化。這些創新包括：一種新穎的樹學習算法，用於處理稀疏數據;理論上加權的加權分位數草圖程序使得能夠在近似樹學習中處理實例權重。並行和分佈式計算使學習更快，從而可以更快地進行模型探索。更重要的是，XGBoost利用核外計算，使數據科學家能夠在桌面上處理數億個示例。最後，結合這些技術使端到端系統以最少的集羣資源擴展到更大的數據更令人興奮。本文的主要貢獻如下：

We design and build a highly scalable end-to-end tree boosting system.

We propose a theoretically justi ed weighted quantile sketch for e cient proposal calculation.

We introduce a novel sparsity-aware algorithm for par-allel tree learning.

We propose an e ective cache-aware block structure for out-of-core tree learning.

While there are some existing works on parallel tree boost-ing [22, 23, 19], the directions such as out-of-core compu-tation, cache-aware and sparsity-aware learning have not been explored. More importantly, an end-to-end system that combines all of these aspects gives a novel solution for real-world use-cases. This enables data scientists as well as researchers to build powerful variants of tree boosting al-gorithms [7, 8]. Besides these major contributions, we also make additional improvements in proposing a regularized learning objective, which we will include for completeness.

The remainder of the paper is organized as follows. We will rst review tree boosting and introduce a regularized objective in Sec. 2. We then describe the split nding meth-ods in Sec. 3 as well as the system design in Sec. 4, including experimental results when relevant to provide quantitative support for each optimization we describe. Related work is discussed in Sec. 5. Detailed end-to-end evaluations are included in Sec. 6. Finally we conclude the paper in Sec. 7.

我們設計並構建了一個高度可擴展的端到端樹推進系統。

我們提出了一個理論上合理的加權分位數草圖，用於有效的提議計算。

我們爲par-allel樹學習引入了一種新穎的稀疏感知算法。

我們提出了一種用於核外樹學習的有效緩存感知塊結構。

雖然現有一些關於並行樹增強的工作[22,23,19]，但尚未探索諸如核外計算，高速緩存感知和稀疏感知學習等方向。更重要的是，結合所有這些方面的端到端系統爲現實世界的用例提供了一種新穎的解決方案。這使數據科學家和研究人員能夠構建樹木增強算法的強大變體[7,8]。除了這些主要貢獻之外，我們還在提出正規化學習目標方面做出了進一步的改進，我們將包括完整性。

在本文的其餘部分安排如下。我們將首先回顧樹的推進並在Sec中引入正則化的目標。然後我們描述了Sec中的分裂方法。 3以及Sec中的系統設計。 4，包括相關的實驗結果，爲我們描述的每個優化提供定量支持。相關工作在第二節中討論。 5.詳細的端到端評估包含在Sec。最後，我們在第二節總結了這篇論文。 7。

TREE BOOSTING IN A NUTSHELL

We review gradient tree boosting algorithms in this sec-tion. The derivation follows from the same idea in existing literatures in gradient boosting. Specicially the second order method is originated from Friedman et al. [12]. We make mi-nor improvements in the reguralized objective, which were found helpful in practice.

我們將在本節中回顧漸變樹增強算法。推導遵循現有文獻中梯度增強的相同思想。特別地，二階方法源自Friedman等人。[12]。我們對法律化的目標進行了微觀改進，這在實踐中是有幫助的。

2.1 Regularized Learning Objective

圖1：樹集合模型。給定示例的最終預測是每棵樹的預測總和。

它進入葉子並通過總結相應葉子中的分數（由w給出）來計算最終預測。要了解模型中使用的函數集，我們最小化以下正則化目標。

Here l is a di erentiable convex loss function that measures the di erence between the prediction y^i and the target yi. The second term penalizes the complexity of the model (i.e., the regression tree functions). The additional regular-ization term helps to smooth the nal learnt weights to avoid over- tting. Intuitively, the regularized objective will tend to select a model employing simple and predictive functions. A similar regularization technique has been used in Regu-larized greedy forest (RGF) [25] model. Our objective and the corresponding learning algorithm is simpler than RGF and easier to parallelize. When the regularization parame-ter is set to zero, the objective falls back to the traditional gradient tree boosting.

這裏l是一個不可靠的凸損失函數，它測量預測y ^ i和目標yi之間的差異。第二項懲罰模型的複雜性（即迴歸樹函數）。額外的規則化術語有助於平滑最終學習的權重，以避免過度。直觀地，正則化目標將傾向於選擇採用簡單和預測函數的模型。類似的正則化技術已被用於Regu-larized貪婪林（RGF）[25]模型。我們的目標和相應的學習算法比RGF更簡單，更易於並行化。當正則化參數設置爲零時，目標回退到傳統的梯度樹提升。

2.2 Gradient Tree Boosting

圖2：結構分數計算。我們只需要總結每個葉子上的梯度和二階梯度統計量，然後應用得分公式來獲得質量得分。

Eq (6) can be used as a scoring function to measure the quality of a tree structure q. This score is like the impurity score for evaluating decision trees, except that it is derived for a wider range of objective functions. Fig. 2 illustrates how this score can be calculated.
方程（6）可以用作評分函數來測量樹結構q的質量。該評分類似於評估決策樹的雜質評分，除了它是針對更廣泛的目標函數得出的。圖2說明了如何計算得分。

This formula is usually used in practice for evaluating the split candidates.

2.3 Shrinkage and Column Subsampling

Besides the regularized objective mentioned in Sec. 2.1, two additional techniques are used to further prevent over-tting. The rst technique is shrinkage introduced by Fried-man [11]. Shrinkage scales newly added weights by a factor n after each step of tree boosting. Similar to a learning rate in tochastic optimization, shrinkage reduces the inuence of each individual tree and leaves space for future trees and leaves space for future trees to improve the model. The second technique is column (feature) subsampling. This technique is used in RandomForest [4, 13], It is implemented in a commercial software TreeNet 4

(6)for gradient boosting, but is not implemented in existing opensource packages. According to user feedback, using col-umn sub-sampling prevents over- tting even more so than the traditional row sub-sampling (which is also supported). The usage of column sub-samples also speeds up computa-tions of the parallel algorithm described later.

除了第二節中提到的正則化目標。 2.1，使用另外兩種技術來進一步防止過度使用。第一種技術是Fried-man引入的收縮[11]。在樹木提升的每個步驟之後，收縮比例新增加了因子n的權重。與隨機優化中的學習速率類似，收縮減少了每棵樹的影響，爲未來的樹木留下了空間，爲未來的樹木留出了改進模型的空間。第二種技術是列（特徵）子採樣。這種技術用於RandomForest [4,13]，它是在商業軟件TreeNet 4中實現的

（6）用於梯度增強，但未在現有的開源軟件包中實現。根據用戶反饋，使用柱子採樣比傳統的行子採樣（也支持）更能防止過度採樣。列子樣本的使用也加速了後面描述的並行算法的計算。

3. SPLIT FINDING ALGORITHMS

3.1 Basic Exact Greedy Algorithm

One of the key problems in tree learning is to nd the best split as indicated by Eq (7). In order to do so, a s-plit nding algorithm enumerates over all the possible splits on all the features. We call this the exact greedy algorithm. Most existing single machine tree boosting implementation-s, such as scikit-learn [20], R’s gbm [21] as well as the single machine version of XGBoost support the exact greedy algo-rithm. The exact greedy algorithm is shown in Alg. 1. It is computationally demanding to enumerate all the possible splits for continuous features. In order to do so e ciently, the algorithm must rst sort the data according to feature values and visit the data in sorted order to accumulate the gradient statistics for the structure score in Eq (7).

樹學習中的關鍵問題之一是找到方程（7）所示的最佳分裂。爲此，s-plit nding算法枚舉所有特徵上的所有可能分裂。我們稱之爲精確的貪婪算法。大多數現有的單機樹提升實現，如scikit-learn [20]，R的gbm [21]以及XGBoost的單機版本都支持精確的貪婪算法。確切的貪婪算法如Alg所示。 1.枚舉連續特徵的所有可能分裂在計算上要求很高。爲了有效地執行此操作，算法必須首先根據特徵值對數據進行排序，並按排序順序訪問數據，以累積方程（7）中結構分數的梯度統計。

3.2 Approximate Algorithm

The exact greedy algorithm is very powerful since it enu-merates over all possible splitting points greedily. However, it is impossible to e ciently do so when the data does not t entirely into memory. Same problem also arises in the dis-tributed setting. To support e ective gradient tree boosting in these two settings, an approximate algorithm is needed.

We summarize an approximate framework, which resem-bles the ideas proposed in past literatures [17, 2, 22], in Alg. 2. To summarize, the algorithm rst proposes candi-date splitting points according to percentiles of feature dis-tribution (a speci c criteria will be given in Sec. 3.3). The algorithm then maps the continuous features into bucket-s split by these candidate points, aggregates the statistics and nds the best solution among proposals based on the aggregated statistics.

確切的貪婪算法非常強大，因爲它貪婪地計算所有可能的分裂點。但是，當數據不完全進入內存時，不可能有效地這樣做。在分佈式設置中也會出現同樣的問題。爲了支持這兩種設置中的有效梯度樹增強，需要一種近似算法。

我們總結了一個近似的框架，它類似於過去的文獻[17,2,22]中提出的觀點，在Alg中。 2.總之，算法首先根據特徵分佈的百分位數提出了候選分裂點（具體標準將在3.3節中給出）。然後，算法將連續特徵映射到由這些候選點劃分的桶中，彙總統計數據並根據聚合統計數據找出提案中的最佳解決方案。

There are two variants of the algorithm, depending on when the proposal is given. The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split nding at all levels.The local variant re-proposes after each split. The global method requires less proposal steps than the local method. However, usually more candidate points are needed for the global proposal because candidates are not re ned after each split. The local proposal re nes the candidates after splits, and can potentially be more appropriate for deeper trees. A comparison of di erent algorithms on a Higgs boson dataset is given by Fig. 3. We nd that the local proposal indeed requires fewer candidates. The global proposal can be as accurate as the local one given enough candidates.

Most existing approximate algorithms for distributed tree learning also follow this framework. Notably, it is also possi-ble to directly construct approximate histograms of gradient statistics [22]. It is also possible to use other variants of bin-ning strategies instead of quantile [17]. Quantile strategy bene t from being distributable and recomputable, which we will detail in next subsection. From Fig. 3, we also nd that the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

Our system e ciently supports exact greedy for the single machine setting, as well as approximate algorithm with both local and global proposal methods for all settings. Users can freely choose between the methods according to their needs.

該算法有兩種變體，具體取決於提議的時間。全局變體在樹構建的初始階段提出所有候選分裂，並且在所有級別使用相同的分裂結構提議。在每次分割之後重新提出本地變體。全局方法比本地方法需要更少的提議步驟。但是，全球提案通常需要更多候選點，因爲在每次拆分後都不會考慮候選人。本地提案在拆分後重新確定候選人，並且可能更適合更深層的樹木。圖3給出了希格斯玻色子數據集上不同算法的比較。我們認爲本地提案確實需要較少的候選者。全球提案可以與給予足夠候選人的當地提案一樣準確。

用於分佈式樹學習的大多數現有近似算法也遵循該框架。值得注意的是，直接構建梯度統計的近似直方圖也是可能的[22]。也可以使用其他變化的bin-ning策略而不是分位數[17]。分位數和可重組的分位數策略有所好處，我們將在下一小節中詳細介紹。從圖3中，我們還發現，在給定合理的近似水平的情況下，分位數策略可以獲得與精確貪婪相同的精度。

我們的系統有效地支持單機設置的精確貪婪，以及所有設置的本地和全局提議方法的近似算法。用戶可以根據自己的需要自由選擇方法。

3.3 Weighted Quantile Sketch

One important step in the approximate algorithm is to propose candidate split points. Usually percentiles of a fea-ture are used to make candidates distribute evenly on the data. Formally, let multi-set Dk = f(x1k; h1); (x2k; h2) (xnk; hn)g represent the k-th feature values and second order gradient statistics of each training instances. We can de ne a rank functions rk : R ! [0; +1) as

近似算法中的一個重要步驟是提出候選分裂點。通常使用特徵的百分位來使候選者均勻地分佈在數據上。形式上，讓多組Dk = f（x1k; h1）; （x2k; h2）（xnk; hn）g表示每個訓練實例的第k個特徵值和二階梯度統計。我們可以定義一個等級函數rk：R！[0; +1）as

which is exactly weighted squared loss with labels gi=hi and weights hi. For large datasets, it is non-trivial to nd can-didate splits that satisfy the criteria. When every instance has equal weights, an existing algorithm called quantile s-ketch [14, 24] solves the problem. However, there is no existing quantile sketch for the weighted datasets. There-fore, most existing approximate algorithms either resorted to sorting on a random subset of data which have a chance of failure or heuristics that do not have theoretical guarantee.

To solve this problem, we introduced a novel distributed weighted quantile sketch algorithm that can handle weighted data with a provable theoretical guarantee. The general idea is to propose a data structure that supports merge and prune operations, with each operation proven to maintain a certain accuracy level. A detailed description of the algorithm as well as proofs are given in the supplementary material5(link in the footnote).

這是標籤gi = hi和權重hi的加權平方損失。對於大型數據集，滿足條件的nd can-didate拆分是非常重要的。當每個實例具有相等的權重時，稱爲分位數s-ketch [14,24]的現有算法解決了該問題。但是，加權數據集不存在現有的分位數草圖。因此，大多數現有的近似算法要麼對有可能失敗的隨機數據子集進行排序，要麼使用沒有理論保證的啓發式算法。

爲了解決這個問題，我們引入了一種新穎的分佈式加權分位數草圖算法，該算法可以處理加權數據並具有可證明的理論保證一般的想法是提出一種支持合併和修剪操作的數據結構，每個操作都被證明可以保持一定的準確度。補充材料5（腳註中的鏈接）給出了算法的詳細描述以及證明。

3.4 Sparsity-aware Split Finding

In many real-world problems, it is quite common for the input x to be sparse. There are multiple possible causes for sparsity: 1) presence of missing values in the data; 2) frequent zero entries in the statistics; and, 3) artifacts of feature engineering such as one-hot encoding. It is important to make the algorithm aware of the sparsity pattern in the data. In order to do so, we propose to add a default direction in each tree node, which is shown in Fig. 4. When a value is missing in the sparse matrix x, the instance is classified into the default direction. There are two choices of default direction in each branch. The optimal default directions are learnt from the data. The algorithm is shown in Alg. 3. The key improvement is to only visit the non-missing entries Ik. The presented algorithm treats the non-presence as a missing value and learns the best direction to handle missing values. The same algorithm can also be applied when the non-presence corresponds to a user speci ed value by limiting the enumeration only to consistent solutions.

To the best of our knowledge, most existing tree learning algorithms are either only optimized for dense data, or need speci c procedures to handle limited cases such as categorical encoding. XGBoost handles all sparsity patterns in a uni ed way. More importantly, our method exploits the sparsity to make computation complexity linear to number of non-missing entries in the input. Fig. 5 shows the comparison of sparsity aware and a naive implementation on an Allstate-10K dataset (description of dataset given in Sec. 6). We find that the sparsity aware algorithm runs 50 times faster than the naive version. This confirms the importance of the sparsity aware algorithm.

在許多現實問題中，輸入x稀疏是很常見的。稀疏性有多種可能的原因：1）數據中存在缺失值; 2）統計中頻繁的零項; 3）特徵工程的工件，例如單熱編碼。使算法瞭解數據中的稀疏模式非常重要。爲此，我們建議在每個樹節點中添加一個默認方向，如圖4所示。當稀疏矩陣x中缺少一個值時，該實例被分類爲默認方向。每個分支中有兩種默認方向選擇。從數據中學習最佳默認方向。算法顯示在Alg中。 3.關鍵的改進是隻訪問非缺失的條目Ik。所提出的算法將非存在視爲缺失值並且學習處理缺失值的最佳方向。當非存在對應於用戶指定的值時，也可以通過將枚舉限制爲一致的解決方案來應用相同的算法。

據我們所知，大多數現有的樹學習算法要麼僅針對密集數據進行優化，要麼需要特定的過程來處理有限的情況，例如分類編碼。 XGBoost以統一的方式處理所有稀疏模式。更重要的是，我們的方法利用稀疏性使計算複雜度與輸入中的非缺失條目的數量成線性關係。圖5顯示了對Allstate-10K數據集的稀疏性和初始實現的比較（第6節中給出的數據集的描述）。我們發現稀疏感知算法比天真版本快50倍。這證實了稀疏感知算法的重要性。

4. SYSTEM DESIGN

4.1 Column Block for Parallel Learning

The most time consuming part of tree learning is to get the data into sorted order. In order to reduce the cost of sorting, we propose to store the data in in-memory units, which we called block. Data in each block is stored in the compressed column (CSC) format, with each column sorted by the corresponding feature value. This input data layout only needs to be computed once before training, and can be reused in later iterations.

In the exact greedy algorithm, we store the entire dataset in a single block and run the split search algorithm by lin-early scanning over the presorted entries. We do the split nding of all leaves collectively, so one scan over the block will collect the statistics of the split candidates in all leaf branches. Fig. 6 shows how we transform a dataset into the format and nd the optimal split using the block structure.

The block structure also helps when using the approximate algorithms. Multiple blocks can be used in this case, with each block corresponding to subset of rows in the dataset. Different blocks can be distributed across machines, or stored on disk in the out-of-core setting. Using the sorted structure, the quantile finding step becomes a linear scan over the sorted columns. This is especially valuable for lo-cal proposal algorithms, where candidates are generated frequently at each branch. The binary search in histogram aggregation also becomes a linear time merge style algorithm. Collecting statistics for each column can be parallelized, giving us a parallel algorithm for split finding. Importantly, the column block structure also supports column subsampling, as it is easy to select a subset of columns in a block.

樹學習中最耗時的部分是將數據按順序排列。爲了降低排序成本，我們建議將數據存儲在內存單元中，我們稱之爲塊。每個塊中的數據以壓縮列（CSC）格式存儲，每列按相應的特徵值排序。此輸入數據佈局僅需要在訓練之前計算一次，並且可以在以後的迭代中重複使用。

在精確的貪婪算法中，我們將整個數據集存儲在一個塊中，並通過對預先排序的條目進行lin-early掃描來運行拆分搜索算法。我們共同對所有葉子進行分割，因此對塊進行一次掃描將收集所有葉子分支中的分割候選者的統計數據。圖6顯示了我們如何使用塊結構將數據集轉換爲格式並找到最佳分割。

在使用近似算法時，塊結構也有幫助。在這種情況下可以使用多個塊，每個塊對應於數據集中的行的子集。不同的塊可以跨機器分佈，也可以在覈外設置中存儲在磁盤上。使用排序結構，分位數查找步驟變爲對排序列的線性掃描。這對於本地提議算法特別有價值，其中候選者在每個分支處經常生成。直方圖聚合中的二分搜索也變爲線性時間合併樣式算法。收集每列的統計數據可以並行化，爲我們提供了一種用於拆分查找的並行算法。重要的是，列塊結構還支持列子採樣，因爲很容易在塊中選擇列的子集。

圖7：精確貪婪算法中緩存感知預取的影響。我們發現緩存缺失效應會影響大型數據集（1000萬個實例）的性能。當數據集很大時，使用緩存感知預取可將性能提高兩倍。

圖8：短距離數據依賴模式，可能由於緩存未命中而導致停頓。

圖9：塊大小在近似算法中的影響。我們發現過於小的塊會導致無效的並行化，而過大的塊也會因緩存未命中而減慢訓練速度。

Time Complexity Analysis

4.2 Cache-aware Access

While the proposed block structure helps optimize the computation complexity of split nding, the new algorithm requires indirect fetches of gradient statistics by row index, since these values are accessed in order of feature. This is a non-continuous memory access. A naive implementation of split enumeration introduces immediate read/write de-pendency between the accumulation and the non-continuous memory fetch operation (see Fig. 8). This slows down split nding when the gradient statistics do not t into CPU cache and cache miss occur.

For the exact greedy algorithm, we can alleviate the prob-lem by a cache-aware prefetching algorithm. Speci cally, we allocate an internal bu er in each thread, fetch the gra-dient statistics into it, and then perform accumulation in a mini-batch manner. This prefetching changes the direct read/write dependency to a longer dependency and helps to reduce the runtime overhead when number of rows in the is large. Figure 7 gives the comparison of cache-aware vs. non cache-aware algorithm on the the Higgs and the All-state dataset. We nd that cache-aware implementation of the exact greedy algorithm runs twice as fast as the naive version when the dataset is large.

For approximate algorithms, we solve the problem by choos-ing a correct block size. We de ne the block size to be max-imum number of examples in contained in a block, as this reects the cache storage cost of gradient statistics. Choos-ing an overly small block size results in small workload for each thread and leads to ine cient parallelization. On the other hand, overly large blocks result in cache misses, as the gradient statistics do not t into the CPU cache. A good choice of block size balances these two factors. We compared various choices of block size on two data sets. The results are given in Fig. 9. This result validates our discussion and shows that choosing 216 examples per block balances the cache property and parallelization.

雖然所提出的塊結構有助於優化分割中的計算複雜度，但是新算法需要通過行索引間接提取梯度統計，因爲這些值是按特徵的順序訪問的。這是一種非連續的內存訪問。分裂枚舉的簡單實現引入了累積和非連續存儲器獲取操作之間的立即讀/寫依賴性（參見圖8）。當梯度統計信息不進入CPU緩存併發生緩存未命中時，這會降低分割速度。

對於精確的貪婪算法，我們可以通過緩存感知預取算法來緩解問題。具體來說，我們在每個線程中分配一個內部緩衝區，獲取其中的梯度統計信息，然後以小批量方式執行累積。此預取將直接讀/寫依賴關係更改爲更長的依賴關係，並有助於在其中的行數較大時減少運行時開銷。圖7給出了Higgs和All-state數據集上緩存感知與非緩存感知算法的比較。我們發現，當數據集很大時，精確貪婪算法的緩存感知實現的運行速度是天真版本的兩倍。

對於近似算法，我們通過選擇正確的塊大小來解決問題。我們將塊大小定義爲塊中包含的最大數量的示例，因爲這反映了梯度統計的高速緩存存儲成本。選擇過小的塊大小會導致每個線程的工作量很小，並導致無效的並行化。另一方面，過大的塊會導致高速緩存未命中，因爲梯度統計信息不會進入CPU高速緩存。塊大小的良好選擇平衡了這兩個因素。我們在兩個數據集上比較了塊大小的各種選擇。結果如圖9所示。該結果驗證了我們的討論，並表明每個塊選擇216個示例可以平衡緩存屬性和並行化。

4.3 Blocks for Out-of-core Computation

One goal of our system is to fully utilize a machine’s re-sources to achieve scalable learning. Besides processors and memory, it is important to utilize disk space to handle data that does not t into main memory. To enable out-of-core computation, we divide the data into multiple blocks and store each block on disk. During computation, it is impor-tant to use an independent thread to pre-fetch the block into a main memory bu er, so computation can happen in con-currence with disk reading. However, this does not entirely solve the problem since the disk reading takes most of the computation time. It is important to reduce the overhead and increase the throughput of disk IO. We mainly use two techniques to improve the out-of-core computation.

Block Compression The rst technique we use is block compression. The block is compressed by columns, and de-compressed on the y by an independent thread when load-ing into main memory. This helps to trade some of the computation in decompression with the disk reading cost. We use a general purpose compression algorithm for com-pressing the features values. For the row index, we substract the row index by the begining index of the block and use a 16bit integer to store each o set. This requires 216 examples per block, which is con rmed to be a good setting. In most of the dataset we tested, we achieve roughly a 26% to 29% compression ratio.

Block Sharding The second technique is to shard the data onto multiple disks in an alternative manner. A pre-fetcher thread is assigned to each disk and fetches the data into an in-memory bu er. The training thread then alternatively reads the data from each bu er. This helps to increase the throughput of disk reading when multiple disks are available.

我們系統的一個目標是充分利用機器的資源來實現可擴展的學習。除了處理器和內存之外，利用磁盤空間來處理不會進入主內存的數據也很重要。爲了實現核外計算，我們將數據分成多個塊並將每個塊存儲在磁盤上。在計算過程中，使用獨立的線程將塊預取到主存儲器中是很重要的，因此計算可以在與磁盤讀取相關的情況下發生。但是，這並不能完全解決問題，因爲磁盤讀取佔用了大部分計算時間。減少開銷並增加磁盤IO的吞吐量非常重要。我們主要使用兩種技術來改進核外計算。

塊壓縮我們使用的第一種技術是塊壓縮。該塊由列壓縮，並在加載到主存儲器時由獨立線程在y上解壓縮。這有助於將解壓縮中的一些計算與磁盤讀取成本進行交換。我們使用通用壓縮算法來壓縮特徵值。對於行索引，我們通過塊的開始索引來減去行索引，並使用16位整數來存儲每個o set。這需要每個塊216個示例，這被認爲是一個很好的設置。在我們測試的大多數數據集中，我們實現了大約26％到29％的壓縮比。

塊分片第二種技術是以另一種方式將數據分片到多個磁盤上。爲每個磁盤分配一個預取線程，並將數據提取到內存中。然後，訓練線程交替地從每個存儲器讀取數據。當有多個磁盤可用時，這有助於提高磁盤讀取的吞吐量。

5.RELATED WORKS

Our system implements gradient boosting [10], which per-forms additive optimization in functional space. Gradient tree boosting has been successfully used in classi cation [12], learning to rank [5], structured prediction [8] as well as other elds. XGBoost incorporates a regularized model to prevent overfitting. This this resembles previous work on regularized greedy forest [25], but simpli es the objective and algorithm for parallelization. Column sampling is a simple but e ective technique borrowed from RandomForest [4]. While sparsity-aware learning is essential in other types of models such as linear models [9], few works on tree learning have considered this topic in a principled way. The algorithm proposed in this paper is the rstuni ed approach to handle all kinds of sparsity patterns. There are several existing works on parallelizing tree learn-ing [22, 19]. Most of these algorithms fall into the approximate framework described in this paper. Notably, it is also possible to partition data by columns [23] and apply the ex-act greedy algorithm. This is also supported in our frame-work, and the techniques such as cache-aware prefecthing can be used to bene t this type of algorithm. While most existing works focus on the algorithmic aspect of parallelization, our work improves in two unexplored system directions:out-of-core computation and cache-aware learning. This gives us insights on how the system and the algorithm can be jointly optimized and provides an end-to-end system that can handle large scale problems with very limited computing resources. We also summarize the comparison between our system and existing opensource implementations in Table 1.

Quantile summary (without weights) is a classical prob-lem in the database community [14, 24]. However, the ap-proximate tree boosting algorithm reveals a more general problem { nding quantiles on weighted data. To the best of our knowledge, the weighted quantile sketch proposed in this paper is the rst method to solve this problem. The weighted quantile summary is also not speci c to the tree learning and can bene t other applications in data science and machine learning in the future.

我們的系統實現了梯度增強[10]，它在功能空間中進行了添加優化。梯度樹增強已成功用於分類[12]，學習排名[5]，結構化預測[8]以及其他領域。 XGBoost採用正則化模型來防止過度擬合。這類似於以前關於正則化貪婪森林的工作[25]，但簡化了並行化的目標和算法。柱採樣是一種從RandomForest [4]借來的簡單但有效的技術。雖然稀疏感知學習在其他類型的模型（如線性模型[9]）中是必不可少的，但很少有關於樹學習的工作以原則方式考慮該主題。本文提出的算法是處理各種稀疏模式的rstuni ed方法。有幾個關於並行樹學習的現有工作[22,19]。大多數這些算法都屬於本文所述的近似框架。值得注意的是，也可以按列[23]對數據進行分區，並應用ex-act貪婪算法。我們的框架工作也支持這一點，並且可以使用諸如緩存感知預知之類的技術來獲得這種類型的算法。雖然大多數現有的工作都集中在並行化的算法方面，但我們的工作在兩個未開發的系統方向上進行了改進：核外計算和緩存感知學習。這爲我們提供了有關如何聯合優化系統和算法的見解，並提供了一個端到端系統，可以處理非常有限的計算資源的大規模問題。我們還總結了表1中我們的系統與現有開源實現之間的比較。

分位數摘要（無權重）是數據庫社區中的經典問題[14,24]。然而，近似樹提升算法揭示了一個更普遍的問題{加權數據上的分數。據我們所知，本文提出的加權分位數草圖是解決該問題的第一種方法。加權分位數摘要也不是樹學習的特定，並且可以在未來的數據科學和機器學習中獲得其他應用。

6.END TO END EVALUATIONS

6.1 System Implementation

We implemented XGBoost as an open source package6. The package is portable and reusable. It supports various weighted classification and rank objective functions, as well as user de ned objective function. It is available in popular languages such as python, R, Julia and integrates naturally with language native data science pipelines such as scikit-learn. The distributed version is built on top of the rabit library7 for allreduce. The portability of XGBoost makes it available in many ecosystems, instead of only being tied to a specific platform. The distributed XGBoost runs natively on Hadoop, MPI Sun Grid engine. Recently, we also enable distributed XGBoost on jvm bigdata stacks such as Flink and Spark. The distributed version has also been integrated into cloud platform Tianchi8 of Alibaba. We believe that there will be more integrations in the future.

我們將XGBoost實現爲開源包6。該包裝是便攜式和可重複使用的。它支持各種加權分類和秩目標函數，以及用戶定義的目標函數。它以流行的語言提供，例如python，R，Julia，並且自然地與語言本地數據科學管道集成，例如scikit-learn。分佈式版本建立在rabit庫7之上，用於allreduce。 XGBoost的可移植性使其可用於許多生態系統，而不僅僅是綁定到特定平臺。分佈式XGBoost在Hadoop，MPI Sun Grid引擎上本機運行。最近，我們還在jvm bigdata堆棧（如Flink和Spark）上啓用了分佈式XGBoost。分佈式版本也已集成到阿里巴巴的雲平臺天池8中。我們相信未來會有更多的整合。

6.2 Dataset and Setup

We used four datasets in our experiments. A summary of these datasets is given in Table 2. In some of the experiments, we use a randomly selected subset of the data either due to slow baselines or to demonstrate the performance of the algorithm with varying dataset size. We use a su x to denote the size in these cases. For example Allstate-10K means a subset of the Allstate dataset with 10K instances.

The rst dataset we use is the Allstate insurance claim dataset9. The task is to predict the likelihood and cost of an insurance claim given di erent risk factors. In the exper-iment, we simpli ed the task to only predict the likelihood of an insurance claim. This dataset is used to evaluate the impact of sparsity-aware algorithm in Sec. 3.4. Most of the sparse features in this data come from one-hot encoding. We randomly select 10M instances as training set and use the rest as evaluation set.

The second dataset is the Higgs boson dataset10 from high energy physics. The data was produced using Monte Carlo simulations of physics events. It contains 21 kinematic prop-erties measured by the particle detectors in the accelerator. It also contains seven additional derived physics quantities of the particles. The task is to classify whether an event corresponds to the Higgs boson. We randomly select 10M instances as training set and use the rest as evaluation set.

The third dataset is the Yahoo! learning to rank challenge dataset [6], which is one of the most commonly used bench-marks in learning to rank algorithms. The dataset contains 20K web search queries, with each query corresponding to a list of around 22 documents. The task is to rank the docu-ments according to relevance of the query. We use the o cial train test split in our experiment.

The last dataset is the criteo terabyte click log dataset11. We use this dataset to evaluate the scaling property of the system in the out-of-core and the distributed settings. The data contains 13 integer features and 26 ID features of user, item and advertiser information. Since a tree based model is better at handling continuous features, we preprocess the data by calculating the statistics of average CTR and count of ID features on the rst ten days, replacing the ID fea-tures by the corresponding count statistics during the next ten days for training. The training set after preprocessing contains 1.7 billion instances with 67 features (13 integer, 26 average CTR statistics and 26 counts). The entire dataset is more than one terabyte in LibSVM format.

我們在實驗中使用了四個數據集。表2中給出了這些數據集的摘要。在一些實驗中，由於基線較慢，我們使用隨機選擇的數據子集，或者演示具有不同數據集大小的算法的性能。在這些情況下，我們使用su x來表示大小。例如，Allstate-10K表示具有10K實例的Allstate數據集的子集。

我們使用的第一個數據集是Allstate保險索賠數據集9。任務是根據不同的風險因素預測保險索賠的可能性和成本。在實驗中，我們簡化了僅預測保險索賠可能性的任務。此數據集用於評估稀疏性感知算法在Sec中的影響。 3.4。此數據中的大多數稀疏功能都來自單熱編碼。我們隨機選擇10M實例作爲訓練集，並將其餘部分用作評估集。

第二個數據集是來自高能物理學的希格斯玻色子數據集10。數據是使用物理事件的蒙特卡羅模擬生成的。它包含21個運動學特性，由加速器中的粒子探測器測量。它還包含七個額外的粒子派生物理量。任務是分類事件是否與希格斯玻色子相對應。我們隨機選擇10M實例作爲訓練集，並將其餘部分用作評估集。

第三個數據集是Yahoo!學習排名挑戰數據集[6]，這是學習排名算法最常用的基準標記之一。數據集包含20K Web搜索查詢，每個查詢對應於大約22個文檔的列表。任務是根據查詢的相關性對文檔進行排名。我們在實驗中使用了公式列車測試分組。

最後一個數據集是criteo terabyte click log dataset11。我們使用此數據集來評估系統在覈外和分佈式設置中的擴展屬性。該數據包含13個整數功能和26個用戶，項目和廣告商信息的ID功能。由於基於樹的模型更好地處理連續特徵，我們通過計算前十天的平均CTR和ID特徵的統計數據來預處理數據，在接下來的十天內用相應的計數統計數據替換ID特徵。爲了訓練。預處理後的訓練集包含17個具有67個特徵的實例（13個整數，26個平均點擊率統計和26個計數）。整個數據集的LibSVM格式超過1TB。

We use the rst three datasets for the single machine par-allel setting, and the last dataset for the distributed and out-of-core settings. All the single machine experiments are conducted on a Dell PowerEdge R420 with two eight-core Intel Xeon (E5-2470) (2.3GHz) and 64GB of memory. If not speci ed, all the experiments are run using all the available cores in the machine. The machine settings of the distribut-ed and the out-of-core experiments will be described in the corresponding section. In all the experiments, we boost trees with a common setting of maximum depth equals 8, shrink-age equals 0.1 and no column subsampling unless explicitly speci ed. We can nd similar results when we use other settings of maximum depth.

我們將前三個數據集用於單機par-allel設置，並將最後一個數據集用於分佈式和核外設置。所有單機實驗均在戴爾PowerEdge R420上進行，配備兩個八核Intel Xeon（E5-2470）（2.3GHz）和64GB內存。如果未指定，則使用機器中的所有可用核心運行所有實驗。分佈式和核外實驗的機器設置將在相應的部分中描述。在所有實驗中，我們使用最大深度等於8的共同設置來提升樹，收縮年齡等於0.1並且除非明確指定，否則不進行列子採樣。當我們使用其他最大深度設置時，我們可以得到類似的結果。

圖10：Yahoo LTRC數據集上XGBoost和pG-BRT之間的比較。

表4：雅虎上500棵樹的學習與排名比較 LTRC數據集

6.3 Classification

In this section, we evaluate the performance of XGBoost on a single machine using the exact greedy algorithm on Higgs-1M data, by comparing it against two other common-ly used exact greedy tree boosting implementations. Since scikit-learn only handles non-sparse input, we choose the dense Higgs dataset for a fair comparison. We use the 1M subset to make scikit-learn nish running in reasonable time. Among the methods in comparison, R’s GBM uses a greedy approach that only expands one branch of a tree, which makes it faster but can result in lower accuracy, while both scikit-learn and XGBoost learn a full tree. The results are shown in Table 3. Both XGBoost and scikit-learn give better performance than R’s GBM, while XGBoost runs more than 10x faster than scikit-learn. In this experiment, we also find column subsamples gives slightly worse performance than using all the features. This could due to the fact that there are few important features in this dataset and we can benefit from greedily select from all the features.

在本節中，我們使用Higgs-1M數據上的精確貪婪算法，通過將其與其他兩種常用的精確貪婪樹提升實現進行比較，評估XGBoost在單臺機器上的性能。由於scikit-learn只處理非稀疏輸入，我們選擇密集的Higgs數據集進行公平比較。我們使用1M子集在合理的時間內運行scikit-learn nish。在比較的方法中，R的GBM使用貪婪的方法，只擴展樹的一個分支，這使得它更快但可能導致更低的準確性，而scikit-learn和XGBoost都學習完整的樹。結果顯示在表3中.XGBoost和scikit-learn都比R的GBM提供更好的性能，而XGBoost的運行速度比scikit-learn快10倍。在此實驗中，我們還發現列子樣本的性能略差於使用所有功能。這可能是因爲此數據集中的重要特徵很少，我們可以從所有功能中貪婪地選擇。

6.4 LearningtoRank

We next evaluate the performance of XGBoost on the learning to rank problem. We compare against pGBRT [22], the best previously pubished system on this task. XGBoost runs exact greedy algorithm, while pGBRT only support an approximate algorithm. The results are shown in Table 4 and Fig. 10. We nd that XGBoost runs faster. Interest-ingly, subsampling columns not only reduces running time, and but also gives a bit higher performance for this prob-lem. This could due to the fact that the subsampling helps prevent over tting, which is observed by many of the users.

我們接下來評估XGBoost在學習排名問題上的表現。我們比較pGBRT [22]，這是此任務中最好的先前發佈的系統。 XGBoost運行精確的貪心算法，而pGBRT僅支持近似算法。結果顯示在表4和圖10中。我們發現XGBoost運行得更快。有趣的是，二次採樣列不僅減少了運行時間，而且還爲這個問題提供了更高的性能。這可能是由於子採樣有助於防止過度使用，這是許多用戶觀察到的。

圖11：對criteo數據的不同子集的核外方法的比較。丟失的數據點是由於磁盤空間不足造成的。我們可以發現基本算法只能處理200M的例子。添加壓縮可提供3倍的加速，並且分成兩個磁盤可提供另外2倍的加速。系統從400M示例開始耗盡文件緩存。在此之後，算法確實必須依賴磁盤。壓縮+分片方法在用盡le緩存時具有不那麼顯着的減速，並且之後呈現線性趨勢。

Figure 12: Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on di erent subset of criteo data. XGBoost runs more 10x than spark per iteration and 2.2x as H2O’s optimized version (However, H2O is slow in loading the data, get-ting worse end-to-end time). Note that spark suffers from drastic slow down when running out of memory. XGBoost runs faster and scales smoothly to the full 1.7 billion examples with given resources by utilizing out-of-core computation.

圖12：32個EC2節點上不同分佈式系統的比較，在不同的criteo數據子集上進行10次迭代。 XGBoost每次迭代的運行次數比火花多10倍，H2O的優化版本運行2.2倍（但是，H2O在加載數據時速度慢，端到端時間更差）。請注意，當內存不足時，火花會急劇減速。通過利用核外計算，XGBoost運行速度更快，並且通過給定資源可以平滑地擴展到完整的17億個示例。

6.5 Out-of-core Experiment

We also evaluate our system in the out-of-core setting on the criteo data. We conducted the experiment on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). The results are shown in Figure 11. We can nd that compression helps to speed up computation by factor of three, and sharding into two disks further gives 2x speedup. For this type of experiment, it is important to use a very large dataset to drain the system le cache for a real out-of-core setting. This is indeed our setup. We can observe a transition point when the system runs out of le cache. Note that the transition in the nal method is less dramatic. This is due to larger disk throughput and better utilization of computation resources. Our nal method is able to process 1.7 billion examples on a single machine.

我們還在criteo數據的out-of-core設置中評估我們的系統。我們在一臺AWS c3.8xlarge機器上進行了實驗（32個vcores，兩個320 GB SSD，60 GB RAM）。結果顯示在圖11中。我們可以確定壓縮有助於將計算速度提高三倍，並且分成兩個磁盤進一步提供2倍的加速。對於此類實驗，使用非常大的數據集來排空系統文件緩存以實現真正的核外設置非常重要。這確實是我們的設置。當系統用完le緩存時，我們可以觀察到一個轉換點。請注意，nal方法的轉換不那麼引人注目。這是由於更大的磁盤吞吐量和更好的計算資源利用率。我們的nal方法能夠在一臺機器上處理17億個示例。

6.6 Distributed Experiment

Finally, we evaluate the system in the distributed setting. We set up a YARN cluster on EC2 with m3.2xlarge ma-chines, which is a very common choice for clusters. Each machine contains 8 virtual cores, 30GB of RAM and two 80GB SSD local disks. The dataset is stored on AWS S3 instead of HDFS to avoid purchasing persistent storage.

We rst compare our system against two production-level distributed systems: Spark MLLib [18] and H2O 12. We use 32 m3.2xlarge machines and test the performance of the systems with various input size. Both of the baseline systems are in-memory analytics frameworks that need to store the data in RAM, while XGBoost can switch to out-of-core set-ting when it runs out of memory. The results are shown in Fig. 12. We can nd that XGBoost runs faster than the baseline systems. More importantly, it is able to take advantage of out-of-core computing and smoothly scale to all 1.7 billion examples with the given limited computing re-sources. The baseline systems are only able to handle subset of the data with the given resources. This experiment shows the advantage to bring all the system improvement togeth-er and solve a real-world scale problem. We also evaluate the scaling property of XGBoost by varying the number of machines. The results are shown in Fig. 13. We can nd XGBoost’s performance scales linearly as we add more ma-chines. Importantly, XGBoost is able to handle the entire 1.7 billion data with only four machines. This shows the system’s potential to handle even larger data.

最後，我們在分佈式設置中評估系統。我們在EC2上使用m3.2xlarge機器建立了一個YARN集羣，這是集羣的一個非常常見的選擇。每臺機器包含8個虛擬內核，30GB內存和兩個80GB SSD本地磁盤。數據集存儲在AWS S3而不是HDFS上，以避免購買持久存儲。

我們首先將我們的系統與兩個生產級分佈式系統進行比較：Spark MLLib [18]和H2O 12.我們使用32 m3.2xlarge機器並測試具有不同輸入尺寸的系統的性能。兩個基線系統都是內存分析框架，需要將數據存儲在RAM中，而XGBoost可以在內存不足時切換到核外設置。結果顯示在圖12中。我們可以發現XGBoost的運行速度比基線系統快。更重要的是，它能夠利用核外計算，並在給定有限的計算資源的情況下平滑擴展到所有17億個示例。基線系統只能處理具有給定資源的數據子集。該實驗顯示了將所有系統改進提供給解決方案並解決實際規模問題的優勢。我們還通過改變機器的數量來評估XGBoost的縮放屬性。結果顯示在圖13中。當我們添加更多的機器時，我們可以線性地找到XGBoost的性能標度。重要的是，XGBoost只需要四臺機器即可處理整個17億個數據。這表明系統有可能處理更大的數據。

Figure 13: Scaling of XGBoost with different num-ber of machines on criteo full 1.7 billion dataset. Using more machines results in more le cache and makes the system run faster, causing the trend to be slightly super linear. XGBoost can process the entire dataset using as little as four machines, and scales smoothly by utilizing more available resources.

圖13：在criteo完整的17億數據集上使用不同數量的機器縮放XGBoost。使用更多的機器會導致更多的緩存並使系統運行得更快，從而使趨勢略微超線性。 XGBoost可以使用少至四臺機器處理整個數據集，並通過利用更多可用資源順利擴展。

7.CONCLUSION

In this paper, we described the lessons we learnt when building XGBoost, a scalable tree boosting system that is widely used by data scientists and provides state-of-the-art results on many problems. We proposed a novel sparsity aware algorithm for handling sparse data and a theoretically justi ed weighted quantile sketch for approximate learning. Our experience shows that cache access patterns, data com-pression and sharding are essential elements for building a scalable end-to-end system for tree boosting. These lessons can be applied to other machine learning systems as well. By combining these insights, XGBoost is able to solve real-world scale problems using a minimal amount of resources.

在本文中，我們描述了我們在構建XGBoost時學到的經驗教訓，XGBoost是一個可擴展的樹推進系統，被數據科學家廣泛使用，並提供了許多問題的最新結果。我們提出了一種用於處理稀疏數據的新型稀疏感知算法和用於近似學習的理論上加權的加權分位數草圖。我們的經驗表明，緩存訪問模式，數據壓縮和分片是構建可擴展的端到端系統以實現樹提升的基本要素。這些課程也可以應用於其他機器學習系統。通過結合這些見解，XGBoost能夠使用最少量的資源解決實際規模問題。

Acknowledgments

We would like to thank Tyler B. Johnson, Marco Tulio Ribeiro, Sameer Singh, Arvind Krishnamurthy for their valuable feedback. We also sincerely thank Tong He, Bing Xu, Michael Benesty, Yuan Tang, Hongliang Liu, Qiang Kou, Nan Zhu and all other con-tributors in the XGBoost community. This work was supported in part by ONR (PECASE) N000141010672, NSF IIS 1258741 and the TerraSwarm Research Center sponsored by MARCO and DARPA.

我們要感謝Tyler B. Johnson，Marco Tulio Ribeiro，Sameer Singh，Arvind Krishnamurthy提供的寶貴意見。我們也衷心感謝Tong He，Bing Xu，Michael Benesty，Yuan Tang，劉洪亮，Qiang Kou，Nan Zhu以及XGBoost社區的所有其他貢獻者。這項工作部分由ONR（PECASE）N000141010672，NSF IIS 1258741和由MARCO和DARPA贊助的TerraSwarm研究中心提供支持。

8.REFERENCES

[1]R. Bekkerman. The present and the future of the kdd cup competition: an outsider’s perspective.

[2]R. Bekkerman, M. Bilenko, and J. Langford. Scaling Up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, New York, NY, USA, 2011.

[3]J. Bennett and S. Lanning. The netix prize. In

Proceedings of the KDD Cup Workshop 2007, pages 3{6, New York, Aug. 2007.

[4]L. Breiman. Random forests. Maching Learning, 45(1):5{32, Oct. 2001.

[5]C. Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 11:23{581, 2010.

[6]O. Chapelle and Y. Chang. Yahoo! Learning to Rank Challenge Overview. Journal of Machine Learning Research - W & CP, 14:1{24, 2011.

[7]T. Chen, H. Li, Q. Yang, and Y. Yu. General functional
matrix factorization using gradient boosting. In Proceeding of 30th International Conference on Machine Learning (ICML’13), volume 1, pages 436{444, 2013.

[8]T. Chen, S. Singh, B. Taskar, and C. Guestrin. E cient second-order gradient boosting for conditional random elds. In Proceeding of 18th Arti cial Intelligence and Statistics Conference (AISTATS’15), volume 1, 2015.

[9]R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear classi cation. Journal of Machine Learning Research, 9:1871{1874, 2008.

[10]J. Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5):1189{1232, 2001.

[11]J. Friedman. Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4):367{378, 2002.

[12]J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28(2):337{407, 2000.

[13]J. H. Friedman and B. E. Popescu. Importance sampled learning ensembles, 2003.

[14]M. Greenwald and S. Khanna. Space-e cient online computation of quantile summaries. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, pages 58{66, 2001.

[15]X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi,

A.Atallah, R. Herbrich, S. Bowers, and J. Q. n. Candela. Practical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, ADKDD’14, 2014.

[16]P. Li. Robust Logitboost and adaptive base class (ABC) Logitboost. In Proceedings of the Twenty-Sixth Conference Annual Conference on Uncertainty in Arti cial Intelligence (UAI’10), pages 302{311, 2010.

[17]P. Li, Q. Wu, and C. J. Burges. Mcrank: Learning to rank using multiple classi cation and gradient boosting. In

Advances in Neural Information Processing Systems 20, pages 897{904. 2008.

[18]X. Meng, J. Bradley, B. Yavuz, E. Sparks,
S.Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde,
S.Owen, D. Xin, R. Xin, M. J. Franklin, R. Zadeh,

M.Zaharia, and A. Talwalkar. MLlib: Machine learning in apache spark. Journal of Machine Learning Research, 17(34):1{7, 2016.

[19]B. Panda, J. S. Herbach, S. Basu, and R. J. Bayardo. Planet: Massively parallel learning of tree ensembles with mapreduce. Proceeding of VLDB Endowment, 2(2):1426{1437, Aug. 2009.

[20]F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
B.Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,

D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825{2830, 2011.

[21]G. Ridgeway. Generalized Boosted Models: A guide to the gbm package.

[22]S. Tyree, K. Weinberger, K. Agrawal, and J. Paykin. Parallel boosted regression trees for web search ranking. In

Proceedings of the 20th international conference on World wide web, pages 387{396. ACM, 2011.

[23]J. Ye, J.-H. Chow, J. Chen, and Z. Zheng. Stochastic gradient boosted distributed decision trees. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM ’09.

[24]Q. Zhang and W. Wang. A fast algorithm for approximate quantiles in high speed data streams. In Proceedings of the 19th International Conference on Scienti c and Statistical Database Management, 2007.

[25]T. Zhang and R. Johnson. Learning nonlinear functions using regularized greedy forest. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5), 2014.