[xgboost] python

參考:https://xgboost.apachecn.org/

https://xgboost.readthedocs.io


Python 軟件包介紹

本文檔給出了有關 xgboost python 軟件包的基本演練.

其他有用的鏈接列表

安裝 XGBoost

要安裝 XGBoost, 請執行以下步驟:

  • 您需要在項目的根目錄下運行 make 命令
  • 在 python-package 目錄下運行
python setup.py install
import xgboost as xgb

數據接口

XGBoost python 模塊能夠使用以下方式加載數據:

  • libsvm txt format file(libsvm 文本格式的文件)
  • Numpy 2D array, and(Numpy 2維數組, 以及)
  • xgboost binary buffer file. (xgboost 二進制緩衝文件)

這些數據將會被存在一個名爲 DMatrix 的對象中.

  • 要加載 ligbsvm 文本格式或者 XGBoost 二進制文件到 DMatrix 對象中. 代碼如下:
dtrain = xgb.DMatrix('train.svm.txt')
dtest = xgb.DMatrix('test.svm.buffer')
  • 要加載 numpy 的數組到 DMatrix 對象中, 代碼如下:
data = np.random.rand(5,10) # 5 entities, each contains 10 features
label = np.random.randint(2, size=5) # binary target
dtrain = xgb.DMatrix( data, label=label)
  • 要加載 scpiy.sparse 數組到 DMatrix 對象中, 代碼如下:
csr = scipy.sparse.csr_matrix((dat, (row, col)))
dtrain = xgb.DMatrix(csr)
  • 保存 DMatrix 到 XGBoost 二進制文件中後, 會在下次加載時更快:
dtrain = xgb.DMatrix('train.svm.txt')
dtrain.save_binary("train.buffer")
  • 要處理 DMatrix 中的缺失值, 您可以通過指定缺失值的參數來初始化 DMatrix:
dtrain = xgb.DMatrix(data, label=label, missing = -999.0)
  • 在需要時可以設置權重:
w = np.random.rand(5, 1)
dtrain = xgb.DMatrix(data, label=label, missing = -999.0, weight=w)

設置參數

XGBoost 使用 pair 格式的 list 來保存 參數. 例如:

  • Booster(提升)參數
param = {'bst:max_depth':2, 'bst:eta':1, 'silent':1, 'objective':'binary:logistic' }
param['nthread'] = 4
param['eval_metric'] = 'auc'
  • 您也可以指定多個評估的指標:
param['eval_metric'] = ['auc', 'ams@0'] 

# alternativly:
# plst = param.items()
# plst += [('eval_metric', 'ams@0')]
  • 指定驗證集以觀察性能
evallist  = [(dtest,'eval'), (dtrain,'train')]

訓練

有用參數列表和數據以後, 您現在可以訓練一個模型了.

  • 訓練
num_round = 10
bst = xgb.train( plst, dtrain, num_round, evallist )
  • 保存模型 訓練之後,您可以保存模型並將其轉儲出去.
bst.save_model('0001.model')
  • 轉儲模型和特徵映射 您可以將模型轉儲到 txt 文件並查看模型的含義
# 轉存模型
bst.dump_model('dump.raw.txt')
# 轉儲模型和特徵映射
bst.dump_model('dump.raw.txt','featmap.txt')
  • 加載模型 當您保存模型後, 您可以使用如下方式在任何時候加載模型文件
bst = xgb.Booster({'nthread':4}) #init model
bst.load_model("model.bin") # load data

提前停止

如果您有一個驗證集, 你可以使用提前停止找到最佳數量的 boosting rounds(梯度次數). 提前停止至少需要一個 evals 集合. 如果有多個, 它將使用最後一個.

train(..., evals=evals, early_stopping_rounds=10)

該模型將開始訓練, 直到驗證得分停止提高爲止. 驗證錯誤需要至少每個 early_stopping_rounds 減少以繼續訓練.

如果提前停止,模型將有三個額外的字段: bst.best_scorebst.best_iteration 和 bst.best_ntree_limit. 請注意 train() 將從上一次迭代中返回一個模型, 而不是最好的一個.

這與兩個度量標準一起使用以達到最小化(RMSE, 對數損失等)和最大化(MAP, NDCG, AUC). 請注意, 如果您指定多個評估指標, 則 param ['eval_metric'] 中的最後一個用於提前停止.

預測

當您 訓練/加載 一個模型並且準備好數據之後, 即可以開始做預測了.

# 7 個樣本, 每一個包含 10 個特徵
data = np.random.rand(7, 10)
dtest = xgb.DMatrix(data)
ypred = bst.predict(xgmat)

如果在訓練過程中提前停止, 可以用 bst.best_ntree_limit 從最佳迭代中獲得預測結果:

ypred = bst.predict(xgmat,ntree_limit=bst.best_ntree_limit)

繪圖

您可以使用 plotting(繪圖)模塊來繪製出 importance(重要性)以及輸出的 tree(樹).

要繪製出 importance(重要性), 可以使用 plot_importance. 該函數需要安裝 matplotlib.

xgb.plot_importance(bst)

輸出的 tree(樹)會通過 matplotlib 來展示, 使用 plot_tree 指定 target tree(目標樹)的序號. 該函數需要 graphviz 和 matplotlib.

xgb.plot_tree(bst, num_trees=2)

當您使用 IPython 時, 你可以使用 to_graphviz 函數, 它可以將 target tree(目標樹)轉換成 graphviz 實例. graphviz 實例會自動的在 IPython 上呈現.

xgb.to_graphviz(bst, num_trees=2)

 


參數調整注意事項

參數調整是機器學習中的一門暗藝術,模型的最優參數可以依賴於很多場景。所以要創建一個全面的指導是不可能的。

本文檔試圖爲 xgboost 中的參數提供一些指導意見。

理解 Bias-Variance(偏差-方差)權衡

如果你瞭解一些機器學習或者統計課程,你會發現這可能是最重要的概念之一。 當我們允許模型變得更復雜(例如深度更深)時,模型具有更好的擬合訓練數據的能力,會產生一個較少的偏差模型。 但是,這樣複雜的模型需要更多的數據來擬合。

xgboost 中的大部分參數都是關於偏差方差的權衡的。最好的模型應該仔細地將模型複雜性與其預測能力進行權衡。 參數文檔 會告訴你每個參數是否會使得模型更加 conservative (保守)與否。這可以幫助您在複雜模型和簡單模型之間靈活轉換。

控制過擬合

當你觀察到訓練精度高,但是測試精度低時,你可能遇到了過擬合的問題。

通常有兩種方法可以控制 xgboost 中的過擬合。

  • 第一個方法是直接控制模型的複雜度
    • 這包括 max_depthmin_child_weight 和 gamma
  • 第二種方法是增加隨機性,使訓練對噪聲強健
    • 這包括 subsamplecolsample_bytree
    • 你也可以減小步長 eta, 但是當你這麼做的時候需要記得增加 num_round 。

處理不平衡的數據集

對於廣告點擊日誌等常見情況,數據集是極不平衡的。 這可能會影響 xgboost 模型的訓練,有兩種方法可以改善它。

  • 如果你只關心預測的排名順序(AUC)
    • 通過 scale_pos_weight 來平衡 positive 和 negative 權重。
    • 使用 AUC 進行評估
  • 如果你關心預測正確的概率
    • 在這種情況下,您無法重新平衡數據集
    • 在這種情況下,將參數 max_delta_step 設置爲有限數字(比如說1)將有助於收斂

Text Input Format of DMatrix

Basic Input Format

XGBoost currently supports two text formats for ingesting data: LibSVM and CSV. The rest of this document will describe the LibSVM format. (See this Wikipedia article for a description of the CSV format.)

For training or predicting, XGBoost takes an instance file with the format as below:

train.txt

1 101:1.2 102:0.03
0 1:2.1 10001:300 10002:400
0 0:1.3 1:0.3
1 0:0.01 1:0.3
0 0:0.2 1:0.3

Each line represent a single instance, and in the first line ‘1’ is the instance label, ‘101’ and ‘102’ are feature indices, ‘1.2’ and ‘0.03’ are feature values. In the binary classification case, ‘1’ is used to indicate positive samples, and ‘0’ is used to indicate negative samples. We also support probability values in [0,1] as label, to indicate the probability of the instance being positive.

Auxiliary Files for Additional Information

Note: all information below is applicable only to single-node version of the package. If you’d like to perform distributed training with multiple nodes, skip to the section Embedding additional information inside LibSVM file.

Group Input Format

For ranking task, XGBoost supports the group input format. In ranking task, instances are categorized into query groups in real world scenarios. For example, in the learning to rank web pages scenario, the web page instances are grouped by their queries. XGBoost requires an file that indicates the group information. For example, if the instance file is the train.txt shown above, the group file should be named train.txt.group and be of the following format:

train.txt.group

2
3

This means that, the data set contains 5 instances, and the first two instances are in a group and the other three are in another group. The numbers in the group file are actually indicating the number of instances in each group in the instance file in order. At the time of configuration, you do not have to indicate the path of the group file. If the instance file name is xxx, XGBoost will check whether there is a file named xxx.groupin the same directory.

Instance Weight File

Instances in the training data may be assigned weights to differentiate relative importance among them. For example, if we provide an instance weight file for the train.txt file in the example as below:

train.txt.weight

1
0.5
0.5
1
0.5

It means that XGBoost will emphasize more on the first and fourth instance (i.e. the positive instances) while training. The configuration is similar to configuring the group information. If the instance file name is xxx, XGBoost will look for a file named xxx.weight in the same directory. If the file exists, the instance weights will be extracted and used at the time of training.

Note

Binary buffer format and instance weights

If you choose to save the training data as a binary buffer (using save_binary()), keep in mind that the resulting binary buffer file will include the instance weights. To update the weights, use the set_weight() function.

Initial Margin File

XGBoost supports providing each instance an initial margin prediction. For example, if we have a initial prediction using logistic regression for train.txt file, we can create the following file:

train.txt.base_margin

-0.4
1.0
3.4

XGBoost will take these values as initial margin prediction and boost from that. An important note about base_margin is that it should be margin prediction before transformation, so if you are doing logistic loss, you will need to put in value before logistic transformation. If you are using XGBoost predictor, use pred_margin=1 to output margin values.

Embedding additional information inside LibSVM file

This section is applicable to both single- and multiple-node settings.

Query ID Columns

This is most useful for ranking task, where the instances are grouped into query groups. You may embed query group ID for each instance in the LibSVM file by adding a token of form qid:xx in each row:

train.txt

1 qid:1 101:1.2 102:0.03
0 qid:1 1:2.1 10001:300 10002:400
0 qid:2 0:1.3 1:0.3
1 qid:2 0:0.01 1:0.3
0 qid:3 0:0.2 1:0.3
1 qid:3 3:-0.1 10:-0.3
0 qid:3 6:0.2 10:0.15

Keep in mind the following restrictions:

  • You are not allowed to specify query ID’s for some instances but not for others. Either every row is assigned query ID’s or none at all.
  • The rows have to be sorted in ascending order by the query IDs. So, for instance, you may not have one row having large query ID than any of the following rows.

Instance weights

You may specify instance weights in the LibSVM file by appending each instance label with the corresponding weight in the form of [label]:[weight], as shown by the following example:

train.txt

1:1.0 101:1.2 102:0.03
0:0.5 1:2.1 10001:300 10002:400
0:0.5 0:1.3 1:0.3
1:1.0 0:0.01 1:0.3
0:0.5 0:0.2 1:0.3

where the negative instances are assigned half weights compared to the positive instances.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章