從0開始玩一玩xgboost |官網demo | 可選目標函數 | 各種評價指標

從0開始玩一玩xgboost |官網demo | 可選目標函數 | 各種評價指標 | 特徵重要度可視化

從業這麼多年，說說對機器學習算法的認識。
關於分類問題，結構化數據的分類迴歸問題都可以用xgboost來解決；nlp的所有問題都可以抽象成分類問題，也就是nlp問題都可以用bert來解決，包括命名實體識別、實體關係抽取、實體鏈接（百度叫實體鏈指）等。有興趣可以看這個基於bert的實體關係抽取，點我

這裏說說xgboost怎麼玩。

文章目錄

官網博客，點我

1.安裝

mac安裝，我直接：pip install xgboost安裝報錯（mac安裝坑挺深），conda安裝：conda install py-xgboost 就沒有問題了，依賴什麼的都會一起安裝，省事。

要結合matplotlib畫圖工具一起使用的話一定要：conda install matplotlib進行安裝，xgboost使用conda 安裝，matplotlib再使用pip安裝的話會出現環境污染問題，conda list查看環境情況會發現有兩個版本的numpy、scipy，這是因爲pip安裝的源和conda安裝的源不一樣，默認版本不一樣，不要混着使用。

2.運行報錯：AttributeError: module ‘xgboost’ has no attribute ‘DMatrix’

修改文件名，你的文件名很可能是xgboost.py

3.輸入數據格式：libsvm

libsvm 使用的訓練數據和檢驗數據文件格式如下：

[label] [index1]:[value1] [index2]:[value2] …
[label] [index1]:[value1] [index2]:[value2] …

label 目標值，就是說 class（屬於哪一類），就是你要分類的種類，通常是一些整數。
index 是有順序的索引，通常是連續的整數。就是指特徵編號，必須按照升序排列
value 就是特徵值，用來 train 的數據，通常是一堆實數組成。

詳細見：地址，點我

4.mushroom數據：

每個樣本描述了蘑菇的22個屬性，比如形狀、氣味等等（將22維原始特徵用加工後變成了126維特徵並存爲libsvm格式)，然後給出了這個蘑菇是否可食用。其中6513個樣本做訓練，1611個樣本做測試。

libsvm格式數據下載（訓練測試數據都在這裏）：下載地址，點我

5.跑一個最簡單的demo

# coding=utf-8
import xgboost as xgb
import numpy as np

if __name__ == '__main__':
    # 讀取數據
    data_train = xgb.DMatrix('agaricus_train.txt')
    data_test = xgb.DMatrix('agaricus_test.txt')

    # 設定相關參數，訓練模型
    param = {'max_depth': 2, 'eta': 1, 'silent': 1, 'objective': 'binary:logistic'}
    n_round = 10
    model = xgb.train(param, data_train, num_boost_round=n_round)

    # 預測結果
    y_predict = model.predict(data_test)
    # 實際標籤
    y_true = data_test.get_label()
    
    print('y_predict:', y_predict)
    print('y_true:', y_true)

終端輸出(可以看到預測結果和標籤結果是一致的，閾值設置0.5即可)：

6.可選目標函數(objective)

二分類問題一般選擇：reg:logistic
多分類問題一般選擇：multi:softmax

“reg:linear” —— 線性迴歸。
“reg:logistic”—— 邏輯迴歸。
“binary:logistic”—— 二分類的邏輯迴歸問題，輸出爲概率。
“binary:logitraw”—— 二分類的邏輯迴歸問題，輸出的結果爲wTx。
“count:poisson”—— 計數問題的poisson迴歸，輸出結果爲poisson分佈。在poisson迴歸中，max_delta_step的缺省值爲0.7。(used to safeguard optimization)
“multi:softmax” –讓XGBoost採用softmax目標函數處理多分類問題，同時需要設置參數num_class（類別個數）
“multi:softprob” –和softmax一樣，但是輸出的是ndata * nclass的向量，可以將該向量reshape成ndata行nclass列的矩陣。沒行數據表示樣本所屬於每個類別的概率。
“rank:pairwise” –set XGBoost to do ranking task by minimizing the pairwise loss

7.輸出評價指標

如果不知道這些指標怎麼計算，這篇博客幫你搞明白：
文章地址，點它：搞懂迴歸和分類模型的評價指標的計算：混淆矩陣，ROC，AUC，KS，SSE，R-square，Adjusted R-Square

from sklearn import metrics
print ('AUC: %.4f' % metrics.roc_auc_score(y_true,y_predict_prob))  # y_predict_prob是概率值
print ('ACC: %.4f' % metrics.accuracy_score(y_true,y_predict))
print ('Recall: %.4f' % metrics.recall_score(y_true,y_predict))
print ('F1-score: %.4f' %metrics.f1_score(y_true,y_predict))
print ('Precesion: %.4f' %metrics.precision_score(y_true,y_predict))
print(metrics.confusion_matrix(y_true,y_predict))

8.特徵重要度可視化

import os
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"
from xgboost import plot_importance
from matplotlib import pyplot as plt
plot_importance(model)
plt.show()

效果：

9.輸出top-n重要特徵

feature_score = model.get_fscore()
feature_score = dict(sorted(feature_score.items(), key=lambda x: x[1], reverse=True))

輸出：

{'f29': 48, 'f6': 44, 'f2': 36, 'f4': 20, 'f179': 18, 'f133': 16, 'f95': 12, 'f7': 10, 'f71': 10, 'f12': 10, 'f5': 10, 'f96': 8, 'f21': 6, 'f81': 6, 'f43': 6, 'f141': 6, 'f115': 6, 'f27': 4, 'f132': 4, 'f74': 4, 'f1': 4, 'f83': 4, 'f121': 4, 'f178': 4, 'f84': 4, 'f93': 4, 'f99': 4, 'f124': 4, 'f118': 2, 'f88': 2, 'f46': 2, 'f201': 2, 'f199': 2, 'f151': 2, 'f101': 2, 'f54': 2, 'f0': 2, 'f41': 2, 'f144': 2, 'f36': 2, 'f225': 2, 'f117': 2, 'f98': 2, 'f80': 2, 'f17': 2, 'f25': 2, 'f55': 2, 'f146': 2, 'f128': 2, 'f44': 2}

10.如何調參

看這篇博客

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

從0開始玩一玩xgboost |官網demo | 可選目標函數 | 各種評價指標 | 特徵重要度可視化

文章目錄

1.安裝

2.運行報錯：AttributeError: module ‘xgboost’ has no attribute ‘DMatrix’

3.輸入數據格式：libsvm

4.mushroom數據：

5.跑一個最簡單的demo

6.可選目標函數(objective)

7.輸出評價指標

8.特徵重要度可視化

9.輸出top-n重要特徵

10.如何調參

DAPPER 事務 TRANSACTION

基於tensorflow和deepspeech的中文語音識別模型，訓練+部署

醫療對話場景的語音識別 |垂直領域（google 2018 論文解讀）

語音識別數據增強方法（google2019年7月論文）

財經知識 | 金融小白學習之旅

beam search解碼原理（斯坦福 2014 論文解讀）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

從0開始玩一玩xgboost |官網demo | 可選目標函數 | 各種評價指標 | 特徵重要度可視化

文章目錄

1.安裝

2.運行報錯：AttributeError: module ‘xgboost’ has no attribute ‘DMatrix’

3.輸入數據格式 ：libsvm

4.mushroom數據：

5.跑一個最簡單的demo

6.可選目標函數(objective)

7.輸出評價指標

8.特徵重要度可視化

9.輸出top-n重要特徵

10.如何調參

3.輸入數據格式：libsvm