xgboost的原生接口與sklearn接口輸出feature_importance

原創

2020-06-30 22:10

1、sklearn的原生接口和sklearn接口調用feature_importance有差別：
bst = xgb.train(param, d1_train, num_boost_round=100, evals=watch_list)
xgc = xgb.XGBClassifier(objective=‘binary:logistic’, seed=10086, **bst_params)

xgc.feature_importances_ 等同於xgc.get_booster().get_fcore()等同於xgc.get_booster().get_score(importance_type=“weight”)
而原生接口直接bst.get_fcore()以及.get_score(importance_type=“weight”)

2、以sklearn接口爲例xgboost的特徵重要性有兩種途徑，如下：
xgc.feature_importances_、xgb.plot_importance(xgc, max_num_features=10)
但是兩者輸出的結果有差異：

網上有說其中一個的importance_type是gain，一個是weight。其實不然，查看源碼：

    @property
    def feature_importances_(self):
        """
        Returns
        -------
        feature_importances_ : array of shape = [n_features]

        """
        b = self.get_booster()
        fs = b.get_fscore()
        all_features = [fs.get(f, 0.) for f in b.feature_names]
        all_features = np.array(all_features, dtype=np.float32)
        return all_features / all_features.sum()

def plot_importance(booster, ax=None, height=0.2,
                    xlim=None, ylim=None, title='Feature importance',
                    xlabel='F score', ylabel='Features',
                    importance_type='weight', max_num_features=None,
                    grid=True, show_values=True, **kwargs):

    """Plot importance based on fitted trees.

    Parameters
    ----------
    booster : Booster, XGBModel or dict
        Booster or XGBModel instance, or dict taken by Booster.get_fscore()
    ax : matplotlib Axes, default None
        Target axes instance. If None, new figure and axes will be created.
    grid : bool, Turn the axes grids on or off.  Default is True (On).
    importance_type : str, default "weight"
        How the importance is calculated: either "weight", "gain", or "cover"
        "weight" is the number of times a feature appears in a tree
        "gain" is the average gain of splits which use the feature
        "cover" is the average coverage of splits which use the feature
            where coverage is defined as the number of samples affected by the split
    max_num_features : int, default None
        Maximum number of top features displayed on plot. If None, all features will be displayed.
    height : float, default 0.2
        Bar height, passed to ax.barh()
    xlim : tuple, default None
        Tuple passed to axes.xlim()
    ylim : tuple, default None
        Tuple passed to axes.ylim()
    title : str, default "Feature importance"
        Axes title. To disable, pass None.
    xlabel : str, default "F score"
        X axis title label. To disable, pass None.
    ylabel : str, default "Features"
        Y axis title label. To disable, pass None.
    show_values : bool, default True
        Show values on plot. To disable, pass False.
    kwargs :
        Other keywords passed to ax.barh()

    Returns
    -------
    ax : matplotlib Axes
    """
    # TODO: move this to compat.py
    try:
        import matplotlib.pyplot as plt
    except ImportError:
        raise ImportError('You must install matplotlib to plot importance')

    if isinstance(booster, XGBModel):
        importance = booster.get_booster().get_score(importance_type=importance_type)
    elif isinstance(booster, Booster):
        importance = booster.get_score(importance_type=importance_type)
    elif isinstance(booster, dict):
        importance = booster
    else:
        raise ValueError('tree must be Booster, XGBModel or dict instance')

    if len(importance) == 0:
        raise ValueError('Booster.get_score() results in empty')

從源碼可知importance_type都是‘weight’，就是特徵用於分割的次數，只是feature_importances_ 返回值均一化處理了下。
3、比較importance_type不同取值

發現兩種方法的變量重要性排序有差異，
實際上，判斷特徵重要性共有三個維度，而在實際中，三個選項中的每個選項的功能重要性排序都非常不同

權重。在所有樹中一個特徵被用來分裂數據的次數。
覆蓋。在所有樹中一個特徵被用來分裂數據的次數，並且有多少數據點通過這個分裂點。
增益。使用特徵分裂時平均訓練損失的減少量

然後，推薦使用shap衡量特徵重要性，參考這篇博客，以及之前對feature_importance原理的說明

行文之初參考這篇博文

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

xgboost的原生接口與sklearn接口輸出feature_importance

理解xgboost

xgboost的原生接口與sklearn接口輸出feature_importance

Python連續變量分箱--woe值單調分箱

Python ： satasmodels & sklearn LogisticRegression

logistic regression--sas逐步迴歸推導驗證

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結