特徵選擇與評分的可視化顯示 - 在Python中使用Matplotlib繪製“橫着的”條形圖

原創

2018-09-04 21:42

特徵選擇的一種常用方法是計算各個特徵與標籤的相關性，給特徵打分，根據分值大小排序進行選擇。例如：scikit-learn中的SelectKBest和SelectPercentile默認的"f_classif"就是通過方差分析給特徵打分。我們常常會直接使用 SelectKBest 進行特徵選擇，但有的時候，我們希望瞭解各類特徵的價值，從而指導我們可以進一步致力於挖掘數據哪方面的特徵。這時候將特徵的評分可視化一下看起來就會非常方便。

這裏以鳶尾花數據集爲例，使用“橫着的”條形圖進行特徵可視化（當然豎着的也行哈）

一、簡單的可視化函數：

import numpy as np
from sklearn.feature_selection import SelectKBest
from matplotlib import pyplot as plt

def plot_feature_scores(x, y, names=None):

    if not names:
        names = range(len(x[0]))


    # 1. 使用 sklearn.feature_selection.SelectKBest 給特徵打分
    slct = SelectKBest(k="all")
    slct.fit(x, y)
    scores = slct.scores_

    # 2. 將特徵按分數 從大到小 排序
    named_scores = zip(names, scores)
    sorted_named_scores = sorted(named_scores, key=lambda z: z[1], reverse=True)

    sorted_scores = [each[1] for each in sorted_named_scores]
    sorted_names = [each[0] for each in sorted_named_scores]

    y_pos = np.arange(len(names))           # 從上而下的繪圖順序

    # 3. 繪圖
    fig, ax = plt.subplots()
    ax.barh(y_pos, sorted_scores, height=0.7, align='center', color='#AAAAAA', tick_label=sorted_names)
    # ax.set_yticklabels(sorted_names)      # 也可以在這裏設置 條條 的標籤~
    ax.set_yticks(y_pos)
    ax.set_xlabel('Feature Score')
    ax.set_ylabel('Feature Name')
    ax.invert_yaxis()
    ax.set_title('F_classif scores of the features.')

    # 4. 添加每個 條條 的數字標籤
    for score, pos in zip(sorted_scores, y_pos):
        ax.text(score + 20, pos, '%.1f' % score, ha='center', va='bottom', fontsize=8)

    plt.show()

該函數的輸入是特徵向量list、標籤list 和特徵名稱list，不輸入特徵名稱時默認按特徵向量的下標顯示。

二、載入鳶尾花數據集，並手動輸入特徵名稱

from sklearn import datasets
def load_named_x_y_data():
    """ 載入數據，同時也返回特徵的名稱，以便可視化顯示（以鳶尾花數據集爲例） """
    # x_names = ["花萼長度", "花萼寬度", "花瓣長度", "花瓣寬度"]
    x_names = ["sepal_lenth", "sepal_width", "petal_lenth", "petal_width"]
    iris = datasets.load_iris()
    x = iris.data
    y = iris.target
    return x, y, x_names

三、運行結果

def main():
    x, y, names = load_named_x_y_data()
    plot_feature_scores(x, y, names)
    return


if __name__ == "__main__":
    main()

計算結果顯示“花瓣長度”特徵最爲重要，“花瓣寬度”特徵也很重要。而花萼的兩個特徵重要性卻相當低。

我們可以簡單看一下該數據集包含的三類鳶尾花都長啥樣~

1.山鳶尾（setosa）

2.雜色鳶尾（Versicolour）

3.維吉尼亞鳶尾（Virginica）

好像也看不出來什麼~~23333~

四、最後附上完整代碼：

# coding:utf-8
from sklearn import datasets
import numpy as np
from sklearn.feature_selection import SelectKBest
from matplotlib import pyplot as plt


def load_named_x_y_data():
    """ 載入數據，同時也返回特徵的名稱，以便可視化顯示（以鳶尾花數據集爲例） """
    # x_names = ["花萼長度", "花萼寬度", "花瓣長度", "花瓣寬度"]
    x_names = ["sepal_lenth", "sepal_width", "petal_lenth", "petal_width"]
    iris = datasets.load_iris()
    x = iris.data
    y = iris.target
    return x, y, x_names


def plot_feature_scores(x, y, names=None):

    if not names:
        names = range(len(x[0]))


    # 1. 使用 sklearn.feature_selection.SelectKBest 給特徵打分
    slct = SelectKBest(k="all")
    slct.fit(x, y)
    scores = slct.scores_

    # 2. 將特徵按分數 從大到小 排序
    named_scores = zip(names, scores)
    sorted_named_scores = sorted(named_scores, key=lambda z: z[1], reverse=True)

    sorted_scores = [each[1] for each in sorted_named_scores]
    sorted_names = [each[0] for each in sorted_named_scores]

    y_pos = np.arange(len(names))           # 從上而下的繪圖順序

    # 3. 繪圖
    fig, ax = plt.subplots()
    ax.barh(y_pos, sorted_scores, height=0.7, align='center', color='#AAAAAA', tick_label=sorted_names)
    # ax.set_yticklabels(sorted_names)      # 也可以在這裏設置 條條 的標籤~
    ax.set_yticks(y_pos)
    ax.set_xlabel('Feature Score')
    ax.set_ylabel('Feature Name')
    ax.invert_yaxis()
    ax.set_title('F_classif scores of the features.')

    # 4. 添加每個 條條 的數字標籤
    for score, pos in zip(sorted_scores, y_pos):
        ax.text(score + 20, pos, '%.1f' % score, ha='center', va='bottom', fontsize=8)

    plt.show()


def main():
    x, y, names = load_named_x_y_data()
    plot_feature_scores(x, y, names)
    return


if __name__ == "__main__":
    main()

鳶尾花數據集只有4個特徵，可能直接 print 特徵得分就可以看出來哪個重要、哪個不重要了，但是實際情況下很多時候（比如一些數據挖掘競賽），一臉懵逼的你很可能會先二話不說狂寫一兩百個特徵，這個時候繪圖就比較重要了~

另外，很多時候沒必要把所有特徵的分值都畫出來，否則特徵維度高起來，不管是繪圖時間還是可視化效果，都是令人難受的：

只顯示排名前三十的特徵，瞬間就清爽了許多（需要看其他的特徵的話，也可以分段繪圖）~

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

特徵選擇與評分的可視化顯示 - 在Python中使用Matplotlib繪製“橫着的”條形圖

無法訪問/404頁面/SSL ERROR （總之就是連接不上）——問題與解決方案

如何用一臺服務器給多個 Jupyter 用戶提供服務

openstack 安裝常見配置和錯誤處理

Tensorflow 入門（一）環境搭建

Jupyter Notebook（一）安裝配置與使用

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結