對於特徵選擇的作用在這裏照搬《西瓜書》中的描述:
常用的特徵選擇方法有以下三種(備註:以下代碼採用Jupyter notebook編寫,格式與傳統稍有不同):
1、過濾式特徵選擇
簡單理解就是過濾式特徵選擇通過選擇與響應變量(目標變量)相關性度量(可能是相關係數,互信息,卡方檢驗等)高於設定閾值的特徵。
在scikit-learn工具包中,主要有以下幾種過濾式特徵選擇方法:
1)、移除方差小於指定閾值的特徵
特徵的分佈方差低,表示特徵的分佈集中度高,多樣性較低,包含的信息量少,對模型的作用不大。如某個特徵的取值全爲0,則該特徵在模型訓練過程中起不到正向作用。
對於下述數據,通過設置threshold,可以過濾掉特徵方差小於threshold的特徵。
from sklearn.feature_selection import VarianceThreshold
X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
# 此處使用的threshold是二分類特徵中,某個取值佔樣本總體的80%
var_selection = VarianceThreshold(threshold=0.8 * (1 - 0.8))
X_selection = var_selection.fit_transform(X)
X_selection
array([[0, 1],
[1, 0],
[0, 0],
[1, 1],
[1, 0],
[1, 1]])
計算得到的各個屬性列的方差如下(threshold=0.16):
# 計算得到的各個屬性列的方差
var_selection.variances_
array([0.13888889, 0.22222222, 0.25 ])
2)、單變量特徵選擇
單變量特徵選擇是通過單變量統計檢驗來選擇最好的特徵。它可以看作是估計器的預處理步驟。
示例:
i. SelectKBest按照度量得分,選擇得分前k個特徵。
# 使用卡方檢驗完成單變量特徵選擇
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.datasets import load_iris
"""
Parameters
| ----------
| score_func : callable
| Function taking two arrays X and y, and returning a pair of arrays
| (scores, pvalues) or a single array with scores.
| Default is f_classif (see below "See also"). The default function only
| works with classification tasks.
|
| k : int or "all", optional, default=10
| Number of top features to select.
| The "all" option bypasses selection, for use in a parameter search.
"""
X, y = load_iris(return_X_y=True)
# 使用卡方檢驗計算特徵和目標值的關係,並保留得分最高的k=2個特徵
kBest = SelectKBest(chi2, k=2)
kBest.fit_transform(X, y)
print(kBest.scores_, kBest.pvalues_)
print(X.shape, X_new.shape)
print(X[:5, 2:],"\n" ,X_new[:5, :])
# chi2(X, y)
[ 10.81782088 3.7107283 116.31261309 67.0483602 ] [4.47651499e-03 1.56395980e-01 5.53397228e-26 2.75824965e-15]
(150, 4) (150, 2)
[[1.4 0.2]
[1.4 0.2]
[1.3 0.2]
[1.5 0.2]
[1.4 0.2]]
[[1.4 0.2]
[1.4 0.2]
[1.3 0.2]
[1.5 0.2]
[1.4 0.2]]
ii. SelectPercentile按照度量得分,選擇得分前百分之多少的特徵
from sklearn.feature_selection import SelectPercentile
from sklearn.feature_selection import chi2
from sklearn.datasets import load_iris
"""
score_func : callable
| Function taking two arrays X and y, and returning a pair of arrays
| (scores, pvalues) or a single array with scores.
| Default is f_classif (see below "See also"). The default function only
| works with classification tasks.
|
| percentile : int, optional, default=10
| Percent of features to keep.
"""
X, y = load_iris(return_X_y=True)
# 使用卡方檢驗計算特徵和目標值的關係,並保留特徵
percentile = SelectPercentile(chi2, percentile=0.5)
percentile.fit_transform(X, y)
print(percentile.scores_, percentile.pvalues_)
print(X.shape, X_new.shape)
print(X[:5, 2:],"\n" ,X_new[:5, :])
# chi2(X, y)
[ 10.81782088 3.7107283 116.31261309 67.0483602 ] [4.47651499e-03 1.56395980e-01 5.53397228e-26 2.75824965e-15]
(150, 4) (150, 2)
[[1.4 0.2]
[1.4 0.2]
[1.3 0.2]
[1.5 0.2]
[1.4 0.2]]
[[1.4 0.2]
[1.4 0.2]
[1.3 0.2]
[1.5 0.2]
[1.4 0.2]]
2、包裹式特徵選擇
簡單理解就是包裹式特徵選擇方法通過不斷訓練模型,在每輪迭代過程中,去除那些貢獻度最低的特徵,直至達到最小特徵數,或者模型性能出現大幅下降爲止。
參數說明:
Parameters
Estimator:進行特徵選擇的模型,模型需要能夠表示特徵的重要程度
n_features_to_select:選擇的特徵數量
step:每輪迭代丟棄的特徵數量或百分比
Attributes
n_features_:被選中的特徵數量
support_:特徵是否被選中的狀態碼,True or False,與ranking_值爲1對應
ranking_:特徵的排名次序,被選中值爲1
estimator_:
示例:
# 使用遞歸的特徵消除方法RFE進行特徵選擇
%matplotlib inline
from sklearn.datasets import load_digits
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from matplotlib import pyplot as plt
from sklearn.svm import SVC
digits = load_digits()
X = digits.images.reshape(len(digits.images), -1)
y = digits.target
print(X.shape)
# 用於訓練的簡單模型
svc = SVC(kernel="linear", C=1)
# n_features_to_select:選擇的特徵數量, step:每次迭代清除的特徵數量
rfe = RFE(estimator=svc, n_features_to_select=16, step=1)
rfe.fit(X, y)
# 特徵得分
ranking = rfe.ranking_.reshape(digits.images[0].shape)
print(ranking.shape)
# ranking_結果表示每個特徵的最終排序名次,被選中的特徵的ranking_值爲1
print("每個特徵的最終排序名次:", rfe.ranking_)
# support_表示每個特徵是否被選中 True or False, ranking_爲1的對應位置爲True
print("每次特徵是否被選中:", rfe.support_)
# n_features_表示最終選擇的特徵數量
print("選中的特徵數量:", rfe.n_features_)
print("模型:\n", rfe.estimator_)
# 依據特徵選擇結果選擇特徵
X_selected = X[:, rfe.support_]
print("被選擇的特徵數據:", X_selected.shape)
# Plot pixel ranking, ranking -> (8, 8)表示每個像素點的特徵重要性排序
plt.matshow(ranking, cmap=plt.cm.Blues)
plt.colorbar()
plt.title("Ranking of pixels with RFE")
plt.show()
(1797, 64)
(8, 8)
每個特徵的最終排序名次: [49 35 16 8 1 2 19 36 42 22 15 28 1 17 29 37 39 26 4 1 13 1 24 38
40 30 1 3 5 23 1 44 48 27 10 20 14 1 1 47 46 25 1 1 1 1 1 43
41 32 11 21 9 1 7 33 45 34 1 12 18 6 1 31]
每次特徵是否被選中: [False False False False True False False False False False False False
True False False False False False False True False True False False
False False True False False False True False False False False False
False True True False False False True True True True True False
False False False False False True False False False False True False
False False True False]
選中的特徵數量: 16
模型:
SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
kernel='linear', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False)
被選擇的特徵數據: (1797, 16)
3、嵌入式特徵選擇
嵌入式特徵選擇原則:
a. 基於L1正則化的特徵選擇
1)、對於迴歸問題使用Lasso進行迴歸特徵選擇
2)、對於分類問題使用LR和LinearSVC進行特徵選擇
3)、基於L1正則化的特徵選擇方法基於coef_進行選擇
b. 基於樹模型的特徵選擇方法
1)、基於樹模型的特徵選擇方法基於feature_importance_進行選擇
示例:
1)、基於LassoCV模型完成嵌入式特徵選擇
加載糖尿病數據集:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_diabetes
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LassoCV
diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target
feature_names = diabetes.feature_names
print(feature_names)
X[:2]
基於糖尿病數據集訓練LassoCV估計器:
clf = LassoCV().fit(X, y)
# 由於LassoCV訓練得到的模型參數可能爲正或者負,爲正表示對於正類有積極影響,爲負表示對正類有消極影響
# 積極影響和消極影響都是影響,所以要對影響係數取絕對值
importance = np.abs(clf.coef_)
print(importance)
[ 0. 226.2375274 526.85738059 314.44026013 196.92164002 1.48742026 151.78054083 106.52846989 530.58541123 64.50588257]
基於LassoCV模型訓練得到的參數絕對值選擇絕對值較大的參數對應的特徵:
idx_third = importance.argsort()[-3]
threshold = importance[idx_third] + 0.01
# 獲取排名前2的特徵索引編號
idx_features = (-importance).argsort()[:2]
# 獲取排名前2的特徵名
name_features = np.array(feature_names)[idx_features]
print('Selected features: {}'.format(name_features))
sfm = SelectFromModel(clf, threshold=threshold)
sfm.fit(X, y)
X_transform = sfm.transform(X)
# 特徵選擇後的特徵數量
n_features = sfm.transform(X).shape[1]
X_transform.shape, n_features
Selected features: ['s5' 'bmi']
((442, 2), 2)
查看特徵選擇結果:
print("特徵選擇標記:", sfm.get_support())
print("模型參數權重:", sfm.estimator_.coef_)
print("特徵選擇閾值:", sfm.threshold_)
特徵選擇標記: [False False True False False False False False True False]
模型參數權重: [ -0. -226.2375274 526.85738059 314.44026013 -196.92164002 1.48742026 -151.78054083 106.52846989 530.58541123 64.50588257]
特徵選擇閾值: 314.450260129206
2)、基於LR完成嵌入式特徵選擇
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
X = [[ 0.87, -1.34, 0.31 ],
[-2.79, -0.02, -0.85 ],
[-1.34, -0.48, -2.55 ],
[ 1.92, 1.48, 0.65 ]]
y = [0, 1, 0, 1]
selector = SelectFromModel(estimator=LogisticRegression()).fit(X, y)
print("模型參數權重:", selector.estimator_.coef_)
# 特徵選擇閾值默認爲權重參數絕對值的均值
print("特徵選擇閾值:", selector.threshold_, np.mean(np.abs(selector.estimator_.coef_)))
print("特徵選擇標記:", selector.get_support())
X_transformed = selector.transform(X)
X_transformed.shape
模型參數權重: [[-0.32857694 0.83411609 0.46668853]]
特徵選擇閾值: 0.5431271870420732 0.5431271870420732
特徵選擇標記: [False True False]
3)、基於L1正則化的特徵選擇方法(迴歸:Lasso,分類:LR/LinearSVC)
# iris 數據集特徵選擇
from sklearn.svm import LinearSVC
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel
X, y = load_iris(return_X_y=True)
print(X.shape)
# 帶有L1正則化項的LinearSVC分類模型
lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y)
model = SelectFromModel(lsvc, prefit=True)
X_new = model.transform(X)
print(X_new.shape)
print(model.get_support())
(150, 4)
(150, 3)
[ True True True False]
4)、基於樹模型的特徵選擇方法
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel
X, y = load_iris(return_X_y=True)
print(X.shape)
clf = ExtraTreesClassifier(n_estimators=50)
clf = clf.fit(X, y)
print(clf.feature_importances_ )
# 默認使用的threshold是clf模型feature_importance_的均值
model = SelectFromModel(clf, prefit=True, threshold=np.mean(clf.feature_importances_))
X_new = model.transform(X)
print(X_new.shape)
print(model.get_support())
(150, 4)
[0.10608772 0.0658854 0.43061022 0.39741666]
(150, 2)
[False False True True]
特徵選擇模型及參數:
model
SelectFromModel(estimator=ExtraTreesClassifier(bootstrap=False,
class_weight=None,
criterion='gini', max_depth=None,
max_features='auto',
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
n_estimators=50, n_jobs=None,
oob_score=False,
random_state=None, verbose=0,
warm_start=False),
max_features=None, norm_order=1, prefit=True,
threshold=0.24999999999999994)
模型屬性:
model.estimator, model.threshold, model.max_features, model.prefit, clf.feature_importances_
(ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=None,
oob_score=False, random_state=None, verbose=0,
warm_start=False),
0.24999999999999994,
None,
True,
array([0.10608772, 0.0658854 , 0.43061022, 0.39741666]))