sklearn-數據特徵(第5講)


1.1.用途:
數據和特徵決定機器學習的上限,而模型和算法只是逼近這個上限而已;特徵處理是工程的核心;
好的特徵能夠提高預測結過的準確性;

1.2.要求:
降低數據的擬合度,較少的冗餘數據;提高算法的精度;減少訓練時間;

1.3.scklearn特徵處理方法:
單變量特徵選定;遞歸特徵消除;主成分分析;特徵的重要性;
2.1.單變量特徵選定

SelectKBest()用統計方法選定數據特徵
1—)卡方檢驗:檢驗自變量對應變量的相關性的方法;統計樣本實際觀測值與理論推斷值之間的
偏離程度;卡方值越大則偏離越大,爲0表明理論值完全符合;

實例1:chi-squared選擇4個對結果影響最大的特徵;
# 通過卡方檢驗選定數據特徵
import csv,pandas as pd,numpy as np

np.set_printoptions(precision=3)

# 導入數據
filename = 'pima_data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pd.read_csv(filename, names=names)

array = data.values# 將數據分爲輸入數據和輸出結果
X = array[:, 0:8]
Y = array[:, 8]


# 通過卡方檢驗選定數據特徵
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2


def getFetures(X=X,Y=Y):
    # 特徵選定
    test = SelectKBest(score_func=chi2, k=4)
    fit = test.fit(X, Y)
    
    
    scores=pd.DataFrame(fit.scores_,names[0:8])
    scores=scores.sort_values(by=0, ascending=True)
    
    print(scores)
    features = fit.transform(X)
    print(features)

getFetures()
"""
              0
pedi     5.3927
pres    17.6054
skin    53.1080
preg   111.5197
mass   127.6693
age    181.3037
plas  1411.8870
test  2175.5653

[[148.    0.   33.6  50. ]
 [ 85.    0.   26.6  31. ]
 [183.    0.   23.3  32. ]
 ...
 [121.  112.   26.2  30. ]
 [126.    0.   30.1  47. ]
 [ 93.    0.   30.4  23. ]]
 """
 #相關係數
 #互信息法
2.2.通過遞歸消除來選定特徵
 遞歸消除RFE使用一個基模型來多輪訓練,每輪訓練後消除若干權值係數的特徵,再基於新的
 特徵集開始下一輪訓練;通過每一個基模型的精度找到對最終預測結果影響最大的特徵。
 
實例2:# 通過遞歸消除來選定特徵-找到3個影響最大的特徵
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
def getFeturesRFE(X=X,Y=Y):
    # 特徵選定
    model = LogisticRegression()
    rfe = RFE(model, 3)#找出3個最重要的特徵
    fit = rfe.fit(X, Y)
    
    
    print("特徵個數:",fit.n_features_)
    print("被選定的特徵:",fit.support_)
    print("特徵排名:",fit.ranking_)
    
getFeturesRFE()

# 特徵個數: 3
# 被選定的特徵: [ True False False False False  True  True False]
# 特徵排名: [1 2 4 5 6 1 1 3]

 

2.3.主要成分分析PCA
使用線性代數來轉換壓縮數據,被稱爲數據降維;
分類:
有PCA主成分分析-數據降維,讓樣本具有醉倒的發散性(無監督)應用在聚類等
LDA線性判別分析-數據降維,讓映射後的樣本具有更好的分類性能(有監督)

實例3:
# 通過主要成分分析選定數據特徵
from sklearn.decomposition import PCA
def getFeturesPCA(X=X,Y=Y):
    # 特徵選定
    pca = PCA(n_components=3)
    fit = pca.fit(X)
    print("解釋方差:%s" % fit.explained_variance_ratio_)
    print(fit.components_)
    
getFeturesPCA()

#解釋方差:[0.889 0.062 0.026]
#[[-2.022e-03  9.781e-02  1.609e-02  6.076e-02  9.931e-01  1.401e-02
#   5.372e-04 -3.565e-03]
# [-2.265e-02 -9.722e-01 -1.419e-01  5.786e-02  9.463e-02 -4.697e-02
#  -8.168e-04 -1.402e-01]
# [-2.246e-02  1.434e-01 -9.225e-01 -3.070e-01  2.098e-02 -1.324e-01
#  -6.400e-04 -1.255e-01]]
2.4.特徵的重要性
袋裝決策樹,隨機森林,極端隨機樹都可計算數據特徵的重要性

實例4:
# 通過決策樹計算特徵的重要性
from sklearn.ensemble import ExtraTreesClassifier

def test_ExtraTreesClassifier(X=X,Y=Y):
    # 特徵選定
    model = ExtraTreesClassifier()
    fit = model.fit(X, Y)
    print(fit.feature_importances_)
    
test_ExtraTreesClassifier()

#[0.11  0.23  0.1   0.082 0.076 0.142 0.118 0.142]
3..備註:


1.單變量特徵選擇

1.1.說明:
單變量特徵選擇通過基於單變量統計檢驗選擇最佳特徵來工作。
可以將其視爲估算器的預處理步驟。Scikit-learn將要素選擇例程公開爲實現該transform方法的對象:

對每個特徵使用通用的單變量統計檢驗:
誤報率SelectFpr,誤發現率 SelectFdr或family-wise錯誤SelectFwe。
GenericUnivariateSelect允許使用可配置的策略執行單變量特徵選擇。
這允許使用超參數搜索估計器選擇最佳的單變量選擇策略。

對於迴歸:f_regression,mutual_info_regression
對於分類:chi2,f_classif,mutual_info_classif

基於F檢驗的方法估計兩個隨機變量之間的線性相關程度。
互信息方法可捕獲任何類型統計依賴性,由於是非參數性的,需要更多樣本才能進行準確的估算。

稀疏數據特徵選擇
如用稀疏數據 chi2,mutual_info_regression,mutual_info_classif 會處理數據而不會使其變得密集。
警告 當心不要在分類問題上使用迴歸評分功能,您將獲得無用的結果。
=======================================================================================
1.2.函數:
SelectKBest(score_func= f_classif, *, k=10)	刪除除 k 最高評分功能
 參數:
 k:int或“all”,可選,默認=10 要選擇的頂部特徵數。 “all”選項繞過選擇,用於參數搜索.	
 score_func:callable,函數取兩個數組X和y,返回一對數組(scores, pvalues)或一個分數的數組。
       默認函數爲f_classif,默認函數只適用於分類函數		

score_func裏可選公式:
f_classif:                方差分析F值之間的標籤/特徵分類任務。	
mutual_info_classif:離散目標的相互信息。	
chi2:                     用於分類任務的非負特徵的Chi-平方統計。	
f_regression:          迴歸任務的標籤/特徵之間的F值。	
mutual_info_regression:連續目標的相互信息。	
SelectPercentile:   根據最高分的百分位數選擇特徵。	
SelectFpr:             根據假陽性率測試選擇特徵。	
SelectFdr:             根據估計的錯誤發現率選擇特徵。	
SelectFwe:            根據family-wise錯誤率選擇特徵。	
GenericUnivariateSelect:具有可配置模式的單變量特徵選擇器。	
 
	
屬性:	
scores_ : array-like, shape=(n_features,)  特徵的得分	
pvalues_:數組的shape(n_features,)	特徵評分的p-values,如果`score_func`=None只返回評分	
	
實例:		
from sklearn.datasets import load_digits	
from sklearn.feature_selection import SelectKBest, chi2	
X, y = load_digits(return_X_y=True)	
X.shape	    # (1797, 64)	
X_new = SelectKBest(chi2, k=20).fit_transform(X, y)	
X_new.shape	# (1797, 20)	
	
註釋:
分數相等的特徵之間的關係將在未指定的情況下被打破方式。	
=======================================================================================	
SelectPercentile(score_func = <function f_classif>,*,percentile = 10 )
# 根據最高分數的百分位數選擇特徵。刪除除用戶指定的最高得分百分比以外的所有特徵
參數:
score_func:
    函數接受兩個數組X和y,並返回一對數組(分數,pvalue)或帶分數的單個數組。
    默認值爲f_classif默認功能僅適用於分類任務。

percentile int,可選,默認爲10 要保留的功能百分比。

屬性
scores_: array_like shape(n_features,)特徵得分
pvalues_:array_like shape(n_features,)特徵分數的p值,如果score_func僅返回分數,則爲None 。
=======================================================================================
方法:
fit(X,y)                         在(X,y)上運行記分函數並得到適當的特徵。
fit_transform(X[, y])       擬合數據,然後轉換數據。
get_params([deep])      獲得此估計器的參數。
get_support([indices])   獲取所選特徵的掩碼或整數索引。
inverse_transform(X)     反向變換操作。
set_params(**params)  設置估計器的參數。
transform(X)                 將X還原爲所選特徵。



例如,我們可以執行 χ2 測試樣本以僅檢索以下兩個最佳功能:

實例:
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
X, y = load_iris(return_X_y=True)
X.shape# (150, 4)
X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
X_new.shape# (150, 2)
這些對象作爲輸入計分函數,返回單因素分數和p值(或僅分數SelectKBest和 SelectPercentile):

=======================================================================================
sklearn.feature_selection.f_classif(X, y)[source]
Compute the ANOVA F-value for the provided sample.

Parameters
X{array-like, sparse matrix} shape = [n_samples, n_features]
The set of regressors that will be tested sequentially.

yarray of shape(n_samples)
The data matrix.

Returns
Farray, shape = [n_features,] The set of F values.

pvalarray, shape = [n_features,] The set of p-values.
=======================================================================================
sklearn.feature_selection.chi2(X, y)[source]
Compute chi-squared stats between each non-negative feature and class.

This score can be used to select the n_features features with the highest values for the test chi-squared statistic from X, 
which must contain only non-negative features such as booleans or frequencies (e.g., term counts in document classification), relative to the classes.

Recall that the chi-square test measures dependence between stochastic variables, so using this function “weeds out” the features 
that are the most likely to be independent of class and therefore irrelevant for classification.

Parameters
X{array-like, sparse matrix} of shape (n_samples, n_features)
Sample vectors.

yarray-like of shape (n_samples,)
Target vector (class labels).

Returns
chi2array, shape = (n_features,)
chi2 statistics of each feature.

pvalarray, shape = (n_features,)
p-values of each feature.

=======================================================================================
=======================================================================================

Univariate Feature Selection
An example showing univariate feature selection.

Noisy (non informative) features are added to the iris data and univariate feature selection is applied. For each feature, we plot the p-values for the univariate feature selection and the corresponding weights of an SVM. We can see that univariate feature selection selects the informative features and that these have larger SVM weights.

In the total set of features, only the 4 first ones are significant. We can see that they have the highest score with univariate feature selection. The SVM assigns a large weight to one of these features, but also Selects many of the non-informative features. Applying univariate feature selection before the SVM increases the SVM weight attributed to the significant features, and will thus improve classification.

../../_images/sphx_glr_plot_feature_selection_001.png
Out:
Classification accuracy without selecting features: 0.789
Classification accuracy after univariate feature selection: 0.868

print(__doc__)

import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import LinearSVC
from sklearn.pipeline import make_pipeline
from sklearn.feature_selection import SelectKBest, f_classif

# #############################################################################
# Import some data to play with

# The iris dataset
X, y = load_iris(return_X_y=True)

# Some noisy data not correlated
E = np.random.RandomState(42).uniform(0, 0.1, size=(X.shape[0], 20))

# Add the noisy data to the informative features
X = np.hstack((X, E))

# Split dataset to select feature and evaluate the classifier
X_train, X_test, y_train, y_test = train_test_split(
        X, y, stratify=y, random_state=0
)

plt.figure(1)
plt.clf()

X_indices = np.arange(X.shape[-1])

# #############################################################################
# Univariate feature selection with F-test for feature scoring
# We use the default selection function to select the four
# most significant features
selector = SelectKBest(f_classif, k=4)
selector.fit(X_train, y_train)
scores = -np.log10(selector.pvalues_)
scores /= scores.max()
plt.bar(X_indices - .45, scores, width=.2,
        label=r'Univariate score ($-Log(p_{value})$)')

# #############################################################################
# Compare to the weights of an SVM
clf = make_pipeline(MinMaxScaler(), LinearSVC())
clf.fit(X_train, y_train)
print('Classification accuracy without selecting features: {:.3f}'
      .format(clf.score(X_test, y_test)))

svm_weights = np.abs(clf[-1].coef_).sum(axis=0)
svm_weights /= svm_weights.sum()

plt.bar(X_indices - .25, svm_weights, width=.2, label='SVM weight')

clf_selected = make_pipeline(
        SelectKBest(f_classif, k=4), MinMaxScaler(), LinearSVC()
)
clf_selected.fit(X_train, y_train)
print('Classification accuracy after univariate feature selection: {:.3f}'
      .format(clf_selected.score(X_test, y_test)))

svm_weights_selected = np.abs(clf_selected[-1].coef_).sum(axis=0)
svm_weights_selected /= svm_weights_selected.sum()

plt.bar(X_indices[selector.get_support()] - .05, svm_weights_selected,
        width=.2, label='SVM weights after selection')


plt.title("Comparing feature selection")
plt.xlabel('Feature number')
plt.yticks(())
plt.axis('tight')
plt.legend(loc='upper right')
plt.show()
Total running time of the script: ( 0 minutes 0.139 seconds)

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章