特徵工程--剔除共線特徵

原創

2020-03-06 17:18

所謂共線性，指的是自變量之間存在較強甚至完全的線性相關關係。這會導致模型預測能力下降，增加對於模型結果的解釋成本。

如:

plot_data = data[['A', 'B']].dropna()

plt.plot(plot_data['A'], plot_data['B'], 'bo')
plt.xlabel('Site EUI'); plt.ylabel('Weather Norm EUI')
plt.title('Weather Norm EUI vs Site EUI, R = %.4f' % np.corrcoef(data[['A', 'B']].dropna(), rowvar=False)[0][1])

剔除共線特徵函數

def remove_collinear_features(x, threshold):
    '''
    Objective:
        Remove collinear features in a dataframe with a correlation coefficient
        greater than the threshold. Removing collinear features can help a model
        to generalize and improves the interpretability of the model.
        
    Inputs: 
        threshold: any features with correlations greater than this value are removed
    
    Output: 
        dataframe that contains only the non-highly-collinear features
    '''
    
    y = x['score']
    x = x.drop(columns = ['score'])
    
    corr_matrix = x.corr()
    iters = range(len(corr_matrix.columns) - 1)
    drop_cols = []
    
    for i in iters:
        for j in range(i):
            item = corr_matrix.iloc[j: (j+1), (i+1): (i+2)]
            col = item.columns
            row = item.index
            val = abs(item.values)
            
            if val >= threshold:
                # print(col.values[0], "|", row.values[0], "|", round(val[0][0], 2))
                drop_cols.append(col.values[0])
        
    drops = set(drop_cols)
    x = x.drop(columns = drops)
    x['score'] = y
    return x

features = remove_collinear_features(features, 0.6)
features = features.dropna(axis = 1, how = 'all')
print(features.shape)
features.head()

(11319, 65)

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

特徵工程--剔除共線特徵

剔除共線特徵函數

數據挖掘之房價預測任務

協同過濾與隱語義模型推薦系統實例2: 基於相似度的推薦

ARIMA 時間序列2: 評估和參數選擇

時間處理date_range,truncate,Timestamp,Period,Timedelta,resample,rolling

HMM隱馬爾科夫模型與實例2: 預測股票走勢

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

特徵工程--剔除共線特徵

剔除共線特徵 函數

剔除共線特徵函數