一,介紹
常見的特徵選擇方法有三類:過濾式、包裹式、嵌入式。
(1)過濾式
過濾式中最著名的方法爲Relief。其思想是:現在同類中找到樣本最相近的兩點,稱爲“猜中近鄰”;再從異類樣本中尋找最近的兩點,稱爲“猜錯近鄰”,然後用於計算某個屬性的相關統計量:
其中爲第i個分量在j屬性上的取值。
對離散屬性而言:
對連續屬性而言,需要先歸一化,然後:
Relief是針對二分類問題設計的,對於多分類問題,用Relief-F處理:首先,在同類中需尋找K個“猜中近鄰”,然後在其他類中分別找到K個“猜錯近鄰”:
其中,pl爲第l類樣本在數據集中所佔比例。
(2)包裹式
包裹式特徵選擇直接把最終使用的學習器性能作爲特徵子集的評價準則。LVW就是其中的一種典型特徵選擇方法,它是在拉斯維加斯方法框架下使用隨機策略進行子集搜索。接下來我們介紹下這個隨機策略:
著名的隨機策略現在有兩個,一個是上面所說的拉斯維加斯方法,另外一個是蒙特卡洛方法。
蒙特卡羅算法:採樣越多,越接近最優解;舉個例子筐裏有100個蘋果,讓我每次閉眼拿1個,挑出最大的。於是我隨機拿1個,再隨機拿1個跟它比,留下大的,再隨機拿1個……我每拿一次,留下的蘋果都至少不比上次的小。拿的次數越多,挑出的蘋果就越大,但我除非拿100次,否則無法肯定挑出了最大的。這個挑蘋果的算法,就屬於蒙特卡羅算法——儘量找好的,但不保證是最好的。
拉斯維加斯算法:採樣越多,越有可能找到最優解;舉個例子有一把鎖,給我100把鑰匙,只有1把是對的。於是我每次隨機拿1把鑰匙去試,打不開就再換1把。我試的次數越多,打開(最優解)的機會就越大,但在打開之前,那些錯的鑰匙都是沒有用的。這個試鑰匙的算法,就是拉斯維加斯的——儘量找最好的,但不保證能找到。
(3)嵌入式
嵌入式特徵選擇時間特徵選擇過程和學習器訓練過程融爲一體。
我們考慮最簡單的線性迴歸模型,其優化目標則爲:
爲了防止過擬合,我們加入正則化項:
如果λ>0,稱爲嶺迴歸。
如果λ>0,稱爲LASSO迴歸。
爲了求得最優解,需要對上式求導。但是直接求導比較困難,於是我們藉助泰勒展開式變爲:
再通過L-Lipschitz條件將二階導轉換爲L:
對泰勒展開式簡化:
轉化得到:
令:
得到閉式解:
二,代碼實現
relief算法:
import numpy as np
from random import randrange
from sklearn.datasets import make_classification
from sklearn.preprocessing import normalize
def distanceNorm(Norm, D_value):
if Norm == '1':
counter = np.absolute(D_value)
counter = np.sum(counter)
elif Norm == '2':
counter = np.power(D_value, 2)
counter = np.sum(counter)
counter = np.sqrt(counter)
elif Norm == 'Infinity':
counter = np.absolute(D_value)
counter = np.max(counter)
else:
raise Exception('We will program this later......')
return counter
def Relief(features, labels, iter_ratio):
(m, n) = np.shape(features)
distance = np.zeros((m, m))
weight = np.zeros(n)
if iter_ratio >= 0.5:
# 計算距離
for index_i in range(m):
for index_j in range(index_i + 1, m):
D_value = features[index_i] - features[index_j]
distance[index_i, index_j] = distanceNorm('2', D_value) # 計算兩個元素之間的歐式距離
distance += distance.T # 存儲距離矩陣
else:
pass;
for iter_num in range(int(iter_ratio * m)):
nearHit = list()
nearMiss = list()
distance_sort = list()
# 隨機選擇樣本
index_i = randrange(0, m, 1)
self_features = features[index_i]
# 獲取猜中近鄰和 猜錯近鄰
if iter_ratio >= 0.5:
distance[index_i, index_i] = np.max(distance[index_i]) # 獲取與自己相距最大點(用於排除自己)
for index in range(m):
distance_sort.append([distance[index_i, index], index, labels[index]]) # 存儲所有距離
else:
distance = np.zeros(m)
for index_j in range(m):
D_value = features[index_i] - features[index_j]
distance[index_j] = distanceNorm('2', D_value)
distance[index_i] = np.max(distance)
for index in range(m):
distance_sort.append([distance[index], index, labels[index]])
distance_sort.sort(key=lambda x: x[0]) # 距離排序
for index in range(m):
if nearHit == [] and distance_sort[index][2] == labels[index_i]:
nearHit = features[distance_sort[index][1]] # 猜中近鄰
elif nearMiss == [] and distance_sort[index][2] != labels[index_i]:
nearMiss = features[distance_sort[index][1]] # 猜錯近鄰
elif nearHit != [] and nearMiss != []:
break
else:
continue
# 更新權重
weight = weight - np.power(self_features - nearHit, 2) + np.power(self_features - nearMiss, 2)
print(weight)
return weight
if __name__ == '__main__':
features, labels = make_classification(n_samples=500) # 隨機生成分類樣本(500*20,二分類)
features = normalize(X=features, norm='l2', axis=0) # 歸一化數據
for x in range(1, 10):
weight = Relief(features, labels, 1)
Relief-F算法:
import numpy as np
from random import randrange
from sklearn.datasets import make_classification
from sklearn.preprocessing import normalize
def distanceNorm(Norm, D_value):
if Norm == '1':
counter = np.absolute(D_value)
counter = np.sum(counter)
elif Norm == '2':
counter = np.power(D_value, 2)
counter = np.sum(counter)
counter = np.sqrt(counter)
elif Norm == 'Infinity':
counter = np.absolute(D_value)
counter = np.max(counter)
else:
raise Exception('We will program this later......')
return counter
def Relief(features, labels, iter_ratio,k=5):
(m, n) = np.shape(features)
distance = np.zeros((m, m))
weight = np.zeros(n)
if iter_ratio >= 0.5:
# 計算距離
for index_i in range(m):
for index_j in range(index_i + 1, m):
D_value = features[index_i] - features[index_j]
distance[index_i, index_j] = distanceNorm('2', D_value) # 計算兩個元素之間的歐式距離
distance += distance.T # 存儲距離矩陣
else:
pass;
for iter_num in range(int(iter_ratio * m)):
# 隨機選擇樣本
index_i = randrange(0, m, 1)
self_features = features[index_i]
nearHit = list()
nearMiss = dict()
n_labels = list(set(labels))
termination = np.zeros(len(n_labels))
temp = np.ones(len(n_labels))
del n_labels[n_labels.index(labels[index_i])]
for label in n_labels:
nearMiss[label] = list()
distance_sort = list()
# 獲取猜中近鄰和 猜錯近鄰
if iter_ratio >= 0.5:
distance[index_i, index_i] = np.max(distance[index_i]) # 獲取與自己相距最大點(用於排除自己)
for index in range(m):
distance_sort.append([distance[index_i, index], index, labels[index]]) # 存儲所有距離
else:
distance = np.zeros(m)
for index_j in range(m):
D_value = features[index_i] - features[index_j]
distance[index_j] = distanceNorm('2', D_value)
distance[index_i] = np.max(distance)
for index in range(m):
distance_sort.append([distance[index], index, labels[index]])
distance_sort.sort(key=lambda x: x[0]) # 距離排序
for index in range(m):
if distance_sort[index][2] == labels[index_i]: # 猜中近鄰
if len(nearHit) < k:
nearHit.append(features[distance_sort[index][1]])
else:
termination[distance_sort[index][2]] = 1
elif distance_sort[index][2] != labels[index_i]: # 猜錯近鄰
if len(nearMiss[distance_sort[index][2]]) < k:
nearMiss[distance_sort[index][2]].append(features[distance_sort[index][1]])
else:
termination[distance_sort[index][2]] = 1
if (termination == temp).all()==True: # 所有分類獲取到後退出循環
break
# 更新權重
nearHit_term = np.zeros(n)
for x in nearHit:
nearHit += np.abs(np.power(self_features - x, 2))
nearMiss_term = np.zeros((len(list(set(labels))), n))
for index, label in enumerate(nearMiss.keys()):
for x in nearMiss[label]:
nearMiss_term[index] += np.abs(np.power(self_features - x, 2))
weight += nearMiss_term[index] / (k * len(nearMiss.keys()))
weight -= nearHit_term / k
print(weight)
return weight
if __name__ == '__main__':
features, labels = make_classification(n_samples=500,n_classes=4,n_informative=3) # 隨機生成分類樣本(500*20,二分類)
features = normalize(X=features, norm='l2', axis=0) # 歸一化數據
for x in range(1, 10):
weight = Relief(features, labels, 1)
LASSO算法選擇特徵值
from sklearn.linear_model import Lasso from sklearn.preprocessing import StandardScaler from sklearn.datasets import load_boston import numpy as np boston = load_boston() scaler = StandardScaler() X = scaler.fit_transform(boston["data"]) Y = boston["target"] names = boston["feature_names"] lasso = Lasso(alpha=.3) lasso.fit(X, Y) def pretty_print_linear(coefs, names = None, sort = False): if len(names) == 0: names = ["X%s" % x for x in range(len(coefs))] lst = zip(coefs, names) if sort: lst = sorted(lst, key = lambda x:-np.abs(x[0])) return " + ".join("%s * %s" % (round(coef, 3), name) for coef, name in lst) print ("Lasso模型各元素權重: ", pretty_print_linear(lasso.coef_, names, sort = True))