原文:
分類法/範例二: Normal and Shrinkage Linear Discriminant Analysis for classification
"""
總結:
1.通過score方法拿到模型對當前特徵數量的樣本判斷準確度
2.對比有無shrinkage,部分方法纔可以使用特徵壓縮
http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html#sklearn.discriminant_analysis.LinearDiscriminantAnalysis
"""
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
# LinearDiscriminantAnalysis 模式識別的經典算法 特徵抽取方法 使得類內距離最小 類間距離最大
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
def generate_data(n_samples, n_features):
#指定特徵、中心點數量、範圍等來生成幾類數據
X, y = make_blobs(n_samples=n_samples, n_features=1, centers=[[-2], [2]])
# add non-discriminative features
# print((X))
if n_features > 1:
X = np.hstack([X, np.random.randn(n_samples, n_features - 1)])
return X, y
X, y = generate_data(10, 5)
# print((X,y))
import pandas as pd
pd.set_option('precision',2)
df=pd.DataFrame(np.hstack([y.reshape(10,1),X]))
df.columns = ['y', 'X0', 'X1', 'X2', 'X3', 'X4']
print(df)
#改變特徵數量並測試shrinkage之功能
n_train = 20 # samples for training
n_test = 200 # samples for testing
n_averages = 50 # how often to repeat classification
n_features_max = 75 # maximum number of features
step = 4 # step size for the calculation
acc_clf1, acc_clf2 = [], []
n_features_range = range(1, n_features_max + 1, step)
for n_features in n_features_range:
score_clf1, score_clf2 = 0, 0
for _ in range(n_averages):
X, y = generate_data(n_train, n_features)
# 線性分類判別器
# lsqr用最小二乘最小平方QR分解求解
# 對比特徵縮減 shrinkage
clf1 = LinearDiscriminantAnalysis(solver='lsqr', shrinkage='auto').fit(X, y)
clf2 = LinearDiscriminantAnalysis(solver='lsqr', shrinkage=None).fit(X, y)
X, y = generate_data(n_test, n_features)
score_clf1 += clf1.score(X, y)
score_clf2 += clf2.score(X, y)
acc_clf1.append(score_clf1 / n_averages)
acc_clf2.append(score_clf2 / n_averages)
#顯示LDA判別結果
#以比例形式,更有對比意義
features_samples_ratio = np.array(n_features_range) / n_train
# figsize長寬比
fig = plt.figure(figsize=(4,3), dpi=150)
plt.plot(features_samples_ratio, acc_clf1, linewidth=2,
label="Linear Discriminant Analysis with shrinkage", color='r')
plt.plot(features_samples_ratio, acc_clf2, linewidth=2,
label="Linear Discriminant Analysis", color='g')
plt.xlabel('n_features / n_samples')
plt.ylabel('Classification accuracy')
plt.legend(loc=1, prop={'size': 5})
plt.show()
**下圖爲對比結果,使用特徵壓縮可以大大提高預測準確度**
最小二乘的矩陣形式:Ax=b,其中A爲nxk的矩陣,x爲kx1的列向量,b爲nx1的列向量。如果n大於k(方程的個數大於未知量的個數), 這個方程系統稱爲Over Determined System,如果n小於k(方程的個數小於未知量的個數),這個系統就是Under Determined System。
QR分解法是三種將矩陣分解的方式之一。這種方式,把矩陣分解成一個正交矩陣與一個上三角矩陣的積。
QR 分解經常用來解線性最小二乘法問題。QR 分解也是特定特徵值算法即QR算法的基礎。