PCA代碼(wine數據)
(注意:np.linalg.eig函數求出的特徵值從大到小排列,且一一對應特徵向量,但是特徵向量是每一列,不是每一行!!!!!)
數據未標準化的PCA
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
'''***************************************************************
* @Fun_Name : def getSample(fileName)
* @Function : 獲取文件內的樣本,存入列表
* @Parameter : 文件名
* @Return : 樣本特徵 標籤
* @Creed : Talk is cheap , show me the code
***********************xieqinyu creates in 16:08 2020/5/17***'''
def getSample(fileName):
dataSet = pd.read_csv(fileName,header=None).values # 將字典轉化爲列表形式,不要設表頭,默認第一行爲表頭,會刪除表頭
labels = dataSet[:,0]
feature = dataSet[:,1:14]
return labels,feature
'''***************************************************************
* @Fun_Name : def reduceMean(feature):
* @Function : 特徵去中心化
* @Parameter : 特徵矩陣
* @Return : 去中心化後的特徵矩陣
* @Creed : Talk is cheap , show me the code
***********************xieqinyu creates in 19:54 2020/5/17***'''
def reduceMean(feature):
featureMean = np.mean(feature,axis=0) # 求均值
featureDeal = feature - featureMean # 去均值後的特徵
return featureDeal
'''***************************************************************
* @Fun_Name : def getC(featureDeal):
* @Function : 得到C矩陣
* @Parameter : 去中心化後的特徵矩陣
* @Return : C
* @Creed : Talk is cheap , show me the code
***********************xieqinyu creates in 20:03 2020/5/17***'''
def getC(featureDeal):
m,n = np.shape(featureDeal)
featureDeal = np.mat(featureDeal)
C = (featureDeal.T * featureDeal)/m;
return C
'''***************************************************************
* @Fun_Name : def getFeatureValuesVector(C):
* @Function : 得到C的特徵值特徵向量
* @Parameter : C n降到幾維
* @Return : 因爲是降到二維 返回前二大的特徵向量
* @Creed : Talk is cheap , show me the code
***********************xieqinyu creates in 20:14 2020/5/17***'''
def getFeatureValuesVector(C,n):
featureValues,featureVector = np.linalg.eig(C)
return (featureVector[:,0:n]) #特徵向量已排好,對應的特徵值從大到小
label,feature = getSample('wine.txt')
featureDeal = reduceMean(feature)
C = getC(featureDeal)
featureVector = getFeatureValuesVector(C,2)
Coord = np.mat(featureDeal)*np.mat(featureVector)
plt.scatter(Coord[0:59,0].tolist(),Coord[0:59,1].tolist(),color = "b")
plt.scatter(Coord[59:130,0].tolist(),Coord[59:130,1].tolist(),color = "r")
plt.scatter(Coord[130:178,0].tolist(),Coord[130:178,1].tolist(),color = "g")
plt.show()
# print(Coord)
效果
數據標準化後的PCA:
數據標準化意義和方法:
https://www.cnblogs.com/fonttian/p/9162822.html
把上面程序中reduceMean替換成這個:
def reduceMean(feature):
# 數據標準化
featureMean = np.mean(feature,axis=0)
featureStd = np.std(feature,axis= 0)
featureDeal = (feature - featureMean)/featureStd
return featureDeal
效果:
有個理論地方我有點模糊,希望路過的大佬幫我解答下
Coord = np.mat(featureDeal)*np.mat(featureVector)
這一步是將向量投影到二維空間,這個向量爲什麼不是原始向量,而是標準化後的向量。