[Python數據挖掘] PCA降維

[問題背景]

假定有這樣的數據集:

使用Python編制字典如下:

data = [[3, 5, 3, 6],
        [4, 3, 5, 8],
        [5, 1, 4, 10],
        [6, 3, 2, 13],
        [19, 23, 32, 101],
        [20, 23, 45, 106],
        [23, 6, 7, 69],
        [24, 11, 44, 73],
        [25, 2, 3, 129],
        [26, 3, 2, 133],
        [21, 1, 23, 110],
        [22, 12, 11, 115],
        [23, 2, 43, 120],
        [24, 7, 9, 124],
        [15, 5, 4, 43],
        [16, 6, 7, 46],
        [17, 1, 4, 49],
        [18, 2, 3, 53],
        [27, 4, 4, 138],
        [29, 5, 6, 143],
        [7, 2, 4, 15],
        [8, 14, 8, 17],
        [9, 22, 33, 20],
        [10, 43, 57, 22],
        [11, 1, 32, 24],
        [12, 2, 34, 27],
        [19, 4, 6, 56],
        [20, 3, 5, 59],
        [21, 3, 4, 63],
        [22, 3, 22, 66]
        ]

target = [0, 0, 0, 0, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]

Data = {'data':data, 'target':target}

字典Data包含data和target,其中data是包含30個四維數據條小列表的大列表,target是30個數據條對應標籤的列表,標籤取值有三類{0,1,2}

如何將此數據集降成二維?

[問題分析]

導入sklearn.decomposition中的PCA (decomposition本意爲“分解”),以便作PCA(主成分分析)降維

導入matplotlib.pyplot,以便描點作圖

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

使用Python編制字典數據集,代碼在上面[問題背景]中已經給出。

將數據部分複製到X,將標籤部分複製到y:

X = Data['data']
y = Data['target']

用PCA()生成pca降維器,設置目標維度數n_components=2,即將數據降至二維:

pca = PCA(n_components=2)

將數據X傳入降維器pca的接口fit_transform(),降維後的數據返回給reduced_X:

reduced_X = pca.fit_transform(X)

這時print(reduced_X),可以看到:

原本的30條4維數據被轉化爲了30條2維數據。

下面對三類數據描點作圖:

生成三類數據的x和y軸座標列表:

A_x, A_y = [], []
B_x, B_y = [], []
C_x, C_y = [], []

使用標籤信息將二維數據填入三類數據x.y.座標列表:

for i in range(len(reduced_X)):
    if (y[i] == 0):
        A_x.append(reduced_X[i][0])
        A_y.append(reduced_X[i][1])
    elif (y[i] == 1):
        B_x.append(reduced_X[i][0])
        B_y.append(reduced_X[i][1])
    elif (y[i] == 2):
        C_x.append(reduced_X[i][0])
        C_y.append(reduced_X[i][1])

最後使用plt.scatter()爲plt對象描繪散點圖,使用plt.show()顯示繪圖內容

其中plt.scatter()的前兩個參數爲座標列表,第三個參數c爲顏色(取值:'r'紅, 'b'藍, 'g'綠, 等等),第四個參數爲散點形狀(取值‘D’菱方塊, ‘.‘小圓點, 'x'叉叉)

plt.scatter(A_x, A_y, c='y', marker='s')
plt.scatter(B_x, B_y, c='b', marker='x')
plt.scatter(C_x, C_y, c='g', marker='.')
plt.show()

最後附上完整代碼:

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

data = [[3, 5, 3, 6],
        [4, 3, 5, 8],
        [5, 1, 4, 10],
        [6, 3, 2, 13],
        [19, 23, 32, 101],
        [20, 23, 45, 106],
        [23, 6, 7, 69],
        [24, 11, 44, 73],
        [25, 2, 3, 129],
        [26, 3, 2, 133],
        [21, 1, 23, 110],
        [22, 12, 11, 115],
        [23, 2, 43, 120],
        [24, 7, 9, 124],
        [15, 5, 4, 43],
        [16, 6, 7, 46],
        [17, 1, 4, 49],
        [18, 2, 3, 53],
        [27, 4, 4, 138],
        [29, 5, 6, 143],
        [7, 2, 4, 15],
        [8, 14, 8, 17],
        [9, 22, 33, 20],
        [10, 43, 57, 22],
        [11, 1, 32, 24],
        [12, 2, 34, 27],
        [19, 4, 6, 56],
        [20, 3, 5, 59],
        [21, 3, 4, 63],
        [22, 3, 22, 66]
        ]

target = [0, 0, 0, 0, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]

Data = {'data':data, 'target':target}

X = Data['data']
y = Data['target']

pca = PCA(n_components=2)

reduced_X = pca.fit_transform(X)


A_x, A_y = [], []
B_x, B_y = [], []
C_x, C_y = [], []

for i in range(len(reduced_X)):
    if (y[i] == 0):
        A_x.append(reduced_X[i][0])
        A_y.append(reduced_X[i][1])
    elif (y[i] == 1):
        B_x.append(reduced_X[i][0])
        B_y.append(reduced_X[i][1])
    elif (y[i] == 2):
        C_x.append(reduced_X[i][0])
        C_y.append(reduced_X[i][1])

plt.scatter(A_x, A_y, c='y', marker='s')
plt.scatter(B_x, B_y, c='b', marker='x')
plt.scatter(C_x, C_y, c='g', marker='.')
plt.show()

本文代碼參考中國MOOC大學禮欣老師課程《Python機器學習應用》中降維部分編寫講解,因爲考慮到直接使用sklearn.datasets中的鳶尾花數據集load_iris不易讓讀者理解數據預處理的對象內部結構,而鳶尾花數據集實際上就是包含了數據和標籤的字典,因此自行編制數據集以突出其數據預處理過程。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章