主成分分析原理見: http://blog.sina.com.cn/s/blog_14154cb430102xjcc.html
主成分分析(principal component analysis,PCA)是一種降維技術,把多個變量化爲能夠反映原始變量大部分信息的少數幾個主成分
流程環節爲:
1、數據預處理。數值型,去缺失值,
2、主成分計算。
3、判斷要選擇的主成分數目。
4、選擇並解釋主成分。
5、計算主成分得分。
6、結果可視化。
具體流程
1、數據預處理
# 導入包和數據
> library(ggplot2) # ggplot畫圖
> data("mtcars") # 選用R內置數據集mtcars
> mtcars
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
2、主成分計算
R語言內置兩種主成分分析計算函數,princomp和prcomp,兩個函數的計算方式和出來的結果格式都有細微差異,我們將分別羅列
# 主成分計算-princomp
car.pr1 <- princomp(mtcars,cor=TRUE)
# 主成分計算-prcomp
car.pr2 <- prcomp(mtcars)
3、判斷要選擇的主成分數目。
# 碎石圖-princomp
screeplot(car.pr1,type="lines")
# 碎石圖-prcomp
screeplot(car.pr2,type="lines")
## 利用summary函數查看主成分貢獻率
# Standard deviation 標準差
# Proportion of Variance 單主成分貢獻率
# Cumulative Proportion 累積貢獻率
# 主成分貢獻率-princomp
> summary(car.pr1)
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
Standard deviation 2.5706809 1.6280258 0.79195787 0.51922773 0.47270615 0.45999578
Proportion of Variance 0.6007637 0.2409516 0.05701793 0.02450886 0.02031374 0.01923601
Cumulative Proportion 0.6007637 0.8417153 0.89873322 0.92324208 0.94355581 0.96279183
Comp.7 Comp.8 Comp.9 Comp.10 Comp.11
Standard deviation 0.36777981 0.35057301 0.277572792 0.228112781 0.148473587
Proportion of Variance 0.01229654 0.01117286 0.007004241 0.004730495 0.002004037
Cumulative Proportion 0.97508837 0.98626123 0.993265468 0.997995963 1.000000000
# 主成分貢獻率-prcomp
> summary(car.pr2)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9
Standard deviation 136.533 38.14808 3.07102 1.30665 0.90649 0.66354 0.3086 0.286 0.2507
Proportion of Variance 0.927 0.07237 0.00047 0.00008 0.00004 0.00002 0.0000 0.000 0.0000
Cumulative Proportion 0.927 0.99937 0.99984 0.99992 0.99996 0.99998 1.0000 1.000 1.0000
PC10 PC11
Standard deviation 0.2107 0.1984
Proportion of Variance 0.0000 0.0000
Cumulative Proportion 1.0000 1.0000
選擇前兩個主成分
# 貢獻率提取-princomp
> car.pv1 <- eigen(cor(mtcars))$values
> car.pv1 <- car.pv1/sum(car.pv1)
> car.pv1[1:2] # 展示前兩個
[1] 0.6007637 0.2409516
# 貢獻率提取-prcomp 對於prcomp,可以直接從summary中提取
car.pv2 <- summary(car.pr2)$importance
> car.pv2[2,1:2] # 展示前兩個
PC1 PC2
0.92700 0.07237
4、選擇並解釋主成分。(載荷矩陣)
# 載荷矩陣-princomp
car.pr1$loadings[,1:2]
Comp.1 Comp.2
mpg 0.3625305 0.01612440
cyl -0.3739160 0.04374371
disp -0.3681852 -0.04932413
hp -0.3300569 0.24878402
drat 0.2941514 0.27469408
wt -0.3461033 -0.14303825
qsec 0.2004563 -0.46337482
vs 0.3065113 -0.23164699
am 0.2349429 0.42941765
gear 0.2069162 0.46234863
carb -0.2140177 0.41357106
# 載荷矩陣-prcomp
> car.pr2$rotation[,1:2]
PC1 PC2
mpg -0.038118199 0.009184847
cyl 0.012035150 -0.003372487
disp 0.899568146 0.435372320
hp 0.434784387 -0.899307303
drat -0.002660077 -0.003900205
wt 0.006239405 0.004861023
qsec -0.006671270 0.025011743
vs -0.002729474 0.002198425
am -0.001962644 -0.005793760
gear -0.002604768 -0.011272462
carb 0.005766010 -0.027779208
5、計算主成分得分。
# 計算主成分得分-princomp ,對於princomp,可以直接提取pca結果裏的scores ,或用predict提取
> car.pca1 <- car.pr1$scores[,1:2] # 直接提取pca結果裏的scores,前兩列
> car.pca1 <- predict(car.pr1)[,1:2] # predict提取主成分,前兩列
> car.pca1
Comp.1 Comp.2
Mazda RX4 0.6572132031 1.7354457
Mazda RX4 Wag 0.6293955058 1.5500334
Datsun 710 2.7793970426 -0.1464566
Hornet 4 Drive 0.3117707086 -2.3630190
Hornet Sportabout -1.9744889419 -0.7544022
Valiant 0.0561375337 -2.7859996
Duster 360 -3.0026742880 0.3348874
Merc 240D 2.0553287289 -1.4651808
Merc 230 2.2874083842 -1.9835265
Merc 280 0.5263812077 -0.1620126
Merc 280C 0.5092054932 -0.3238945
Merc 450SE -2.2478104359 -0.6834740
Merc 450SL -2.0478227622 -0.6832207
Merc 450SLC -2.1485421615 -0.8017395
Cadillac Fleetwood -3.8997903717 -0.8279481
Lincoln Continental -3.9541231097 -0.7333815
Chrysler Imperial -3.5929719882 -0.4211349
Fiat 128 3.8562837567 -0.2967519
Honda Civic 4.2540325032 0.6884140
Toyota Corolla 4.2342207436 -0.2792875
Toyota Corona 1.9041678566 -2.1198383
Dodge Challenger -2.1848507430 -1.0142171
AMC Javelin -1.8633834347 -0.9064645
Camaro Z28 -2.8889945733 0.6808260
Pontiac Firebird -2.2459189274 -0.8738121
Fiat X1-9 3.5739682964 -0.1212038
Porsche 914-2 2.6512550541 2.0463709
Lotus Europa 3.3857059882 1.3785993
Ford Pantera L -1.3729574238 3.4999996
Ferrari Dino 0.0009899207 3.2190722
Maserati Bora -2.6691258658 4.3796772
Volvo 142E 2.4205931001 0.2336399
# 計算主成分得分-prcomp 對於prcomp只能用predict提取
> car.pca2 <- predict(car.pr2)[,1:2]
> car.pca2
PC1 PC2
Mazda RX4 -79.596425 2.132241
Mazda RX4 Wag -79.598570 2.147487
Datsun 710 -133.894096 -5.057570
Hornet 4 Drive 8.516559 44.985630
Hornet Sportabout 128.686342 30.817402
Valiant -23.220146 35.106518
Duster 360 159.309025 -32.259197
Merc 240D -112.615805 39.702195
Merc 230 -103.534591 7.513104
Merc 280 -67.046877 -6.208536
Merc 280C -66.997514 -6.206387
Merc 450SE 55.211672 -10.373509
Merc 450SL 55.173910 -10.361893
Merc 450SLC 55.251602 -10.370934
Cadillac Fleetwood 242.814893 52.501758
Lincoln Continental 236.369886 38.280788
Chrysler Imperial 224.737944 16.111941
Fiat 128 -172.363654 6.575522
Honda Civic -181.066911 17.783639
Toyota Corolla -179.697852 4.188212
Toyota Corona -121.224099 -3.345362
Dodge Challenger 80.159386 34.983214
AMC Javelin 67.572431 28.894067
Camaro Z28 150.354631 -36.633575
Pontiac Firebird 164.652522 48.239880
Fiat X1-9 -171.897231 6.643746
Porsche 914-2 -123.804988 2.033356
Lotus Europa -137.082789 -28.675647
Ford Pantera L 159.413222 -53.318347
Ferrari Dino -64.762396 -62.954280
Maserati Bora 145.361703 -139.049149
Volvo 142E -115.181783 -13.826313
6 結果可視化
# 主成分拼接
type <- sample(1:5,nrow(mtcars),replace = T) #mtcar.沒有分組變量,我們隨機分成5組
car.pdata1 <- data.frame(name=rownames(car.pca1),car.pca1)
car.pdata1$type <- factor(type)
car.pdata2 <- data.frame(name=rownames(car.pca2),car.pca2)
car.pdata2$type <- factor(type)
展示主成分及分組置信橢圓-princomp
pca_plot1 <- ggplot(car.pdata1, aes(Comp.1, Comp.2 ,color = type,shape=type)) +
geom_point(size=2)+
# 置信橢圓
stat_ellipse(aes(group = type,fill=type),show.legend = F,geom = "polygon",alpha = 0.2) +
geom_vline(xintercept = 0, size = 0.2,linetype = 2) + #在 x=0 處添加垂直線
geom_hline(yintercept = 0, size = 0.2,linetype = 2) + #在 y=0 處添加水平線
theme(legend.title=element_blank())+ # 圖例標題爲空
labs(x= paste0("Comp.1(", round(car.pv1[1]*100,2), "%)"),
y= paste0("Comp.2(", round(car.pv1[2]*100,2), "%)"),title = "Individuals-PCA1")
pca_plot1
展示主成分及分組置信橢圓-prcomp
pca_plot2 <- ggplot(car.pdata2, aes(PC1, PC2 ,color = type,shape=type)) +
geom_point(size=2)+
# 置信橢圓
stat_ellipse(aes(group = type,fill=type),show.legend = F,geom = "polygon",alpha = 0.2) +
geom_vline(xintercept = 0, size = 0.2,linetype = 2) + #在 x=0 處添加垂直線
geom_hline(yintercept = 0, size = 0.2,linetype = 2) + #在 y=0 處添加水平線
theme(legend.title=element_blank())+ # 圖例標題爲空
labs(x= paste0("PC1(", round(car.pv2[2,1]*100,2), "%)"),
y= paste0("PC2(", round(car.pv2[2,2]*100,2), "%)"),title = "Individuals-PCA2")
pca_plot2