主成分分析原理见: http://blog.sina.com.cn/s/blog_14154cb430102xjcc.html
主成分分析(principal component analysis,PCA)是一种降维技术,把多个变量化为能够反映原始变量大部分信息的少数几个主成分
流程环节为:
1、数据预处理。数值型,去缺失值,
2、主成分计算。
3、判断要选择的主成分数目。
4、选择并解释主成分。
5、计算主成分得分。
6、结果可视化。
具体流程
1、数据预处理
# 导入包和数据
> library(ggplot2) # ggplot画图
> data("mtcars") # 选用R内置数据集mtcars
> mtcars
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
2、主成分计算
R语言内置两种主成分分析计算函数,princomp和prcomp,两个函数的计算方式和出来的结果格式都有细微差异,我们将分别罗列
# 主成分计算-princomp
car.pr1 <- princomp(mtcars,cor=TRUE)
# 主成分计算-prcomp
car.pr2 <- prcomp(mtcars)
3、判断要选择的主成分数目。
# 碎石图-princomp
screeplot(car.pr1,type="lines")
# 碎石图-prcomp
screeplot(car.pr2,type="lines")
## 利用summary函数查看主成分贡献率
# Standard deviation 标准差
# Proportion of Variance 单主成分贡献率
# Cumulative Proportion 累积贡献率
# 主成分贡献率-princomp
> summary(car.pr1)
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
Standard deviation 2.5706809 1.6280258 0.79195787 0.51922773 0.47270615 0.45999578
Proportion of Variance 0.6007637 0.2409516 0.05701793 0.02450886 0.02031374 0.01923601
Cumulative Proportion 0.6007637 0.8417153 0.89873322 0.92324208 0.94355581 0.96279183
Comp.7 Comp.8 Comp.9 Comp.10 Comp.11
Standard deviation 0.36777981 0.35057301 0.277572792 0.228112781 0.148473587
Proportion of Variance 0.01229654 0.01117286 0.007004241 0.004730495 0.002004037
Cumulative Proportion 0.97508837 0.98626123 0.993265468 0.997995963 1.000000000
# 主成分贡献率-prcomp
> summary(car.pr2)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9
Standard deviation 136.533 38.14808 3.07102 1.30665 0.90649 0.66354 0.3086 0.286 0.2507
Proportion of Variance 0.927 0.07237 0.00047 0.00008 0.00004 0.00002 0.0000 0.000 0.0000
Cumulative Proportion 0.927 0.99937 0.99984 0.99992 0.99996 0.99998 1.0000 1.000 1.0000
PC10 PC11
Standard deviation 0.2107 0.1984
Proportion of Variance 0.0000 0.0000
Cumulative Proportion 1.0000 1.0000
选择前两个主成分
# 贡献率提取-princomp
> car.pv1 <- eigen(cor(mtcars))$values
> car.pv1 <- car.pv1/sum(car.pv1)
> car.pv1[1:2] # 展示前两个
[1] 0.6007637 0.2409516
# 贡献率提取-prcomp 对于prcomp,可以直接从summary中提取
car.pv2 <- summary(car.pr2)$importance
> car.pv2[2,1:2] # 展示前两个
PC1 PC2
0.92700 0.07237
4、选择并解释主成分。(载荷矩阵)
# 载荷矩阵-princomp
car.pr1$loadings[,1:2]
Comp.1 Comp.2
mpg 0.3625305 0.01612440
cyl -0.3739160 0.04374371
disp -0.3681852 -0.04932413
hp -0.3300569 0.24878402
drat 0.2941514 0.27469408
wt -0.3461033 -0.14303825
qsec 0.2004563 -0.46337482
vs 0.3065113 -0.23164699
am 0.2349429 0.42941765
gear 0.2069162 0.46234863
carb -0.2140177 0.41357106
# 载荷矩阵-prcomp
> car.pr2$rotation[,1:2]
PC1 PC2
mpg -0.038118199 0.009184847
cyl 0.012035150 -0.003372487
disp 0.899568146 0.435372320
hp 0.434784387 -0.899307303
drat -0.002660077 -0.003900205
wt 0.006239405 0.004861023
qsec -0.006671270 0.025011743
vs -0.002729474 0.002198425
am -0.001962644 -0.005793760
gear -0.002604768 -0.011272462
carb 0.005766010 -0.027779208
5、计算主成分得分。
# 计算主成分得分-princomp ,对于princomp,可以直接提取pca结果里的scores ,或用predict提取
> car.pca1 <- car.pr1$scores[,1:2] # 直接提取pca结果里的scores,前两列
> car.pca1 <- predict(car.pr1)[,1:2] # predict提取主成分,前两列
> car.pca1
Comp.1 Comp.2
Mazda RX4 0.6572132031 1.7354457
Mazda RX4 Wag 0.6293955058 1.5500334
Datsun 710 2.7793970426 -0.1464566
Hornet 4 Drive 0.3117707086 -2.3630190
Hornet Sportabout -1.9744889419 -0.7544022
Valiant 0.0561375337 -2.7859996
Duster 360 -3.0026742880 0.3348874
Merc 240D 2.0553287289 -1.4651808
Merc 230 2.2874083842 -1.9835265
Merc 280 0.5263812077 -0.1620126
Merc 280C 0.5092054932 -0.3238945
Merc 450SE -2.2478104359 -0.6834740
Merc 450SL -2.0478227622 -0.6832207
Merc 450SLC -2.1485421615 -0.8017395
Cadillac Fleetwood -3.8997903717 -0.8279481
Lincoln Continental -3.9541231097 -0.7333815
Chrysler Imperial -3.5929719882 -0.4211349
Fiat 128 3.8562837567 -0.2967519
Honda Civic 4.2540325032 0.6884140
Toyota Corolla 4.2342207436 -0.2792875
Toyota Corona 1.9041678566 -2.1198383
Dodge Challenger -2.1848507430 -1.0142171
AMC Javelin -1.8633834347 -0.9064645
Camaro Z28 -2.8889945733 0.6808260
Pontiac Firebird -2.2459189274 -0.8738121
Fiat X1-9 3.5739682964 -0.1212038
Porsche 914-2 2.6512550541 2.0463709
Lotus Europa 3.3857059882 1.3785993
Ford Pantera L -1.3729574238 3.4999996
Ferrari Dino 0.0009899207 3.2190722
Maserati Bora -2.6691258658 4.3796772
Volvo 142E 2.4205931001 0.2336399
# 计算主成分得分-prcomp 对于prcomp只能用predict提取
> car.pca2 <- predict(car.pr2)[,1:2]
> car.pca2
PC1 PC2
Mazda RX4 -79.596425 2.132241
Mazda RX4 Wag -79.598570 2.147487
Datsun 710 -133.894096 -5.057570
Hornet 4 Drive 8.516559 44.985630
Hornet Sportabout 128.686342 30.817402
Valiant -23.220146 35.106518
Duster 360 159.309025 -32.259197
Merc 240D -112.615805 39.702195
Merc 230 -103.534591 7.513104
Merc 280 -67.046877 -6.208536
Merc 280C -66.997514 -6.206387
Merc 450SE 55.211672 -10.373509
Merc 450SL 55.173910 -10.361893
Merc 450SLC 55.251602 -10.370934
Cadillac Fleetwood 242.814893 52.501758
Lincoln Continental 236.369886 38.280788
Chrysler Imperial 224.737944 16.111941
Fiat 128 -172.363654 6.575522
Honda Civic -181.066911 17.783639
Toyota Corolla -179.697852 4.188212
Toyota Corona -121.224099 -3.345362
Dodge Challenger 80.159386 34.983214
AMC Javelin 67.572431 28.894067
Camaro Z28 150.354631 -36.633575
Pontiac Firebird 164.652522 48.239880
Fiat X1-9 -171.897231 6.643746
Porsche 914-2 -123.804988 2.033356
Lotus Europa -137.082789 -28.675647
Ford Pantera L 159.413222 -53.318347
Ferrari Dino -64.762396 -62.954280
Maserati Bora 145.361703 -139.049149
Volvo 142E -115.181783 -13.826313
6 结果可视化
# 主成分拼接
type <- sample(1:5,nrow(mtcars),replace = T) #mtcar.没有分组变量,我们随机分成5组
car.pdata1 <- data.frame(name=rownames(car.pca1),car.pca1)
car.pdata1$type <- factor(type)
car.pdata2 <- data.frame(name=rownames(car.pca2),car.pca2)
car.pdata2$type <- factor(type)
展示主成分及分组置信椭圆-princomp
pca_plot1 <- ggplot(car.pdata1, aes(Comp.1, Comp.2 ,color = type,shape=type)) +
geom_point(size=2)+
# 置信椭圆
stat_ellipse(aes(group = type,fill=type),show.legend = F,geom = "polygon",alpha = 0.2) +
geom_vline(xintercept = 0, size = 0.2,linetype = 2) + #在 x=0 处添加垂直线
geom_hline(yintercept = 0, size = 0.2,linetype = 2) + #在 y=0 处添加水平线
theme(legend.title=element_blank())+ # 图例标题为空
labs(x= paste0("Comp.1(", round(car.pv1[1]*100,2), "%)"),
y= paste0("Comp.2(", round(car.pv1[2]*100,2), "%)"),title = "Individuals-PCA1")
pca_plot1
展示主成分及分组置信椭圆-prcomp
pca_plot2 <- ggplot(car.pdata2, aes(PC1, PC2 ,color = type,shape=type)) +
geom_point(size=2)+
# 置信椭圆
stat_ellipse(aes(group = type,fill=type),show.legend = F,geom = "polygon",alpha = 0.2) +
geom_vline(xintercept = 0, size = 0.2,linetype = 2) + #在 x=0 处添加垂直线
geom_hline(yintercept = 0, size = 0.2,linetype = 2) + #在 y=0 处添加水平线
theme(legend.title=element_blank())+ # 图例标题为空
labs(x= paste0("PC1(", round(car.pv2[2,1]*100,2), "%)"),
y= paste0("PC2(", round(car.pv2[2,2]*100,2), "%)"),title = "Individuals-PCA2")
pca_plot2