数据分析-主成分分析流程(R语言)

主成分分析原理见: http://blog.sina.com.cn/s/blog_14154cb430102xjcc.html
主成分分析(principal component analysis,PCA)是一种降维技术,把多个变量化为能够反映原始变量大部分信息的少数几个主成分
流程环节为:
1、数据预处理。数值型,去缺失值,
2、主成分计算。
3、判断要选择的主成分数目。
4、选择并解释主成分。
5、计算主成分得分。
6、结果可视化。

具体流程
1、数据预处理

# 导入包和数据
> library(ggplot2)  # ggplot画图
> data("mtcars")   # 选用R内置数据集mtcars
> mtcars
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

2、主成分计算
R语言内置两种主成分分析计算函数,princomp和prcomp,两个函数的计算方式和出来的结果格式都有细微差异,我们将分别罗列

# 主成分计算-princomp
car.pr1 <- princomp(mtcars,cor=TRUE)
# 主成分计算-prcomp
car.pr2 <- prcomp(mtcars)

3、判断要选择的主成分数目。

# 碎石图-princomp
screeplot(car.pr1,type="lines")

在这里插入图片描述

# 碎石图-prcomp
screeplot(car.pr2,type="lines")

在这里插入图片描述

## 利用summary函数查看主成分贡献率
# Standard deviation 标准差
# Proportion of Variance  单主成分贡献率
# Cumulative Proportion  累积贡献率
# 主成分贡献率-princomp
> summary(car.pr1)
Importance of components:
                          Comp.1    Comp.2     Comp.3     Comp.4     Comp.5     Comp.6
Standard deviation     2.5706809 1.6280258 0.79195787 0.51922773 0.47270615 0.45999578
Proportion of Variance 0.6007637 0.2409516 0.05701793 0.02450886 0.02031374 0.01923601
Cumulative Proportion  0.6007637 0.8417153 0.89873322 0.92324208 0.94355581 0.96279183
                           Comp.7     Comp.8      Comp.9     Comp.10     Comp.11
Standard deviation     0.36777981 0.35057301 0.277572792 0.228112781 0.148473587
Proportion of Variance 0.01229654 0.01117286 0.007004241 0.004730495 0.002004037
Cumulative Proportion  0.97508837 0.98626123 0.993265468 0.997995963 1.000000000

# 主成分贡献率-prcomp
> summary(car.pr2)
Importance of components:
                           PC1      PC2     PC3     PC4     PC5     PC6    PC7   PC8    PC9
Standard deviation     136.533 38.14808 3.07102 1.30665 0.90649 0.66354 0.3086 0.286 0.2507
Proportion of Variance   0.927  0.07237 0.00047 0.00008 0.00004 0.00002 0.0000 0.000 0.0000
Cumulative Proportion    0.927  0.99937 0.99984 0.99992 0.99996 0.99998 1.0000 1.000 1.0000
                         PC10   PC11
Standard deviation     0.2107 0.1984
Proportion of Variance 0.0000 0.0000
Cumulative Proportion  1.0000 1.0000

选择前两个主成分

# 贡献率提取-princomp
> car.pv1 <- eigen(cor(mtcars))$values
> car.pv1 <- car.pv1/sum(car.pv1)
> car.pv1[1:2] # 展示前两个
[1] 0.6007637 0.2409516

# 贡献率提取-prcomp  对于prcomp,可以直接从summary中提取
car.pv2 <- summary(car.pr2)$importance
> car.pv2[2,1:2] # 展示前两个
    PC1     PC2 
0.92700 0.07237 

4、选择并解释主成分。(载荷矩阵)

# 载荷矩阵-princomp
car.pr1$loadings[,1:2]
         Comp.1      Comp.2
mpg   0.3625305  0.01612440
cyl  -0.3739160  0.04374371
disp -0.3681852 -0.04932413
hp   -0.3300569  0.24878402
drat  0.2941514  0.27469408
wt   -0.3461033 -0.14303825
qsec  0.2004563 -0.46337482
vs    0.3065113 -0.23164699
am    0.2349429  0.42941765
gear  0.2069162  0.46234863
carb -0.2140177  0.41357106


# 载荷矩阵-prcomp
> car.pr2$rotation[,1:2]
              PC1          PC2
mpg  -0.038118199  0.009184847
cyl   0.012035150 -0.003372487
disp  0.899568146  0.435372320
hp    0.434784387 -0.899307303
drat -0.002660077 -0.003900205
wt    0.006239405  0.004861023
qsec -0.006671270  0.025011743
vs   -0.002729474  0.002198425
am   -0.001962644 -0.005793760
gear -0.002604768 -0.011272462
carb  0.005766010 -0.027779208

5、计算主成分得分。

# 计算主成分得分-princomp ,对于princomp,可以直接提取pca结果里的scores ,或用predict提取
> car.pca1 <- car.pr1$scores[,1:2]  # 直接提取pca结果里的scores,前两列
> car.pca1 <- predict(car.pr1)[,1:2] # predict提取主成分,前两列
> car.pca1
                           Comp.1     Comp.2
Mazda RX4            0.6572132031  1.7354457
Mazda RX4 Wag        0.6293955058  1.5500334
Datsun 710           2.7793970426 -0.1464566
Hornet 4 Drive       0.3117707086 -2.3630190
Hornet Sportabout   -1.9744889419 -0.7544022
Valiant              0.0561375337 -2.7859996
Duster 360          -3.0026742880  0.3348874
Merc 240D            2.0553287289 -1.4651808
Merc 230             2.2874083842 -1.9835265
Merc 280             0.5263812077 -0.1620126
Merc 280C            0.5092054932 -0.3238945
Merc 450SE          -2.2478104359 -0.6834740
Merc 450SL          -2.0478227622 -0.6832207
Merc 450SLC         -2.1485421615 -0.8017395
Cadillac Fleetwood  -3.8997903717 -0.8279481
Lincoln Continental -3.9541231097 -0.7333815
Chrysler Imperial   -3.5929719882 -0.4211349
Fiat 128             3.8562837567 -0.2967519
Honda Civic          4.2540325032  0.6884140
Toyota Corolla       4.2342207436 -0.2792875
Toyota Corona        1.9041678566 -2.1198383
Dodge Challenger    -2.1848507430 -1.0142171
AMC Javelin         -1.8633834347 -0.9064645
Camaro Z28          -2.8889945733  0.6808260
Pontiac Firebird    -2.2459189274 -0.8738121
Fiat X1-9            3.5739682964 -0.1212038
Porsche 914-2        2.6512550541  2.0463709
Lotus Europa         3.3857059882  1.3785993
Ford Pantera L      -1.3729574238  3.4999996
Ferrari Dino         0.0009899207  3.2190722
Maserati Bora       -2.6691258658  4.3796772
Volvo 142E           2.4205931001  0.2336399

# 计算主成分得分-prcomp  对于prcomp只能用predict提取
> car.pca2 <- predict(car.pr2)[,1:2]
> car.pca2
                            PC1         PC2
Mazda RX4            -79.596425    2.132241
Mazda RX4 Wag        -79.598570    2.147487
Datsun 710          -133.894096   -5.057570
Hornet 4 Drive         8.516559   44.985630
Hornet Sportabout    128.686342   30.817402
Valiant              -23.220146   35.106518
Duster 360           159.309025  -32.259197
Merc 240D           -112.615805   39.702195
Merc 230            -103.534591    7.513104
Merc 280             -67.046877   -6.208536
Merc 280C            -66.997514   -6.206387
Merc 450SE            55.211672  -10.373509
Merc 450SL            55.173910  -10.361893
Merc 450SLC           55.251602  -10.370934
Cadillac Fleetwood   242.814893   52.501758
Lincoln Continental  236.369886   38.280788
Chrysler Imperial    224.737944   16.111941
Fiat 128            -172.363654    6.575522
Honda Civic         -181.066911   17.783639
Toyota Corolla      -179.697852    4.188212
Toyota Corona       -121.224099   -3.345362
Dodge Challenger      80.159386   34.983214
AMC Javelin           67.572431   28.894067
Camaro Z28           150.354631  -36.633575
Pontiac Firebird     164.652522   48.239880
Fiat X1-9           -171.897231    6.643746
Porsche 914-2       -123.804988    2.033356
Lotus Europa        -137.082789  -28.675647
Ford Pantera L       159.413222  -53.318347
Ferrari Dino         -64.762396  -62.954280
Maserati Bora        145.361703 -139.049149
Volvo 142E          -115.181783  -13.826313

6 结果可视化

# 主成分拼接
type <- sample(1:5,nrow(mtcars),replace = T) #mtcar.没有分组变量,我们随机分成5组
car.pdata1 <- data.frame(name=rownames(car.pca1),car.pca1)
car.pdata1$type <- factor(type)
car.pdata2 <- data.frame(name=rownames(car.pca2),car.pca2)
car.pdata2$type <- factor(type)

展示主成分及分组置信椭圆-princomp

pca_plot1 <- ggplot(car.pdata1, aes(Comp.1, Comp.2 ,color = type,shape=type)) + 
  geom_point(size=2)+ 
  # 置信椭圆
  stat_ellipse(aes(group = type,fill=type),show.legend = F,geom = "polygon",alpha = 0.2) +
  geom_vline(xintercept = 0, size = 0.2,linetype = 2) + #在 x=0 处添加垂直线
  geom_hline(yintercept = 0, size = 0.2,linetype = 2) + #在 y=0 处添加水平线
  theme(legend.title=element_blank())+ # 图例标题为空
  labs(x= paste0("Comp.1(", round(car.pv1[1]*100,2), "%)"),
       y= paste0("Comp.2(", round(car.pv1[2]*100,2), "%)"),title = "Individuals-PCA1")
pca_plot1  

在这里插入图片描述
展示主成分及分组置信椭圆-prcomp

pca_plot2 <- ggplot(car.pdata2, aes(PC1, PC2 ,color = type,shape=type)) + 
  geom_point(size=2)+ 
  # 置信椭圆
  stat_ellipse(aes(group = type,fill=type),show.legend = F,geom = "polygon",alpha = 0.2) +
  geom_vline(xintercept = 0, size = 0.2,linetype = 2) + #在 x=0 处添加垂直线
  geom_hline(yintercept = 0, size = 0.2,linetype = 2) + #在 y=0 处添加水平线
  theme(legend.title=element_blank())+ # 图例标题为空
  labs(x= paste0("PC1(", round(car.pv2[2,1]*100,2), "%)"),
       y= paste0("PC2(", round(car.pv2[2,2]*100,2), "%)"),title = "Individuals-PCA2")
  
pca_plot2 

在这里插入图片描述

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章