day 8.0 降维算法 PCA

# PCA和SVD
from sklearn.decomposition import PCA
# PCA(n_components=None
#     , copy=True
#     , whiten=False
#     , svd_solver='auto'
#     , tol=0.0
#     , iterated_power='auto'
#     , random_state=None)

# todo 重要参数
# n_components 是我们降维后需要的维度
# 如果我们希望可视化一组数据来观察数据分布，我们往往将数据降到三维以下，很
# 多时候是二维，即n_components的取值为2
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA

# todo:1  提取数据集
data = load_iris()
x = data.data
y = data.target
import pandas as pd

# print(pd.DataFrame(x))
# print(y)

# todo: 2 建模
# 调用PCA
pca = PCA(n_components=2)
pca.fit(x)  # 拟合模型
x_dr = pca.transform(x)  # 获取新矩阵
# print(x_dr.shape)
# print(pd.DataFrame(x_dr))
# print(x_dr)
#  也可以一步到位
# xdr=PCA(2).fit_transform(x)
# x_dr[y == 0, 0]
# print(data.target_names) # ['setosa' 'versicolor' 'virginica']

# todo: 3 可视化
# # plt.figure()
# # plt.scatter(x_dr[y == 0, 0], x_dr[y == 0, 1], c='red', label=data.target_names[0])
# # plt.scatter(x_dr[y == 1, 0], x_dr[y == 1, 1], c='black', label=data.target_names[1])
# # plt.scatter(x_dr[y == 2, 0], x_dr[y == 2, 1], c='orange', label=data.target_names[2])
# # plt.legend()
# # plt.title("PCA of IRIS dataset")
# # plt.show()
# colors = ['red', 'black', 'orange']
# plt.figure()
# for i in [0, 1, 2]:
#     plt.scatter(x_dr[y == i, 0]
#                ,x_dr[y == i, 1]
#                ,alpha=.7
#                ,c=colors[i]
#                ,label=data.target_names[i]
#                )
# plt.legend()
# plt.title('PCA of IRIS dataset')
# plt.show()

# todo: 4 探索降维后的数据
# 属性explained_variance，查看降维后每个新特征向量上所带的信息量大小（可解释性方差的大小）
# pca.explained_variance_
# 属性explained_variance_ratio，查看降维后每个新特征向量所占的信息量占原始数据总信息量的百分比
# 又叫做可解释方差贡献率
# pca.explained_variance_ratio_
# 大部分信息都被有效地集中在了第一个特征上
# print(pca.explained_variance_ratio_.sum())

# todo 5: 选择最好的n_components
# 如果不加限制的时候
# pca_line = PCA().fit(x)
# print(pca_line.explained_variance_ratio_)  # [0.92461872 0.05306648 0.01710261 0.00521218]
# import numpy as np
#
# # np的累加功能
# # print(np.cumsum(pca_line.explained_variance_ratio_)) # [0.92461872 0.97768521 0.99478782 1.        ]
# plt.plot([1, 2, 3, 4], np.cumsum(pca_line.explained_variance_ratio_))
# plt.xticks([1, 2, 3, 4])  # 这是为了限制座标轴显示为整数
# plt.xlabel("number of components after dimension reduction")
# plt.ylabel("cumulative explained variance ratio")
# plt.show()   # 结果发现，特征 2 或者 3个的时候最好，，一般选择逐渐变平的转折点的特征数量

# todo 6:最大似然估计自选超参数 n_components
# 让PCA用最大似然估计(maximum likelihood estimation)自选超参数的方法，
# 输入“mle”作为n_components的参数输入
# pca_mle = PCA(n_components="mle")
# pca_mle = pca_mle.fit(x)
# X_mle = pca_mle.transform(x)
# # print(X_mle)
# #可以发现，mle为我们自动选择了3个特征
#
# print(pca_mle.explained_variance_ratio_.sum())  # 0.9947878161267247

# todo 7:按信息量占比选超参数 n_components
# 输入[0,1]之间的浮点数，并且让参数svd_solver =='full'，
# 表示希望降维后的总解释性方差占比大于n_components
# 指定的百分比，即是说希望保留百分之多少的信息量
# pca_f = PCA(n_components=0.97, svd_solver="full")
# pca_f = pca_f.fit(x)
# X_f = pca_f.transform(x)
# print(X_f)
# print(pca_f.explained_variance_ratio_.sum())

# todo 8：重要参数 svd_solver
# svd_solver: 默认auto，一般就用auto，不必纠结
#               randomized,  适合特征矩阵巨大，计算量庞大的情况
#               full, 运行精准完整的svd，适合时间充足的情况
#               arpack  可以加快运算速度，适合特征矩阵很大的时候，但特征矩阵为稀松矩阵的情况
# random_state

# todo 9 : 重要属性： components_
#  #### 查看PCA保存的属性
# # print(PCA(2).fit(x).components_)
# # print(PCA(2).fit(x).components_.shape)

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

day 8.0 降维算法 PCA

MySQL 核心模块揭秘 | 18 期 | 锁在内存里长什么样*

使用perf工具生成火焰图

大龄程序员思考

响应式界面控件DevExtreme * 更强的数据分析和可视化功能

HttpSecurity 是如何组装过滤器链的

数说海南——近6年海南各市县人口简单看

长序列中Transformers的高级注意力机制总结

WebStorm 创建 Vue 项目

Java字符串基本操作

mysql最簡單安裝步驟

day 6 處理分類型數據

day 6 缺失值處理

機器學習 sklearn學習第一天-決策樹分類樹

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結