分类

一、MNIST

MNIST数据集：70000 张规格较小的手写数字图片。

二、获取数据

1、从网络获取

from sklearn.datasets import fetch_mldata

mnist = fetch_mldata('MNIST original')
print(mnist)

输出结果

{'__header__': b'MATLAB 5.0 MAT-file Platform: posix, Created on: Sun Mar 30 03:19:02 2014', '__version__': '1.0', '__globals__': [], 'mldata_descr_ordering': array([[array(['label'], dtype='<U5'), array(['data'], dtype='<U4')]],
      dtype=object), 'data': array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=uint8), 'label': array([[0., 0., 0., ..., 9., 9., 9.]])}

一般而言，sklearn 加载的数据集有着相似的字典结构，包括：__header__、__version__、__globals__、mldata_descr_ordering、data 和 label

2、本地读取

会出现无法现在的情况，本博客提供数据集资源：
传送门链接: https://pan.baidu.com/s/1VLD1CmMqWIoDotqf-9umUA 提取码: exw9

from sklearn.datasets import fetch_mldata
import scipy.io as sio
import numpy as np

mnist = sio.loadmat('./datasets/mnist/mnist-original.mat')
print(mnist)

X, y = mnist["data"].T, mnist["label"].T
print(X.shape)
print(y.shape)

import matplotlib.pyplot as plt
import matplotlib

## 查看样例
some_digit = X[39000]
some_digit_image = some_digit.reshape(28, 28)
plt.imshow(some_digit_image, cmap=matplotlib.cm.binary, interpolation="nearest")
plt.axis("off")
plt.show()
print(y[39000])


## 创建测试集
#  前 60000 张图片为训练集
#  最后 10000 张图片为测试集
X_train, X_test, y_train, y_test = X[:,60000], X[60000,:], y[:60000], y[60000:]

## 乱序
import numpy as np

shuffle_index = np.random.permutation(60000)
X_train, y_train = X_train[shuffle_index],y_train[shuffle_index]

输出结果

{'__header__': b'MATLAB 5.0 MAT-file Platform: posix, Created on: Sun Mar 30 03:19:02 2014', '__version__': '1.0', '__globals__': [], 'mldata_descr_ordering': array([[array(['label'], dtype='<U5'), array(['data'], dtype='<U4')]],
      dtype=object), 'data': array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=uint8), 'label': array([[0., 0., 0., ..., 9., 9., 9.]])}
(70000, 784)
(70000, 1)
[[6.]]

三、训练一个二分类器

二分类：类别为 “是6” 和 “非6”

from sklearn.linear_model import SGDClassifier

## 创建分类标签
y_train_6 = (y_train == 6)
y_test_6 = (y_test == 6)

## 随机梯度下降 SGD 分类器
#  SGD 一次只处理一条数据 ==》在线学习(Online Learning)
sgd_clf = SGDClassifier(random_state=2019)
sgd_clf.fit(X_train, y_train_6)

## 预测
print(sgd_clf.predict([some_digit]))

输出结果

[ True]

四、性能评估

1、交叉验证——精度

K折交叉验证：将训练集分成K折，然后使用一个模型对其中一折进行预测，对其他折进行训练。

（1）轮子版 `cross_val_score()`

过程：StratifiedKFold 类实现了分层采样，生成的折（fold）包含了各类相应比例的样例。在每一次迭代，下面的代码生成分类器的一个克隆版本，在训练折（training folds）的克隆版本上进行训练，在测试折（test folds）上进行测试。最后计算准确率

from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.base import clone

skfolds = StratifiedShuffleSplit(n_splits=3, random_state=2019)
for train_index, test_index in skfolds.split(X_train, y_train_6):
    clone_clf = clone(sgd_clf)
    X_train_folds = X_train[train_index]
    y_train_folds = (y_train_6[train_index])
    X_test_fold = X_train[test_index]
    y_test_fold = (y_train_6[test_index])
    clone_clf.fit(X_train_folds, y_train_folds.ravel())
    y_pred = clone_clf.predict(X_test_fold)
    n_correct = sum(y_pred == y_test_fold.ravel())
    print(n_correct / len(y_pred))

输出结果

0.982
0.9753333333333334
0.976

（2）函数版 `cross_val_score()`

from sklearn.model_selection import cross_val_score
score = cross_val_score(sgd_clf, X_train, y_train_6, cv=3, scoring="accuracy")
print(score)  ## 精度
# [0.9820009  0.97985    0.98024901]

输出结果

[0.97780111 0.982      0.98429921]

（3）笨分类器

from sklearn.base import BaseEstimator
class Never6Classifier(BaseEstimator):
    def fit(self, X, y=None):
        pass
    def predict(self, X):
        return np.zeros((len(X), 1), dtype=bool)
never_6_clf = Never6Classifier()
score6 = cross_val_score(never_6_clf, X_train, y_train_6, cv=3, scoring="accuracy")
print(score6)

输出结果

[0.90045 0.9028  0.90085]

由于数据的分布，致使笨分类器也有90%的精度
==》精度通常不是很好的性能度量指标，特别是处理有偏差的数据集，eg：数据不平衡：其中一些类比其他类频繁得多

2、混淆矩阵

cross_val_predict() 函数同样使用 K 折交叉验证。返回每一个测试数据的预测值；
confusion_matrix() 函数，可获得一个混淆矩阵，参数为groundtruth和预测值

思想：类别 A 被分类成类别 B 的次数，eg：为了知道分类器将 5 误分为 3 的次数，你需要查看混淆矩阵的第五行第三列。

from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix

y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_6, cv=3)
print(confusion_matrix(y_train_6, y_train_pred))

输出结果

[[53419   663]
 [  529  5389]]

解读：

混淆矩阵中的每一行表示一个实际的类, 而每一列表示一个预测的类；
- 该矩阵的第一行认为“非6”（反例）中的 53419 张被正确归类为 “非 6”（他们被称为真反例， true negatives） , 而其余663被错误归类为"是 6" （假正例， false positives）。第二行认为“是 6” （正例）中的 529 被错误地归类为“非 6”（假反例， false negatives），其余 5389 正确分类为 “是 6”类（真正例， true positives）
完美的分类器将只有真反例和真正例，所以混淆矩阵的非零值仅在其主对角线（左上至右下）

（1）准确率（precision）

$precision=\frac{TP}{TP+FP}$
其中，TP表示真正例的数目，FP表示假正例的数目

（2）召回率（recall）

召回率，也称敏感度（sensitivity）或者真正例率（true positive rate，TPR）：正例被分类器正确探测出的比例。
$recall=\frac{TP}{TP+FN}$
其中，FN表示假反例的数目。

3、准确率与召回率

from sklearn.metrics import precision_score, recall_score

precision = precision_score(y_train_6, y_train_pred)
recall = recall_score(y_train_6, y_train_pred)
print("The precision is ", precision)
print("The recall is ", recall)

输出结果

The precision is  0.8694196428571429
The recall is  0.921426157485637

4、F1值

F1值是准确率和召回率的调和平均。调和平均会给小的值更大的权重，若要得到一个高的F1值，需要召回率和准确率同时高。
$F1=\frac{2}{\frac{1}{precision}+\frac{1}{recall}}=2*\frac{precision*recall}{precision +recall}=\frac{TP}{TP+\frac{FN+FP}{2}}$

调用 f1_score() 即可获得F1值

from sklearn.metrics import f1_score

f1 = f1_score(y_train_6,y_train_pred)
print("The F1 score is ", f1)

输出结果

The F1 score is  0.8615836283567914

5、准确率/召回率之间的折衷——PR曲线

根据使用的场景不同，会更注重召回率或准确率，增加准确率会降低召回率，反之亦然。
==》准确率与召回率之间的折衷

预测过程：通过将预测值与阈值进行对比，分别正例和反例。通过降低阀值可以提高召回率、降低准确率。

sklearn中通过设置决策分数的方法，调用 decision_function() 方法，该方法返回每一个样例的分数值，然后基于这个分数值，使用自定义阀值做出预测。

y_scores = sgd_clf.decision_function([some_digit])
print(y_scores)

## 设置阀值1
threshould = 0
y_some_digit_pred = (y_scores > threshould)
print(y_some_digit_pred)

## 设置阀值2
threshould = 200000
y_some_digit_pred = (y_scores > threshould)
print(y_some_digit_pred)

输出结果

[97250.73888009]
[ True]
[False]

==》提高阀值会降低召回率
==》阀值选择

from sklearn.metrics import precision_recall_curve

## 返回决策分数，而非预测值
y_scores = cross_val_predict(sgd_clf, X_train, y_train_6, cv=3, method="decision_function")
precisions, recalls, threshoulds = precision_recall_curve(y_train_6, y_scores)

def plot_precision_recall_vs_threshold(precisions, recalls, threshoulds):
    plt.plot(threshoulds, precisions[:-1],'b--', label="Precision")
    plt.plot(threshoulds, recalls[:-1],"g-", label="Recall")
    plt.xlabel("Threshold")
    plt.legend(loc="upper left")
    plt.ylim([0, 1])
plot_precision_recall_vs_threshold(precisions, recalls, threshoulds)
plt.show()

# 要达到90%的准确率
y_train_pred_90 = (y_scores > 70000)
precision = precision_score(y_train_6, y_train_pred_90)
recall = recall_score(y_train_6, y_train_pred_90)
print("The precision is ", precision)
print("The recall is ", recall)

输出结果

The precision is  0.9231185706551164
The recall is  0.8643122676579925

6、ROC 曲线

受试者工作特征（ ROC）曲线是真正例率（true positive rate，TPR，也称召回率）对假正例率（false positive rate，FPR）的曲线。

需要计算不同阀值下的TPR、FPR，使用roc_curve()函数

## ROC曲线
from sklearn.metrics import roc_curve
fpr, tpr, threshoulds = roc_curve(y_train_6, y_scores)
def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--')
    plt.axis([0, 1, 0, 1])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.legend(loc="lower right")
plot_roc_curve(fpr, tpr)
plt.show()

比较分类器之间优劣的方法：测量ROC曲线下的面积（AUC）——完美分类器 ROC AUC等于1，一个纯随机分类器的ROC AUC等于0.5

from sklearn.metrics import roc_auc_score
auc = roc_auc_score(y_train_6, y_scores)
print(auc)

输出结果

0.9859523237334556

7、PR曲线 vs. ROC曲线

优先使用PR曲线当正例很少或关注假正例多于假反例的时候，其他情况使用ROC曲线

from sklearn.ensemble import RandomForestClassifier
## 不提供decision_function()方法，提供predict_proba()方法
forest_clf = RandomForestClassifier(random_state=2019)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_6, cv=3, method="predict_proba")

# 使用正例的概率作为样例的分数
y_scores_forest = y_probas_forest[:, 1]
fpr_forest, tpr_forest, thresholds_forest = roc_curve(y_train_6, y_scores_forest)

plt.plot(fpr, tpr, "b:", label="SGD")
plot_roc_curve(fpr_forest, tpr_forest, "Random Forest")
plt.legend(loc="bottom right")
plt.show()

auc_forest = roc_auc_score(y_train_6, y_scores_forest)
print("The AUC is ", auc_forest)

输出结果

The AUC is  0.9956826118210167

分析：RandomForest的ROC曲线比SGDClassifier好：它更靠近左上角。

四、多类分类

可以直接处理多分类器的算法：随机森林、朴素贝叶斯
严格二分类器：SVM、线性分类器

1、二分类器 ==》多类分类器

（eg：要分为10类）

一对所有（OvA）策略：训练10个分类器，每个对应一个分类的类别（类别1与其他，类别2与其他…）
一对一（OvO）策略：对每个分类类别都训练一个二分类器。若有N个类，需要训练 N*(N-1)/2 个分类器。
- 优点：每个分类器只需要在训练集的部分数据上面进行训练。这部分数据是它所需要区分的那两类对应的数据。

对于一些算法（eg：SVM）在训练集上的大小很难扩展==》OvO（可在小数据集上更多的训练）
大数据集==》OvA

在sklearn中，使用二分类器完成多分类，自动执行OvA（SVM为OvO）

sgd_clf.fit(X_train, y_train)
print(sgd_clf.predict([some_digit]))

some_digit_scores = sgd_clf.decision_function([some_digit])
print(some_digit_scores)
## 最大值的类别
print(np.argmax(some_digit_scores))
## 获取目标类别
print(sgd_clf.classes_)
print(sgd_clf.classes_[6])

输出结果

[6.]
[[-493795.59394766 -316594.71495827  -59032.18005876 -300444.77319706
  -434956.4672297  -292368.411729    276453.49558919 -750703.98392662
  -296673.25971762 -565079.84324395]]
6
[0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
6.0

强制使用OvO或OvA策略：OneVsOneClassifier, OneVsRestClassifier

## 创建基于SGDClassifier的OvO策略的分类器
from sklearn.multiclass import OneVsOneClassifier
ovo_clf = OneVsOneClassifier(SGDClassifier(random_state=2019))
ovo_clf.fit(X_train, y_train)
ovo = ovo_clf.predict([some_digit])
print(ovo)
print(len(ovo_clf.estimators_))  # 获得分类器的个数

## 训练一个RandomForestClassifier
forest_clf.fit(X_train, y_train)
forest = forest_clf.predict([some_digit])
print(forest)
# 得到样例对应的类别的概率值的列表
forest_proba = forest_clf.predict_proba([some_digit])
print(forest_proba)
# 交叉验证评估分类器
forest_score = cross_val_score(forest_clf, X_train, y_train, cv=3, scoring="accuracy")
print(forest_score)

## 加入预处理：将输入正则化
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))
score_std = cross_val_score(forest_clf, X_train_scaled, y_train, cv=3, scoring="accuracy")
print(score_std)

输出结果

[6.]
45
[6.]
[[0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]]
[0.940012   0.93944697 0.94039106]
[0.940012   0.93949697 0.94034105]

五、误差分析

当得到一个不错的模型并需要改善它，则需要分析模型产生的误差类型

1、检查混淆矩阵

cross_val_predict() 做出预测 =》 confusion_matrix() 计算混淆矩阵 =》 matshow()显示混淆矩阵

y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)
conf_mx = confusion_matrix(y_train, y_train_pred)
print(conf_mx)
## 以图像的形式显示混淆矩阵
plt.matshow(conf_mx, cmap=plt.cm.gray)
plt.show()

输出结果

2、分析混淆矩阵

分析：数字5对应的格子比其他数字要暗淡许多。
可能原因：1. 数据集中数字5的图片比较少；2.分类器对于数字5的表现不如其他数字好

比较错误率，而不是绝对的错误数。方法：将混淆矩阵的每一个值除以相应类别（真实值的个数）的图片的总数目。

rows_sums = conf_mx.sum(axis=1, keepdims=True)
norm_conf_mx = conf_mx / rows_sums
## 使用0对对角线进行填充
np.fill_diagonal(norm_conf_mx, 0)
plt.matshow(norm_conf_mx, cmap=plt.cm.gray)
plt.show()

注意：行代表实际类别，列代表预测的类别。不是严格对称的。
第8、9列亮，表示许多图片被误分类为数字8或数字9；特别黑，代表大多数被正确分类；将数字 8 误分类为数字 5 的数量，有更多的数字 5 被误分类为数字 8。
==》努力改善分类器在数字8和数字9上的表现，纠正3/5的混淆。
==》收集数据、构建新的特征、对输入进行预处理（eg：图片预处理来确保它们可以很好地中心化和不过度旋转）

六、多标签分类

输出多个标签的分类系统称为多标签分类系统。

1、训练预测

from sklearn.neighbors import KNeighborsClassifier

# 创建 y_multilabel 数组，里面包含两个目标标签。
y_train_large = (y_train >= 7)
y_train_odd = (y_train % 2 == 1)
y_multilabel = np.c_[y_train_large, y_train_odd]
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_multilabel)
pred_knn = knn_clf.predict([some_digit])
print(pred_knn)

输出结果
数字6不是大数，同时不是奇数

[[False False]]

2、评估

评估分类器、选择正确的度量标准

对每个个体标签去度量 F1 值，然后计算平均值。

y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_train, cv=3)
f1_score_knn = f1_score(y_train, y_train_knn_pred, average="macro")
print(f1_score_knn)

输出结果

0.9684568539645069

标签的权重，eg：标签权重等于支持度（该标签的样例的数目），将 average="weighted"

七、多输出分类

多输出-多类分类，简称多输出分类。

例子：图片去噪，输出是多标签的（一个像素一个标签）和每个标签可以有多个值（像素强度取值范围从0到255），所以是一个多输出分类系统。

import random as rnd
## 添加噪声
noise_train = rnd.randint(0, 100, len(X_train), 784)
noise_test = rnd.randint(0, 100, len(X_test), 784)
X_train_mod = X_train + noise_train
X_test_mod = X_test + noise_test
y_train_mod = X_train
y_test_mod = X_test

## 训练、预测
knn_clf.fit(X_train_mod, y_train_mod)
clean_digit = knn_clf.predict(X_test_mod[some_index])
plot_digit(clean_digit)

输出结果

用Scikit-learn和TensorFlow进行机器学习（三）

文章目录

分类

一、MNIST

二、获取数据

1、从网络获取

2、本地读取

三、训练一个二分类器

四、性能评估

1、交叉验证——精度

（1）轮子版 `cross_val_score()`

（2）函数版 `cross_val_score()`

（3）笨分类器

2、混淆矩阵

（1）准确率（precision）

（2）召回率（recall）

3、准确率与召回率

4、F1值

5、准确率/召回率之间的折衷——PR曲线

6、ROC 曲线

7、PR曲线 vs. ROC曲线

四、多类分类

1、二分类器 ==》多类分类器

五、误差分析

1、检查混淆矩阵

2、分析混淆矩阵

六、多标签分类

1、训练预测

2、评估

七、多输出分类

【Keras】學習筆記（一）

典型分類器評價指標及實例

【論文】Legal Judgment Prediction via Topological Learning

【Paper】Few-Shot Charge Prediction with Discriminative Legal Attributes

【Paper】Learning to Predict Charges for Criminal Cases with Legal Basis

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

用Scikit-learn和TensorFlow进行机器学习（三）

文章目录

分类

一、MNIST

二、获取数据

1、从网络获取

2、本地读取

三、训练一个二分类器

四、性能评估

1、交叉验证——精度

（1）轮子版 cross_val_score()

（2）函数版 cross_val_score()

（3）笨分类器

2、混淆矩阵

（1）准确率（precision）

（2）召回率（recall）

3、准确率与召回率

4、F1值

5、准确率/召回率之间的折衷——PR曲线

6、ROC 曲线

7、PR曲线 vs. ROC曲线

四、多类分类

1、二分类器 ==》多类分类器

五、误差分析

1、检查混淆矩阵

2、分析混淆矩阵

六、多标签分类

1、训练预测

2、评估

七、多输出分类

（1）轮子版 `cross_val_score()`

（2）函数版 `cross_val_score()`