机器学习——集成学习(一)

一. 集成学习概述

什么是集成学习？

它本身不是一个单独的机器学习算法，而是通过构建并结合多个机器学习器来完成学习任务。
一般说的集成学习方法都是指同质个体学习器，而同质个体学习器使用最多的模型是CART决策树和神经网络。

集成学习需要解决的问题是什么？

a. 如何得到若干个个体学习器
b. 如何选择一种结合策略，将这些个体学习器集合成一个强学习器

如何得到若干个个体学习器

a. 所有个体学习器都是同一种类的或者说是同质的
b. 所有的个体学习器不全是一个种类的，或者说是异质的。

二. 集成学习算法简介

bagging 算法

bagging的个体弱学习器的训练集是通过随机采样得到的。通过T次的随机采样，我们就可以得到T个采样集，对于这T个采样集，我们可以分别独立的训练出T个弱学习器，再对这T个弱学习器通过集合策略来得到最终的强学习器。
对于这里的随机采样有必要做进一步的介绍，这里一般采用的是自助采样法（Bootstrap sampling）,即对于m个样本的原始训练集，我们每次先随机采集一个样本放入采样集，接着把该样本放回，也就是说下次采样时该样本仍有可能被采集到，这样采集m次，最终可以得到m个样本的采样集，由于是随机采样，这样每次的采样集是和原始训练集不同的，和其他采样集也是不同的，这样得到多个不同的弱学习器。

bagging 原理图
bagging 代码实现

# bagging.py
#
import numpy as np
from sklearn import ensemble, neighbors, datasets, preprocessing
from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

np.random.RandomState(0)

# 加载数据
wine = datasets.load_wine()

# 划分训练集与测试集
x, y = wine.data, wine.target
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)

# 数据预处理
scaler = preprocessing.StandardScaler().fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

# 创建模型
""" 
参数n_estimators：表示创建多少个同质模型
max_samples: 随机采样（自助采样法）
max_features： 随机选择特征
"""
clf = ensemble.BaggingClassifier(neighbors.KNeighborsClassifier(), n_estimators=10, max_samples=0.5, max_features=0.5)

# 模型拟合
clf.fit(x_train, y_train)

# 预测
y_pred = clf.predict(x_test)

# 评估
print(accuracy_score(y_test, y_pred))

随机森林

随机森林(Random Forest, 简称RF)是bagging的一个特化进阶版，所谓的特化是因为随机森林的弱学习器都是决策树。所谓的进阶是随机森林在bagging的样本随机采样基础上，又加上了特征的随机选择，其基本思想没有脱离bagging的范畴。

首先，RF使用了CART决策树作为弱学习器。第二，在使用决策树的基础上，RF对决策树的建立做了改进，对于普通的决策树，我们会在节点上所有的 $n$ 个样本特征中选择一个最优的特征来做决策树的左右子树划分，但是RF通过随机选择节点上的一部分样本特征，这个数字小于 $n$ ，假设为 $n_{sub}$ ，然后在这些随机选择的 $n_{sub}$ 个样本特征中，选择一个最优的特征来做决策树的左右子树划分。这样进一步增强了模型的泛化能力。
除了上面两点，RF和普通的bagging算法没有什么不同.

随机森林代码实现

# 随机深林.py
#
import numpy as np
from sklearn import tree,ensemble, datasets, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

np.random.RandomState(0)

# 加载数据
wine = datasets.load_wine()

# 划分训练集与测试集
x, y = wine.data, wine.target
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)

# 数据预处理
scaler = preprocessing.StandardScaler().fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

# 创建模型
clf = ensemble.RandomForestClassifier(n_estimators=25,max_depth=3)

# 模型拟合
clf.fit(x_train, y_train)

# 预测
y_pred = clf.predict(x_test)

# 评估
print(accuracy_score(y_test, y_pred))

# 查看特征权重
print(wine.feature_names)
print(clf.feature_importances_)

boosting 算法

Bagging一般使用强学习器，其个体学习器之间不存在强依赖关系，容易并行。Boosting则使用弱分类器，其个体学习器之间存在强依赖关系，是一种序列化方法。Bagging主要关注降低方差，而Boosting主要关注降低偏差。Boosting是一族算法，其主要目标为将弱学习器“提升”为强学习器，大部分Boosting算法都是根据前一个学习器的训练效果对样本分布进行调整，再根据新的样本分布训练下一个学习器，如此迭代M次，最后将一系列弱学习器组合成一个强学习器。而这些Boosting算法的不同点则主要体现在每轮样本分布的调整方式上。

*** AdaBoost后面单独进行讲解。

AdaBoost的具体流程为先对每个样本赋予相同的初始权重，每一轮学习器训练过后都会根据其表现对每个样本的权重进行调整，增加分错样本的权重，这样先前做错的样本在后续就能得到更多关注，按这样的过程重复训练出M个学习器，最后进行加权组合

adaboost代码实现

# bagging.py
#
import numpy as np
from sklearn import ensemble, neighbors, datasets, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

np.random.RandomState(0)

# 加载数据
wine = datasets.load_wine()

# 划分训练集与测试集
x, y = wine.data, wine.target
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)

# 数据预处理
scaler = preprocessing.StandardScaler().fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

# 创建模型
clf = ensemble.AdaBoostClassifier(n_estimators=50)

# 模型拟合
clf.fit(x_train, y_train)

# 预测
y_pred = clf.predict(x_test)

# 评估
print(accuracy_score(y_test, y_pred))

Stacking 概述

Stacking 与 bagging 和 boosting 主要存在两方面的差异。首先，Stacking
通常考虑的是异质弱学习器（不同的学习算法被组合在一起），而bagging 和 boosting
主要考虑的是同质弱学习器。其次，stacking 学习用元模型组合基础模型，而bagging 和 boosting则根据确定性算法组合弱学习器。

Stacking 原理

将训练好的所有基模型对整个训练集进行预测，第j个基模型对第i个训练样本的预测值将作为新的训练集中第i个样本的第j个特征值，最后基于新的训练集进行训练。同理，预测的过程也要先经过所有基模型的预测形成新的测试集，最后再对测试集进行预测：

10. Stacking实现代码
介绍一款功能强大的stacking利器，mlxtend库，它可以很快地完成对sklearn模型地stacking。

from sklearn import datasets
 
iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target
 
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB 
from sklearn.ensemble import RandomForestClassifier
from mlxtend.classifier import StackingClassifier
import numpy as np
 
clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
lr = LogisticRegression()
sclf = StackingClassifier(classifiers=[clf1, clf2, clf3], 
                          meta_classifier=lr)
 
print('3-fold cross validation:\n')
 
for clf, label in zip([clf1, clf2, clf3, sclf], 
                      ['KNN', 
                       'Random Forest', 
                       'Naive Bayes',
                       'StackingClassifier']):
 
    scores = model_selection.cross_val_score(clf, X, y, 
                                              cv=3, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" 
          % (scores.mean(), scores.std(), label))

GBRT算法
*** 后面单独对GBRT进行讲解

Gradient Tree Boosting 或梯度提升回归树(GBRT)是对于任意的可微损失函数的提升算法的泛化。 GBRT是一个准确高效的现有程序, 它既能用于分类问题也可以用于回归问题。梯度树提升模型被应用到各种领域,包括网页搜索排名和生态领域。
GBRT 的优点: 对混合型数据的自然处理(异构特征) 强大的预测能力在输出空间中对异常点的鲁棒性(通过具有鲁棒性的损失函数实现) Gradient
Boost与传统的Boost有着很大的区别,它的每一次计算都是为了减少上一次的残差 (residual),而为了减少这些残差,可以在残差减少的梯度(Gradient)方向上建立一个新模型。所以说, 在Gradient Boost中,每个新模型的建立是为了使得先前模型残差往梯度方向减少,与传统的Boost算法对正确、错误的样本进行加权有着极大的区别。
它主要的思想是,每一次建立模型是在之前建立模型损失函数的梯度下降方向。损失函数(loss function)描述的是模型的不靠谱程度,损失函数越大,则说明模型越容易出错(其实这里有一个方差、偏差均衡的问题,但是这里假设损失函数越大,模型越容易出错)。如果我们的模型能够让损失函数持续的下降,则说明我们的模型在不停的改进,而最好的方式就是让损失函数在其梯度(Gradient)的方向上下降。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor
from sklearn import linear_model
from sklearn import ensemble
# Data set
x = np.array(list(range(1, 11))).reshape(-1, 1)
y = np.array([5.56, 5.70, 5.91, 6.40, 6.80, 7.05, 8.90, 8.70, 9.00, 9.05]).ravel()

# Fit regression model
"""分类"""
# model1 = ensemble.GradientBoostingClassifier(max_depth=1,n_estimators=20)

"""回归"""
model1 = ensemble.GradientBoostingRegressor(max_depth=1,n_estimators=20)
model1.fit(x, y)

# Predict
X_test = np.arange(0.0, 10.0, 0.01)[:, np.newaxis]
y_1 = model1.predict(X_test)

# Plot the results
plt.figure()
plt.scatter(x, y, s=20, edgecolor="black", c="darkorange", label="data")
plt.plot(X_test, y_1, color="cornflowerblue", label="max_depth=1", linewidth=2)
plt.xlabel("data")
plt.ylabel("target")
plt.title("Decision Tree Regression")
plt.legend()
plt.show()

机器学习——集成学习(一)

一. 集成学习概述

二. 集成学习算法简介

深度學習中的網絡設計技術(一) ——理論概述

python數據結構——鏈表及其實現

matplotlib實戰二——利用matplotlib畫激活函數曲線

分組卷積和深度可分離卷積

python數據結構——棧、隊列、雙端隊列

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結