AdaBoost实战

AdaBoost原理

Ensemble Learning（集成学习）

集成学习是指集合多个弱分类器以形成一个集成的强分类器。集成学习的框架可以通过下图来反映

也就是说，集成学习是一种集合了多个机器学习模型的“意见”，已完成最后决策的机制。
常见的集成学习策略有三种：

Bagging
Boosting
Stacking

Bagging和Boosting的不同可以通过下面这张图来理解

Bagging对数据集进行随机采样，构成 $N$ 组，然后每组使用模型单独训练，最后进行表决，是一种类似于串联的结构。而Boosting则是不改变训练集的情况下，不断调整样本权重来调优弱分类器性能，实际上是一种串行的思路。
而我们接下来要探讨的AdaBoost是属于Boosting策略型集成算法的一种。

AdaBoost工作机制

分类器实现

准备工作

以下是我们需要用的工具

import numpy as np
import pandas as pd
from random import seed
from random import randrange
from math import sqrt
from math import exp
from math import pi
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
import operator
import plotly.express as px
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
import plotly.graph_objects as go
from plotly.subplots import make_subplots

基本接口调用

从最简单的开始，我们调用sklearn的AdaBoost分类器接口，完成最基本的实现。

# 随机创建一个预测问题
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=6)
# 调用AdaBoost分类策略
model = AdaBoostClassifier()
# 交叉检验
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# 计算准确率
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('准确率: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

该问题的输出结果类似以下：

准确率: 0.806 (0.041)

更换元分类器

最常见的AdaBoost元分类器是决策树，这也是sklearn中默认的元分类器，如果我们不传入任何参数进入AdaBoostClassifier，那么它就会默认使用一层的决策树来作为它的元分类器。但是实际上，AdaBoost接受所有能够样本赋权的分类器作为其的元分类器，而我们最常见的赋权分类器有两种，即决策树和支持向量机（SVM）。在这个部分我们尝试在一个分类问题中找到一个较优的元分类器。
首先，我们先尝试以不同层数的决策树作为元分类器来看看AdaBoost的效果

def get_dataset(iris=True):
    """
    获取研究的数据集
    @是否用鸢尾花数据集进行测试
    """
    if iris == True:
        data = load_iris()
        X = data.data
        y = data.target
    else:
        X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=6)
    return X, y
 
def get_models():
    """
    产生弱分类器(基元)
    """
    models = dict()
    for i in range(1,11):
        models[str(i)] = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=i))
    return models
 ![在这里插入图片描述](https://img-blog.csdnimg.cn/20200516110831264.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MTY3Nzg3Ng==,size_16,color_FFFFFF,t_70)
def evaluate_model(model):
    """
    使用交叉检验评估模型
    @model: 模型
    """
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
    scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
    return scores
 
X, y = get_dataset(iris=False)
models = get_models()
results, names = [], []
for name, model in models.items():
    scores = evaluate_model(model)
    results.append(scores)
    names.append(name)
    print('>%s %.3f (%.3f)' % (name, np.mean(scores), np.std(scores)))
# 画图
fig = go.Figure()
c = ['hsl('+str(h)+',50%'+',50%)' for h in np.linspace(0, 360, len(results))]
for result, i in zip(results, [j for j in range(len(results))]):
    fig.add_trace(go.Box(
    y=result,
    name=names[i],
    jitter=0.3,
    pointpos=-1.8,
    boxpoints='all',
    marker_color=c[i],
    line_color=c[i],
    boxmean='sd'))
fig.update_layout(template='none')
fig.show()

以不同层数的决策树作为元分类器得到的精确度结果：

>1 0.806 (0.041)
>2 0.863 (0.028)
>3 0.866 (0.030)
>4 0.890 (0.030)
>5 0.917 (0.026)
>6 0.924 (0.021)
>7 0.923 (0.021)
>8 0.930 (0.026)
>9 0.933 (0.021)
>10 0.925 (0.027)

可视化一下这个结果：

接着，我们将元分类器换为支持向量机：

# 随机创建一个预测问题
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=6)
# 建立支持向量机元分类器
svc = SVC(probability=True, kernel='linear')
# 调用AdaBoost分类策略
model = AdaBoostClassifier(base_estimator=svc)
# 交叉检验
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# 计算准确率
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('准确率: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

结果是

准确率: 0.719 (0.101)

对比之下，决策树似乎更适合和AdaBoost搭配（当然，具体问题具体分析），后面我就先使用决策树作为元分类器来进行实验。

确定元分类器数量

默认情况下，我们是将50个决策树进行串联组合，当然我们也可以按照需求调整分类器数量，下面我们做一个模拟：

def get_models():
	models = dict()
	models['10'] = AdaBoostClassifier(n_estimators=10)
	models['50'] = AdaBoostClassifier(n_estimators=50)
	models['100'] = AdaBoostClassifier(n_estimators=100)
	models['500'] = AdaBoostClassifier(n_estimators=500)
	models['1000'] = AdaBoostClassifier(n_estimators=1000)
	models['5000'] = AdaBoostClassifier(n_estimators=5000)
	return models

我们将get_models()函数更换一下，来测试元分类器数量和最后的总体精确率之间有什么关系。结果如下

>10 0.773 (0.039)
>50 0.806 (0.041)
>100 0.801 (0.032)
>500 0.793 (0.028)
>1000 0.791 (0.032)
>5000 0.782 (0.031)

可见，并不是说分类器越多，AdaBoost效果就越好，而且随着分类器数量增加，AdaBoost的执行效率也会大大下降，复杂度急剧上升。

确定学习率

学习率是最基本的机器学习超参数，我们关注学习率变化对结果准确度的影响。

def get_models():
    models = dict()
    for i in np.arange(0.1, 2.1, 0.1):
        key = '%.3f' % i
        models[key] = AdaBoostClassifier(learning_rate=i)
    return models

还是更换get_models()函数。
结果

>0.100 0.767 (0.049)
>0.200 0.786 (0.042)
>0.300 0.802 (0.040)
>0.400 0.798 (0.037)
>0.500 0.805 (0.042)
>0.600 0.795 (0.031)
>0.700 0.799 (0.035)
>0.800 0.801 (0.033)
>0.900 0.805 (0.032)
>1.000 0.806 (0.041)
>1.100 0.801 (0.037)
>1.200 0.800 (0.030)
>1.300 0.799 (0.041)
>1.400 0.793 (0.041)
>1.500 0.790 (0.040)
>1.600 0.775 (0.034)
>1.700 0.767 (0.054)
>1.800 0.768 (0.040)
>1.900 0.736 (0.047)
>2.000 0.682 (0.048)

使用鸢尾花数据集检验

完成了上面的一些基本操作后，我使用鸢尾花数据集来测试AdaBoost的集成学习策略效果，并合单个决策树分类器的效果进行对比。
AdaBoost（决策树）

# 随机创建一个预测问题
# X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=6)
X, y = get_dataset(iris=True)
# 调用AdaBoost分类策略
model = AdaBoostClassifier()
# 交叉检验
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# 计算准确率
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('准确率: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

结果:

准确率: 0.947 (0.056)

AdaBoost（SVC）

# 随机创建一个预测问题
# X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=6)
X, y = get_dataset(iris=True)
# 建立支持向量机元分类器
svc = SVC(probability=True, kernel='linear')
# 调用AdaBoost分类策略
model = AdaBoostClassifier(base_estimator=svc)
# 交叉检验
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# 计算准确率
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('准确率: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

结果：

准确率: 0.964 (0.037)

决策树
具体代码见机器学习算法——手动搭建决策树分类器(代码+作图)。
结果：

算法的平均准确率: 94.000%

机器学习算法——使用Adaboost进行分类性能提升

AdaBoost实战

AdaBoost原理

Ensemble Learning（集成学习）

AdaBoost工作机制

分类器实现

准备工作

基本接口调用

更换元分类器

确定元分类器数量

确定学习率

使用鸢尾花数据集检验

使用c#强大的表达式树实现对象的深克隆之解决循环引用的问题

GPT-4o 引领人机交互新风向，向量数据库赛道沸腾了

free AI online tools All In One

痞子衡嵌入式：恩智浦i.MX RT1xxx系列MCU启动那些事（12.A）- uSDHC eMMC启动时间(RT1170)

基于Ubuntu-22.04安装K8s-v1.28.2实验（二）使用kube-vip实现集群VIP访问

企业大模型如何成为自己数据的“百科全书”？

本地SSL证书过期输入命令在IIS自动生成

.NET周刊【5月第2期 2024-05-12】

基于Ubuntu-22.04安装K8s-v1.28.2实验（一）部署K8s

基于Ubuntu-22.04安装K8s-v1.28.2实验（三）数据卷挂载NFS（网络文件系统）

機器學習算法——手動搭建樸素貝葉斯分類器(附代碼)

新冠數據整理和簡單分析

新冠數據整理和簡單分析（三）—— 使用Anylogic進行仿真實驗

新冠數據整理和簡單分析（二）——SIR及其變種

分類變量回歸——Probit和Logit（附代碼）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結