机器学习之Adaboost算法

文章目录

一、算法原理

二、实战分析

一、算法原理

1.算法的基本思想

Adaboost是adaptive boosting的简写，是自适应的boosting算法，基本思想为：在前一个弱分类器的基础上，增加误分类样本的权重，这些误分类的样本在下一个弱分类器那里被重点关注，依次迭代进行，直到到达预定的足够小的错误率或最大的迭代次数为止。大概流程描述如下：

初始化训练数据的权值分布，假设样本个数为 $N$ ，则每个样本的权值为 $\frac{1}{N}$ ；
在初始训练集上训练出一个弱分类器，根据分类结果，被误分类的样本权重增加，正确分类的样本的权重将减少，然后将权值更新过的训练数据集用于训练下一个弱分类器，不断进行迭代。
将各个弱分类器进行组合形成强分类器。各个弱分类器也有自己的权重，加大分类误差率小的弱分类器的权重，使其在最终的分类函数中有更大的决定权，同理，减少分类误差率高的弱分类器的权重，使其在最终的分类函数中起着较小的决定权。

2.算法的流程

假设一个二分类的训练数据集 $T=\{(x_{\scriptscriptstyle 1},y_{\scriptscriptstyle 1}),(x_{\scriptscriptstyle 2},y_{\scriptscriptstyle 2}),\cdots,(x_{\scriptscriptstyle N},y_{\scriptscriptstyle N})\}$
其中， $x_{i}\in \chi \in R^{\scriptscriptstyle n}$ ，标记 $y_{\scriptscriptstyle i}\in \{-1,1\}$ 。
(1)初始化训练数据集的权值分布
$D_{\scriptscriptstyle 1}=(w_{\scriptscriptstyle 11},\cdots,w_{\scriptscriptstyle 1i},\cdots,w_{\scriptscriptstyle 1N}),\ w_{\scriptscriptstyle 1i}=\frac{1}{N},\ i=1,2,\cdots,N$
(2)训练每个弱分类器，假设有 $M$ 个弱分类器，对于 $m=1,2,\cdots,M$
(a)使用具有权值分布的训练数据集学习，得到基本分类器
$G_{\scriptscriptstyle m}(x)：\chi\to\{-1,1\}$
(b)计算基分类器在训练数据集上的分类误差率
$e_{\scriptscriptstyle m}=\sum\limits_{\scriptscriptstyle i=1}^{\scriptscriptstyle N}P(G_{m}(x_{i}\ne y_{i}))=\sum\limits_{\scriptscriptstyle i=1}^{\scriptscriptstyle N}w_{mi}I(P(G_{m}(x_{i}\ne y_{i}))$
可以看出来，每个基分类器的误差率其实就是误分类样本的权值之和。
©计算基分类器 $G_{\scriptscriptstyle m}(x)$ 的系数
$\alpha_{\scriptscriptstyle m}=\frac{1}{2}log\frac{1-e_{\scriptscriptstyle m}}{e_{\scriptscriptstyle m}}$
这里的对数是自然对数。
我们知道，基分类器要满足 “好而不同”，而 “好” 体现在每个基分类器的性能要比随机猜测要好一些，
当 $e_{\scriptscriptstyle m}\geq0.5$ 时， $\frac{1-e_{\scriptscriptstyle m}}{e_{\scriptscriptstyle m}}\leq1$ ，则 $\alpha_{\scriptscriptstyle m}\leq 0$ ；
当 $e_{\scriptscriptstyle m}<0.5$ 时， $\frac{1-e_{\scriptscriptstyle m}}{e_{\scriptscriptstyle m}}>1$ ，则 $\alpha_{\scriptscriptstyle m}> 0$ ；
给分类性能好的分类器较大的权重，使其在最终的分类函数中起到更大的决定作用。
(d)更新训练集的权值分布，
$D_{\scriptscriptstyle m+1}=(w_{\scriptscriptstyle m+1,1},\cdots,w_{\scriptscriptstyle m+1,i},w_{\scriptscriptstyle m+1,N})$ $w_{\scriptscriptstyle m+1,i}=\frac{w_{\scriptscriptstyle mi}\ e^{(-\alpha_{\scriptscriptstyle m}y_{\scriptscriptstyle i}G_{m}(x_{\scriptscriptstyle i}))}}{Z_{\scriptscriptstyle m}}$
其中，
$Z_{\scriptscriptstyle m}=\sum\limits_{\scriptscriptstyle i=1}^{\scriptscriptstyle N}w_{\scriptscriptstyle m,i}e^{-\alpha_{\scriptscriptstyle m}y_{\scriptscriptstyle i}G_{\scriptscriptstyle m}(x_{\scriptscriptstyle i})}$
$Z_{\scriptscriptstyle m}$ 是一个规范化因子
观察训练样本的权值更新公式，我们可以发现，
当 $y_{\scriptscriptstyle i}G_{\scriptscriptstyle m}(x_{\scriptscriptstyle i})=1$ ，即样本点 $i$ 被正确分类时，有 $w_{\scriptscriptstyle m+1,i}=\frac{w_{\scriptscriptstyle mi}\ e^{(-\alpha_{\scriptscriptstyle m})}}{Z_{\scriptscriptstyle m}}$
当 $y_{\scriptscriptstyle i}G_{\scriptscriptstyle m}(x_{\scriptscriptstyle i})=-1$ ，即样本点被误分类，有
$w_{\scriptscriptstyle m+1,i}=\frac{w_{\scriptscriptstyle mi}\ e^{(\alpha_{\scriptscriptstyle m})}}{Z_{\scriptscriptstyle m}}$
一句话总结，增加被误分类样本的权重，减少已被正确分类的样本的权重。
(3)构建基本分类器的线性组合
$f(x)=\sum\limits_{\scriptscriptstyle m=1}^{\scriptscriptstyle M}\alpha_{\scriptscriptstyle m}G_{\scriptscriptstyle m}(x)$
最终的分类器函数为，
$G(x)=sign(f(x))=sign(\sum\limits_{\scriptscriptstyle m=1}^{\scriptscriptstyle M}\alpha_{\scriptscriptstyle m}G_{\scriptscriptstyle m}(x))$

二、实战分析

1.基於单层决策树构建弱分类器

加载数据以及可视化数据：

import numpy as np
import  matplotlib.pyplot as  plt
import matplotlib as mpl
def loadSimData():
    dataMat = np.matrix([[1.0,2.1],
                        [2.0,1.1],
                        [1.3,1.0],
                        [1.0,1.0],
                        [2.0,1.0]])
    classlabels = [1.0,1.0,-1.0,-1.0,1.0]
    return dataMat,classlabels
dataMat,classlabels = loadSimData()
def plotDecisionStu(dataMat,classlabes):
    n = np.shape(dataMat)[0]
    xcord1=[];ycord1=[]
    xcord2=[];ycord2=[]
    for i in range(n):
        if classlabels[i]==1:
            xcord1.append(dataMat[i,0]);ycord1.append(dataMat[i,1])
        else:
            xcord2.append(dataMat[i,0]);ycord2.append(dataMat[i,1])
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.scatter(xcord1,ycord1,s=30,c='red',marker = 's')
    ax.scatter(xcord2,ycord2,s=30,c='blue',marker = '8')
    mpl.rcParams['font.sans-serif'] = ['simhei']
    plt.title('单层决策树测试数据')
    plt.show()
print(plotDecisionStu(dataMat,classlabels))

结果：

单层决策树生成函数：

伪代码：
将最小错误率minError设为 $+\infty$
对数据集中的每一个特征（第一层循环）：　　
　　对每个步长（第二层循环）：
　　　　对每个不等号（第三层循环）：
　　　　　　　建立一棵单层决策树并用加权数据集进行训练
　　　　　　　如果错误率低于minError,则将当前的单层决策树设为最佳的单层决策树
返回最佳的单层决策树

# 单层决策树生成函数
#单层决策树的阈值过滤函数
# 参数说明：dataMatrix-训练数据集
#                dimen-某一个特征的索引
#            threshVal-阈值
#           threshIneq-不等式符号
def  stumpClassify(dataMatrix,dimen,threshVal,threshIneq):
    # 初始化每一个样本点的类标签为1
    reArray = np.ones((np.shape(dataMatrix)[0],1))
    # 判断不等式的符号：lt-表示小于或等于阈值
    #                   gt-表示大于阈值
    if threshIneq == 'lt':
        # 如果是lt,表示特征值小于或等于阈值，类标签为-1
        reArray[dataMatrix[:,dimen] <= threshVal] = -1.0
    else:
        # 否则，表示特征值大于阈值，类标签为-1
        reArray[dataMatrix[:,dimen] > threshVal] = -1.0
    return  reArray
# 返回数据集上的最佳决策树
def buildStump(dataMatrix,labelMat,D):
    # 返回训练集的大小
    m,n = np.shape(dataMatrix)
    # 步数，最佳决策树信息，最优单层决策树的预测结果
    numSteps = 10.0;bestStemp = {};bestClassEst = np.zeros((m,1))
    # 初始的误分类率为无穷大
    minError = np.inf
    # 遍历每一个特征
    for i in range(n):
        # 返回每一个特征的最大特征值和最小特征值
        rangeMin = dataMatrix[:,i].min();rangeMax = dataMatrix[:,i].max()
        # 计算步长大小
        StepSize = (rangeMax - rangeMin)/numSteps
        # 遍历每一个步数
        for j in range(-1,int(numSteps)+1):
            # 遍历每一个不等式
            for inequal in ['lt','gt']:
                # 计算阈值
                threshVal = (rangeMin + float(j)*StepSize)
                # 返回预测值
                predictedVals = stumpClassify(dataMatrix,i,threshVal,inequal)
                # 初始化误分类的样本矩阵
                errArr = np.mat(np.ones((m,1)))
                # 将误分类矩阵中预测值与真实值相等的位置赋值为0
                errArr[labelMat.T == predictedVals] = 0
                # 每个基分类器的误差率其实就是误分类样本的权值之和
                weightedError = D.T*errArr
                #print("split: dim % d,thresh %.2f,thresh inequal %s,the weighted error is                 　　　　　＃　　　%.3f"%(i,threshVal,inequal,weightedError))
                # 更新最小误差率
                if weightedError < minError:
                    minError = weightedError
                    # 将阈值添加到字典的threshVal键
                    bestStemp['threshVal'] = threshVal
                    # 将不等式添加到字典的threshIneq
                    bestStemp['threshIneq']= inequal
                    # 将预测值赋值给bestClassEst
                    bestClassEst = predictedVals.copy()
                    # 最佳的分类特征
                    bestStemp['dim'] = i
    return  bestStemp,bestClassEst,minError

D = np.mat(np.ones((5,1))/5)
# bestStemp, bestClassEst, minError = buildStump(dataMat,classlabels,D)
print(buildStump(dataMat,classlabels,D))

结果：

2.完整的Ａdaboost算法

**上面的程序只是生成了单个基分类器，这里，我们要生成多个弱分类器来构建完整的Ａdaboost算法：**

整个代码实现的伪代码如下：
对每次迭代
　　利用buildStump()函数找到最佳的单层决策树
　　将单层决策树加入到单层决策树组
　　计算alpha
　　计算新的权值向量
　　更新累计的类别估计值
　　如果错误率等于0，则退出循环

# 完整的Adaboost算法
def adaBoostTrainDs(dataArr,classLabels,numIt=40):
    # 弱分类器的相关信息表
    weakClassArr = []
    # 返回样本点的个数
    m = np.shape(dataArr)[0]
    # 初始化样本点的权重向量
    D = np.mat(np.ones((m,1))/m)
    # 集成的弱分类器的分类矩阵
    aggClassEst = np.mat(np.zeros((m,1)))
    # 开始迭代
    for  i in range(numIt):
        # 得到最佳的单层决策树
        bestStemp,ClassEst,Error = buildStump(dataArr,classlabels,D)
        print('D:',D.T)
        # 计算每一个单层决策树的权重系数
        alpha = float(0.5*np.log((1-Error)/max(Error,1e-16)))
        # 将每一个弱分类器的系数alpha添加到字典 bestStemp
        bestStemp['alpha'] = alpha
        # 将该决策树的信息存储起来
        weakClassArr.append(bestStemp)
        print('ClassEst:',ClassEst.T)
        # 更新每一个样本的权值向量
        expon = np.multiply(-alpha*np.mat(classlabels).T,ClassEst)
        D = np.multiply(D,np.exp(expon))
        D = D/D.sum()
        # 累加当前决策树的加权预测值
        aggClassEst += alpha*ClassEst
        print('aggClassEst:',aggClassEst)
        # 返回一个 m×1 的矩阵，预测正确的位置为0，误分类的位置为1
        aggErrors = np.multiply(np.sign(aggClassEst)!=np.mat(classlabels).T,np.ones((m,1)))
        # 计算误分类率
        errorRate = aggErrors.sum()/m
        print('total error',errorRate)
        # 如果误分类率为0，则退出循环
        if errorRate == 0.0:
            break
    return  weakClassArr
print(adaBoostTrainDs(dataMat,classlabels,numIt=40))

结果：

D: [[0.2 0.2 0.2 0.2 0.2]]
ClassEst: [[-1.  1. -1. -1.  1.]]
aggClassEst: [[-0.69314718]
 [ 0.69314718]
 [-0.69314718]
 [-0.69314718]
 [ 0.69314718]]
total error 0.2
D: [[0.5   0.125 0.125 0.125 0.125]]
ClassEst: [[ 1.  1. -1. -1. -1.]]
aggClassEst: [[ 0.27980789]
 [ 1.66610226]
 [-1.66610226]
 [-1.66610226]
 [-0.27980789]]
total error 0.2
D: [[0.28571429 0.07142857 0.07142857 0.07142857 0.5       ]]
ClassEst: [[1. 1. 1. 1. 1.]]
aggClassEst: [[ 1.17568763]
 [ 2.56198199]
 [-0.77022252]
 [-0.77022252]
 [ 0.61607184]]
total error 0.0
[{'threshVal': 1.3, 'threshIneq': 'lt', 'dim': 0, 'alpha': 0.6931471805599453}, {'threshVal': 1.0, 'threshIneq': 'lt', 'dim': 1, 'alpha': 0.9729550745276565}, {'threshVal': 0.9, 'threshIneq': 'lt', 'dim': 0, 'alpha': 0.8958797346140273}]

其中，最后一行详细记录了每一个基学习器采用的阈值、不等式、特征、权重。

3.基于Adaboost的分类

上一个代码是在训练集上训练弱分类器，这里，我们把训练好的弱分类器抽离出来，在小数据集上进行测试，

# 利用训练好的基学习器对数据进行分类
def adaClassifiy(dataToClass,classifierArr):
    dataMatrix = np.mat(dataToClass)
    m = np.shape(dataMatrix)[0]
    aggClassEst = np.mat(np.zeros((m,1)))
    for i in range(len(classifierArr)):
        classEst = stumpClassify(dataMatrix,classifierArr[i]['dim'],classifierArr[i]['threshVal'],
                                 classifierArr[i]['threshIneq'])
        aggClassEst += classifierArr[i]['alpha']*classEst
    AggClassEst = np.sign(aggClassEst)
    return AggClassEst
classifierArr = adaBoostTrainDs(dataMat,classlabels,numIt=40)
AggClassEst = adaClassifiy([[5,5],[0,0]],classifierArr)
print(AggClassEst)

结果：

D: [[0.2 0.2 0.2 0.2 0.2]]
ClassEst: [[-1.  1. -1. -1.  1.]]
aggClassEst: [[-0.69314718]
 [ 0.69314718]
 [-0.69314718]
 [-0.69314718]
 [ 0.69314718]]
total error 0.2
D: [[0.5   0.125 0.125 0.125 0.125]]
ClassEst: [[ 1.  1. -1. -1. -1.]]
aggClassEst: [[ 0.27980789]
 [ 1.66610226]
 [-1.66610226]
 [-1.66610226]
 [-0.27980789]]
total error 0.2
D: [[0.28571429 0.07142857 0.07142857 0.07142857 0.5       ]]
ClassEst: [[1. 1. 1. 1. 1.]]
aggClassEst: [[ 1.17568763]
 [ 2.56198199]
 [-0.77022252]
 [-0.77022252]
 [ 0.61607184]]
total error 0.0
[[ 1.]
 [-1.]]

参考资料：
1.《统计学习方法》李航
2.《机器学习实战》Peter Harrington
3.matplotlib.markers：https://matplotlib.org/api/markers_api.html
4.matplotlib绘图之中文标题、座标轴标签乱码问题：https://blog.csdn.net/anmo1221/article/details/77746528
5.常用数学符号的 LaTeX 表示方法:http://mohu.org/info/symbols/symbols.htm

机器学习之Adaboost算法

文章目录

一、算法原理

1.算法的基本思想

2.算法的流程

二、实战分析

1.基於单层决策树构建弱分类器

2.完整的Ａdaboost算法

3.基于Adaboost的分类

「Pygors跨平台GUI」1：Pygors跨平台GUI应用研究

[转帖]

python列出centos7内存使用前50的进程信息

Garnet：微软官方基于.NET开源的高性能分布式缓存存储数据库

Flink执行图

Java响应式编程

评估统计算法在银行伪造钞票检测中的价值

MySQL面試試題（二）

Python中的copy()和deepcopy()

MySQL基礎知識一及軟件安裝

排列組合問題探索

關於sum的坑

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結