提升方法模型實現


數據準備

本文將要實現的是二分類的AdaBoost算法,使用的數據是兩類別數據mnist_binary.csv;由於原數據特徵值在0-255之間,這樣會使得AdaBoost中的基本分類器閾值分佈較廣;因此還將數據進行二值化到0-1,使閾值在[-0.5, 0.5, 1.5]三個值當中進行選擇。二值化步驟在代碼中完成,不另外生成相應的數據集了。


AdaBoost算法

Adaboost算法的思想比較簡單:通過將多個的弱分類器組合成一個強分類器,而多個弱分類器的學習是通過改變訓練的權值或分佈來實現的;而這樣一種加法模型的方式被成爲提升方法。

書中對AdaBoost的分治思想和提升過程有明確的闡述:
AdaBoost思想

AdaBoost算法步驟清晰,原理也比較簡單,詳細步驟如下:
AdaBoost算法步驟
具體代碼實現如下:

# @Author: phd
# @Date: 2019-11-08
# @Site: github.com/phdsky
# @Description: NULL

import time
import logging
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Binarizer


def log(func):
    def warpper(*args, **kwargs):
        start_time = time.time()
        ret = func(*args, **kwargs)

        end_time = time.time()
        logging.debug("%s() cost %s seconds" % (func.__name__, end_time - start_time))

        return ret

    return warpper


def calc_accuracy(y_pred, y_truth):
    assert len(y_pred) == len(y_truth)
    n = len(y_pred)

    hit_count = 0
    for i in range(0, n):
        if y_pred[i] == y_truth[i]:
            hit_count += 1

    print("Accuracy %f\n" % (hit_count / n))

    return hit_count / n


def sign(x):
    if x >= 0:
        return 1
    elif x < 0:
        return -1
    else:
        print("Sign function input wrong!\n")


class AdaBoost(object):
    def __init__(self, X_train, y_train, max_classfifers):
        self.X = X_train
        self.Y = y_train

        self.sample_num = len(X_train)  # sample num
        self.feature_num = len(X_train[0])  # feature num
        self.D = np.full(self.sample_num, (1./self.sample_num))  # weight distribution

        self.M = max_classfifers  # max classifier number
        self.axis = np.full(self.M, -1)  # min ei axis selected
        self.alpha = np.zeros(self.M)
        self.Gm = np.zeros(self.M)  # basic classifier

        self.thresh_array = np.arange(np.min(self.X)-0.5, np.max(self.X)+0.51, 1)
        self.direction = np.full(self.M, -1)

    def basic_classifier(self, threshold, value, direction):
        if direction == 0:
            if value < threshold:
                return 1
            else:
                return -1
        elif direction == 1:
            if value > threshold:
                return 1
            else:
                return -1
        else:
            print("WTF the operation direction is?")

    def train_basic_classifier(self, classifier):
        # After binarization, the value is 0 ~ 1, so the
        # threshold should be [-0.5, 0.5, 1.5]
        # For multi dimensional data, choose the axis which
        # has the min ei value to take part in decision
        min_ei = self.sample_num  # all weight is 1 and hit
        selected_axis = -1
        threshold = self.thresh_array[-1] + 1

        direction_array = [0, 1]
        direction = -1

        for axis in range(self.feature_num):
            for th in self.thresh_array:
                axis_vector = self.X[:, axis]
                thresh_vector = np.full(self.sample_num, th)

                for direct in direction_array:
                    # Use vector format calculation for accelerating
                    if direct == 0:
                        compare_vector = np.asarray([axis_vector < thresh_vector], dtype=int) * 2 - 1
                    elif direct == 1:
                        compare_vector = np.asarray([axis_vector > thresh_vector], dtype=int) * 2 - 1

                    calc_ei = np.sum((compare_vector != self.Y)*self.D)

                    # calc_ei = 0.
                    # for sample in range(self.sample_num):
                    #     calc_ei += self.D[sample]*\
                    #                int(self.basic_classifier(thresh, self.X[sample][axis]) != self.Y[sample])

                    if calc_ei < min_ei:
                        min_ei = calc_ei
                        selected_axis = axis
                        threshold = th
                        direction = direct

        self.axis[classifier] = selected_axis
        self.Gm[classifier] = threshold
        self.direction[classifier] = direction

        return min_ei

    @log
    def train(self):
        m = 0
        while m < self.M:
            print("Training %d classifier..." % m)

            # Train basic classifier and classify error
            ei = self.train_basic_classifier(classifier=m)

            # Calculate alpha value
            self.alpha[m] = 0.5*np.log((1 - ei) / ei)

            # Validate training
            train_label = self.predict(X_test=self.X, classifier_number=(m + 1))
            accuracy = calc_accuracy(train_label, self.Y)

            if accuracy == 1.:
                print("Fitting perfect on training set!")
                return m + 1

            # Calculate regulator
            Zm = 0.
            for i in range(self.sample_num):
                Zm += self.D[i] * np.exp(-self.alpha[m]*self.Y[i] *
                                         self.basic_classifier(self.Gm[m], self.X[i][self.axis[m]], self.direction[m]))

            # Update weight distribution
            for i in range(self.sample_num):
                self.D[i] = self.D[i] * np.exp(-self.alpha[m]*self.Y[i] *
                                               self.basic_classifier(self.Gm[m], self.X[i][self.axis[m]], self.direction[m])) / Zm

            m += 1

        return m

    # @log
    def predict(self, X_test, classifier_number):
        n = len(X_test)
        predict_label = np.full(n, -1)

        for i in range(n):
            to_predict = X_test[i]
            result = 0.

            for m in range(classifier_number):
                result += self.alpha[m] * self.basic_classifier(self.Gm[m], to_predict[self.axis[m]], self.direction[m])

            predict_label[i] = sign(result)

        return predict_label


def example_large():
    mnist_data = pd.read_csv("../data/mnist_binary.csv")
    mnist_values = mnist_data.values

    images = mnist_values[::, 1::]
    labels = mnist_values[::, 0]

    X_train, X_test, y_train, y_test = train_test_split(
        images, labels, test_size=0.33, random_state=42
    )

    # Binary the images to avoid AdaBoost classifier threshold complex
    binarizer_train = Binarizer(threshold=127).fit(X_train)
    X_train_binary = binarizer_train.transform(X_train)

    binarizer_test = Binarizer(threshold=127).fit(X_test)
    X_test_binary = binarizer_test.transform(X_test)

    adaboost = AdaBoost(X_train=X_train_binary, y_train=y_train, max_classfifers=233)

    print("AdaBoost training...")
    classifier_trained = adaboost.train()
    print("\nTraining done...")
    print("\nTraining done with %d classifiers!" % classifier_trained)

    print("Testing on %d samples..." % len(X_test))
    y_predicted = adaboost.predict(X_test=X_test_binary, classifier_number=classifier_trained)

    calc_accuracy(y_pred=y_predicted, y_truth=y_test)


def example_small():
    X_train = np.asarray([[0], [1], [2], [3], [4], [5], [6], [7], [8], [9]])
    y_train = np.asarray([1, 1, 1, -1, -1, -1, 1, 1, 1, -1])

    adaboost = AdaBoost(X_train=X_train, y_train=y_train, max_classfifers=5)

    print("Adaboost training...")
    classifier_trained = adaboost.train()
    print("\nTraining done with %d classifiers!" % classifier_trained)


if __name__ == "__main__":
    logger = logging.getLogger()
    logger.setLevel(logging.DEBUG)

    # example_large()
    example_small()

代碼實現過程中有注意點如下:

  1. 要注意到書中的弱分類器中符號是有方向的;剛開始實現的過程中沒有注意到這一點,以至於在寫完算法後跑AdaBoost時,準確率一直維持在一個略高的水平,並且沒有看到想要的提升過程;於是就搬了書上的例子來做對比,實現完後對照書上的發現最後一步數據對不上,仔細看了一下才發現最後一個弱分類器的符號反向了;改過來給每個分類器加個flag之後重新跑,結果就比較符合預期了。
  2. 代碼中有較多的比較操作,文中基本按向量方式實現以加速,不然算下來太慢了。

輸出結果:

/Users/phd/Softwares/anaconda3/bin/python /Users/phd/Desktop/ML/boosting/adaboost.py
AdaBoost training...
Training 0 classifier...
Accuracy 0.661194

Training 1 classifier...
Accuracy 0.661194

Training 2 classifier...
Accuracy 0.682836

Training 3 classifier...
Accuracy 0.702807

Training 4 classifier...
Accuracy 0.706681

Training 5 classifier...
Accuracy 0.719829

Training 6 classifier...
Accuracy 0.718728

Training 7 classifier...
Accuracy 0.727505

Training 8 classifier...
Accuracy 0.741294

Training 9 classifier...
Accuracy 0.741649

Training 10 classifier...
Accuracy 0.755508

Training 11 classifier...
Accuracy 0.761443

Training 12 classifier...
Accuracy 0.762864

Training 13 classifier...
Accuracy 0.763291

Training 14 classifier...
Accuracy 0.765601

Training 15 classifier...
Accuracy 0.763042

Training 16 classifier...
Accuracy 0.770860

Training 17 classifier...
Accuracy 0.771357

Training 18 classifier...
Accuracy 0.769581

Training 19 classifier...
Accuracy 0.774982

Training 20 classifier...
Accuracy 0.777967

Training 21 classifier...
Accuracy 0.776048

Training 22 classifier...
Accuracy 0.778820

Training 23 classifier...
Accuracy 0.778216

Training 24 classifier...
Accuracy 0.777683

Training 25 classifier...
Accuracy 0.781414

Training 26 classifier...
Accuracy 0.780242

Training 27 classifier...
Accuracy 0.778536

Training 28 classifier...
Accuracy 0.783440

Training 29 classifier...
Accuracy 0.781485

Training 30 classifier...
Accuracy 0.784790

Training 31 classifier...
Accuracy 0.784009

Training 32 classifier...
Accuracy 0.787456

Training 33 classifier...
Accuracy 0.785679

Training 34 classifier...
Accuracy 0.789197

Training 35 classifier...
Accuracy 0.785537

Training 36 classifier...
Accuracy 0.792502

Training 37 classifier...
Accuracy 0.787918

Training 38 classifier...
Accuracy 0.793888

Training 39 classifier...
Accuracy 0.789552

Training 40 classifier...
Accuracy 0.792324

Training 41 classifier...
Accuracy 0.791720

Training 42 classifier...
Accuracy 0.795060

Training 43 classifier...
Accuracy 0.794812

Training 44 classifier...
Accuracy 0.797477

Training 45 classifier...
Accuracy 0.797157

Training 46 classifier...
Accuracy 0.798436

Training 47 classifier...
Accuracy 0.798045

Training 48 classifier...
Accuracy 0.800391

Training 49 classifier...
Accuracy 0.799112


Training done...

Training done with 50 classifiers!
Testing on 13860 samples...
DEBUG:root:train() cost 266.6678547859192 seconds
Accuracy 0.794444


Process finished with exit code 0

從結果可以看到,AdaBoost確實是有一個“Boosting”過程的,算法的速度也還可以,對於二值化二分類問題準確率跟之前的模型大致持平,勉強還可以,不過到後面可以看到訓練再多的弱分類器對效果提升甚微。


提升樹

不管是AdaBoost還是提升樹,本質上都是一種 算法是前向分佈算法,模型是加法模型 的實例化算法;此外AdaBoost也可以看作是由一個根結點直接連接兩個葉節點的簡單決策樹(決策樹樁),兩者算法流程相似。

提升樹對於損失函數的不同,在每步模型擬合的過程中的對象是不一樣的,具體表現在:

  1. 如果模型損失是平方損失函數(迴歸問題)或是指數損失函數(分類問題)時,那麼擬合殘差即可;利用平方損失函數的迴歸提升樹算法步驟如下:
    迴歸提升樹
  2. 如果模型損失是一般損失函數(一般決策問題)的話,其優化過程就需要通過梯度下降方法,利用損失函數在當前模型的負梯度來近似殘差;梯度提升樹算法步驟如下:
    梯度提升樹

總結

本章概要


參考

  1. 《統計學習方法》
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章