數據準備
本文將要實現的是二分類的AdaBoost算法,使用的數據是兩類別數據mnist_binary.csv;由於原數據特徵值在0-255之間,這樣會使得AdaBoost中的基本分類器閾值分佈較廣;因此還將數據進行二值化到0-1,使閾值在[-0.5, 0.5, 1.5]三個值當中進行選擇。二值化步驟在代碼中完成,不另外生成相應的數據集了。
AdaBoost算法
Adaboost算法的思想比較簡單:通過將多個的弱分類器組合成一個強分類器,而多個弱分類器的學習是通過改變訓練的權值或分佈來實現的;而這樣一種加法模型的方式被成爲提升方法。
書中對AdaBoost的分治思想和提升過程有明確的闡述:
AdaBoost算法步驟清晰,原理也比較簡單,詳細步驟如下:
具體代碼實現如下:
# @Author: phd
# @Date: 2019-11-08
# @Site: github.com/phdsky
# @Description: NULL
import time
import logging
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Binarizer
def log(func):
def warpper(*args, **kwargs):
start_time = time.time()
ret = func(*args, **kwargs)
end_time = time.time()
logging.debug("%s() cost %s seconds" % (func.__name__, end_time - start_time))
return ret
return warpper
def calc_accuracy(y_pred, y_truth):
assert len(y_pred) == len(y_truth)
n = len(y_pred)
hit_count = 0
for i in range(0, n):
if y_pred[i] == y_truth[i]:
hit_count += 1
print("Accuracy %f\n" % (hit_count / n))
return hit_count / n
def sign(x):
if x >= 0:
return 1
elif x < 0:
return -1
else:
print("Sign function input wrong!\n")
class AdaBoost(object):
def __init__(self, X_train, y_train, max_classfifers):
self.X = X_train
self.Y = y_train
self.sample_num = len(X_train) # sample num
self.feature_num = len(X_train[0]) # feature num
self.D = np.full(self.sample_num, (1./self.sample_num)) # weight distribution
self.M = max_classfifers # max classifier number
self.axis = np.full(self.M, -1) # min ei axis selected
self.alpha = np.zeros(self.M)
self.Gm = np.zeros(self.M) # basic classifier
self.thresh_array = np.arange(np.min(self.X)-0.5, np.max(self.X)+0.51, 1)
self.direction = np.full(self.M, -1)
def basic_classifier(self, threshold, value, direction):
if direction == 0:
if value < threshold:
return 1
else:
return -1
elif direction == 1:
if value > threshold:
return 1
else:
return -1
else:
print("WTF the operation direction is?")
def train_basic_classifier(self, classifier):
# After binarization, the value is 0 ~ 1, so the
# threshold should be [-0.5, 0.5, 1.5]
# For multi dimensional data, choose the axis which
# has the min ei value to take part in decision
min_ei = self.sample_num # all weight is 1 and hit
selected_axis = -1
threshold = self.thresh_array[-1] + 1
direction_array = [0, 1]
direction = -1
for axis in range(self.feature_num):
for th in self.thresh_array:
axis_vector = self.X[:, axis]
thresh_vector = np.full(self.sample_num, th)
for direct in direction_array:
# Use vector format calculation for accelerating
if direct == 0:
compare_vector = np.asarray([axis_vector < thresh_vector], dtype=int) * 2 - 1
elif direct == 1:
compare_vector = np.asarray([axis_vector > thresh_vector], dtype=int) * 2 - 1
calc_ei = np.sum((compare_vector != self.Y)*self.D)
# calc_ei = 0.
# for sample in range(self.sample_num):
# calc_ei += self.D[sample]*\
# int(self.basic_classifier(thresh, self.X[sample][axis]) != self.Y[sample])
if calc_ei < min_ei:
min_ei = calc_ei
selected_axis = axis
threshold = th
direction = direct
self.axis[classifier] = selected_axis
self.Gm[classifier] = threshold
self.direction[classifier] = direction
return min_ei
@log
def train(self):
m = 0
while m < self.M:
print("Training %d classifier..." % m)
# Train basic classifier and classify error
ei = self.train_basic_classifier(classifier=m)
# Calculate alpha value
self.alpha[m] = 0.5*np.log((1 - ei) / ei)
# Validate training
train_label = self.predict(X_test=self.X, classifier_number=(m + 1))
accuracy = calc_accuracy(train_label, self.Y)
if accuracy == 1.:
print("Fitting perfect on training set!")
return m + 1
# Calculate regulator
Zm = 0.
for i in range(self.sample_num):
Zm += self.D[i] * np.exp(-self.alpha[m]*self.Y[i] *
self.basic_classifier(self.Gm[m], self.X[i][self.axis[m]], self.direction[m]))
# Update weight distribution
for i in range(self.sample_num):
self.D[i] = self.D[i] * np.exp(-self.alpha[m]*self.Y[i] *
self.basic_classifier(self.Gm[m], self.X[i][self.axis[m]], self.direction[m])) / Zm
m += 1
return m
# @log
def predict(self, X_test, classifier_number):
n = len(X_test)
predict_label = np.full(n, -1)
for i in range(n):
to_predict = X_test[i]
result = 0.
for m in range(classifier_number):
result += self.alpha[m] * self.basic_classifier(self.Gm[m], to_predict[self.axis[m]], self.direction[m])
predict_label[i] = sign(result)
return predict_label
def example_large():
mnist_data = pd.read_csv("../data/mnist_binary.csv")
mnist_values = mnist_data.values
images = mnist_values[::, 1::]
labels = mnist_values[::, 0]
X_train, X_test, y_train, y_test = train_test_split(
images, labels, test_size=0.33, random_state=42
)
# Binary the images to avoid AdaBoost classifier threshold complex
binarizer_train = Binarizer(threshold=127).fit(X_train)
X_train_binary = binarizer_train.transform(X_train)
binarizer_test = Binarizer(threshold=127).fit(X_test)
X_test_binary = binarizer_test.transform(X_test)
adaboost = AdaBoost(X_train=X_train_binary, y_train=y_train, max_classfifers=233)
print("AdaBoost training...")
classifier_trained = adaboost.train()
print("\nTraining done...")
print("\nTraining done with %d classifiers!" % classifier_trained)
print("Testing on %d samples..." % len(X_test))
y_predicted = adaboost.predict(X_test=X_test_binary, classifier_number=classifier_trained)
calc_accuracy(y_pred=y_predicted, y_truth=y_test)
def example_small():
X_train = np.asarray([[0], [1], [2], [3], [4], [5], [6], [7], [8], [9]])
y_train = np.asarray([1, 1, 1, -1, -1, -1, 1, 1, 1, -1])
adaboost = AdaBoost(X_train=X_train, y_train=y_train, max_classfifers=5)
print("Adaboost training...")
classifier_trained = adaboost.train()
print("\nTraining done with %d classifiers!" % classifier_trained)
if __name__ == "__main__":
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
# example_large()
example_small()
代碼實現過程中有注意點如下:
- 要注意到書中的弱分類器中符號是有方向的;剛開始實現的過程中沒有注意到這一點,以至於在寫完算法後跑AdaBoost時,準確率一直維持在一個略高的水平,並且沒有看到想要的提升過程;於是就搬了書上的例子來做對比,實現完後對照書上的發現最後一步數據對不上,仔細看了一下才發現最後一個弱分類器的符號反向了;改過來給每個分類器加個flag之後重新跑,結果就比較符合預期了。
- 代碼中有較多的比較操作,文中基本按向量方式實現以加速,不然算下來太慢了。
輸出結果:
/Users/phd/Softwares/anaconda3/bin/python /Users/phd/Desktop/ML/boosting/adaboost.py
AdaBoost training...
Training 0 classifier...
Accuracy 0.661194
Training 1 classifier...
Accuracy 0.661194
Training 2 classifier...
Accuracy 0.682836
Training 3 classifier...
Accuracy 0.702807
Training 4 classifier...
Accuracy 0.706681
Training 5 classifier...
Accuracy 0.719829
Training 6 classifier...
Accuracy 0.718728
Training 7 classifier...
Accuracy 0.727505
Training 8 classifier...
Accuracy 0.741294
Training 9 classifier...
Accuracy 0.741649
Training 10 classifier...
Accuracy 0.755508
Training 11 classifier...
Accuracy 0.761443
Training 12 classifier...
Accuracy 0.762864
Training 13 classifier...
Accuracy 0.763291
Training 14 classifier...
Accuracy 0.765601
Training 15 classifier...
Accuracy 0.763042
Training 16 classifier...
Accuracy 0.770860
Training 17 classifier...
Accuracy 0.771357
Training 18 classifier...
Accuracy 0.769581
Training 19 classifier...
Accuracy 0.774982
Training 20 classifier...
Accuracy 0.777967
Training 21 classifier...
Accuracy 0.776048
Training 22 classifier...
Accuracy 0.778820
Training 23 classifier...
Accuracy 0.778216
Training 24 classifier...
Accuracy 0.777683
Training 25 classifier...
Accuracy 0.781414
Training 26 classifier...
Accuracy 0.780242
Training 27 classifier...
Accuracy 0.778536
Training 28 classifier...
Accuracy 0.783440
Training 29 classifier...
Accuracy 0.781485
Training 30 classifier...
Accuracy 0.784790
Training 31 classifier...
Accuracy 0.784009
Training 32 classifier...
Accuracy 0.787456
Training 33 classifier...
Accuracy 0.785679
Training 34 classifier...
Accuracy 0.789197
Training 35 classifier...
Accuracy 0.785537
Training 36 classifier...
Accuracy 0.792502
Training 37 classifier...
Accuracy 0.787918
Training 38 classifier...
Accuracy 0.793888
Training 39 classifier...
Accuracy 0.789552
Training 40 classifier...
Accuracy 0.792324
Training 41 classifier...
Accuracy 0.791720
Training 42 classifier...
Accuracy 0.795060
Training 43 classifier...
Accuracy 0.794812
Training 44 classifier...
Accuracy 0.797477
Training 45 classifier...
Accuracy 0.797157
Training 46 classifier...
Accuracy 0.798436
Training 47 classifier...
Accuracy 0.798045
Training 48 classifier...
Accuracy 0.800391
Training 49 classifier...
Accuracy 0.799112
Training done...
Training done with 50 classifiers!
Testing on 13860 samples...
DEBUG:root:train() cost 266.6678547859192 seconds
Accuracy 0.794444
Process finished with exit code 0
從結果可以看到,AdaBoost確實是有一個“Boosting”過程的,算法的速度也還可以,對於二值化二分類問題準確率跟之前的模型大致持平,勉強還可以,不過到後面可以看到訓練再多的弱分類器對效果提升甚微。
提升樹
不管是AdaBoost還是提升樹,本質上都是一種 算法是前向分佈算法,模型是加法模型 的實例化算法;此外AdaBoost也可以看作是由一個根結點直接連接兩個葉節點的簡單決策樹(決策樹樁),兩者算法流程相似。
提升樹對於損失函數的不同,在每步模型擬合的過程中的對象是不一樣的,具體表現在:
- 如果模型損失是平方損失函數(迴歸問題)或是指數損失函數(分類問題)時,那麼擬合殘差即可;利用平方損失函數的迴歸提升樹算法步驟如下:
- 如果模型損失是一般損失函數(一般決策問題)的話,其優化過程就需要通過梯度下降方法,利用損失函數在當前模型的負梯度來近似殘差;梯度提升樹算法步驟如下:
總結
參考
- 《統計學習方法》