聚合模型實際上就是將許多模型聚合在一起，從而使其分類性能更佳。

aggregation models: mix or combine hypotheses (for better performance)

下面舉個例子：
你有 $T$ 朋友，他們對於股票漲停的預測表現爲 $g_1,\cdots ,g_T$ 。常見的聚合（aggregation）方法有：

select the most trust-worthy friend from their usual performance
根據他們的平常表現，選出最值得信任的朋友
$G(\mathbf{x})=g_{t_{*}}(\mathbf{x}) \text { with } t_{*}=\operatorname{argmin}_{t \in\{1,2, \ldots, T\}} E_{\text {val }}\left(g_{t}^{-}\right)$
mix the predictions from all your friends uniformly
將所有朋友的預測取平均值
$G(\mathbf{x})=\operatorname{sign}\left(\sum_{t=1}^{T} 1 \cdot g_{t}(\mathbf{x})\right)$
mix the predictions from all your friends non-uniformly
將所有朋友的預測值取加權平均值
$G(\mathbf{x})=\operatorname{sign}\left(\sum_{t=1}^{T} \alpha_{t} \cdot g_{t}(\mathbf{x})\right) \text { with } \alpha_{t} \geq 0$
combine the predictions conditionally
根據當前狀態 $\mathbf{x}$ 確定權重後結合。
$G(\mathbf{x})=\operatorname{sign}\left(\sum_{t=1}^{T} q_{t}(\mathbf{x}) \cdot g_{t}(\mathbf{x})\right) \text { with } q_{t}(\mathbf{x}) \geq 0$

學到這裏，可能有一種感覺，與模型選擇比較相近，並根據直觀印象，取平均獲得是分類器一定比最好的差，比最差的好。所以會感覺 aggregation 用處不大，那現在看一下， aggregation 的真正的用處是什麼？

以下圖爲例：

左側第一個圖中，實際上是使用三條豎線或橫線實現了二分類，雖然豎線或橫線是很弱的一種分類器，但是如此結合便獲得了一個較強的分類器，其分類效果好於任何一個分類器獨自分類的結果。

右側第一個圖中，是許多直線的取平均值獲得的，這種狀態存在於數據樣本較少時，可以獲取一種與SVM類似的效果，雖然這麼多直線對於訓練樣本（採樣數據）的分類效果一樣，但是對於測試樣本（全局數據）可能有更好的分類效果。

所以說真正的 aggregation 並不只是單純的取平均而已，其可能是爲了彌補當前分類器的不足（分類器分類性能較弱，分類器的泛化能力較弱）。即合理的聚合（aggregation）代表了更好的性能（performance）。

Blending

均值融合（uniform blending）

用於分類：

數學表達如下：

$G(\mathbf{x})=\operatorname{sign} \left( \sum_{t=1}^{T} g_{t}(\mathbf{x}) \right)$

有 $T$ 個人，每人一票。當 $g_{t}$ 預測值相近，那麼性能不變。當 $g_{t}$ 多樣民主時，少數服從多數（majority can correct minority）

在多分類中的數學表達爲：

$G(\mathbf{x})=\underset{1 \leq k \leq K}{\operatorname{argmax}} \sum_{t=1}^{T}\left[\kern-0.15em\left[g_{t}(\mathbf{x})=k\right]\kern-0.15em\right]$

用於迴歸：

$G(\mathbf{x})=\frac{1}{T} \sum_{t=1}^{T} g_{t}(\mathbf{x})$

當 $g_{t}$ 預測值相近，那麼性能不變。當 $g_{t}$ 多樣民主時，一些分類結果 $g_{t}(\mathbf{x})>f(\mathbf{x})$ ，另一些分類結果 $g_{t}(\mathbf{x})<f(\mathbf{x})$ ，那麼理想狀態取平均可以獲得最佳解。

綜合上述兩種需求，多樣性的 hypotheses 更容易使得融合模型性能更佳。

現在進行理論分析，其性能是否改善，這裏以迴歸模型爲例：
這裏的取平均是針對全部的 hypothesis 或者說 $T$ 個 $g_t$ 進行的，並針對的是隨機的單個樣本。
$\begin{aligned} \operatorname{avg}\left(\left(g_{t}(\mathrm{x})-f(\mathrm{x})\right)^{2}\right) &=\operatorname{avg}\left(g_{t}^{2}-2 g_{t} f+f^{2}\right) \\ &=\operatorname{avg}\left(g_{t}^{2}\right)-2 G f+f^{2} \\ &=\operatorname{avg}\left(g_{t}^{2}\right)-G^{2}+(G-f)^{2} \\ &=\operatorname{avg}\left(g_{t}^{2}\right)-2 G^{2}+G^{2}+(G-f)^{2} \\ &=\operatorname{avg}\left(g_{t}^{2}\right)-2\operatorname{avg}\left(g_{t}\right)G+G^{2}+(G-f)^{2} \\ &=\operatorname{avg}\left(g_{t}^{2}-2 g_{t} G+G^{2}\right)+(G-f)^{2} \\ &=\operatorname{avg}\left(\left(g_{t}-G\right)^{2}\right)+(G-f)^{2} \end{aligned}$

也就是說，在對全部訓練樣本 $\mathbf{x}_n$ 進行分析取全部誤差的平均值。這裏用 $\mathcal{E}$ 表示平均值。舉個例子： $\frac{1}{N}\sum_{n = 1}^{N}\left(g_{t}(\mathrm{x}_n)-f(\mathrm{x}_n)\right)^{2} = \mathcal{E}\left(g_{t}-f\right)^{2}$ 。

$\begin{aligned} \operatorname{avg}\left(\mathcal{E}\left(g_{t}-f\right)^{2}\right) &=\operatorname{avg}\left(\mathcal{E}\left(g_{t}-G\right)^{2}\right) & +\mathcal{E}(G-f)^{2}\\ \operatorname{avg}\left(E_{\text {out }}\left(g_{t}\right)\right) &=\operatorname{avg}\left(\mathcal{E}\left(g_{t}-G\right)^{2}\right) &+E_{\text {out }}(G) \\ & \geq & +E_{\text {out }}(G) \end{aligned}$

即 $G$ 優於 $g_t$ 的平均值。

現在假設在分佈爲 $P^{N}$ (i.i.d.) 的數據上選取大小爲 $N$ 的數據集 $\mathcal{D}_{t}$ ，並通過 $\mathcal{A}\left(\mathcal{D}_{t}\right)$ 獲取 $g_{t}$ 。那麼執行無數次可以獲取到 $\bar g$ ，表達式如下：

$\bar{g}=\lim _{T \rightarrow \infty} G=\lim _{T \rightarrow \infty} \frac{1}{T} \sum_{t=1}^{T} g_{t}=\underset{\mathcal{D}}{\mathcal{E}} \mathcal{A}(\mathcal{D})$

那麼現在用 $\bar{g}$ 代替 $G$ ，之前所求仍然成立，即：

$\begin{aligned} \operatorname{avg}\left(E_{\text {out }}\left(g_{t}\right)\right) &=\operatorname{avg}\left(\mathcal{E}\left(g_{t}-\bar{g}\right)^{2}\right) &+E_{\text {out }}(\bar{g}) \\ \end{aligned}$

其中

$\operatorname{avg}\left(E_{\text {out }}\left(g_{t}\right)\right)$ 代表了算法的期望性能（expected performance of A）。
$E_{\text {out }}(\bar{g})$ 代表了共識性能（performance of consensus），又叫偏差（bias）
$\operatorname{avg}\left(\mathcal{E}\left(g_{t}-\bar{g}\right)^{2}\right)$ 代表了共識的期望偏差（expected deviation to consensus），又叫方差（variance）

線性融合（Linear Blending）

用於分類：

數學表達如下：

$G(\mathbf{x})=\operatorname{sign}\left(\sum_{t=1}^{T} \alpha_{t} \cdot g_{t}(\mathbf{x})\right) \text { with } \alpha_{t} \geq 0$

與均值融合相似，有 $T$ 個人，但是每人 $\alpha_t$ 票，而不是都只有一票。

用於迴歸：
$\min _{\alpha_{t} \geq 0} \frac{1}{N} \sum_{n=1}^{N}\left(y_{n}-\sum_{t=1}^{T} \alpha_{t} g_{t}\left(\mathbf{x}_{n}\right)\right)^{2}$

這裏重溫一下線性迴歸加非線性轉換的結合模型，其數學表達如下：

$\min _{w_{i}} \frac{1}{N} \sum_{n=1}^{N}\left(y_{n}-\sum_{i=1}^{\tilde{d}} w_{i} \phi_{i}\left(\mathbf{x}_{n}\right)\right)^{2}$

可以看出兩種非常相似。

所以說線性融合就是線性迴歸使用假設函數作爲非線性轉換工具，並且有約束條件。

那麼該最優化問題可以寫爲：
$\min _{\alpha_{t} \geq 0} \frac{1}{N} \sum_{n=1}^{N} \operatorname{err}\left(y_{n}, \sum_{t=1}^{T} \alpha_{t} g_{t}\left(\mathbf{x}_{n}\right)\right)$

在實際運用中，常常不用約束條件 $\alpha_t > 0$ ，因爲：
$\text { if } \alpha_{t}<0 \Rightarrow \alpha_{t} g_{t}(\mathbf{x})=\left|\alpha_{t}\right|\left(-g_{t}(\mathbf{x})\right)$
也就是說認爲 $g_t$ 的分類錯誤很高，與預測值常常相反。那麼取反便會得到較好性能的分類器。

與模型選擇一樣，雖然使用訓練集獲取 $g_t$ ，但是最好使用驗證集獲取 $\alpha_t$ 。

堆疊融合（Stacking or Any Blending）

前面提到的均值融合和線性融合實際上類似於濾波，將預測值乘以一個係數後輸出，若將其視爲一個模型，那麼該模型表達式爲 $\tilde g(g_1,g_2,\cdots,g_T) = \alpha_1 g_1 + \alpha_2 g_2 + \cdots + \alpha_T g_T$ 。那麼 blending 的一般形式便不侷限於輸入參數的線性組合，可能 $\tilde g$ 是也是一個 hypothesis。

Given $g_{1}^{-}, g_{2}^{-}, \ldots, g_{T}^{-}$ from $\mathcal{D}_{\text {train }},$ transform $\left(\mathbf{x}_{n}, y_{n}\right)$ in $\mathcal{D}_{\text {val }}$ to $\left(\mathbf{z}_{n}=\Phi^{-}\left(\mathbf{x}_{n}\right), y_{n}\right),$ where

學習步驟如下：

從訓練集 $\mathcal{D}_{\text {train}}$ 中獲取 $g_{1}^{-}, g_{2}^{-}, \ldots, g_{T}^{-}$ ，將驗證集數據映射到 $\mathcal Z$ 空間，即 $\mathbf{z}_{n}=\left(\Phi^{-}\left(\mathbf{x}_{n}\right), y_{n}\right)$ ，其中映射函數爲： $\Phi^{-}(\mathbf{x})=\left(g_{1}^{-}(\mathbf{x}), \ldots, g_{T}^{-}(\mathbf{x})\right)$
在 $\mathcal{Z}$ 空間訓練出融合各種模型的模型（函數） $\tilde{g}$ $=$ AnyModel $\left(\left\{\left(\mathbf{z}_{n}, y_{n}\right)\right\}\right)$
最終的堆疊融合模型 $G_{\mathrm{ANYB}}(\mathbf{x})=\tilde{g}(\Phi(\mathbf{x}))$ 。

優缺點：

很強大（powerful），可以完成有條件的融合（conditional blending）
很容易過擬合（模型複雜度過高）

應用（Blending in Practice）

在 any blending 的基礎上，將原來的 $g_t$ 和 $G$ 結合在一起再做一次融合。

Bagging

blending : 在獲取 $g_t$ 之後，進行聚合;
learning : 在聚合（學習）過程中獲取 $g_t$ 。

獲得多樣 $g_t$ 的方法有：

diversity by different models
diversity by different parameters: 例如優化方法GD的步長變化多樣
diversity by algorithmic randomness
diversity by data randomness

下面便從數據出發，來滿足假設函數的多樣性。

那應該怎麼做呢，在前面提到有共識便是一個模型的期望表現：

$\text { consensus } \bar { g } = \text { expected } g _ { t } \text { from } \mathcal { D } _ { t } \sim P ^ { N }$

其優於單個的 $g_t$ 。

其由兩個部分組成，一個是無窮多個 $g_t$ ，另一個則是豐富的樣本數據。對於第一個問題這裏提供有限個但相當多個 $g_t$ ，第二個問題只能從手中的數據入手，來創造多樣的樣本數據集 $\mathcal{D}_t$ 。

拔靴法（Bootstrap Aggregation）

Bagging 實際上就是指 Bootstrap Aggregation，拔靴法實際上是從手中的數據重採樣來獲得仿真的 $\mathcal{D}_t$ 。其實現方法是：

在原有的大小爲 $N$ 的數據集 $\mathcal{D}$ 上，有放回的採樣 $N^\prime$ 次獲得仿真數據集 $\tilde \mathcal{D} _ t \rightarrow$ 這一步便是 Bootstrap 操作。
通過 $\mathcal A (\tilde \mathcal{D} _ t)$ 獲取 $g_t$ ，再使用均值融合獲得： $G = \operatorname {Uniform}(\{g_t\})$ 。

拔靴法（bootstrap aggregation）是一種簡單的基於基算法（base algorithm $\mathcal A$ ）的融合算法（meta algorithm）。方法合理前提是：數據集的多樣性和基算法 $\mathcal A$ 對於隨機數據敏感。

Adaptive Boosting

Adaptive Boosting （AdaBoost ）實際上是從 Bagging 的核心 bootstrap 出發實現的一種融合算法。具體實現如下：

加權基算法（Weighted Base Algorithm）

數據集的構造相當於對於不同樣本的權重不同，也就是說重採樣（Re-sample）過程相當於重賦予權重（Re-weighting）過程：

假設重採樣如下：

$\begin{aligned} \mathcal { D } = \left\{ \left( \mathbf { x } _ { 1 } , y _ { 1 } \right) , \left( \mathbf { x } _ { 2 } , y _ { 2 } \right) , \left( \mathbf { x } _ { 3 } , y _ { 3 } \right) , \left( \mathbf { x } _ { 4 } , y _ { 4 } \right) \right\} \\ \stackrel { \text { bootstrap } } { \Longrightarrow } \tilde { \mathcal { D } } _ { t } = \left\{ \left( \mathbf { x } _ { 1 } , y _ { 1 } \right) , \left( \mathbf { x } _ { 1 } , y _ { 1 } \right) , \left( \mathbf { x } _ { 2 } , y _ { 2 } \right) , \left( \mathbf { x } _ { 4 } , y _ { 4 } \right) \right\} \end{aligned}$

原來的誤差計算如下：
$E _ { \mathrm { in } } ^ { 0 / 1 } ( h ) = \frac { 1 } { 4 } \sum _ { ( \mathbf { x } , y ) \in \tilde { D } _ { t } } \left[\kern-0.15em\left[ y \neq h ( \mathbf { x } ) \right]\kern-0.15em\right]$

現在則是：

$E _ { \mathrm { in } } ^ { \mathrm { u }^{(t)} } ( h ) = \frac { 1 } { 4 } \sum _ { n = 1 } ^ { 4 } u _ { n } ^ { ( t ) } \cdot \left[\kern-0.15em\left[ y _ { n } \neq h \left( \mathbf { x } _ { n } \right) \right]\kern-0.15em\right]$

其中 $u_1 = 2,u_2 = 1,u_3 = 0,u_4 = 1$ 。

那麼袋中的每一個 $g_t$ 都是通過最小化加權誤差（bootstrap-weighted error）獲得的。

所以加權基算法（Weighted Base Algorithm）的數學表達爲：

$E _ { \mathrm { in } } ^ { \mathrm { u } } ( h ) = \frac { 1 } { N } \sum _ { n = 1 } ^ { N } u _ { n } \cdot \operatorname { err } \left( y _ { n } , h \left( \mathbf { x } _ { n } \right) \right)$

那麼通過重新賦值獲取多樣的 $g_t$ 是另一種可行的方法：

假如兩個 $g_t$ 的獲取方法如下：
$\begin{aligned} g _ { t } & \leftarrow \underset { h \in \mathcal { H } } { \operatorname { argmin } } \left( \sum _ { n = 1 } ^ { N } u _ { n } ^ { ( t ) } \left[ \kern-0.15em \left[ y _ { n } \neq h \left( \mathbf { x } _ { n } \right) \right]\kern-0.15em\right] \right) \\ g _ { t + 1 } & \leftarrow \underset { h \in \mathcal { H } } { \operatorname { argmin } } \left( \sum _ { n = 1 } ^ { N } u _ { n } ^ { ( t + 1 ) } \left[ \kern-0.15em \left[ y _ { n } \neq h \left( \mathbf { x } _ { n } \right) \right]\kern-0.15em\right] \right) \end{aligned}$

什麼時候兩個人分類器會很不一樣呢？就是當 $g_t$ 對於權重 $u _ { n } ^ { ( t ) }$ 的性能很好，但是 $g_t$ 對於權重 $u _ { n } ^ { ( t + 1) }$ 的性能很差，最差的 $g$ 則是隨機值也就是說有 50% 的概率會預測準確。即：

所以現在希望的效果是：

$\frac {\sum _ { n = 1 } ^ { N } u _ { n } ^ { ( t + 1 ) } \left[ \kern-0.15em \left[y _ { n } \neq g _ { t } \left( \mathbf { x } _ { n } \right) \right] \kern-0.15em \right] } { \sum _ { n = 1 } ^ { N } u _ { n } ^ { ( t + 1 ) } } = \frac { \square_ { t + 1 } } { \square_ { t + 1 } + \bigcirc_{ t + 1 } } = \frac { 1 } { 2 } , \text { where } \\ \square_ { t + 1 } = \sum _ { n = 1 } ^ { N } u _ { n } ^ { ( t + 1 ) } \left[ \kern-0.15em \left[y _ { n } \neq g _ { t } \left( \mathbf { x } _ { n } \right) \right] \kern-0.15em \right]\\ \bigcirc_{ t + 1 } = \sum _ { n = 1 } ^ { N } u _ { n } ^ { ( t + 1 ) } \left[ \kern-0.15em \left[y _ { n } = g _ { t } \left( \mathbf { x } _ { n } \right) \right] \kern-0.15em \right]$

那麼通過重新放縮權重（re-scaling (multiplying) weights）便可以實現，即：

對於 $g_t$ 分類錯誤的樣本：

$u _ { n } ^ { ( t + 1 ) } = \bigcirc_{ t } \cdot u _ { n } ^ { ( t ) }$

對於 $g_t$ 分類正確的樣本：

$u _ { n } ^ { ( t + 1 ) } = \square_{ t } \cdot u _ { n } ^ { ( t ) }$

那麼在實際中如何實現呢？這裏提出放縮係數。

放縮係數（Scaling Factor）

錯誤率 $\epsilon _ { t }$ 定義如下：
$\epsilon _ { t } = \frac {\sum _ { n = 1 } ^ { N } u _ { n } ^ { ( t ) } \left[ \kern-0.15em \left[y _ { n } \neq g _ { t } \left( \mathbf { x } _ { n } \right) \right] \kern-0.15em \right] } { \sum _ { n = 1 } ^ { N } u _ { n } ^ { ( t) } }$

放縮係數的定義如下：
$\mathbf { \star } _ { t } = \sqrt { \frac { 1 - \epsilon _ { t } } { \epsilon _ { t } } }$

那麼：

$\begin{aligned} \left[ \kern-0.15em \left[y _ { n } \neq g _ { t } \left( \mathbf { x } _ { n } \right) \right] \kern-0.15em \right] \quad u ^ { ( t + 1 ) } _ n &\leftarrow u ^ { ( t ) } _ n \cdot \mathbf { \star } _ { t } \\ \left[ \kern-0.15em \left[y _ { n } = g _ { t } \left( \mathbf { x } _ { n } \right) \right] \kern-0.15em \right] \quad u ^ { ( t + 1 ) } _ n &\leftarrow u ^ { ( t ) } _n / \mathbf { \star } _ { t } \end{aligned}$

Linear Aggregation on the Fly

有了上述的前提，便可以設計一個由數據多樣化創造的融合算法，而AdaBoost 除了上述一些前提外，還有一步，那就是 Linear Aggregation on the Fly，在學習中獲得線性融合的參數 $\alpha_t$ ，即：

$\alpha_t = \ln(\mathbf { \star } _ { t })$

當 $\epsilon _ { t } \rightarrow 0$ 時， $\mathbf { \star } _ { t } \rightarrow \inf, \ln(\mathbf { \star } _ { t }) \rightarrow \inf$ ，也就是說當無錯誤時，給予無窮權重，即當前 $g_t$ 完全可以完成任務。
當 $\epsilon _ { t } = \frac{1}{2}$ 時， $\mathbf { \star } _ { t } = 1, \ln(\mathbf { \star } _ { t }) = 0$ 也就是說當錯誤率爲1/2時，不給予權重，即 $g_t$ 與隨機數的性能一樣無用。
當 $\epsilon _ { t } \rightarrow 1$ 時， $\mathbf { \star } _ { t } = 0, \ln(\mathbf { \star } _ { t }) \rightarrow -\inf$ 也就是說當全錯誤時，給予負無窮權重，即只需要將分類結果取反便可以獲得非常高準確率的 $g_t$ 。

AdaBoost 實現步驟

$u^{(1)} = \left[\frac{1}{N},\cdots,\frac{1}{N}\right]$
for $t = 1,\cdots,T$

由 $\mathcal { A } \left( \mathcal { D } , \mathbf { u } ^ { ( t ) } \right)$ 獲取 $g _ { t }$ ，其中 $\mathcal { A }$ 用於優化權重爲 $\mathbf { u } ^ { ( t ) }$ 的加權誤差。
由 $\mathbf { u } ^ { ( t ) }$ 更新 $\mathbf { u } ^ { ( t+1 ) }$
$\begin{aligned} \left[ \kern-0.15em \left[y _ { n } \neq g _ { t } \left( \mathbf { x } _ { n } \right) \right] \kern-0.15em \right] \quad u ^ { ( t + 1 ) } _ n &\leftarrow u ^ { ( t ) } _ n \cdot \mathbf { \star } _ { t } \\ \left[ \kern-0.15em \left[y _ { n } = g _ { t } \left( \mathbf { x } _ { n } \right) \right] \kern-0.15em \right] \quad u ^ { ( t + 1 ) } _ n &\leftarrow u ^ { ( t ) } _n / \mathbf { \star } _ { t } \end{aligned}$

其中: $\epsilon _ { t } = \frac {\sum _ { n = 1 } ^ { N } u _ { n } ^ { ( t + 1 ) } \left[ \kern-0.15em \left[y _ { n } \neq g _ { t } \left( \mathbf { x } _ { n } \right) \right] \kern-0.15em \right] } { \sum _ { n = 1 } ^ { N } u _ { n } ^ { ( t + 1 ) } }$ ， $\mathbf { \star } _ { t } = \sqrt { \frac { 1 - \epsilon _ { t } } { \epsilon _ { t } } }$
計算線性融合係數 $\alpha_t = \ln(\mathbf { \star } _ { t })$
獲得最終hypothesis： $G ( \mathbf { x } ) = \operatorname { sign } \left( \sum _ { t = 1 } ^ { T } \alpha _ { t } g _ { t } ( \mathbf { x } ) \right)$

理論證明（Theoretical Guarantee）

AdaBoost 的 VC bound 如下：
$E _ { \mathrm { out } } ( G ) \leq E _ { \mathrm { in } } ( G ) + O ( \sqrt { \underbrace { O \left( d _ { \mathrm { vc } } ( \mathcal { H } ) \cdot T \log T \right) }_{d_{\mathbf{vc}} \text{ of all possible } G} \cdot \frac { \log N } { N } } )$

原作者有證明最多經過 $T= \log(N)$ 次迭代，便可以實現 $E_{\text{in}}(G) = 0$ ，只要基模型比隨機數性能優越即可 $\epsilon _ { t } \leq \epsilon < \frac { 1 } { 2 }$ 。

也就是說，如果基模型 $g$ 很弱（weak），但是總比隨機數優秀，那麼由AdaBoost + $\mathcal A$ 獲取的 $G$ 也會很強（strong）。

決策樹樁（Decision Stump）

數學表達如下：
$h _ { s , i , \theta } ( \mathbf { x } ) = s \cdot \operatorname { sign } \left( x _ { i } - \theta \right)$

一共有三個參數特徵（feature） $i$ ，閾值（threshold） $\theta$ ，方向（direction） $s$ 。其實現的功能便是一個分界點，特徵（feature） $i$ 表達的是在第 $i$ 維的分解點，閾值（threshold） $\theta$ 代表在本維度的分界點位於 $\theta$ ，方向（direction） $s$ 代表了分界點兩邊的樣本類型。

這是一個弱模型，但是將其作爲 AdaBoost 的基模型便可以實現高精度預測了，並且效率很高，時間複雜度爲： $O(d \cdot N \log N)$

若使用 Decision Stump 作爲 AdaBoost 的基模型，假設一個簡單的數據集如下分佈：

那麼其學習過程的一種狀態可能如下表達：

機器學習技法之聚合模型（Aggregation Model）

Blending

均值融合（uniform blending）

線性融合（Linear Blending）

堆疊融合（Stacking or Any Blending）

應用（Blending in Practice）

Bagging

拔靴法（Bootstrap Aggregation）

Adaptive Boosting

加權基算法（Weighted Base Algorithm）

放縮係數（Scaling Factor）

Linear Aggregation on the Fly

AdaBoost 實現步驟

理論證明（Theoretical Guarantee）

決策樹樁（Decision Stump）

物理機開關機

多層感知器分類器的 Tensorflow 實現

Tensorflow 之張量操作

Tensorflow 之張量類型

Tensorflow 之 CPU計算效率和GPU計算效率對比

梯度提升機（Gradient Boosting Machine）之 LightGBM

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

機器學習技法 之 聚合模型（Aggregation Model）

Blending

均值融合（uniform blending）

線性融合（Linear Blending）

堆疊融合（Stacking or Any Blending）

應用（Blending in Practice）

Bagging

拔靴法（Bootstrap Aggregation）

Adaptive Boosting

加權基算法（Weighted Base Algorithm）

放縮係數（Scaling Factor）

Linear Aggregation on the Fly

AdaBoost 實現步驟

理論證明（Theoretical Guarantee）

決策樹樁（Decision Stump）

機器學習技法之聚合模型（Aggregation Model）