Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1

M. Courbariaux M, et al., Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1, (2016).

摘要

二值化神經網絡（Binarized Neural Networks，BNNs）：推理（run-time）階段，權值（weights）和激活（activations）均爲二進制數值；訓練階段：使用二進制權值和激活計算參數梯度

訓練BNNs的方法

網絡前饋過程（forward pass）中，BNNs減小內存佔用及訪問（access），且多數運算爲位操作（bit-wise operations），能夠有效降低功耗（power-efficiency）

引言

1 二值化神經網絡

1.1 確定、隨機二值化（Deterministic vs Stochastic Binarization）

訓練BNN時，將權值和激活限定爲 $\pm 1$

二值化函數（binarization functions）：

（1）確定（deterministic）二值化函數：

$x^b = \mathrm{Sign}(x) = \begin{cases} +1 & \text{if} \ x \geq 0 \\ -1 & \text{otherwise} \end{cases} \tag{1}$

（2）隨機（stochastic）二值化函數：

$x^b = \begin{cases} +1 & \text{with probability} \ p = \sigma(x) \\ -1 & \text{with probability} \ 1 - p \end{cases} \tag{2}$

其中， $\sigma$ 爲“硬邏輯”（hard sigmoid）函數：

$\sigma(x) = \mathrm{clip}(\frac{x + 1}{2}, 0, 1) = \max(0, \min(1, \frac{x + 1}{2}))$

隨機二值化函數性能優於確定二值化函數，但需要硬件生成隨機序列，因此難以應用。

1.2 梯度計算與累加（Gradient Computation and Accumulation）

權值的梯度是實數值（real-valued），通過實值變量累加計算。

隨機梯度下降（Stochasic Gradient Descent，SGD）採用有噪的小步長探索參數空間，各權值的隨機梯度貢獻累加平滑能夠消除噪聲。

計算參數的梯度時，向權重和激活項中添加噪聲相當於了一種正則化，有助於提高模型的泛化能力。

本文訓練BNNs的方法可以視爲Dropout的變體，Dropout是隨機將激活置零，本文是對權值和激活二值化。

1.3 離散化梯度傳播（Propagating Gradients Through Discretization）

符號函數量化（sign function quantization）

$q = \mathrm{Sign}(r)$

假設梯度 $\frac{\partial C}{\partial q}$ 的估計量 $g_q$ 已知，則梯度 $\frac{\partial C}{\partial r}$ 的估計量（straight-through estimator）爲

$g_r = g_q 1_{|r| \leq 1} \tag{4}$

上式保留了梯度信息，但當 $r$ 過大時，丟棄（cancel）梯度。

算法1：訓練BNN

$C$ ：迷你批次（minibatch）的損失函數
$\lambda$ ：學習速率衰減係數
$L$ ：網絡層數
$\circ$ ：元素乘法（element-wise multiplication）。

Binarize()：指定權值和激活的二值化方法（確定、隨機）；
Clip()：指定如何截斷權值；
BatchNorm()：指定如何對激活批量標準化；
BackBatchNorm()：標準化層處，指定梯度如何反向傳播；
Update()：梯度已知時，如何更新參數（ADAM、AdaMax）。

導數 $1_{|r| \leq 1}$ 可視爲通過“硬正切”（hard tanh）傳播梯度，表示爲分段線性激活函數（piece-wise linear activation function）：

$\mathrm{Htanh}(x) = \mathrm{Clip}(x, -1, 1) = \max(-1, \min(1, x)) \tag{5}$

隱層單元通過 非線性符號函數（sign function non-linearity）得到二值激活（binary activations），其權值計算分爲兩步：

（1）將實數權值限定在 $-1$ 和 $+1$ 之間：當權值更新使 $w^r$ 超出 $[-1, +1]$ ，將 $w^r$ 投影到 $-1$ 或 $+1$ 上，即訓練時截斷（clip）權值。

（2）使用權值 $w^r$ 時，將其二值化 $w^b = \mathrm{Sign}(w^r)$

1.4 移位批標準化（Shift based Batch Normalization）

算法2：移位批標準化（shift-based batch normalization，SBN）

$AP2(x) = \mathrm{Sign}(x) \times 2^{\mathrm{round}(\log_2 |x|)}$ ：2的冪函數的近似；
$\ll \gg$ ：移位（both left and right binary shift）

【作者給出的公式書寫有誤】

推導：

$x = \mathrm{Sign}(x) 2^{\log_2(|x|)} \approx \mathrm{Sign}(x) \times 2^{\mathrm{round}(\log_2 |x|)}$

（1）BatchNorm：

均值： $\mu_B = \frac{1}{m} \sum_{i = 1}^{m} x_i$
方差： $\sigma_B^2 = \frac{1}{m} \sum_{i = 1}^{m} (x_i - \mu_B)^2$
標準化： $\hat{x}_i = \frac{x_i - \mu_B}{\sigma_B}$
縮放平移（scale and shift）： $y_i = \gamma \hat{x}_i + \beta$

（2）Shift-based BatchNorm：

$C(x_i) = x_i - \mu_B$ ， $\sigma_B^2 = \frac{1}{m} \sum_{i = 1}^{m} C^2(x_i)$

用移位運算近似平方運算（ $C^2(x_i)$ ）：

$\begin{aligned} C^2(x_i) = & C(x_i) \times \mathrm{Sign}(C(x_i)) 2^{\log_2 |C(x_i)|} \\ \approx & C(x_i) \times \mathrm{Sign}(C(x_i)) 2^{\mathrm{round}(\log_2 |C(x_i)|)} \\ = & |C(x_i)| 2^{\mathrm{round}(\log_2 |C(x_i)|)} \\ = & |C(x_i)| \ll \gg \mathrm{round}(\log_2 |C(x_i)|) \\ \end{aligned}$

近似方差：

$\sigma_B^2 \approx \frac{1}{m} \sum_{i = 1}^{m} C(x_i)| \ll \gg \mathrm{round}(\log_2 |C(x_i)|)$

用移位運算近似除運算（ $C2(x_i) / \sigma_B$ ）：

$\begin{aligned} \frac{C(x_i)}{\sigma_B} = & \frac{C(x_i)}{\mathrm{Sign}(\sigma_B) 2^{\log_2 |\sigma_B|}} \\ \approx & \frac{C(x_i)}{\mathrm{Sign}(\sigma_B) 2^{\mathrm{round}(\log_2 |\sigma_B|)}} \\ = & \frac{C(x_i)}{2^{\mathrm{round}(\log_2 \sigma_B)}} \\ = & C(x_i) \ll \gg \mathrm{round}(\log_2 \sigma_B)) \\ \end{aligned}$

標準化：

$\hat{x}_i \approx C(x_i) \ll \gg \mathrm{round}(\log_2 \sigma_B))$

縮放平移：

$y_i = \mathrm{Sign}(\gamma) \hat{x}_i \ll \gg \mathrm{round}(\log_2 |\gamma|) + \beta$

1.5 移位AdaMax（Shift based AdaMax）

算法4：移位AdaMax優化器（shift-based AdaMax）

$g_t^2$ ：按元素取平方， $g_t \circ g_t$
默認設置： $\alpha = 2^{-10}$ 、 $1 - \beta_1 = 2^{-3}$ 、 $1 - \beta_2 = 2^{-10}$
所有向量運算均指按元素運算。
$\beta_1^t$ 、 $\beta_2^t$ ： $\beta_1$ 、 $\beta_2$ 的 $t$ 次方（ $\beta_1$ and $\beta_2$ to the power $t$ ）

1.6 第一層

相鄰兩層，前一層的輸出是後一層的輸入，因此除第一層外，所有層的輸入都是二進制的。

連續數值（continuous-valued inputs）輸入可以用定點數（fixed point numbers）處理（fixed point numbers），8位（8-bit）定點輸入可表示爲：

$s = x \cdot w^b$

$s = \sum_{i = 1}^8 2^{n - 1} (x^n \cdot w^b)$

其中， $x$ 爲由1024個8-bit輸入組成的向量， $x_1^8$ 表示第1個輸入的最高位（most significant bit of the first input）， $w^b$ 爲由1024個1-bit權值組成的向量， $s$ 爲加權和。

算法5：BNN預測

2 基準測試（Benchmark Results）

2.1 多層感知器、MNIST、Theano（Multi-Layer Perception (MLP) on MNIST (Theano)）

2.2 多層感知器、MNIST、Torch7（MLP on MNIST (Torch7)）

2.3 卷積網絡、CIFAR-10、Theano（ConvNet on CIFAR-10 (Theano)）

2.4 卷積網絡、CIFAR-10、Torch7（ConvNet on CIFAR-10 (Torch7)）

2.5 卷積網絡、SVHN（ConvNet on SVHN）

3 前向過程低功耗（Very Power Efficient in Forward Pass）

3.1 內存佔用及訪問（Memory Size and Accesses）

3.2 同或（XNOR-Count）

1-bit XNOR-count operations

3.3 重複濾波器（Exploiting Filter Repetitions）

二值卷積核的數量取決於卷積核的尺寸（the number of unique filters is bounded by the filter size）。

文獻閱讀 - Binarized Neural Networks

Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1

摘要

引言

1 二值化神經網絡

1.1 確定、隨機二值化（Deterministic vs Stochastic Binarization）

1.2 梯度計算與累加（Gradient Computation and Accumulation）

1.3 離散化梯度傳播（Propagating Gradients Through Discretization）

1.4 移位批標準化（Shift based Batch Normalization）

1.5 移位AdaMax（Shift based AdaMax）

1.6 第一層

2 基準測試（Benchmark Results）

2.1 多層感知器、MNIST、Theano（Multi-Layer Perception (MLP) on MNIST (Theano)）

2.2 多層感知器、MNIST、Torch7（MLP on MNIST (Torch7)）

2.3 卷積網絡、CIFAR-10、Theano（ConvNet on CIFAR-10 (Theano)）

2.4 卷積網絡、CIFAR-10、Torch7（ConvNet on CIFAR-10 (Torch7)）

2.5 卷積網絡、SVHN（ConvNet on SVHN）

3 前向過程低功耗（Very Power Efficient in Forward Pass）

3.1 內存佔用及訪問（Memory Size and Accesses）

3.2 同或（XNOR-Count）

3.3 重複濾波器（Exploiting Filter Repetitions）

4 運算速度提升7倍（Seven Times Faster on GPU at Run-Time）

5 討論（Discussion and Related Work）

視覺SLAM十四講：第3講三維空間剛體運動

ubuntu系統ros安裝

視覺SLAM十四講：第2講初識SLAM

期望極大（EM）算法

Linux環境下，使用VSCode編譯C++工程

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結