Ristretto: Hardware-Oriented Approximation of Convolutional Neural Networks

文章目錄

Introduction

主要介紹了Ristretto，a framework for automated neural network approximation 。它是開源的，並且基於Caffe。

Convolution Neural Networks

Layer Types

Convolution layer：特徵提取，但是需要進行很多計算，比該層的參數還多，

Fully connected layer：特徵提取，但是佔了模型參數的很大一部分

Rectified Linear Unit(ReLU)：讓模型能夠學習到非線性特徵

Normalization layers：Local Response Normalization(LPN)將feature map歸一化。Batch Normalization。但是這個層中的數值和其他層相差較大 $(2^{14})$ ，因此文章主要量化其他層的參數。

Pooling：降低feature的大小和encode translation invariance ，它也能降低參數的規模和計算的複雜性。一般的池化方式爲MAX pooling，也有average pooling和L2-norm pooling。由於該層操作簡單，因此不做量化。

Computational Complexity and Memory Requirements

Deep CNNs的複雜度主要在於兩部分，卷積層包含了90%以上的計算操作，全連接層包含了90%以上的網絡參數。因此，一個高效的CNNs加速器必須做到：提供足夠大的計算吞吐量，足夠的內存帶寬保證數據處理從不空閒。因此本文主要考慮怎麼量化這兩個層。

Neural Networks With Limited Numerical Precision

對於一個給定的full precision網絡，量化的主要步驟爲：

1：Quantization of the layer input and weights to reduced precision format (using m
and n bits for number representation, respectively)

2：Perform the MAC(multiplication-and-accumulation ) operations using the quantized values

3：The final result is again quantized

Rounding Schemes

Round nearest even:
$\operatorname{round}(x)= \left\{\begin{array}{ll}{\lfloor x\rfloor,} & {\text { if }\lfloor x\rfloor \leq x \leq x+\frac{\epsilon}{2}} \\ {\lfloor x\rfloor+\epsilon,} & {\text { if }\lfloor x\rfloor+\frac{\epsilon}{2}<x \leq x+\epsilon}\end{array}\right. \tag{1}$
這是一種確定的取值方式，因此本文在inference/test階段採用這種方式

Round stochastic:
$\operatorname{round}(x)= \left\{ \begin{array}{ll} {\lfloor x\rfloor,} & {\text { w.p. }\ \ 1-\frac{x-\lfloor x\rfloor}{\epsilon}} \\ {\lfloor x\rfloor+\epsilon,} & {\text { w.p. }\ \ \frac{x-\lfloor x\rfloor}{\epsilon}} \end{array} \right. \tag{2}$
式中，w.p.表示with probability。這種方式的期望取整誤差爲0，即 $\mathbb{E}(round(x))=0$ ，文章在對量化網絡進行fine-tuning的時候使用這種round方式。

訓練的時候，先訓練連續值的網路，在進行量化，然後進行fine-tuning。

Related Work

Network Approximation

這裏介紹理論幾種得到一個近似網絡的方法，包括fixed point approximation、network pruning and shared weights、binary networks

Accelerator

這一部分主要講了一些硬件的加速方法。由於和算法無關，就不做介紹。

Fixed Point Approximation

Baseline Convolutional Neural Networks

說明了做實驗的baseline爲Lenet、CIFAR-10 FULL、CaffeNet、GoogLeNet、SqueezeNet。

Fixed Point Format

使用 $[\operatorname{IL}.\operatorname{FL}]$ 表示一個fixed point number， $[\operatorname{IL}]$ 和 $[\operatorname{FL}]$ 分別表示整數和分數部分。因此，表示一整個值的bits需要 $\operatorname{IL}+\operatorname{FL}$ 位。使用round nearest方式，採用補碼錶示的話，最大的整數可表示爲
$x_{max}=2^{IL-1}-2^{-FL} \tag{3}$

Dynamic Range of Parameters and Layer Outputs

Dynamic Range in Small CNN：對於Lenet，作者發現參數值比層的輸出是要小的，99%的網絡參數在 $2^0$ 到 $2^{-10}$ ，對於全連接層，99%的參數在 $2^{-4}$ 到 ${2^5}$ 之間。

Dynamic Range in Large CNN ：對於CaffeNet，同樣的，參數值比層的輸出是要小的，但是它們兩之間的差別更大，因此作者使用了16bits（Q9.7）的量化性能是最好的，儘管有一些層的輸出無法表示(0.46%)，同時21.23%的值被截斷到0。最後作者說Similarly to the analysis with LeNet, large layer outputs are more important than small parameters 。

Results

作者在這一節主要介紹了量化的結果，Lenet在MNIST上、CIFAR-10在CIFAR上以及CaffeNet在ImageNet上的量化方案和結果對比。

Dynamic Fixed Point Approximation

Mixed Precision Fixed Point

對網絡的不同部分使用不同精度的量化方案，例如在上圖中， $m$ 和 $n$ 分別表示某一層的輸出和權值的量化位數。

Dynamic Fixed Point

CNN的不同部分有着不同的值區間，對於一個比較大的層，輸出是經過了好多次的累加，所以網絡的參數比更小網絡中的參數要小，而fixed point只能覆蓋一個有限的區間，因此，使用dynamic point是一個解決這個問題的好方案。在dynamic point中，每一個數被表示爲：
$(-1)^s \cdot 2^{-FL}\sum_{i=0}^{B-2}{2^i \cdot x_i} \tag{4}$
式中， $B$ 表示bit-width， $s$ 是sign bit， $FL$ 是分數的長度， $x$ 是mantissa bits。對於網絡的每一層，將它分爲兩個group，一個用於層的outputs，一個用於weights。這兩個group的量化方案是不一樣的，每一個group單獨採用一個多少bit表示分數的方案，如下圖所示。

Choice of Number Format ：作者爲了避免saturation，使用了足夠的bit位，對於一個數據集 $S$ ，使用的整數部分長 $IL$ 爲：
$I L=\left\lceil\lg _{2}\left(\max _{S} x+1\right)\right\rceil \tag{5}$
這個 $IL$ 長度在量化outputs時使用，對於weights，則將 $IL$ 長度減一，因爲實驗表示這樣效果更好點(slightly better)。

Results

Impact of Dynamic Fixed Point ：結果表明，用18-bit進行作者的實驗時(使用CaffeNet/AlexNet)，fixed point和dynamic point的效果都還好，但當進一步減少bit位時，fixed point的性能則急劇下降，但dynamic point的性能則相對穩定。因此，dynamic point對於這種大網絡的效果更好。

Quantization of Individual Network Parts ：作者又用上面提到的三種網絡，對網絡的某一個部分（layer outputs， convolutional kernels， fully connected layers）單獨進行8-bit的dynamic point量化，觀察掉點情況，發現量化layer outputs和convolutional kernels的時候值掉了0.3%，但是量化FC層的時候，掉點0.9%。

Fine-tuned Dynamic Fixed Point Networks ：作者對fine-tuning之後的網絡精度進行分析，發現小網絡掉點少，大網絡掉點多，但是我認爲這是因爲大網絡是在ImageNet上測試的，所以掉點更多，在小測試集上的測試並不能完全說明問題。

Minifloat Approximation

Motivation

因爲網絡的訓練都是在float上訓練的，因此，用更小的floating point number表示的話，是不是就會使模型更小呢？

IEEE-754 Single Precision Standard

根據IEEE-754標準，單精度浮點數(single precision numbers)有一個符號位(sign bit)，8個指數位(exponent bits)和23個尾數位(mantissa bits)。其中，尾數位中的第一位被設定爲1，並且表示指數值的部分被加上了127。如果指數爲全0或者全1，是由特殊意義的。全0的話，要麼表示數字0，要麼表示一個反規格數(denormalized number)，取決於尾數位。如過是全1的話，這個數就表示正/負無窮或者NAN。

Minifloat Number Format

作者用更低bit的量化時，就不能採用IEEE-754標準了，因此，依據分配的bit數，縮短了指數位的偏差(exponent bias)：
$bias=2^{bits-1}-1 \tag{6}$
式中， $bits$ 就表示分配的bit數。並且它不支持denormalized number、正/負無窮、NAN。無窮被saturated number替代，denormalized number被0替代，由於前向沒有除法，不會有NAN。最後，指數位和尾數位的bit數是自動搜出來的。

Network-specific Choice of Number Format ：對於指數位的bit數，作者使用足夠的位數避免saturation。
$bits==\left\lceil\lg _{2}\lg_{2}\left(\max _{S} x+1\right)+1\right\rceil \tag{7}$
式中， $S$ 是逼近的數據集。

Data Path for Accelerator

網路的權重和輸入進行了MAC操作，輸入爲minifloat的，每次乘法的輸出比輸入寬3個bit，進行加法的時候，全精度加，最後把加的結果量化到minifloat。整個過程如下圖所示：

Results

對上述提到的三個CNN模型分別進行12,8,6-bit的minifloat量化，結果都還好，但是比dynamIic fixed point差點，要求的bit位數也多一些。

Turning Multiplications Into Bit shifts

Multiplier-free Arithmetic

作者認爲在進行乘法操作的時候，需要很大的chip area。因此想用integer power of two weights來替換掉乘法。這些weights可以看成是沒有尾數位的minifloat類型。因此，將乘法操作轉換成位移操作(bit shift)。對於一般的卷積，計算方式爲 $z_i=\sum_{j}{x_{j} \cdot w_{j}}$ 。首先，用最接近的2的冪指數逼近這個parameter，如式 $(8)$ ，然後通過式 $(9)$ 就可以逼近輸出了。(這段沒太看懂)
$e_{j}=round(\lg_{2}(w_j)) \tag{8}$

$z_i\approx \sum_{j}{x_j} << e_j \tag{9}$

Maximal Number of Shifts

網絡中大部分權值都在 $[-1,1]$ 之間，但是大部分都是0。但是如果用2的指數冪來量化的話，對於很靠近+1和-1的權重，就有比較大的影響。作者採用4-bit來表示2的指數冪，其中第一位爲符號位，能表示8種不同的數，最小值爲 $2^{-8}$ ，對於小於該值的權重，對網路偶讀性能影響很小。

Data Path for Accelerator

看的不是很懂，但是和原來的差不多，講的也是數據流。

Result

小網絡的掉點少，大網絡掉點雖然多一些(3、4個點)，但是作者說由於指數部分只用了4-bit來表示，能夠這樣的效果還是挺好的。

Comparison of Different Approximations

Fixed Point Approximation：對energe和development time要求最少，但是performance也最差。

Dynamic Fixed Point Approximation ：性能最好，綜合了fixed point和minifloat的優點。

Minifloat Approximation ：比fixed point性能要好，但是當量化位數降低的時候，如果表示指數的bit位數不夠，性能下降很厲害。

Summary

Dynamic fixed point好。它在low bit-widths情況下表現出最好的性能，雖然它比pure fixed point arithmetic需要更多的chip area，這種方法還是很適合用於神經網絡在硬件上加速。

Ristretto: An Approximation Framework for Deep CNNs

From Caffe to Ristretto

Wikipedia：Ristretto is ‘a short shot of espresso coffee made with the normal amount of ground coffee but extracted with about half the amount of water’. 它移除CNN中多餘的部分，保留它的預測能力。它的輸入和輸出是Caffe的prototxt和模型參數。

Quantization Flow

Ristretto能夠縮小任意32-bit浮點模型爲fixed point，minifloat 或者integer power of two parameters。然後講了一下Ristretto的流程，如下圖所示。

接下來主要介紹了Ristretto的前向、反向以及fine-tuning的機制，用起來和Caffe差不多，關於未來的工作，作何提出的三個方面的展望：Network Pruning 、Binary Networks 、C-Code Generation 。

Ristretto Hardware-Oriented Approximation of Convolutional Neural Networks

Ristretto: Hardware-Oriented Approximation of Convolutional Neural Networks

文章目錄

Introduction

Convolution Neural Networks

Layer Types

Computational Complexity and Memory Requirements

Neural Networks With Limited Numerical Precision

Related Work

Network Approximation

Accelerator

Fixed Point Approximation

Baseline Convolutional Neural Networks

Fixed Point Format

Dynamic Range of Parameters and Layer Outputs

Results

Dynamic Fixed Point Approximation

Mixed Precision Fixed Point

Dynamic Fixed Point

Results

Minifloat Approximation

Motivation

IEEE-754 Single Precision Standard

Minifloat Number Format

Data Path for Accelerator

Results

Turning Multiplications Into Bit shifts

Multiplier-free Arithmetic

Maximal Number of Shifts

Data Path for Accelerator

Result

Comparison of Different Approximations

Summary

Ristretto: An Approximation Framework for Deep CNNs

From Caffe to Ristretto

Quantization Flow