Towards Accurate Binary Convolutional Neural Network
文章鏈接 2017年11月30日
視頻鏈接 (youtube)
Introduction
主要的工作:
1:使用多種binary weight base進行線性組合來接近全精度的權值
2:引入多種binary activations。這個將BNNs在Image上的精度提升了將近5%
Realted Work
We relied on the idea of finding the best approximation of full-precision convolution using multiple binary operations, and employing multiple binary activations to allow more information passing through.
Binarization methods
Weight approximation
用( w , h , c i n , c o u t ) (w,h,c_{in},c_{out}) ( w , h , c i n , c o u t ) 表示一個層的tensors。有兩種不同的量化方法:1) approximate weights as a whole and 2) approximate weights channel-wise
Approximate weights as a whole
使用M M M 個二值化的濾波器B 1 , B 2 , ⋯ , B M ∈ { − 1 , + 1 } w × h × c i n × c o u t B_1,B_2,\cdots,B_M \in \{-1, +1\}^{w\times h\times c_{in}\times c_{out}} B 1 , B 2 , ⋯ , B M ∈ { − 1 , + 1 } w × h × c i n × c o u t 來逼近實值的權重W ∈ R w × h × c i n × c o u t W\in \mathbb{R}^{w\times h\times c_{in}\times c_{out}} W ∈ R w × h × c i n × c o u t ,如W ≈ α 1 B 1 + α 2 B 2 + ⋯ + α M B M W \approx \alpha_1B_1+\alpha_2B_2+\dots+\alpha_MB_M W ≈ α 1 B 1 + α 2 B 2 + ⋯ + α M B M 。一個直接方法是解下面的這個問題:
min α , B J ( α , B ) = ∣ ∣ w − B α ∣ ∣ 2 s.t. B i j ∈ { − 1 , + 1 } (1)
\min _{\boldsymbol{\alpha}, \boldsymbol{B}} J(\boldsymbol{\alpha}, \boldsymbol{B})=
{{||\boldsymbol{w}-\boldsymbol{B}\boldsymbol{\alpha}||}^2
\text{ s.t. } \boldsymbol{B}_{i j} \in\{-1,+1\} }\tag{1}
α , B min J ( α , B ) = ∣ ∣ w − B α ∣ ∣ 2 s.t. B i j ∈ { − 1 , + 1 } ( 1 ) 式中,B = [ vec ( B 1 ) , vec ( B 2 ) , ⋯ , vec ( B M ) ] , w = vec ( W ) and α = [ α 1 , α 2 , ⋯ , α M ] T \boldsymbol{B}=\left[\operatorname{vec}\left(\boldsymbol{B}_{1}\right), \operatorname{vec}\left(\boldsymbol{B}_{2}\right), \cdots, \operatorname{vec}\left(\boldsymbol{B}_{M}\right)\right], \boldsymbol{w}=\operatorname{vec}(\boldsymbol{W}) \text { and } \boldsymbol{\alpha}=\left[\alpha_{1}, \alpha_{2}, \cdots, \alpha_{M}\right]^{\mathrm{T}} B = [ v e c ( B 1 ) , v e c ( B 2 ) , ⋯ , v e c ( B M ) ] , w = v e c ( W ) and α = [ α 1 , α 2 , ⋯ , α M ] T ,vec ( ⋅ ) \operatorname{vec}(\cdot) v e c ( ⋅ ) 表示的是向量化。假設用mean ( W ) \operatorname{mean}(\boldsymbol{W}) m e a n ( W ) 和std ( W ) \operatorname{std}(\boldsymbol{W}) s t d ( W ) 分別表示W \boldsymbol{W} W 的均值和方差,那麼將B i B_i B i 改爲:
B i = F u i ( W ) : = sign ( W ‾ + u i std ( W ) ) , i = 1 , 2 , ⋯ , M (2)
\boldsymbol{B}_{i}=F_{u_{i}}(\boldsymbol{W}):=\operatorname{sign}\left(\overline{\boldsymbol{W}}+u_{i} \operatorname{std}(\boldsymbol{W})\right), i=1,2, \cdots, M\tag{2} B i = F u i ( W ) : = s i g n ( W + u i s t d ( W ) ) , i = 1 , 2 , ⋯ , M ( 2 ) 式中,W ‾ = W − mean ( W ) \overline{\boldsymbol{W}}=\boldsymbol{W}-\operatorname{mean}(\boldsymbol{W}) W = W − m e a n ( W ) ,u i u_i u i 是一個滑動因子。例如,將u i u_i u i 設定爲u i = − 1 + ( i − 1 ) 2 M − 1 , i = 1 , 2 , ⋯ , M u_i=-1+(i-1){2 \over M-1},i=1,2,\cdots,M u i = − 1 + ( i − 1 ) M − 1 2 , i = 1 , 2 , ⋯ , M 來覆蓋的整個[ − std ( W ) , std ( W ) ] [-\operatorname{std}(\boldsymbol{W}),\operatorname{std}(\boldsymbol{W})] [ − s t d ( W ) , s t d ( W ) ] 範圍,或者通過網絡去學習。
一旦B i \boldsymbol{B}_i B i 選定之後,上面的問題就變成了一個線性迴歸問題:
min α J ( α ) = ∥ w − B α ∥ 2 (3)
\min _{\boldsymbol{\alpha}} J(\boldsymbol{\alpha})=\|\boldsymbol{w}-\boldsymbol{B} \boldsymbol{\alpha}\|^{2}\tag{3}
α min J ( α ) = ∥ w − B α ∥ 2 ( 3 ) 式中,B i \boldsymbol{B}_i B i 是the bases in the design/dictionary matrix。然後使用STE更新B i \boldsymbol{B}_i B i 。假定c c c 是代價函數,A \boldsymbol{A} A 和O \boldsymbol{O} O 分別是卷積的輸入輸出tensor,前向和反向就可以按照如下的形式計算:
Forward: B 1 , B 2 , ⋯ , B M = F u 1 ( W ) , F u 2 ( W ) , ⋯ , F u M ( W ) Solve ( 3 ) for α O = ∑ m = 1 M α m Conv ( B m , A ) Backward: ∂ c ∂ W = ∂ c ∂ O ( ∑ m = 1 M α m ∂ O ∂ B m ∂ B m ∂ W ) = sTE ∂ c ∂ O ( ∑ m = 1 M α m ∂ O ∂ B m ) = ∑ m = 1 M α m ∂ c ∂ B m
\begin{array}{l}{\text { Forward: } B_{1}, B_{2}, \cdots, B_{M}=F_{u_{1}}(W), F_{u_{2}}(W), \cdots, F_{u_{M}}(W)} \\ {\text { Solve }(3) \text { for } \alpha} \\ {\qquad \begin{aligned} O=& \sum_{m=1}^{M} \alpha_{m} \operatorname{Conv}\left(B_{m}, A\right) \\ \text { Backward: } \frac{\partial c}{\partial W} &=\frac{\partial c}{\partial O}\left(\sum_{m=1}^{M} \alpha_{m} \frac{\partial O}{\partial B_{m}} \frac{\partial B_{m}}{\partial W}\right) \stackrel{\text { sTE }}{=} \frac{\partial c}{\partial O}\left(\sum_{m=1}^{M} \alpha_{m} \frac{\partial O}{\partial B_{m}}\right)=\sum_{m=1}^{M} \alpha_{m} \frac{\partial c}{\partial B_{m}} \end{aligned}}\end{array}
Forward: B 1 , B 2 , ⋯ , B M = F u 1 ( W ) , F u 2 ( W ) , ⋯ , F u M ( W ) Solve ( 3 ) for α O = Backward: ∂ W ∂ c m = 1 ∑ M α m C o n v ( B m , A ) = ∂ O ∂ c ( m = 1 ∑ M α m ∂ B m ∂ O ∂ W ∂ B m ) = sTE ∂ O ∂ c ( m = 1 ∑ M α m ∂ B m ∂ O ) = m = 1 ∑ M α m ∂ B m ∂ c
Multiple binary activations and bitwise convolution
爲了實現bitwise操作,必須將激活值也量化掉,因爲它們將作爲卷積的輸入。激活函數表示爲h ( x ) ∈ [ 0 , 1 ] h(x)\in [0,1] h ( x ) ∈ [ 0 , 1 ] :
h v ( x ) = clip ( x + v , 0 , 1 ) (4)
h_v(x)=\operatorname{clip}(x+v,0,1)\tag{4}
h v ( x ) = c l i p ( x + v , 0 , 1 ) ( 4 ) 式中,v v v 是滑動因子。量化的函數爲:
H v ( R ) : = 2 I h v ( R ) ≥ 0.5 − 1 (5)
H_{v}(\boldsymbol{R}):=2 \mathbb{I}_{\boldsymbol{h}_{v}(\boldsymbol{R}) \geq 0.5}-1\tag{5}
H v ( R ) : = 2 I h v ( R ) ≥ 0 . 5 − 1 ( 5 ) 式中,I \mathbb{I} I 是標誌函數,activation的前向和反向就可以這麼計算:
Forward: A = H v ( R ) Backward: ∂ c ∂ R = ∂ c ∂ A ∘ I 0 ≤ R − v ≤ 1 (using STE)
\begin{array}{l}{\text { Forward: } A=H_{v}(\boldsymbol{R})} \\
\\
\\
{\text { Backward: } \frac{\partial c}{\partial \boldsymbol{R}}=\frac{\partial c}{\partial \boldsymbol{A}} \circ \mathbb{I}_{0 \leq \boldsymbol{R}-v \leq 1} \text { (using STE) }}\end{array}
Forward: A = H v ( R ) Backward: ∂ R ∂ c = ∂ A ∂ c ∘ I 0 ≤ R − v ≤ 1 (using STE)
其中o \operatorname{o} o 表示Hadamard product。
首先,讓激活值的分佈保持相對穩定,使用了batch normalization,把它放在激活函數之前。然後,使用N N N 個額二值激活值的線性組合逼近實值R ≈ β 1 A 1 + β 2 A 2 + ⋯ + β N A N R\approx \beta_1\boldsymbol{A}_1+\beta_2\boldsymbol{A}_2+\dots+\beta_N\boldsymbol{A}_N R ≈ β 1 A 1 + β 2 A 2 + ⋯ + β N A N ,其中,
A 1 , A 2 , … , A N = H v 1 ( R ) , H v 2 ( R ) , … , H v N ( R ) (6)
\boldsymbol{A}_1,\boldsymbol{A}_2,\dots,\boldsymbol{A}_N=H_{v1}(\boldsymbol{R}),H_{v2}(\boldsymbol{R}),\dots,H_{vN}(\boldsymbol{R}) \tag{6}
A 1 , A 2 , … , A N = H v 1 ( R ) , H v 2 ( R ) , … , H v N ( R ) ( 6 ) 式中,β n \beta_n β n 和v n v_n v n 是可以訓練的,在測試時固定,用來學習數據的分佈。最後整個卷積操作變爲:
Conv ( W , R ) ≈ Conv ( ∑ m = 1 M α m B m , ∑ n = 1 N β n A n ) = ∑ m = 1 M ∑ n = 1 N α m β n Conv ( B m , A n ) (7)
\operatorname{Conv}(\boldsymbol{W}, \boldsymbol{R}) \approx \operatorname{Conv}\left(\sum_{m=1}^{M} \alpha_{m} \boldsymbol{B}_{m}, \sum_{n=1}^{N} \beta_{n} \boldsymbol{A}_{n}\right)=\sum_{m=1}^{M} \sum_{n=1}^{N} \alpha_{m} \beta_{n} \operatorname{Conv}\left(\boldsymbol{B}_{m}, \boldsymbol{A}_{n}\right)\tag{7}
C o n v ( W , R ) ≈ C o n v ( m = 1 ∑ M α m B m , n = 1 ∑ N β n A n ) = m = 1 ∑ M n = 1 ∑ N α m β n C o n v ( B m , A n ) ( 7 ) 這也意味着它能夠並行地計算M × N M\times N M × N bitwise convolutions 。
Training algorithm
作者說一般的層的連接順序爲Conv → BN → Activation → Pooling \text{Conv}\rightarrow \text{BN}\rightarrow \text{Activation}\rightarrow \text{Pooling} Conv → BN → Activation → Pooling ,但是在實際過程中,經過最大值池化會將大量的值都變爲+1,造成準確度下降。因此,將max-pooling放在BN層之前。具體的訓練過程在補充材料當中。
Experiment results
Experiment results on ImageNet dataset
使用Resnet作爲基網絡,圖片放縮成224*224大小。
Effect of weight approximation
使用Resnet-18作爲基網絡,BWN表示Binary-Weights-Network,FP表示全精度網絡,結果對比如下:
Comparison with the state-of-the-art
Discussion
Why adding a shift parameter works?
作者說這個可以像BN層中的mean和std一樣,學習數據的分佈。
Advantage over the fixed-point quantization scheme
作者說一個K個二值化的量化方案比K-bit的量化方案好,原因在於1)可以用bitwise操作;2)K個1-bit的乘法器比一個K-bit的乘法器消耗的資源更少;3)保留了脈衝響應
看看視頻還是能更好地理解文章的想法的。