《XNOR-Net: ImageNet Classification Using BinaryConvolutional Neural Networks》翻譯

XNOR-Net: ImageNet Classification Using BinaryConvolutional Neural Networks

Mohammad Rastegari†, Vicente Ordonez†, Joseph Redmon∗, Ali Farhadi†∗

Allen Institute for AI†, University of Washington∗{mohammadr,vicenteor}@allenai.org{pjreddie,ali}@cs.washington.edu

Abstract. We propose two efficient approximations to standard convolutionalneural networks: Binary-Weight-Networks and XNOR-Networks. In Binary-Weight-Networks, the filters are approximated with binary values resulting in 32× mem-ory saving. In XNOR-Networks, both the filters and the input to convolutionallayers are binary. XNOR-Networks approximate convolutions using primarily bi-nary operations. This results in 58× faster convolutional operations (in terms ofnumber of the high precision operations) and 32× memory savings. XNOR-Netsoffer the possibility of running state-of-the-art networks on CPUs (rather thanGPUs) in real-time. Our binary networks are simple, accurate, efficient, and work

on challenging visual tasks. We evaluate our approach on the ImageNet classifi-cation task. The classification accuracy with a Binary-Weight-Network version ofAlexNet is the same as the full-precision AlexNet. We compare our method withrecent network binarization methods, BinaryConnect and BinaryNets, and out-perform these methods by large margins on ImageNet, more than 16% in top-1accuracy. Our code is available at: http://allenai.org/plato/xnornet.

XNOR-Net：使用二元卷積神經網絡的ImageNet分類
Mohammad Rastegari†，Vicente Ordonez†，Joseph Redmon *，Ali Farhadi†*

摘要。我們提出了兩種有效的標準卷積神經網絡的近似：二進制權重網絡和XNOR網絡。在二進制權重網絡中，濾波器近似於二進制值，從而節省32倍的存儲空間。在XNOR網絡中，濾波器和卷積層輸入都是二進制的。 XNOR網絡主要使用二元運算來近似卷積。這使得卷積操作（根據高精度操作的數量）提高了58倍，節省了32倍的內存。 XNOR - 提供了在CPU（而不是GPU）上實時運行最先進網絡的可能性。我們的二進制網絡很簡單，準確，高效並且工作

在具挑戰性的視覺任務我們在ImageNet分類任務中評估我們的方法。二進制權重網絡版本的AlexNet的分類精度與全精度AlexNet相同。我們將我們的方法與最新的網絡二值化方法BinaryConnect和BinaryNets進行了比較，並在ImageNet上大幅超越了這些方法，超過了前16％的精度。我們的代碼可在以下網址獲得：http://allenai.org/plato/xnornet。

1 Introduction

Deep neural networks (DNN) have shown significant improvements in several applica-tion domains including computer vision and speech recognition. In computer vision, aparticular type of DNN, known as Convolutional Neural Networks (CNN), have demon-strated state-of-the-art results in object recognition [1,2,3,4] and detection [5,6,7].

Convolutional neural networks show reliable results on object recognition and de-tection that are useful in real world applications. Concurrent to the recent progress inrecognition, interesting advancements have been happening in virtual reality (VR byOculus) [8], augmented reality (AR by HoloLens) [9], and smart wearable devices.Putting these two pieces together, we argue that it is the right time to equip smartportable devices with the power of state-of-the-art recognition systems. However, CNN-based recognition systems need large amounts of memory and computational power.While they perform well on expensive, GPU-based machines, they are often unsuitablefor smaller devices like cell phones and embedded electronics.

For example, AlexNet[1] has 61M parameters (249MB of memory) and performs1.5B high precision operations to classify one image. These numbers are even higher fordeeper CNNs e.g.,VGG [2] (see section 4.1). These models quickly overtax the limitedstorage, battery power, and compute capabilities of smaller devices like cell phones.

1介紹

深度神經網絡（DNN）在包括計算機視覺和語音識別在內的幾個應用領域已經顯示出顯着的改進。在計算機視覺中，稱爲卷積神經網絡（CNN）的特定類型的DNN在對象識別[1,2,3,4]和檢測[5,6,7]中展示了最先進的結果]。

卷積神經網絡在物體識別和檢測方面顯示出可靠的結果，這在實際應用中非常有用。在最近的認知進展的同時，虛擬現實（Oculus的VR）[8]，增強現實（AR by HoloLens）[9]以及智能可穿戴設備中已經出現了有趣的進步。將這兩個部分放在一起，我們認爲它是爲智能便攜式設備提供最先進的識別系統的強大功能。然而，基於CNN的識別系統需要大量的存儲器和計算能力。雖然它們在基於GPU的昂貴機器上表現良好，但它們通常不適用於手機和嵌入式電子設備等較小的設備。

例如，AlexNet [1]具有61M參數（249MB內存）並執行1.5B高精度操作來對一個圖像進行分類。這些數字對於CNN更高，例如VGG [2]（見4.1節）。這些型號很快加重了手機等小型設備的有限存儲，電池供電和計算能力。

In this paper, we introduce simple, efficient, and accurate approximations to CNNs by binarizing the weights and even the intermediate representations in convolutional neural networks. Our binarization method aims at finding the best approximations of the convolutions using binary operations. We demonstrate that our way of binarizing neural networks results in ImageNet classification accuracy numbers that are comparable to standard full precision networks while requiring a significantly less memory and fewer floating point operations.

We study two approximations: Neural networks with binary weights and XNOR-Networks. In Binary-Weight-Networks all the weight values are approximated with bi-nary values. A convolutional neural network with binary weights is significantly smaller(∼ 32×) than an equivalent network with single-precision weight values. In addition,when weight values are binary, convolutions can be estimated by only addition andsubtraction (without multiplication), resulting in ∼ 2× speed up. Binary-weight ap-proximations of large CNNs can fit into the memory of even small, portable deviceswhile maintaining the same level of accuracy (See Section 4.1 and 4.2).

在本文中，我們通過對卷積神經網絡中的權值甚至中間表示進行二進制化來引入對CNN的簡單，高效和準確的近似。我們的二值化方法旨在使用二元運算找到卷積的最佳近似值。我們證明，我們對神經網絡進行二值化的方式會導致ImageNet分類準確度數字與標準全精度網絡相媲美，同時需要更少的內存和更少的浮點運算。
我們研究兩個近似值：具有二進制權重的神經網絡和XNOR網絡。在二進制權重網絡中，所有權重值都用二進制值近似。與具有單精度權重值的等效網絡相比，具有二進制權重的卷積神經網絡顯着更小（〜32×）。另外，當權重值是二進制時，卷積可以僅通過加法和減法來估計（無乘法），導致〜2倍加速。大尺寸CNN的二進制重量近似值可以適用於甚至小型便攜式設備的記憶，同時保持相同的準確度（見4.1和4.2節）。

To take this idea further, we introduce XNOR-Networks where both the weightsand the inputs to the convolutional and fully connected layers are approximated withbinary values1. Binary weights and binary inputs allow an efficient way of implement-ing convolutional operations. If all of the operands of the convolutions are binary, thenthe convolutions can be estimated by XNOR and bitcounting operations [11]. XNOR-Nets result in accurate approximation of CNNs while offering ∼ 58× speed up in CPUs(in terms of number of the high precision operations). This means that XNOR-Nets canenable real-time inference in devices with small memory and no GPUs (Inference inXNOR-Nets can be done very efficiently on CPUs).

To the best of our knowledge this paper is the first attempt to present an evalua-tion of binary neural networks on large-scale datasets like ImageNet. Our experimental results show that our proposed method for binarizing convolutional neural networks outperforms the state-of-the-art network binarization method of [11] by a large margin(16.3%) on top-1 image classification in the ImageNet challenge ILSVRC2012. Our contribution is two-fold: First, we introduce a new way of binarizing the weight values in convolutional neural networks and show the advantage of our solution compared to state-of-the-art solutions. Second, we introduce XNOR-Nets, a deep neural network model with binary weights and binary inputs and show that XNOR-Nets can obtain similar classification accuracies compared to standard networks while being significantly more efficient. Our code is available at: http://allenai.org/plato/xnornet

爲了進一步理解這個想法，我們引入了XNOR網絡，其中卷積層和完全連接層的權重和輸入都以二進制值1近似。二進制權重和二進制輸入允許實現卷積運算的有效方式。如果卷積的所有操作數都是二進制的，那麼可以通過XNOR和位計數操作來估計卷積[11]。 XNOR-Nets可以精確地近似CNN，同時在CPU中提供〜58倍的加速（以高精度操作數量計）。這意味着XNOR-Nets可以在具有小內存和無GPU的器件中實現可實時推斷（在CPU中可以非常高效地執行XNOR網絡推理）。

就我們所知，本文是第一次嘗試在像ImageNet這樣的大規模數據集上展示二值神經網絡的評估。我們的實驗結果表明，我們提出的卷積神經網絡二值化方法在ImageNet挑戰ILSVRC2012中優於最先進的網絡二值化方法[11]，在前1幅圖像分類方面有很大的優勢（16.3％）。我們的貢獻有兩方面：首先，我們介紹了卷積神經網絡中權值二值化的新方法，並展示了我們的解決方案與最先進的解決方案相比的優勢。其次，我們引入XNOR-Nets，一種具有二進制權重和二進制輸入的深度神經網絡模型，並表明XNOR-Nets可以獲得與標準網絡相似的分類精度，同時效率更高。我們的代碼可在以下網址獲得：http://allenai.org/plato/xnornet

2 Related Work

Deep neural networks often suffer from over-parametrization and large amounts of re-dundancy in their models. This typically results in inefficient computation and memoryusage[12]. Several methods have been proposed to address efficient training and infer-ence in deep neural networks.

Shallow networks: Estimating a deep neural network with a shallower model re-duces the size of a network. Early theoretical work by Cybenko shows that a networkwith a large enough single hidden layer of sigmoid units can approximate any decisionboundary [13]. In several areas (e.g.,vision and speech), however, shallow networkscannot compete with deep models [14]. [15] trains a shallow network on SIFT featuresto classify the ImageNet dataset. They show it is difficult to train shallow networkswith large number of parameters. [16] provides empirical evidence on small datasets(e.g.,CIFAR-10) that shallow nets are capable of learning the same functions as deepnets. In order to get the similar accuracy, the number of parameters in the shallow net-work must be close to the number of parameters in the deep network. They do this byfirst training a state-of-the-art deep model, and then training a shallow model to mimicthe deep model. These methods are different from our approach because we use thestandard deep architectures not the shallow estimations.

Compressing pre-trained deep networks: Pruning redundant, non-informativeweights in a previously trained network reduces the size of the network at inferencetime. Weight decay [17] was an early method for pruning a network. Optimal BrainDamage [18] and Optimal Brain Surgeon [19] use the Hessian of the loss function toprune a network by reducing the number of connections. Recently [20] reduced thenumber of parameters by an order of magnitude in several state-of-the-art neural net-works by pruning. [21] proposed to reduce the number of activations for compressionand acceleration. Deep compression [22] reduces the storage and energy required to runinference on large networks so they can be deployed on mobile devices. They removethe redundant connections and quantize weights so that multiple connections share thesame weight, and then they use Huffman coding to compress the weights. HashedNets[23] uses a hash function to reduce model size by randomly grouping the weights, suchthat connections in a hash bucket use a single parameter value. Matrix factorization hasbeen used by [24,25]. We are different from these approaches because we do not use apretrained network. We train binary networks from scratch.

Designing compact layers: Designing compact blocks at each layer of a deep net-work can help to save memory and computational costs. Replacing the fully connectedlayer with global average pooling was examined in the Network in Network architec-ture [26], GoogLenet[3] and Residual-Net[4], which achieved state-of-the-art resultson several benchmarks. The bottleneck structure in Residual-Net [4] has been proposedto reduce the number of parameters and improve speed. Decomposing 3 × 3 convo-lutions with two 1 × 1 is used in [27] and resulted in state-of-the-art performance onobject recognition. Replacing 3 × 3 convolutions with 1 × 1 convolutions is used in[28] to create a very compact neural network that can achieve ∼ 50× reduction in thenumber of parameters while obtaining high accuracy. Our method is different from thisline of work because we use the full network (not the compact version) but with binaryparameters.

2相關工作

深度神經網絡經常遭受過度參數化和模型中的大量冗餘。這通常導致計算和存儲空間不足[12]。已經提出了幾種方法來解決深度神經網絡中的有效訓練和推理。

淺層網絡：使用淺層模型估計深層神經網絡會縮小網絡的大小。 Cybenko早期的理論工作表明，一個具有足夠大的單隱藏S層單元層的網絡可以逼近任何決策邊界[13]。然而，在幾個領域（如視覺和言語），淺層網絡不能與深層模型競爭[14]。 [15]在SIFT特徵上訓練淺層網絡，對ImageNet數據集進行分類。他們表明，很難訓練具有大量參數的淺層網絡。 [16]提供了小數據集（例如CIFAR-10）的經驗證據，淺網能夠學習與深網相同的功能。爲了獲得相似的精度，淺層網絡中的參數數量必須接近深度網絡中的參數數量。他們通過首先訓練最先進的深層模型，然後訓練淺層模型來模擬深層模型。這些方法與我們的方法不同，因爲我們使用標準的深層架構而不是淺層估計。

壓縮預先訓練的深度網絡：在先前訓練的網絡中修剪冗餘的，非信息量的重量會在推斷時間內減小網絡的大小。重量衰減[17]是修剪網絡的早期方法。 Optimal BrainDamage [18]和Optimal Brain Surgeon [19]通過減少連接數量來使用網絡損失函數的Hessian。最近[20]通過修剪在幾個最先進的神經網絡中減少了一個數量級的參數。 [21]提出減少壓縮和加速的激活次數。深度壓縮[22]減少了在大型網絡上運行所需的存儲和能量，因此可以將它們部署在移動設備上。他們刪除冗餘連接並量化權重，以便多個連接共享相同的權重，然後使用霍夫曼編碼來壓縮權重。 HashedNets [23]使用散列函數通過隨機分組權重來減少模型大小，例如，散列桶中的連接使用單個參數值。矩陣分解已被[24,25]使用。我們與這些方法不同，因爲我們不使用預訓練網絡。我們從零開始培訓二元網絡。

設計緊湊的圖層：在深層網絡的每一層設計緊湊的模塊可以幫助節省內存和計算成本。在Network in Network架構[26]，GoogLenet [3]和Residual-Net [4]中檢驗了全連接層與全局平均池的關係，該算法在幾個基準測試中取得了最新成果。殘差網絡[4]中的瓶頸結構已被提出來減少參數數量並提高速度。在[27]中使用了用兩個1×1分解3×3的解決方案，並導致了最先進的對象識別性能。在[28]中使用1×1卷積代替3×3卷積來創建一個非常緊湊的神經網絡，在獲得高精度的同時可以實現約50倍的參數減少。我們的方法與工作線不同，因爲我們使用完整的網絡（而不是緊湊版本），但使用二進制參數。

Quantizing parameters: High precision parameters are not very important in achiev-ing high performance in deep networks. [29] proposed to quantize the weights of fullyconnected layers in a deep network by vector quantization techniques. They showed justthresholding the weight values at zero only decreases the top-1 accuracy on ILSVRC2012by less than %10. [30] proposed a provably polynomial time algorithm for training asparse networks with +1/0/-1 weights. A fixed-point implementation of 8-bit integerwas compared with 32-bit floating point activations in [31]. Another fixed-point net-work with ternary weights and 3-bits activations was presented by [32]. Quantizing anetwork with L2 error minimization achieved better accuracy on MNIST and CIFAR-10datasets in [33]. [34] proposed a back-propagation process by quantizing the represen-tations at each layer of the network. To convert some of the remaining multiplicationsinto binary shifts the neurons get restricted values of power-of-two integers. In [34]they carry the full precision weights during the test phase, and only quantize the neu-rons during the back-propagation process, and not during the forward-propagation. Ourwork is similar to these methods since we are quantizing the parameters in the network.But our quantization is the extreme scenario +1,-1.

Network binarization: These works are the most related to our approach. Severalmethods attempt to binarize the weights and the activations in neural networks.The per-formance of highly quantized networks (e.g.,binarized) were believed to be very poordue to the destructive property of binary quantization [35]. Expectation BackPropaga-tion (EBP) in [36] showed high performance can be achieved by a network with binaryweights and binary activations. This is done by a variational Bayesian approach, thatinfers networks with binary weights and neurons. A fully binary network at run timepresented in [37] using a similar approach to EBP, showing significant improvement inenergy efficiency. In EBP the binarized parameters were only used during inference. Bi-naryConnect [38] extended the probablistic idea behind EBP. Similar to our approach,BinaryConnect uses the real-valued version of the weights as a key reference for thebinarization process. The real-valued weight updated using the back propagated errorby simply ignoring the binarization in the update. BinaryConnect achieved state-of-the-art results on small datasets (e.g.,CIFAR-10, SVHN). Our experiments shows that thismethod is not very successful on large-scale datsets (e.g.,ImageNet). BinaryNet[11]propose an extention of BinaryConnect, where both weights and activations are bi-narized. Our method is different from them in the binarization method and the network structure. We also compare our method with BinaryNet on ImageNet, and our method outperforms BinaryNet by a large margin.[39] argued that the noise introduced by weight binarization provides a form of regularization, which could help to improve test accuracy. This method binarizes weights while maintaining full precision activation. [40] proposed fully binary training and testing in an array of committee machines with randomized input. [41] retraine a previously trained neural network with binary weights and binary inputs.

量化參數：高精度參數對於實現深度網絡的高性能並不重要。 [29]提出用矢量量化技術量化深層網絡中完全連接層的權重。他們表明，將閾值權重設置爲0只會降低ILSVRC2012上的前1精度小於％10。 [30]提出了一個可證明的多項式時間算法，用於訓練具有+ 1/0 / -1權重的asparse網絡。在[31]中將8位整數的定點實現與32位浮點激活進行了比較。 [32]提出了另一個帶三元權重和3位激活的定點網絡。使用L2誤差最小化的量化網絡在MNIST和CIFAR-10數據集上獲得了更好的精度[33]。 [34]通過量化網絡每層的表示來提出反向傳播過程。爲了將一些剩餘的乘法運算轉換爲二進制移位，神經元得到兩個冪整數的限制值。在[34]中，它們在測試階段攜帶完整的精確權重，並且僅在反向傳播過程中量化中子，而不是在向前傳播過程中量化。因爲我們量化網絡中的參數，所以我們的工作與這些方法相似。但是我們的量化是極端情況+ 1，-1。

網絡二元化：這些作品與我們的方法最相關。幾種方法試圖對神經網絡中的權重和激活進行二進制化。高度量化的網絡（例如二進制化）的性能被認爲是二進制量化的破壞性非常嚴重的[35]。期望反饋（EBP）[36]表明高性能可以通過具有二進制權重和二進制激活的網絡來實現。這是通過變分貝葉斯方法完成的，該方法引入具有二進制權重和神經元的網絡。運行時的完全二元網絡在[37]中用類似的方法表示EBP，顯示出顯着提高節能效率。在EBP中，二進制參數僅用於推理過程中。 Bi-naryConnect [38]擴展了EBP背後的概念。與我們的方法類似，BinaryConnect使用權重的實值版本作爲二值化過程的關鍵參考。通過簡單忽略更新中的二進制化，使用後向傳播錯誤更新實值權重。 BinaryConnect在小型數據集（例如CIFAR-10，SVHN）上實現了最先進的結果。我們的實驗表明，這種方法在大規模數據集（例如ImageNet）上並不是很成功。 BinaryNet [11]提出了BinaryConnect的擴展，其中權重和激活都是雙重敘述的。我們的方法與二進制化方法和網絡結構不同。我們還將我們的方法與ImageNet上的BinaryNet進行了比較，我們的方法大大優於BinaryNet [39]。認爲由重量二值化引入的噪音提供了一種正則化的形式，這可能有助於提高測試的準確性。該方法在保持完全精確度激活的同時二值化權重。 [40]提出了一系列隨機輸入的委員會機器的完全二元訓練和測試。 [41]重新訓練先前訓練過的具有二進制權重和二進制輸入的神經網絡。

3 Binary Convolutional Neural Network

We represent an L-layer CNN architecture with a triplet ⟨I,W,∗⟩. I is a set of ten-sors, where each element I = Il(l=1,...,L) is the input tensor for the lth layer of CNN(Green cubes in figure 1). W is a set of tensors, where each element in this set W =Wlk(k=1,...,Kl) is the kth weight filter in the lth layer of the CNN. Kl is the number ofweight filters in the lth layer of the CNN. ∗ represents a convolutional operation withI and W as its operands2. I ∈ Rc×win×hin, where (c,win,hin) represents channels,width and height respectively.W ∈ Rc×w×h, where w ≤ win, h ≤ hin. We proposetwo variations of binary CNN: Binary-weights, where the elements of W are binarytensors and XNOR-Networks, where elements of both I and W are binary tensors.

3二元卷積神經網絡

我們代表一個L層CNN體系結構，其中三元組爲I，W，*⟩。我是一個十進制的集合，其中每個元素I = Il（l = 1，...，L）是CNN的第l層（圖1中的綠色立方體）的輸入張量。 W是一組張量，其中該集合W = Wlk（k = 1，...，Kl）中的每個元素是CNN的第l層中的第k個權重濾波器。 K1是CNN第l層中的權重過濾器的數量。 *表示以I和W作爲操作數2的卷積運算。 I∈Rc×win×hin，其中（c，win，hin）分別代表渠道，寬度和高度.W∈Rc×w×h，其中w≤win，h≤hh。我們提出了二元CNN的二元變體：二元權重，其中W的元素是二元張量和XNOR網絡，其中I和W的元素都是二元張量。

3.1 Binary-Weight-Networks

In order to constrain a convolutional neural network ⟨I,W,∗⟩ to have binary weights,we estimate the real-value weight filter W ∈ W using a binary filter B ∈ {+1, −1}c×w×hand a scaling factor α ∈ R+ such that W ≈ αB. A convolutional operation can be ap-priximated by:

I ∗ W ≈ (I ⊕ B) α (1)

where, ⊕ indicates a convolution without any multiplication. Since the weight valuesare binary, we can implement the convolution with additions and subtractions. The bi-nary weight filters reduce memory usage by a factor of ∼ 32× compared to single-precision filters. We represent a CNN with binary weights by ⟨I,B,A,⊕⟩, where B isa set of binary tensors and A is a set of positive real scalars, such that B = Blk is abinary filter and α = Alk is an scaling factor and Wlk ≈ AlkBlk

Estimating binary weights:

Training Binary-Weights-Networks:

Algorithm 1 demonstrates our procedure for training a CNN with binary weights.First, we binarize the weight filters at each layer by computing B and A. Then we callforward propagation using binary weights and its corresponding scaling factors, whereall the convolutional operations are carried out by equation 1. Then, we call backwardpropagation, where the gradients are computed with respect to the estimated weightfilters W . Lastly, the parameters and the learning rate gets updated by an update rulee.g.,SGD update with momentum or ADAM [42].

Once the training finished, there is no need to keep the real-value weights. Because,at inference we only perform forward propagation with the binarized weights.

3.1二進制權重網絡

爲了限制卷積神經網絡⟨I，W，*⟩具有二進制權值，我們使用二值濾波器B∈{+1，-1} c×w×hand a估計實值權重濾波器W∈W縮放因子α∈R+使得W≈αB。卷積操作可以通過以下方式來實現：

I * W≈（I⊕B）α（1）

其中，⊕表示沒有任何乘法的卷積。由於權重值是二進制的，我們可以用加法和減法來實現卷積。與單精度濾波器相比，二元濾波器可將內存使用量減少約32倍。我們用二進制權重表示CNN，其中B是一組二進制張量，A是一組正實數標量，這樣B = Blk是一個二階濾波器，α= Alk是一個縮放因子和Wlk≈AlkBlk

估計二進制權重：

訓練二進制權重 - 網絡：

算法1演示了我們使用二進制權重訓練CNN的過程。首先，我們通過計算B和A來對每一層的權重濾波器進行二值化。然後我們使用二進制權重及其相應的比例因子進行前向傳播，其中卷積運算由方程1.然後，我們稱之爲反向傳播，其中關於估計的加權濾波器W計算梯度。最後，參數和學習速率通過更新規則更新，例如帶動量的SGD更新或ADAM [42]。

一旦訓練完成，就不需要保留實際值權重。因爲在推斷中我們只用二進制權重來執行前向傳播。

3.2XNOR-Networks

So far, we managed to find binary weights and a scaling factor to estimate the real-value weights. The inputs to the convolutional layers are still real-value tensors. Now,we explain how to binarize both weigths and inputs, so convolutions can be imple-mented efficiently using XNOR and bitcounting operations. This is the key element ofour XNOR-Networks. In order to constrain a convolutional neural network ⟨I,W,∗⟩to have binary weights and binary inputs, we need to enforce binary operands at eachstep of the convolutional operation. A convolution consist of repeating a shift operationand a dot product. Shift operation moves the weight filter over the input and the dotproduct performs element-wise multiplications between the values of the weight filterand the corresponding part of the input. If we express dot product in terms of binaryoperations, convolution can be approximated using binary operations. Dot product be-tween two binary vectors can be implemented by XNOR-Bitcounting operations [11].In this section, we explain how to approximate the dot product between two vectors inRn by a dot product between two vectors in {+1, −1}n . Next, we demonstrate how touse this approximation for estimating a convolutional operation between two tensors.

Binary Dot Product:

Binary Convolution:

Training XNOR-Networks:

Binary Gradient:

3.2XNOR-網絡

到目前爲止，我們設法找到二進制權重和一個縮放因子來估計實值權重。卷積層的輸入仍然是實數張量。現在，我們解釋如何對稱量和輸入進行二值化，因此可以使用XNOR和位計數操作高效地實現卷積。這是XNOR網絡的關鍵要素。爲了約束卷積神經網絡⟨I，W，⟩具有二進制權值和二進制輸入，我們需要在卷積操作的每一步執行二進制操作數。卷積由重複移位操作和點積組成。移位操作將權重過濾器移動到輸入上，並且該點產品在權重過濾器的值和輸入的相應部分之間執行元素方式的乘法。如果我們用二元運算表達點積，則可以使用二元運算來近似卷積。兩個二進制向量之間的點積可以通過XNOR位計數操作實現[11]。在本節中，我們將解釋如何近似Rn中兩個向量之間點積乘以{+1，-1中兩個向量之間的點積} n。接下來，我們演示如何使用這種近似估計兩個張量之間的卷積運算。

二進制小點產品：

二進制卷積：

培訓XNOR網絡：

二進制漸變：

4 Experiments

We evaluate our method by analyzing its efficiency and accuracy. We measure the ef-ficiency by computing the computational speedup (in terms of number of high preci-sion operation) achieved by our binary convolution vs. standard convolution. To measure accuracy, we perform image classification on the large-scale ImageNet dataset.This paper is the first work that evaluates binary neural networks on the ImageNet dataset. Our binarization technique is general, we can use any CNN architecture. We evaluate AlexNet [1] and two deeper architectures in our experiments. We compare our method with two recent works on binarizing neural networks; BinaryConnect [38] andBinaryNet [11]. The classification accuracy of our binary-weight-network version ofAlexNet is as accurate as the full precision version of AlexNet. This classification ac-curacy outperforms competitors on binary neural networks by a large margin. We also present an ablation study, where we evaluate the key elements of our proposed method; computing scaling factors and our block structure for binary CNN. We shows that our method of computing the scaling factors is important to reach high accuracy.

4實驗
我們通過分析其效率和準確性來評估我們的方法。我們通過計算由二元卷積與標準卷積實現的計算加速（根據高精度操作的數量）來測量效率。爲了測量準確性，我們對大規模ImageNet數據集進行圖像分類。本文是第一個在ImageNet數據集上評估二值神經網絡的工作。我們的二進制技術是一般的，我們可以使用任何CNN架構。我們在我們的實驗中評估AlexNet [1]和兩個更深層次的體系結構。我們將我們的方法與最近關於二值化神經網絡的兩篇文章進行了比較BinaryConnect [38]和BinaryNet [11]。我們的二進制重量網絡版本的AlexNet的分類準確度與全精度版本的AlexNet一樣精確。這種分類的準確性大大超過了二元神經網絡的競爭對手。我們還提供了一個消融研究，我們評估我們提出的方法的關鍵要素;計算比例因子和二進制CNN的塊結構。我們表明，我們的計算比例因子的方法對於達到高精度非常重要。

4.1 Efficiency Analysis

In an standard convolution, the total number of operations is cNWNI, where c is thenumber of channels, NW = wh and NI = winhin. Note that some modern CPUs canfuse the multiplication and addition as a single cycle operation. On those CPUs, Binary-Weight-Networks does not deliver speed up. Our binary approximation of convolution(equation 11) has cNWNI binary operations and NI non-binary operations. With thecurrent generation of CPUs, we can perform 64 binary operations in one clock of CPU,

therefore the speedup can be computed by公式

The speedup depends on the channel size and filter size but not the input size. In fig-ure 4-(b-c) we illustrate the speedup achieved by changing the number of channels andfilter size. While changing one parameter, we fix other parameters as follows: c = 256,nI = 142 and nW = 32 (majority of convolutions in ResNet[4] architecture have thisstructure). Using our approximation of convolution we gain 62.27× theoretical speedup, but in our CPU implementation with all of the overheads, we achieve 58x speedup in one convolution (Excluding the process for memory allocation and memory ac-cess). With the small channel size (c = 3) and filter size (NW = 1 × 1) the speedupis not considerably high. This motivates us to avoid binarization at the first and last layer of a CNN. In the first layer the chanel size is 3 and in the last layer the filter size is 1 × 1. A similar strategy was used in [11]. Figure 4-a shows the required memory for three different CNN architectures(AlexNet, VGG-19, ResNet-18) with binary and double precision weights. Binary-weight-networks are so small that can be easily fitted into portable devices. BinaryNet [11] is in the same order of memory and computation efficiency as our method. In Figure 4, we show an analysis of computation and memory cost for a binary convolution. The same analysis is valid for BinaryNet and Binary Con-nect. The key difference of our method is using a scaling-factor, which does not change the order of efficiency while providing a significant improvement in accuracy.

4.1效率分析

在標準卷積中，操作的總數是cNWNI，其中c是通道的數量，NW = wh和NI = winhin。請注意，一些現代CPU可以將乘法和加法作爲單個循環操作來使用。在這些CPU上，二進制重量網絡不能提高速度。我們的卷積二進制近似（等式11）具有cNWNI二元運算和NI非二元運算。使用目前的CPU，我們可以在一個時鐘的CPU上執行64個二進制操作，

因此加速可以按公式計算

加速取決於通道大小和過濾器大小，但不取決於輸入大小。在圖4（b-c）中，我們通過改變通道數量和濾波器大小來說明實現加速。當改變一個參數時，我們修復其他參數如下：c = 256，nI = 142和nW = 32（ResNet [4]體系結構中的大多數卷積具有這種結構）。使用我們的卷積逼近，我們獲得了62.27倍的理論加速比，但是在我們的CPU實現中，所有的開銷都在一次卷積中實現了58倍的加速（不包括內存分配和內存訪問的過程）。對於小通道尺寸（c = 3）和過濾器尺寸（NW = 1×1），加速度不會很高。這激勵我們避免在CNN的第一層和最後一層進行二值化。在第一層中，chanel大小爲3，最後一層中的過濾器大小爲1×1。[11]中使用了類似的策略。圖4-a顯示了具有二進制和雙精度權重的三種不同CNN體系結構（AlexNet，VGG-19，ResNet-18）所需的內存。二進制重量網絡非常小，可以輕鬆安裝到便攜式設備中。 BinaryNet [11]與我們的方法具有相同的存儲和計算效率。在圖4中，我們展示了二進制卷積的計算和內存成本分析。對BinaryNet和Binary連接同樣的分析是有效的。我們的方法的關鍵區別在於使用縮放因子，它不會改變效率的順序，同時顯着提高精度。

4.2 Image Classification

We evaluate the performance of our proposed approach on the task of natural im-age classification. So far, in the literature, binary neural network methods have pre-sented their evaluations on either limited domain or simplified datasets e.g.,CIFAR-10,MNIST, SVHN. To compare with state-of-the-art vision, we evaluate our method onImageNet (ILSVRC2012). ImageNet has ∼1.2M train images from 1K categories and50K validation images. The images in this dataset are natural images with reasonablyhigh resolution compared to the CIFAR and MNIST dataset, which have relatively smallimages. We report our classification performance using Top-1 and Top-5 accuracies.We adopt three different CNN architectures as our base architectures for binarization:AlexNet [1], Residual Networks (known as ResNet) [4], and a variant of GoogLenet[3].We compare our Binary-weight-network (BWN) with BinaryConnect(BC) [38] andour XNOR-Networks(XNOR-Net) with BinaryNeuralNet(BNN) [11]. BinaryConnect(BC)is a method for training a deep neural network with binary weights during forwardand backward propagations. Similar to our approach, they keep the real-value weightsduring the updating parameters step. Our binarization is different from BC. The bina-rization in BC can be either deterministic or stochastic. We use the deterministic bina-rization for BC in our comparisons because the stochastic binarization is not efficient.The same evaluation settings have been used and discussed in [11]. BinaryNeural-Net(BNN) [11] is a neural network with binary weights and activations during infer-ence and gradient computation in training. In concept, this is a similar approach to ourXNOR-Network but the binarization method and the network structure in BNN is dif-ferent from ours. Their training algorithm is similar to BC and they used deterministicbinarization in their evaluations.

4.2圖像分類

我們評估我們提出的方法在自然圖像分類任務中的表現。到目前爲止，在文獻中，二元神經網絡方法預先對有限域或簡化數據集（例如CIFAR-10，MNIST，SVHN）進行評估。爲了與最先進的視覺進行比較，我們在ImageNet上評估了我們的方法（ILSVRC2012）。 ImageNet擁有約1.2M的來自1K類別和50K驗證圖像的訓練圖像。與CIFAR和MNIST數據集相比，該數據集中的圖像是具有相當高分辨率的自然圖像，其圖像相對較小。我們使用Top-1和Top-5精度報告我們的分類性能。我們採用三種不同的CNN體繫結構作爲二進制化的基礎體系結構：AlexNet [1]，Residual Networks（ResNet）[4]和GoogLenet [ 3]。我們將二進制權重網絡（BWN）與BinaryConnect（BC）[38]和我們的XNOR網絡（XNOR-Net）與BinaryNeuralNet（BNN）進行比較[11]。 BinaryConnect（BC）是一種用於在前向和後向傳播期間訓練具有二進制權重的深度神經網絡的方法。與我們的方法類似，它們在更新參數步驟中保持實際值權重。我們的二元化與BC不同。 BC中的二元化可以是確定性的也可以是隨機的。因爲隨機二值化效率不高，所以我們在BC中使用確定性的二元化進行比較。同樣的評估設置已經在[11]中使用和討論過了。二進制神經網絡（BNN）[11]是一種神經網絡，在訓練過程中推理和梯度計算中具有二進制加權和激活。在概念上，這與我們的XNOR網絡類似，但BNN中的二值化方法和網絡結構與我們的不同。他們的訓練算法與BC類似，他們在評估中使用確定性二值化。

CIFAR-10 : BC and BNN showed near state-of-the-art performance on CIFAR-10, MNIST, and SVHN dataset. BWN and XNOR-Net on CIFAR-10 using the samenetwork architecture as BC and BNN achieve the error rate of 9.88% and 10.17% re-spectively. In this paper we explore the possibility of obtaining near state-of-the-artresults on a much larger and more challenging dataset (ImageNet).

AlexNet: [1] is a CNN architecture with 5 convolutional layers and two fully-connected layers. This architecture was the first CNN architecture that showed to besuccessful on ImageNet classification task. This network has 61M parameters. We useAlexNet coupled with batch normalization layers [43].

Train: In each iteration of training, images are resized to have 256 pixel at theirsmaller dimension and then a random crop of 224 × 224 is selected for training. We run the training algorithm for 16 epochs with batche size equal to 512. We use negative-log-likelihood over the soft-max of the outputs as our classification loss function. In our implementation of AlexNet we do not use the Local-Response-Normalization(LRN)layer3. We use SGD with momentum=0.9 for updating parameters in BWN and BC.For XNOR-Net and BNN we used ADAM [42]. ADAM converges faster and usually achieves better accuracy for binary inputs [11]. The learning rate starts at 0.1 and we apply a learning-rate-decay=0.01 every 4 epochs.

Test: At inference time, we use the 224 × 224 center crop for forward propagation.

CIFAR-10：BC和BNN在CIFAR-10，MNIST和SVHN數據集上表現出接近最先進的性能。使用與BC和BNN相同的網絡架構的CIFAR-10上的BWN和XNOR-Net分別實現了9.88％和10.17％的錯誤率。在本文中，我們探討了在更大和更具挑戰性的數據集（ImageNet）上獲得接近最先進的結果的可能性。

AlexNet：[1]是具有5個卷積層和兩個完全連接層的CNN架構。這個架構是第一個證明在ImageNet分類任務上成功的CNN架構。該網絡具有61M參數。我們使用亞毫米網絡與批量標準化層[43]。

訓練：在每次迭代訓練中，圖像的大小都調整到256個像素，然後選擇224×224的隨機裁剪進行訓練。我們將訓練算法運行16個曆元，其粒度等於512.我們在輸出的soft-max上使用負對數似然作爲我們的分類損失函數。在我們的AlexNet實現中，我們不使用局部響應規範化（LRN）層3。我們使用SGD動量= 0.9來更新BWN和BC中的參數。對於XNOR-Net和BNN，我們使用了ADAM [42]。 ADAM收斂速度更快，通常可以實現更好的二進制輸入精度[11]。學習率從0.1開始，我們每4個時期應用一次學習率衰減= 0.01。

測試：在推斷時，我們使用224×224中心作物進行前向傳播。

Figure 5 demonstrates the classification accuracy for training and inference alongthe training epochs for top-1 and top-5 scores. The dashed lines represent training ac-curacy and solid lines shows the validation accuracy. In all of the epochs our methodoutperforms BC and BNN by large margin (∼17%). Table 1 compares our final accu-racy with BC and BNN. We found that the scaling factors for the weights (α) is muchmore effective than the scaling factors for the inputs (β). Removing β reduces the ac-curacy by a small margin (less than 1% top-1 alexnet).

Binary Gradient: Using XNOR-Net with binary gradient the accuracy of top-1 willdrop only by 1.4%.

Residual Net : We use the ResNet-18 proposed in [4] with short-cut type B.4

Train: In each training iteration, images are resized randomly between 256 and480 pixel on the smaller dimension and then a random crop of 224 × 224 is selectedfor training. We run the training algorithm for 58 epochs with batch size equal to 256 images. The learning rate starts at 0.1 and we use the learning-rate-decay equal to 0.01at epochs number 30 and 40.

Test: At inference time, we use the 224 × 224 center crop for forward propagation.

Figure 6 demonstrates the classification accuracy (Top-1 and Top-5) along the epochsfor training and inference. The dashed lines represent training and the solid lines repre-sent inference. Table 2 shows our final accuracy by BWN and XNOR-Net.

GoogLenet Variant : We experiment with a variant of GoogLenet [3] that uses asimilar number of parameters and connections but only straightforward convolutions,no branching5. It has 21 convolutional layers with filter sizes alternating between 1 × 1and 3 × 3.

Train: Images are resized randomly between 256 and 320 pixel on the smaller di-mension and then a random crop of 224 × 224 is selected for training. We run thetraining algorithm for 80 epochs with batch size of 128. The learning rate starts at 0.1and we use polynomial rate decay, β = 4.

Test: At inference time, we use a center crop of 224 × 224.

圖5顯示了訓練和推斷前1和前5分的訓練和推斷的分類準確性。虛線表示訓練的準確性，實線表示驗證的準確性。在所有這些時代，我們的方法大幅優於BC和BNN（〜17％）。表1比較了我們的最終準確性與BC和BNN。我們發現權重（α）的比例因子比輸入（β）的比例因子有效得多。刪除β會使精度略微降低（小於1％的top-1 alexnet）。

二進制漸變：使用二元梯度的XNOR-Net，top-1的精度將只下降1.4％。

剩餘網絡：我們使用[4]中提出的ResNet-18和B.4型快捷方式

訓練：在每次訓練迭代中，圖像在較小維度上隨機調整大小在256和480像素之間，然後選擇224×224的隨機裁剪進行訓練。我們運行58個時期的訓練算法，批量大小等於256個圖像。學習率從0.1開始，我們使用的學習速率衰減等於0.01和30歲和40歲的時期。

測試：在推斷時，我們使用224×224中心作物進行前向傳播。

圖6顯示了沿着培訓和推理的時代的分類準確性（Top-1和Top-5）。虛線表示訓練，實線代表推斷。表2顯示了我們通過BWN和XNOR-Net的最終準確度。

GoogLenet變體：我們試驗了GoogLenet [3]的一個變體，它使用了相似數量的參數和連接，但只有簡單的卷積，沒有分支5。它具有21個卷積層，濾波器大小在1×1和3×3之間交替。

訓練：圖像在較小的尺寸上隨機調整大小在256和320像素之間，然後選擇224×224的隨機裁剪進行訓練。我們運行80個曆元的訓練算法，批量爲128.學習速率從0.1開始，我們使用多項式速率衰減，β= 4。

測試：在推斷時，我們使用224×224的中心作物。

4.3 Ablation Studies

There are two key differences between our method and the previous network binariaza-tion methods; the binararization technique and the block structure in our binary CNN.

For binarization, we find the optimal scaling factors at each iteration of training. For the block structure, we order the layers in a block in a way that decreases the quantization loss for training XNOR-Net. Here, we evaluate the effect of each of these elements in the performance of the binary networks. Instead of computing the scaling factor αusing equation 6, one can consider α as a network parameter. In other words, a layer after binary convolution multiplies the output of convolution by an scalar parameter foreach filter. This is similar to computing the affine parameters in batch normalization.Table 3-a compares the performance of a binary network with two ways of computing the scaling factors. As we mentioned in section 3.2 the typical block structure in CNN is not suitable for binarization. Table 3-b compares the standard block structure C-B-A-P(Convolution, Batch Normalization, Activation, Pooling) with our structure B-A-C-P.(A, is binary activation).

4.3消融研究

我們的方法和以前的網絡分析方法有兩個關鍵的區別;二進制化技術和我們的二進制CNN中的塊結構。

對於二值化，我們在每次迭代訓練中找到最優的比例因子。對於塊結構，我們以減少XNOR-Net訓練的量化損失的方式對塊中的層進行排序。在這裏，我們評估每個元素在二進制網絡性能中的影響。不用等式6來計算比例因子α，可以將α看作網絡參數。換句話說，二進制卷積之後的一個層將卷積輸出乘以一個標量參數foreach過濾器。這與計算批量歸一化中的仿射參數類似。表3-a比較了二元網絡的性能與兩種計算比例因子的方式。正如我們在第3.2節中所提到的，CNN中的典型塊結構不適合二值化。表3-b比較了標準塊結構C-B-A-P（卷積，批量標準化，活化，合併）與我們的結構B-A-C-P（A，是二元活化）。

5 Conclusion

We introduce simple, efficient, and accurate binary approximations for neural networks.We train a neural network that learns to find binary values for weights, which reducesthe size of network by ∼ 32× and provide the possibility of loading very deep neuralnetworks into portable devices with limited memory. We also propose an architecture,XNOR-Net, that uses mostly bitwise operations to approximate convolutions. This pro-vides ∼ 58× speed up and enables the possibility of running the inference of state ofthe art deep neural network on CPU (rather than GPU) in real-time.

Acknowledgements

This work is in part supported by ONR N00014-13-1-0720, NSF IIS- 1338054, AllenDistinguished Investigator Award, and the Allen Institute for Artificial Intelligence.

5結論

我們爲神經網絡引入簡單，高效和準確的二進制逼近。我們訓練一個神經網絡，學習尋找權重的二進制值，它將網絡的大小減少了32倍，並提供了將非常深的神經網絡加載到便攜式設備中的可能性。有限的記憶。我們還提出了一種架構，XNOR-Net，它使用大部分按位運算來近似卷積。這提供了〜58倍的加速，並能夠實時運行CPU（而不是GPU）上最先進的深度神經網絡狀態的推斷。

致謝

這項工作部分得到了ONR N00014-13-1-0720，NSF IIS-1338054，AllenDistinguished Investigator獎和Allen人工智能研究所的支持。