作者：葉虎

編輯：王抒偉

本文6500字30圖，閱讀約。。。分鐘

算了

愛看多久看多久

零引言：

卷積神經網絡（CNN）已經普遍應用在計算機視覺領域，並且已經取得了不錯的效果。圖1爲近幾年來CNN在ImageNet競賽的表現，可以看到爲了追求分類準確度，模型深度越來越深，模型複雜度也越來越高，如深度殘差網絡（ResNet）其層數已經多達152層。

圖0 CNN在ImageNet上的表現（來源：CVPR2017）

However，在某些真實的應用場景如移動或者嵌入式設備，如此大而複雜的模型是難以被應用的。

首先是模型過於龐大，面臨着內存不足的問題，其次這些場景要求低延遲，或者說響應速度要快，想象一下自動駕駛汽車的行人檢測系統如果速度很慢會發生什麼可怕的事情。

所以，研究小而高效的CNN模型在這些場景至關重要，至少目前是這樣，儘管未來硬件也會越來越快。

目前的研究總結來看分爲兩個方向：一是對訓練好的複雜模型進行壓縮得到小模型；二是直接設計小模型並進行訓練。不管如何，其目標在保持模型性能（accuracy）的前提下降低模型大小（parameterssize），同時提升模型速度（speed, low latency）。

本文的主角MobileNet屬於後者，其是Google最近提出的一種小巧而高效的CNN模型，其在accuracy和latency之間做了折中。

下面對MobileNet做詳細的介紹。

一

Depthwise separable convolution：

MobileNet的基本單元是深度級可分離卷積（depthwise separable convolution---DSC），其實這種結構之前已經被使用在Inception模型中。

深度級可分離卷積其實是一種可分解卷積操作（factorized convolutions），其可以分解爲兩個更小的操作：

depthwise convolution和pointwise convolution

如圖1所示: Depthwise convolution和標準卷積不同，對於標準卷積其卷積核是用在所有的輸入通道上（input channels），而depthwise convolution針對每個輸入通道採用不同的卷積核，就是說一個卷積覈對應一個輸入通道，所以說depthwise convolution是depth級別的操作。

而pointwise convolution其實就是普通的卷積，只不過其採用1x1的卷積核。圖2中更清晰地展示了兩種操作。

對於DSC，其首先是採用depthwise convolution對不同輸入通道分別進行卷積，然後採用pointwise convolution將上面的輸出再進行結合，這樣其實整體效果和一個標準卷積是差不多的，但是會大大減少計算量和模型參數量。

圖1 Depthwise separable convolution

圖2 Depthwise convolution和pointwiseconvolution

這裏簡單分析一下depthwise separable convolution在計算量上與標準卷積的差別。

假定輸入特徵圖大小是： D(F)*D(F)*M

而輸出特徵圖大小是：D(F)*D(F)*N

這是假定兩者是相同的，M、N 指的是通道數（channels or depth）

這裏也假定輸入與輸出特徵圖大小（width and height）是一致的

採用的卷積核大小：D(K)*D(K) 儘管是特例，但是不影響下面分析的一般性。

其中D(F)是特徵圖的width和height，(F下標，微信不能編輯公式，好氣哦)

對於標準的卷積，其計算量將是：D(K)、D(K)、M、N、D(F)、D(F)

而對於depthwise convolution其計算量爲: D(K)、D(K)、M、D(F)、D(F)

對於 pointwise convolution計算量是：M、N、D(F)、D(F)

所以depthwise separable convolution總計算量是：

D(K)、D(K)、M、D(F)、D(F) + M、N、D(F)、D(F)

可以比較depthwise separable convolution和標準卷積如下：

一般情況下比較大，那麼如果採用3x3卷積核的話，depthwise separable convolution相較標準卷積可以降低大約9倍的計算量。其實，後面會有對比，參數量也會減少很多。

二

MobileNet的一般結構：

前面講述了depthwise separable convolution，這是MobileNet的基本組件，但是在真正應用中會加入batchnorm，並使用ReLU激活函數，所以depthwise separable convolution的基本結構如圖3所示。

圖3 加入BN和ReLU的depthwiseseparable convolution

表1 MobileNet的網絡結構

MobileNet的網絡結構如表1所示。

首先是一個3x3的標準卷積，然後後面就是堆積depthwise separable convolution，並且可以看到其中的部分depthwise convolution會通過strides=2進行downsampling。

然後採用average pooling將feature變成1x1，根據預測類別大小加上全連接層，最後是一個softmax層。

如果單獨計算depthwise convolution和pointwise convolution，整個網絡有28層（這裏Avg Pool和Softmax不計算在內）。

我們還可以分析整個網絡的參數和計算量分佈，如表2所示。可以看到整個計算量基本集中在1x1卷積上。

如果你熟悉卷積底層實現的話，你應該知道卷積一般通過一種im2col方式實現，其需要內存重組，但是當卷積核爲1x1時，其實就不需要這種操作了，底層可以有更快的實現。對於參數也主要集中在1x1卷積，除此之外還有就是全連接層佔了一部分參數。

表2 MobileNet網絡的計算與參數分佈

表3 MobileNet與GoogleNet和VGG16性能對比

三

MobileNet 瘦身:

前面說的MobileNet的基準模型，但是有時候你需要更小的模型，那麼就要對MobileNet瘦身了。這裏引入了兩個超參數：width multiplier和resolution multiplier。第一個參數width multiplier主要是按比例減少通道數，該參數記爲，其取值範圍爲(0,1]，那麼輸入與輸出通道數將變成 和，對於depthwiseseparable convolution，其計算量變爲：

因爲主要計算量在後一項，所以width multiplier可以按照 Alpha^2 比例降低計算量，其是參數量也會下降。

第二個參數resolution multiplier主要是按比例降低特徵圖的大小，記爲 Rho，比如原來輸入特徵圖是224*224，可以減少爲192*192，加上resolution multiplier，depthwiseseparable convolution的計算量爲：

要說明的是，resolution multiplier僅僅影響計算量，但是不改變參數量。

引入兩個參數會給肯定會降低MobileNet的性能，具體實驗分析可以見paper，總結來看是在accuracy和computation，以及accuracy和model size之間做折中。

四

MobileNet 的TensorFlow實現:

TensorFlow的nn庫有depthwise convolution算子tf.nn.depthwise_conv2d，所以MobileNet很容易在TensorFlow上實現：

class MobileNet(object):
    def __init__(self, inputs, num_classes=1000, is_training=True,
                 width_multiplier=1, scope="MobileNet"):
        """
        The implement of MobileNet(ref:https://arxiv.org/abs/1704.04861)
        :param inputs: 4-D Tensor of [batch_size, height, width, channels]
        :param num_classes: number of classes
        :param is_training: Boolean, whether or not the model is training
        :param width_multiplier: float, controls the size of model
        :param scope: Optional scope for variables
        """
        self.inputs = inputs
        self.num_classes = num_classes
        self.is_training = is_training
        self.width_multiplier = width_multiplier

        # construct model
        with tf.variable_scope(scope):
            # conv1
            net = conv2d(inputs, "conv_1", round(32 * width_multiplier), filter_size=3,
                         strides=2)  # ->[N, 112, 112, 32]
            net = tf.nn.relu(bacthnorm(net, "conv_1/bn", is_training=self.is_training))
            net = self._depthwise_separable_conv2d(net, 64, self.width_multiplier,
                                "ds_conv_2") # ->[N, 112, 112, 64]
            net = self._depthwise_separable_conv2d(net, 128, self.width_multiplier,
                                "ds_conv_3", downsample=True) # ->[N, 56, 56, 128]
            net = self._depthwise_separable_conv2d(net, 128, self.width_multiplier,
                                "ds_conv_4") # ->[N, 56, 56, 128]
            net = self._depthwise_separable_conv2d(net, 256, self.width_multiplier,
                                "ds_conv_5", downsample=True) # ->[N, 28, 28, 256]
            net = self._depthwise_separable_conv2d(net, 256, self.width_multiplier,
                                "ds_conv_6") # ->[N, 28, 28, 256]
            net = self._depthwise_separable_conv2d(net, 512, self.width_multiplier,
                                "ds_conv_7", downsample=True) # ->[N, 14, 14, 512]
            net = self._depthwise_separable_conv2d(net, 512, self.width_multiplier,
                                "ds_conv_8") # ->[N, 14, 14, 512]
            net = self._depthwise_separable_conv2d(net, 512, self.width_multiplier,
                                "ds_conv_9")  # ->[N, 14, 14, 512]
            net = self._depthwise_separable_conv2d(net, 512, self.width_multiplier,
                                "ds_conv_10")  # ->[N, 14, 14, 512]
            net = self._depthwise_separable_conv2d(net, 512, self.width_multiplier,
                                "ds_conv_11")  # ->[N, 14, 14, 512]
            net = self._depthwise_separable_conv2d(net, 512, self.width_multiplier,
                                "ds_conv_12")  # ->[N, 14, 14, 512]
            net = self._depthwise_separable_conv2d(net, 1024, self.width_multiplier,
                                "ds_conv_13", downsample=True) # ->[N, 7, 7, 1024]
            net = self._depthwise_separable_conv2d(net, 1024, self.width_multiplier,
                                "ds_conv_14") # ->[N, 7, 7, 1024]
            net = avg_pool(net, 7, "avg_pool_15")
            net = tf.squeeze(net, [1, 2], name="SpatialSqueeze")
            self.logits = fc(net, self.num_classes, "fc_16")
            self.predictions = tf.nn.softmax(self.logits)

    def _depthwise_separable_conv2d(self, inputs, num_filters, width_multiplier,
                                    scope, downsample=False):
        """depthwise separable convolution 2D function"""
        num_filters = round(num_filters * width_multiplier)
        strides = 2 if downsample else 1

        with tf.variable_scope(scope):
            # depthwise conv2d
            dw_conv = depthwise_conv2d(inputs, "depthwise_conv", strides=strides)
            # batchnorm
            bn = bacthnorm(dw_conv, "dw_bn", is_training=self.is_training)
            # relu
            relu = tf.nn.relu(bn)
            # pointwise conv2d (1x1)
            pw_conv = conv2d(relu, "pointwise_conv", num_filters)
            # bn
            bn = bacthnorm(pw_conv, "pw_bn", is_training=self.is_training)
            return tf.nn.relu(bn)