深度學習經典網絡：ResNet及其變體（ResNeXt）

ResNeXt：https://arxiv.org/pdf/1611.05431.pdf
keras代碼：https://github.com/keras-team/keras-applications/blob/master/keras_applications/resnext.py

pytorch 代碼:https://github.com/prlz77/ResNeXt.pytorch

1 簡介

深度學習的其他網絡爲了提高準確率，都是採用增加網絡深度或者網絡寬度（其實指卷積中濾波器的個數）的方式，但這會增加模型的複雜度和參數量。爲此作者在原有ResNet 的基礎上，提出一種新的殘差單元，在保持現有網絡的參數量的前提下提高了模型的準確率。該網絡結構名爲ResNeXt。作者主要借鑑了VGG和Inception網絡的思想，VGG在設計時採用堆疊的方式，同一模塊中的像filter_size, filter_channel等超參數都保持一致; Inception網絡採用split-transform-merge 策略，首先利用1×1卷積將輸入映射到嵌入空間，之後再利用像3×3、5×5等卷積對嵌入空間的特徵進行轉換，最後將不同分支的特徵利用concatenation進行融合。但是Inception網絡每個分支的超參數較多，需要進行特別的設計。因此，作者借鑑了VGG堆疊的思想和Inception網絡split-transform-merge 的結構，在增加準確率的同時基本不改變或降低模型的複雜度。結構如圖1所示。

圖 1 左圖爲基本殘差單元，右圖爲ResNeXt中的殘差單元
提出的殘差單元與Inception的結構類似，但是該結構每個分支的超參數都是相同的，相當於進行了一個橫向的堆疊，這樣可以避免繁重的超參數調節工作。其中作者稱分支的數量（圖中爲32）爲cardinality，文中解釋爲the size of the set of transformations, 作者也在實驗中證明了增加cardinality比增加網絡的深度和寬度更加有效。

2 提出的殘差單元

思路:
對於一個簡單的神經元，其計算過程可以用如下公式表示，輸入 $x$ 是一個 $D$ 維的向量。如圖1所示, 可以看出神經元的計算過程就是一個split-transform-merge的策略，它首先將輸入split到一個低緯度的嵌入式空間，然後利用簡單的縮放: $w_{i}x_{i}$ 對低緯度的嵌入式特徵進行transform，最後利用 $\sum_{i=1}^{D}$ 將嵌入特徵進行整合。作者借鑑該模板，提出的block可以用公式2進行表示，式中 $\mathcal{T}_{i}$ 可以是一個任意的函數， $C$ 爲cardinality， $C$ 可以不等於 $D$ ,可以爲任意數字，同時作者在設計時保證了 $\mathcal{T}_{i}$ 是相同的， $\mathcal{T}_{i}$ 採用瞭如圖1右面所示的bottleneck-shaped的結構，公式2可以改寫爲公式3。與Inception-ResNet不同的是這裏在每個分支採用了相似的結構。
$\sum_{i=1}^{D} w_{i} x_{i} \text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }(1)$

$\mathcal{F}(\mathbf{x})=\sum_{i=1}^{C} \mathcal{T}_{i}(\mathbf{x})\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }(2)$

$\mathbf{y}=\mathbf{x}+\sum_{i=1}^{C} \mathcal{T}_{i}(\mathbf{x}\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }(3)$

圖 2 神經元的簡單結構

ResNeXt block結構
作者展示了三種相同的 ResNeXt blocks。。fig3.a 就是前面所說的aggregated residual transformations。 fig3.b 則採用兩層卷積後 concatenate，再卷積，有點類似 Inception-ResNet，只不過這裏的 paths 都是相同的拓撲結構。fig 3.c採用的是grouped convolutions,用來限制本層卷積核和輸入 channels 的卷積。關於分組卷積可以參考：

Group Convolution分組卷積，以及Depthwise Convolution和Global Depthwise Convolution
在實際應用中也是採用的fig3.c的結構。這裏 fig 3.c 採用32個 group，每個 group 的輸入輸出 channels 都是4，最後把channels合併。這張圖的 fig3.c 和 fig1 的左邊圖很像，差別在於fig3.c的中間 filter 數量（此處爲128，而fig 1中爲64）更多。爲什麼更多呢？主要是在對比實驗時，保證兩者的參數量一致，即模型複雜度一致。請看如下解釋。

圖 3 等價的ResNeXt模塊
作者在進行對比實驗時主要通過調整width of bottleneck d ，也就是第每個path的中間channels數量，來探究cardinality的影響。cardinality與b之間的關係如圖4所示, 第二行的d表示每個path的中間channels數量，最後一行則表示整個block的寬度，是第一行C和第二行d的乘積。對於原始殘差單元，也就是圖1左圖，每個block的參數量爲

256 · 64+ 3 · 3 · 64 · 64+ 64 · 256 ≈ 70k

, 而提出的block參數量可以用公式4表示，爲了保證兩者的參數量近似相等，在

C=32

是，

d=4

。

圖 4 cardinality 與 b之間的關係

$C \cdot(256 \cdot d+3 \cdot 3 \cdot d \cdot d+d \cdot 256)\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }(4)$
最後給出整個網絡的一個配置，如圖5所示。可以看出兩者的參數量基本一致。另外，每個卷積也是採用ResNet v1中BN和RELU放在卷積後面的形式，並沒有採用ResNet v2結構

圖 5 ResNet 50 和 ResNeXt 50 的網絡配置

4 部分實驗

Cardinality vs. Width
作者在圖4的設置下進行了實驗，增加cardinality相當於降低分支寬度d, 可以看出增加cardinality模型準確率得到提升。同時可以看出4d 相對於 14d並沒有提升很多，所以作者就沒有繼續減小d。

圖 6 相同複雜度下不同cardinality設置對比實驗

Increasing Cardinality vs. Deeper/Wider
結果如圖7，增加寬度就是簡單地增加filter channels。第一個是基準模型，增加深度和寬度的分別是第三和第四個，可以看到誤差分別降低了0.3%和0.7%。但是第五個加倍了Cardinality，則降低了1.3%，第六個Cardinality加到64，則降低了1.6%。顯然增加Cardianlity比增加深度或寬度更有效。

本質： ResNeXt和ResNet相比，本質上是引入了group操作同時加寬了網絡（在不增加complixity的情況下，通過增加cardinility達到間接增加width的效果），可以看Figure1，每個block的前兩個卷積層寬度和原來ResNet相比增加了一倍。寬度增加應該是效果提升的主要來源。但是如果不用group操作，單純增加寬度的話，顯然計算量和參數要增加不少，因此採用group操作可以減少計算量和參數量。

5 ResNeXt 代碼

代碼採用tensorflow 2.0 的tf.keras模塊實現的，並參考keras相關代碼實現。

"""
ResNeXt models for tensorflow 2.0, tf.keras.
# Reference
- [Aggregated Residual Transformations for Deep Neural Networks](https://arxiv.org/pdf/1611.05431.pdf))

"""

import os
import tensorflow.keras as keras
from tensorflow.keras import Sequential, layers

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'


# create essential global variable, CARDINALITY describes the number of groups.
# DEPTH is the filter channels of the each group of conv2_x.
# BASEWIDTH is the filter channels of the bottleneck, conv2_x should be 128 and setting
# 64 is to correspond to ResNet.
CARDINALITY = 32
DEPTH = 4
BASEWIDTH = 64


def _grouped_convolution_block(inputs, grouped_channels=4, cardinality=32, stride=1):
    """Add a grouped convolution block. It is equivalent to the native paper.

    Args:
        inputs: inputs the tensor.
        grouped_channels: the convolution channels of the each group of inputs.
        cardinality: cardinality factor describing the number of groups, i.e.the group number.
        stride: performs strided convolution for downscaling if > 1.

    Returns:
        return the tensor.

    """
    group_list = []
    if cardinality == 1:
        x = Sequential([layers.Conv2D(grouped_channels, (3, 3), strides=stride, padding='same', use_bias=False),
                        layers.BatchNormalization(),
                        layers.Activation('relu')])(inputs)
        return x
    else:
        for c in range(cardinality):
            x = layers.Lambda(lambda z: z[..., c*grouped_channels: (c+1)*grouped_channels])(inputs)
            x = layers.Conv2D(grouped_channels, (3, 3), strides=stride, padding='same', use_bias=False)(x)
            group_list.append(x)
        x = layers.concatenate(group_list, axis=-1)
        x = layers.Activation('relu')(x)
        return x


def _bottleneck(inputs, filter_channels=64, cardinality=CARDINALITY, stride=1):
    """add the ResNeXt bottleneck block.

    Args:
        inputs: inputs the tensor.
        filter_channels: the bottleneck filter channels, conv2_x - conv5_x are respectively [64, 128, 512, 1024].
        cardinality: cardinality factor describing the number of groups, i.e.the group number.
        stride: performs strided convolution for downscaling if > 1.

    Returns:
        return the tensor.

    """
    grouped_channels = int(DEPTH * filter_channels / BASEWIDTH)
    res = Sequential([layers.Conv2D(grouped_channels*cardinality, (1, 1), strides=1, use_bias=False),
                      layers.BatchNormalization(),
                      layers.Activation('relu')
                      ])(inputs)
    res = _grouped_convolution_block(res, grouped_channels, cardinality, stride)
    res = Sequential([layers.Conv2D(filter_channels*4, (1, 1), strides=1, use_bias=False),
                      layers.BatchNormalization(),
                      ])(res)
    if stride != 1 or inputs.shape[-1] != filter_channels*4:
        shortcut = Sequential([layers.Conv2D(filter_channels*4, (1, 1), strides=stride, use_bias=False),
                               layers.BatchNormalization()
                               ])(inputs)
    else:
        shortcut = inputs
    out = layers.Activation('relu')(layers.add([res, shortcut]))
    return out


def _make_block_layer(inputs, filter_channels, num_blocks, stride):
    """Building resnext block

    Args:
        inputs: inputs tensor.
        filter_channels: output channels per block.
        num_blocks: number of blocks per layer
        stride: block stride

    Returns:
        a resnext layer

    """
    x = _bottleneck(inputs, filter_channels, stride=stride)
    for i in range(1, num_blocks):
        x = _bottleneck(x, filter_channels, stride=1)
    return x


def _create_resnext(inputs, num_blocks, num_classes=100):
    """create the ResNeXt network.

    Args:
        inputs: inputs the tensor.
        num_blocks: the block numbers of different layers and it is a list.
        num_classes: the number of classes.

    Returns:
        return the tensor.

    """

    conv1_x = Sequential([layers.Conv2D(64, (7, 7), strides=2, padding='same', use_bias=False),
                          layers.BatchNormalization(),
                          layers.Activation('relu'),
                          layers.MaxPool2D((3, 3), strides=2, padding='same')
                          ])(inputs)
    conv2_x = _make_block_layer(conv1_x, 64, num_blocks[0], stride=1)
    conv3_x = _make_block_layer(conv2_x, 128, num_blocks[1], stride=2)
    conv4_x = _make_block_layer(conv3_x, 256, num_blocks[2], stride=2)
    conv5_x = _make_block_layer(conv4_x, 512, num_blocks[3], stride=2)
    avg_pool = layers.GlobalAveragePooling2D()(conv5_x)
    fc = layers.Dense(num_classes)(avg_pool)
    return fc


def resnext(inputs, num_classes=100, network_type=50):
    """ return a resnext50(c32x4d) network
    """
    if network_type == 50:
        network = _create_resnext(inputs, [3, 4, 6, 3], num_classes)
    if network_type == 101:
        network = _create_resnext(inputs, [3, 4, 23, 3], num_classes)
    if network_type == 152:
        network = _create_resnext(inputs, [3, 8, 36, 3], num_classes)
    return network


if __name__ == '__main__':
    model_inputs = keras.Input(shape=(224, 224, 3), name='inputs')
    model_outputs = resnext(model_inputs, network_type=152)
    model = keras.Model(inputs=model_inputs, outputs=model_outputs, name='outputs')
    model.summary()

References

1、Group Convolution分組卷積，以及Depthwise Convolution和Global Depthwise Convolution
2、ResNeXt算法詳解

深度學習經典網絡：ResNet及其變體（ResNeXt）

1 簡介

2 提出的殘差單元

4 部分實驗

5 ResNeXt 代碼

References

Java基礎練習：數組

深度學習經典網絡：Inception系列網絡（Inception v1 & Inception v2(BN)）

深度學習經典網絡：ResNet及其變體（ResNet v1）

Java: Arrays.copyOf() 與 System.arraycopy()的區別與聯繫

Java: HashMap 源碼分析（JDK8）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結