经典网络结构(六):DenseNet (Densely Connected Networks 稠密连接网络)

本节参考:《深度学习之PyTorch物体检测实战》
《DIVE INTO DEEP LEARNING》

Function Decomposition

ResNet significantly changed the view of how to parametrize the functions in deep networks. DenseNet is to some extent the logical extension of this. To understand how to arrive at it, let us take a small detour to theory. Recall the Taylor expansion for functions. For scalars it can be written as
f(x)=f(0)+f(x)x+12f(x)x2+16f(x)x3+o(x3)f(x)=f(0)+f′(x)x+12f′′(x)x2+16f′′′(x)x3+o(x3)

The key point is that it decomposes the function into increasingly higher order terms. In a similar vein, ResNet decomposes functions into
f(x)=x+g(x)f(x)=x+g(x)

That is, ResNet decomposes ff into a simple linear term and a more complex nonlinear one. What if we want to go beyond two terms? A solution was proposed by [Huang et al., 2017] in the form of DenseNet, an architecture that reported record performance on the ImageNet dataset.
在这里插入图片描述
As shown in Fig. 5.10, the key difference between ResNet and DenseNet is that in the latter case outputs are concatenated rather than added. As a result we perform a mapping from xx to its values after applying an increasingly complex sequence of functions.
x[x,f1(x),f2(x,f1(x)),f3(x,f1(x),f2(x,f1(x)),]x→[x,f1(x),f2(x,f1(x)),f3(x,f1(x),f2(x,f1(x)),…]在这里插入图片描述
The name DenseNet arises from the fact that the dependency graph between variables becomes quite dense. The last layer of such a chain is densely connected to all previous layers.

DenseNet在ResNet的基础上,最大化了前后层的信息交流,通过建立前面所有层与后面层的密集连接,实现了特征在通道维度上的复用,使其可以在参数与计算量更少的情况下实现比ResNet更优的性能

DenseNet的主要构建模块是稠密块(dense block)和过渡层(transition layer)。前者定义了输入和输出是如何连结的,后者则用来控制通道数,使之不过大。

稠密块(dense block)

DenseNet使用了ResNet改良版的“BN-ReLU-卷积”结构,这里的 1×11\times 1卷积还是用来降维减少计算量的,也就是所谓的 bottleneck layer。以32个conv_blockDenseBlock为例,那么第32个conv_block的输入是前面31层的输出结果,每层输出的channel为32(growth rate),如果不加 bottleneck layer 的话,第32层的 3×33\times 3卷积的输入通道数就是31*32 + (上一个DenseBlock的输出channel数) 。而如果加上 bottleneck layer 的话,一般 1×11\times 1卷积的通道数为 growthRate 的4倍,也就是128, 相比之前的接近1000的通道数,极大地降低了计算量

def conv_block(input_channels, output_channels):
    # 一般 1 x 1 卷积的通道数为 growthRate 的4倍
    inter_channels = 4 * output_channels
        
    return nn.Sequential(
        nn.BatchNorm2d(input_channels), 
        nn.ReLU(inplace=True),
        nn.Conv2d(input_channels, inter_channels, kernel_size=1, bias=False),
        
        nn.BatchNorm2d(inter_channels), 
        nn.ReLU(inplace=True),
        nn.Conv2d(inter_channels, output_channels, kernel_size=3, padding=1, bias=False),
        )

稠密块由多个conv_block组成,每块使用相同的输出通道数。但在前向计算时,我们将每块的输入和输出在通道维上连结。

# output_channels 也被称为 growthRate
class DenseBlock(nn.Module):
    def __init__(self, num_convs, input_channels, output_channels):
        super(DenseBlock, self).__init__()
        layer = []
        for i in range(num_convs):
            layer.append(conv_block(output_channels*i+input_channels, output_channels))
        self.net = nn.Sequential(*layer)
        
    def forward(self, X):
        for blk in self.net:
            Y = blk(X)
            # Concatenate the input and output of each block on the channel
            # dimension
            X = torch.cat((X, Y), dim=1)
        return X

在下面的例子中,我们定义一个有2个输出通道数为10的卷积块。使用通道数为3的输入时,我们会得到通道数为 3+2×10=23 的输出。卷积块的通道数控制了输出通道数相对于输入通道数的增长,因此也被称为增长率(growth rate)。

blk = DenseBlock(2, 3, 10)
X = torch.randn(4, 3, 8, 8)
Y = blk(X)
Y.shape

过渡层(transition layer)

由于每个稠密块都会带来通道数的增加,使用过多则会带来过于复杂的模型。过渡层用来控制模型复杂度。它通过 1×11×1 卷积层来减小通道数,并使用步幅为2的平均池化层减半高和宽,从而进一步降低模型复杂度。

def transition_block(input_channels, output_channels):
    return nn.Sequential(
        nn.BatchNorm2d(input_channels), 
        nn.ReLU(inplace=True),
        nn.Conv2d(input_channels, output_channels, kernel_size=1),
        nn.AvgPool2d(kernel_size=2, stride=2))

DenseNet模型

在这里插入图片描述

  • DenseNet首先使用同ResNet一样的单卷积层和最大池化层
  • 类似于ResNet接下来使用的4个残差块,DenseNet使用的是4个稠密块。同ResNet一样,我们可以设置每个稠密块使用多少个卷积层。这里我们设成4,与ResNet-18保持一致。稠密块里的卷积层通道数(即增长率)设为32,所以每个稠密块将增加128个通道。
    ResNet里通过步幅为2的残差块在每个模块之间减小高和宽。这里则使用过渡层来减半高和宽,并减半通道数。
  • 同ResNet一样,最后接上全局池化层和全连接层来输出
class DenseNet(nn.Module):
    def __init__(self, input_channels, class_num):
        super(DenseNet, self).__init__()
        self.stem = nn.Sequential(
            nn.Conv2d(input_channels, 64, kernel_size=7, stride=2, padding=3, bias=False),
            nn.BatchNorm2d(64), 
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1))
        
        num_channels, growth_rate = 64, 32
        num_convs_in_dense_blocks = [4, 4, 4, 4]
        blks = []
        for i, num_convs in enumerate(num_convs_in_dense_blocks):
            blks.append(DenseBlock(num_convs, num_channels, growth_rate))
            # This is the number of output channels in the previous dense block
            num_channels += num_convs * growth_rate
            # A transition layer that haves the number of channels is added between
            # the dense blocks
            if i != len(num_convs_in_dense_blocks) - 1:
                blks.append(transition_block(num_channels, num_channels // 2))
                num_channels = num_channels // 2
        self.blks = nn.Sequential(*blks)
        
        self.classifier = nn.Sequential(
            nn.BatchNorm2d(num_channels), 
            nn.ReLU(inplace=True),
            nn.AdaptiveMaxPool2d((1,1)),
            nn.Flatten(),
            nn.Linear(num_channels, 10))
                
    def forward(self, x):
        y = self.stem(x)
        y = self.blks(y)
        y = self.classifier(y)
        
        return y

Why do we use average pooling rather than max-pooling in the transition layer?

参考:https://stats.stackexchange.com/questions/413275/why-do-we-use-the-average-pooling-layers-instead-of-max-pooling-layers-in-the-de

Average pooling can better represent the overall strength of a feature by passing gradients through all indices(while gradient flows through only the max index in max pooling), which is very like the DenseNet itself that connections are built between any two layers.

high memory consumption

参考:https://github.com/tensorflow/tensorflow/issues/12948

One problem for which DenseNet has been criticized is its high memory consumption.

  • Is this really the case? Try to change the input shape to 224×224 to see the actual (GPU) memory consumption.

While DenseNets are fairly easy to implement in deep learning frameworks, most implmementations (such as the original) tend to be memory-hungry. In particular, the number of intermediate feature maps generated by batch normalization and concatenation operations grows quadratically with network depth. It is worth emphasizing that this is not a property inherent to DenseNets, but rather to the implementation.

DenseNet is an effective network design that relies on applying nn layers on recursive concatenations of data along the channel axis. Unfortunately, this has the side effect of quadratic memory growth in TensorFlow as completely new blocks of memory are allocated after each concat operation, resulting in poor performance during all phases of execution.
This is a feature request for a new allocation='shared' option for operations such as tf.concat(allocation='shared') . This would make it is possible to utilize the Memory-Efficient Implementation of DenseNets, a paper which demonstrates that this memory utilization can be dramatically reduced through sharing of allocations.

This implementation uses a new strategy to reduce the memory consumption of DenseNets. We use checkpointing to compute the Batch Norm and concatenation feature maps. These intermediate feature maps are discarded during the forward pass and recomputed for the backward pass. This adds 15-20% of time overhead for training, but reduces feature map consumption from quadratic to linear.

This functionality would also be useful for any other application or future network design that employs recursive concatenations.

Why do we not need to concatenate terms if we are just interested in x and f(x) for ResNet? Why do we need this for more than two layers in DenseNet?

参考:https://blog.csdn.net/u014380165/article/details/75142664

Each layer has direct access to the gradients from the loss function and the original input signal, leading to an implicit deep supervision.
这种dense connection相当于每一层都直接连接input和loss,因此就可以减轻梯度消失现象

Design a DenseNet for fully connected networks and apply it to the Housing Price prediction task.

留个空,之后实现一下

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章