對FCN的改進,關於dilaition和DUC

Understanding Convolution for Semantic Segmentation

http://www.cnblogs.com/xiangs/p/9780895.html

https://blog.csdn.net/qq_21997625/ar

https://mp.csdn.net/postedit

https://www.zhihu.com/question/54149221/answer/323880412

https://github.com/fregu856/deeplabv3/tree/master/visualization

圖 4:Dense Pose 紋理轉換 (texture transfer) 的實驗結果。該任務的目標是,將輸入的視頻圖像中所有人的身體表面紋理,轉換成目標紋理。圖中第 1 行爲目標紋理 1 和紋理 2。第 2、3 行從左至右依次爲,輸入圖像,轉換爲紋理 1 的圖像,以及轉換爲紋理 2 的圖像。

在 ECCV 2018 上,論文 [1] 的三名作者發表了 Dense Pose 的一個後續應用,即「密集姿態轉移」(dense pose transfer,

HDC(hybrid dilated convolution)代替雙線性。cat+conv代替+

1.級聯?res+dilation+F.upsample(conv(GAP))?

2.shuffleV3+reset+desnet+cat+conv

3. 從 DeepLabv3 開始去掉 CRFs,可以pass了

4.有個問題是,res+rpn後就接pooler了,pooler過程有選層操作,那麼如果選層不夠精細,可能後面分支也不夠好,現在對各個分支的更改都是對pooler之後的某一RoI特徵進行進一步提取,其實這個RoI已經是認定爲“所謂的目標”,那麼對目標提來踢去確實可以是關節點“優化”。但是,是不是說或可以對pooler之前的四層都給優化一下呢?那麼跟着,rpn也是會優化的 。

還有分配公式這裏會不會有點問題?

res是1/4;1/8;1/16;1/32 有沒有辦法這裏加個小模塊?並且是same。1,2,3?

針對output_stride=8的情況,rate=2×(6,12,18).並行處理後的特徵圖在集中通過256個1×1卷積(BN),最後就是輸出了,依舊是1×1卷積。

主要提出DUC(dense upsampling convolution)和HDC(hybrid dilated convolution),其中DUC相當於用通道數來彌補卷積/池化等操作導致的尺寸的損失,HDC爲了消除在連續使用dilation convolution時容易出現的gridding effect。
  DUC是可以學習的,它能夠捕獲和恢復細節的信息,比如,如果一個網絡的下采樣倍數爲16,但是一個物體的長或者寬小於16個像素,下采樣之後雙線性插值就很難恢復這個物體了。這樣最終的label map就會丟失細節信息了。DUC的輸出和輸入分辨率是一致的,而且可以集成到FCN中,實現端到端的分割。

2. 解碼部分的問題
  大部分語義分割模型主要採用雙線性插值上採樣來獲得輸出label map。但是雙線性插值不是可學習的而且會丟失信息。本文提出了密集上採樣卷積(DUC),來一次性恢復label map的全部分辨率,通過學習一系列上採樣濾波器來對下采樣的feature map進行恢復到要求的分辨率。

本文首先提出了 dense upsampling convolution,可以捕獲和解碼更詳細的信息,這些細節信息是雙線性插值不能獲取的;然後提出了一個 dense upsampling convolution框架,可以增加感受視野擴大全局信息,並且解決了網格問題,這是由於標準的空洞卷積造成的。
 

不過光理解卷積和空洞卷積的工作原理還是遠遠不夠的,要充分理解這個概念我們得重新審視卷積本身,並去了解他背後的設計直覺。以下主要討論 dilated convolution 在語義分割 (semantic segmentation) 的應用。

一、重新思考卷積: Rethinking Convolution

在贏得其中一屆ImageNet比賽裏VGG網絡的文章中,他最大的貢獻並不是VGG網絡本身,而是他對於卷積疊加的一個巧妙觀察。

This (stack of three 3 × 3 conv layers) can be seen as imposing a regularisation on the 7 × 7 conv. filters, forcing them to have a decomposition through the 3 × 3 filters (with non-linearity injected in between).

意思是 7 x 7 的卷積層的正則等效於 3 個 3 x 3 的卷積層的疊加。而這樣的設計不僅可以大幅度的減少參數,其本身帶有正則性質的 convolution map 能夠更容易學一個 generlisable, expressive feature space。這也是現在絕大部分基於卷積的深層網絡都在用小卷積核的原因。

然而 Deep CNN 對於其他任務還有一些致命性的缺陷。較爲著名的是 up-sampling 和 pooling layer 的設計

主要問題有:

  1. Up-sampling / pooling layer (e.g. bilinear interpolation) is deterministic. (a.k.a. not learnable)是確定性的、不可學習。
  2. 內部數據結構丟失;空間層級化信息丟失。
  3. 小物體信息無法重建 (假設有四個pooling layer 則 任何小於 2^4 = 16 pixel 的物體信息將理論上無法重建。)

在這樣問題的存在下,語義分割問題一直處在瓶頸期無法再明顯提高精度, 而 dilated convolution 的設計就良好的避免了這些問題。

1.空洞卷積的拯救之路:Dilated Convolution to the Rescue

題主提到的這篇 MULTI-SCALE CONTEXT AGGREGATION BY DILATED CONVOLUTIONS 可能(?) 是第一篇嘗試用 dilated convolution 做語義分割的文章。後續圖森組和 Google Brain 都對於 dilated convolution 有着更細節的討論,推薦閱讀:Understanding Convolution for Semantic Segmentation Rethinking Atrous Convolution for Semantic Image Segmentation

對於 dilated convolution, 我們已經可以發現他的優點,即內部數據結構的保留和避免使用 down-sampling 這樣的特性。但是完全基於 dilated convolution 的結構如何設計則是一個新的問題。

潛在問題 1:The Gridding Effect

假設我們僅僅多次疊加 dilation rate 2 的 3 x 3 kernel 的話,則會出現這個問題:

發現 kernel 並不連續,也就是並不是所有的 pixel 都用來計算了,因此這裏將信息看做 checker-board 的方式會損失信息的連續性。這對 pixel-level dense prediction 的任務來說是致命的。

潛在問題 2:Long-ranged information might be not relevant. 遠程信息

我們從 dilated convolution 的設計背景來看就能推測出這樣的設計是用來獲取 long-ranged information。然而光采用大 dilation rate 的信息或許只對一些大物體分割有效果,而對小物體來說可能則有弊無利了。如何同時處理不同大小的物體的關係,則是設計好 dilated convolution 網絡的關鍵。

1.1通向標準化設計:Hybrid Dilated Convolution (HDC)

[1,2,5]、[1,2,3]

 HDC的好處是可以使用任意的rate,對於識別相對大的物體表現得很好。增加RF.一組內的rate不應該有公因子關係,否則Gridding問題仍然存在,這就是HDC和ASPP最大的不同.

1.2多尺度分割的另類解:DeepLabV3之Atrous Spatial Pyramid Pooling (ASPP)

在處理多尺度物體分割時,我們通常會有以下幾種方式來操作:((d)爲DeepLabV2的圖)

然而僅僅(在一個卷積分支網絡下)使用 dilated convolution 去抓取多尺度物體是一個不正統的方法。比方說,我們用一個 HDC 的方法來獲取一個大(近)車輛的信息,然而對於一個小(遠)車輛的信息都不再受用。假設我們再去用小 dilated convolution 的方法重新獲取小車輛的信息,則這麼做非常的冗餘。

基於港中文和商湯組的 PSPNet 裏的 Pooling module (其網絡同樣獲得當年的SOTA結果),ASPP 則在網絡 decoder 上對於不同尺度上用不同大小的 dilation rate 來抓去多尺度信息,每個尺度則爲一個獨立的分支,在網絡最後把他合併起來再接一個卷積層輸出預測 label。這樣的設計則有效避免了在 encoder 上冗餘的信息的獲取,直接關注與物體之間之內的相關性。

deeplabv2中的aspp如上圖所示,在特徵頂部映射圖使用了四中不同採樣率的空洞卷積。這表明以不同尺度採樣時有效的,在Deeolabv3中向ASPP中添加了BN層。不同採樣率的空洞卷積可以有效捕獲多尺度信息,但會發現隨着採樣率的增加,濾波器有效權重(權重有效的應用在特徵區域,而不是填充0)逐漸變小。

https://blog.csdn.net/qq_21997625/article/details/87080576 

class ASPP(nn.Module):
    def __init__(self, in_channel=512, depth=256):
        super(ASPP,self).__init__()
        # global average pooling : init nn.AdaptiveAvgPool2d ;also forward torch.mean(,,keep_dim=True)
        self.mean = nn.AdaptiveAvgPool2d((1, 1))
        self.conv = nn.Conv2d(in_channel, depth, 1, 1)
        # k=1 s=1 no pad
        self.atrous_block1 = nn.Conv2d(in_channel, depth, 1, 1)
        self.atrous_block6 = nn.Conv2d(in_channel, depth, 3, 1, padding=6, dilation=6)
        self.atrous_block12 = nn.Conv2d(in_channel, depth, 3, 1, padding=12, dilation=12)
        self.atrous_block18 = nn.Conv2d(in_channel, depth, 3, 1, padding=18, dilation=18)
 
        self.conv_1x1_output = nn.Conv2d(depth * 5, depth, 1, 1)
 
    def forward(self, x):
        size = x.shape[2:]
 
        image_features = self.mean(x)
        image_features = self.conv(image_features)
        image_features = F.upsample(image_features, size=size, mode='bilinear')
 
        atrous_block1 = self.atrous_block1(x)
 
        atrous_block6 = self.atrous_block6(x)
 
        atrous_block12 = self.atrous_block12(x)
 
        atrous_block18 = self.atrous_block18(x)
 
        net = self.conv_1x1_output(torch.cat([image_features, atrous_block1, atrous_block6,
                                              atrous_block12, atrous_block18], dim=1))
        return net

 官方版:tensorflow

def atrous_spatial_pyramid_pooling(inputs, output_stride, batch_norm_decay, is_training, depth=256):
  """Atrous Spatial Pyramid Pooling.
  Args:
    inputs: A tensor of size [batch, height, width, channels].
    output_stride: The ResNet unit's stride. Determines the rates for atrous convolution.
      the rates are (6, 12, 18) when the stride is 16, and doubled when 8.
    batch_norm_decay: The moving average decay when estimating layer activation
      statistics in batch normalization.
    is_training: A boolean denoting whether the input is for training.
    depth: The depth of the ResNet unit output.
  Returns:
    The atrous spatial pyramid pooling output.
  """
  with tf.variable_scope("aspp"):
    if output_stride not in [8, 16]:
      raise ValueError('output_stride must be either 8 or 16.')

    atrous_rates = [6, 12, 18]
    if output_stride == 8:
      atrous_rates = [2*rate for rate in atrous_rates]

    with tf.contrib.slim.arg_scope(resnet_v2.resnet_arg_scope(batch_norm_decay=batch_norm_decay)):
      with arg_scope([layers.batch_norm], is_training=is_training):
        inputs_size = tf.shape(inputs)[1:3]
        # (a) one 1x1 convolution and three 3x3 convolutions with rates = (6, 12, 18) when output stride = 16.
        # the rates are doubled when output stride = 8.
        conv_1x1 = layers_lib.conv2d(inputs, depth, [1, 1], stride=1, scope="conv_1x1")
        conv_3x3_1 = resnet_utils.conv2d_same(inputs, depth, 3, stride=1, rate=atrous_rates[0], scope='conv_3x3_1')
        conv_3x3_2 = resnet_utils.conv2d_same(inputs, depth, 3, stride=1, rate=atrous_rates[1], scope='conv_3x3_2')
        conv_3x3_3 = resnet_utils.conv2d_same(inputs, depth, 3, stride=1, rate=atrous_rates[2], scope='conv_3x3_3')

        # (b) the image-level features
        with tf.variable_scope("image_level_features"):
          # global average pooling
          image_level_features = tf.reduce_mean(inputs, [1, 2], name='global_average_pooling', keepdims=True)
          # 1x1 convolution with 256 filters( and batch normalization)
          image_level_features = layers_lib.conv2d(image_level_features, depth, [1, 1], stride=1, scope='conv_1x1')
          # bilinearly upsample features
          image_level_features = tf.image.resize_bilinear(image_level_features, inputs_size, name='upsample')

        net = tf.concat([conv_1x1, conv_3x3_1, conv_3x3_2, conv_3x3_3, image_level_features], axis=3, name='concat')
        net = layers_lib.conv2d(net, depth, [1, 1], stride=1, scope='conv_1x1_concat')

        return net

DeepLabV3+ 

DeepLab-v3+ 是由 DeepLab-v3 擴充而來,增加了解碼器模組(encoder和上面是一樣的),能夠細化分割結果,能夠更精準的處理物體的邊緣,並進一步將深度卷積神經網絡應用在空間金字塔池化(Spatial Pyramid Pooling,SPP)和解碼器上,大幅提升處理物體大小以及不同長寬比例的能力,最後得到強而有力的語義分割編碼解碼器網絡。

https://github.com/rishizek/tensorflow-deeplab-v3-plus 


def deeplab_v3_plus_generator(num_classes,
                              output_stride,
                              base_architecture,
                              pre_trained_model,
                              batch_norm_decay,
                              data_format='channels_last'):
  """Generator for DeepLab v3 plus models.
  Args:
    num_classes: The number of possible classes for image classification.
    output_stride: The ResNet unit's stride. Determines the rates for atrous convolution.
      the rates are (6, 12, 18) when the stride is 16, and doubled when 8.
    base_architecture: The architecture of base Resnet building block.
    pre_trained_model: The path to the directory that contains pre-trained models.
    batch_norm_decay: The moving average decay when estimating layer activation
      statistics in batch normalization.
    data_format: The input format ('channels_last', 'channels_first', or None).
      If set to None, the format is dependent on whether a GPU is available.
      Only 'channels_last' is supported currently.
  Returns:
    The model function that takes in `inputs` and `is_training` and
    returns the output tensor of the DeepLab v3 model.
  """
  if data_format is None:
    # data_format = (
    #     'channels_first' if tf.test.is_built_with_cuda() else 'channels_last')
    pass

  if batch_norm_decay is None:
    batch_norm_decay = _BATCH_NORM_DECAY

  if base_architecture not in ['resnet_v2_50', 'resnet_v2_101']:
    raise ValueError("'base_architrecture' must be either 'resnet_v2_50' or 'resnet_v2_101'.")

  if base_architecture == 'resnet_v2_50':
    base_model = resnet_v2.resnet_v2_50
  else:
    base_model = resnet_v2.resnet_v2_101

  def model(inputs, is_training):
    """Constructs the ResNet model given the inputs."""
    if data_format == 'channels_first':
      # Convert the inputs from channels_last (NHWC) to channels_first (NCHW).
      # This provides a large performance boost on GPU. See
      # https://www.tensorflow.org/performance/performance_guide#data_formats
      inputs = tf.transpose(inputs, [0, 3, 1, 2])

    # tf.logging.info('net shape: {}'.format(inputs.shape))
    # encoder
    with tf.contrib.slim.arg_scope(resnet_v2.resnet_arg_scope(batch_norm_decay=batch_norm_decay)):
      logits, end_points = base_model(inputs,
                                      num_classes=None,
                                      is_training=is_training,
                                      global_pool=False,
                                      output_stride=output_stride)

    if is_training:
      exclude = [base_architecture + '/logits', 'global_step']
      variables_to_restore = tf.contrib.slim.get_variables_to_restore(exclude=exclude)
      tf.train.init_from_checkpoint(pre_trained_model,
                                    {v.name.split(':')[0]: v for v in variables_to_restore})

    inputs_size = tf.shape(inputs)[1:3]
    net = end_points[base_architecture + '/block4']
    encoder_output = atrous_spatial_pyramid_pooling(net, output_stride, batch_norm_decay, is_training)

    with tf.variable_scope("decoder"):
      with tf.contrib.slim.arg_scope(resnet_v2.resnet_arg_scope(batch_norm_decay=batch_norm_decay)):
        with arg_scope([layers.batch_norm], is_training=is_training):
          with tf.variable_scope("low_level_features"):
            low_level_features = end_points[base_architecture + '/block1/unit_3/bottleneck_v2/conv1']
            low_level_features = layers_lib.conv2d(low_level_features, 48,
                                                   [1, 1], stride=1, scope='conv_1x1')
            low_level_features_size = tf.shape(low_level_features)[1:3]

          with tf.variable_scope("upsampling_logits"):
            net = tf.image.resize_bilinear(encoder_output, low_level_features_size, name='upsample_1')
            net = tf.concat([net, low_level_features], axis=3, name='concat')
            net = layers_lib.conv2d(net, 256, [3, 3], stride=1, scope='conv_3x3_1')
            net = layers_lib.conv2d(net, 256, [3, 3], stride=1, scope='conv_3x3_2')
            net = layers_lib.conv2d(net, num_classes, [1, 1], activation_fn=None, normalizer_fn=None, scope='conv_1x1')
            logits = tf.image.resize_bilinear(net, inputs_size, name='upsample_2')

    return logits

return model

總結

Dilated Convolution 個人認爲想法簡單,直接且優雅,並取得了相當不錯的效果提升。他起源於語義分割,大部分文章也用於語義分割,具體能否對其他應用有價值姑且還不知道,但確實是一個不錯的探究方向。有另外的答主提到WaveNet, ByteNet 也用到了 dilated convolution 確實是一個很有趣的發現,因爲本身 sequence-to-sequence learning 也是一個需要關注多尺度關係的問題。則在 sequence-to-sequence learning 如何實現,如何設計,跟分割或其他應用的關聯是我們可以重新需要考慮的問題。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章