深度學習論文: TResNet: High Performance GPU-Dedicated Architecture及其PyTorch實現

原創

2020-05-06 04:48

TResNet: High Performance GPU-Dedicated Architecture
PDF:https://arxiv.org/abs/2003.13630.pdf
PyTorch: https://github.com/shanglianlm0525/PyTorch-Networks

1 概述

TResNet 模型具有表現出更高的準確度和效率。使用 TResNet 模型以及與 ResNet50 相似的 GPU 吞吐量，研究者在 ImageNet 上實現了 80.7% 的 top-1 準確度。

2 TResNet Design

2-1 Stem Design

class SpaceToDepth(nn.Module):
    def __init__(self, block_size=4):
        super().__init__()
        assert block_size == 4
        self.bs = block_size

    def forward(self, x):
        N, C, H, W = x.size()
        x = x.view(N, C, H // self.bs, self.bs, W // self.bs, self.bs)  # (N, C, H//bs, bs, W//bs, bs)
        x = x.permute(0, 3, 5, 1, 2, 4).contiguous()  # (N, bs, bs, C, H//bs, W//bs)
        x = x.view(N, C * (self.bs ** 2), H // self.bs, W // self.bs)  # (N, C*bs^2, H//bs, W//bs)
        return x

Mark Sandler, Jonathan Baccash, Andrey Zhmoginov, and Andrew Howard. Non-discriminative data or weak model? on the relative importance of data and model resolution. arXiv preprint arXiv:1909.03205, 2019.

2-2 Anti-Alias Downsampling (AA)

class AADownsample(nn.Module):
    def __init__(self, filt_size=3, stride=2, channels=None):
        super(AADownsample, self).__init__()
        self.filt_size = filt_size
        self.stride = stride
        self.channels = channels


        assert self.filt_size == 3
        a = torch.tensor([1., 2., 1.])

        filt = (a[:, None] * a[None, :])
        filt = filt / torch.sum(filt)

        # self.filt = filt[None, None, :, :].repeat((self.channels, 1, 1, 1))
        self.register_buffer('filt', filt[None, None, :, :].repeat((self.channels, 1, 1, 1)))

    def forward(self, input):
        input_pad = F.pad(input, (1, 1, 1, 1), 'reflect')
        return F.conv2d(input_pad, self.filt, stride=self.stride, padding=0, groups=input.shape[1])

Richard Zhang. Making convolutional networks shiftinvariant again. In ICML, 2019.

2-3 In-Place Activated BatchNorm (Inplace-ABN)

使用Inplace-ABN 代替所有的BatchNorm+ReLU , 可以大幅減少GPU內存消耗
同時使用Leaky-ReLU代替ReLU,提升性能的同時,帶來很少代價

https://github.com/mapillary/inplace_abn

Samuel Rota Bulo, Lorenzo Porzi, and Peter Kontschieder. `In-place activated batchnorm for memory-optimized training of dnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2018

2-4 Blocks Selection

下圖左邊爲ResNet34 使用的BasicBlock,右邊爲ResNet50使用的Bottleneck, Bottleneck使用GPU更高,但是可以得到更高精度, BasicBlock有更大的感受野.
因此, TResNet在前兩階段使用BasicBlock,後兩階段使用Bottleneck

2-5 SE Layers

在前三階段增加SE layers, 同時SE layers位置如下

提出的結構如下

3 Code Optimizations

3-1 JIT Compilation

JIT accelerated SpaceToDepth module

@torch.jit.script
class SpaceToDepthJit(object):
    def __call__(self, x: torch.Tensor):
        # assuming hard-coded that block_size==4 for acceleration
        N, C, H, W = x.size()
        x = x.view(N, C, H // 4, 4, W // 4, 4)  # (N, C, H//bs, bs, W//bs, bs)
        x = x.permute(0, 3, 5, 1, 2, 4).contiguous()  # (N, bs, bs, C, H//bs, W//bs)
        x = x.view(N, C * 16, H // 4, W // 4)  # (N, C*bs^2, H//bs, W//bs)
        return x

JIT accelerated AA downsampling module

@torch.jit.script
class AADownsampleJIT(object):
    def __init__(self, filt_size: int = 3, stride: int = 2, channels: int = 0):
        self.stride = stride
        self.filt_size = filt_size
        self.channels = channels

        assert self.filt_size == 3
        assert stride == 2
        a = torch.tensor([1., 2., 1.])

        filt = (a[:, None] * a[None, :]).clone().detach()
        filt = filt / torch.sum(filt)
        self.filt = filt[None, None, :, :].repeat((self.channels, 1, 1, 1)).cuda().half()

    def __call__(self, input: torch.Tensor):
        if input.dtype != self.filt.dtype:
            self.filt = self.filt.float() 
        input_pad = F.pad(input, (1, 1, 1, 1), 'reflect')
        return F.conv2d(input_pad, self.filt, stride=2, padding=0, groups=input.shape[1])

3-2 Fixed Global Average Pooling

AvgPool2d比AdaptiveAvgPool2d更快,但是使用View 和 Mean會比AvgPool2d快5倍.

class FastGlobalAvgPool2d(nn.Module):
    def __init__(self, flatten=False):
        super(FastGlobalAvgPool2d, self).__init__()
        self.flatten = flatten

    def forward(self, x):
        if self.flatten:
            in_size = x.size()
            return x.view((in_size[0], in_size[1], -1)).mean(dim=2)
        else:
            return x.view(x.size(0), x.size(1), -1).mean(-1).view(x.size(0), x.size(1), 1, 1)

3-3 Inplace Operations

在所有可能的地方,儘可能的使用 Inplace操作如 residual connection, SE layers, blocks’ final activation等

4 實驗結果

4-1 Basic

4-2 Ablation

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

深度學習論文: TResNet: High Performance GPU-Dedicated Architecture及其PyTorch實現

1 概述

2 TResNet Design

2-1 Stem Design

2-2 Anti-Alias Downsampling (AA)

2-3 In-Place Activated BatchNorm (Inplace-ABN)

2-4 Blocks Selection

2-5 SE Layers

3 Code Optimizations

3-1 JIT Compilation

3-2 Fixed Global Average Pooling

3-3 Inplace Operations

4 實驗結果

4-1 Basic

4-2 Ablation

深度學習論文: A Compact Convolutional Neural Network for Surface Defect Inspection及其PyTorch實現

深度學習論文: LRNnet: a light-weighted network for real-time semantic segmentation及其PyTorch實現

深度學習論文: Pyramidal Convolution: Rethinking CNN for Visual Recognition及其PyTorch實現

注意力機制論文:Concurrent Spatial and Channel SE in Fully Convolutional Networks及其Pytorch實現

自動駕駛論文：Key Points Estimation and Point Instance Segmentation Approach for Lane Detection及其Pytorch實現

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結