[paper] ENet

ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation
Paper: https://arxiv.org/abs/1606.02147
Code: https://github.com/e-lab/ENet-training

提出一種做語義分割的速度快且準確率不低的網絡ENet。

ENet採用類似ResNet的bottleneck模塊：模塊包含3個卷積層（1x1投影降維，主卷積層，1x1擴張），卷積之間添加BN層和PReLU。

下采樣會丟失邊緣信息，上採樣計算量大，但下采樣會有更大的感受野，更多上下文信息。綜合考慮，使用dilated convolutions。

輸入大圖耗時間，故前兩個block就下采樣保留少量feature map去除圖片中冗餘部分。

採用大encoder，小decoder。

採用ReLU反而降低精度，可能因爲層數較少。

維度下降容易丟失信息，故與stride=2的卷積並行進行pooling最後合併得到feature map。這樣比原來的block快10倍。

卷積權重有冗餘，把n×n 的卷積分解成n×1 和1×n 的小卷積。

使用Spatial Dropout提高準確率。

we propose a novel deep neural network architecture named ENet (efficient neural network), created specifically for tasks requiring low latency operation.

Introduction

These references propose networks with huge numbers of parameters, and long inference times.

we propose a new neural network architecture optimized for fast inference and high accuracy.

However, these networks are slow during inference due to their large architectures and numerous parameters.

Unlike in fully convolutional networks (FCN) [12], fully connected layers of VGG16 were discarded in the latest incarnation of SegNet, in order to reduce the number of floating point operations and memory footprint, making it the smallest of these networks.

Network architecture

We adopt a view of ResNets [24] that describes them as having a single main branch and extensions with convolutional filters that separate from it, and then merge back with an element-wise addition, as shown in Figure 2b.

Each block consists of three convolutional layers: a 1×1 projection that reduces the dimensionality, a main convolutional layer (conv in Figure 2b), and a 1×1 expansion. We place Batch Normalization [25] and PReLU [26] between all convolutions.

Just as in the original paper, we refer to these as bottleneck modules.

If the bottleneck is downsampling, a max pooling layer is added to the main branch.

Also, the first 1×1 projection is replaced with a 2×2 convolution with stride 2 in both dimensions.

We zero pad the activations, to match the number of feature maps.

conv is either a regular, dilated or full convolution (also known as deconvolution or fractionally strided convolution) with 3×3 filters.

Sometimes we replace it with asymmetric convolution i.e. a sequence of 5×1 and 1×5 convolutions.

For the regularizer, we use Spatial Dropout [27], with p=0.01 before bottleneck2.0, and p=0.1 afterwards.

Figure 2: (a) ENet initial block. MaxPooling is performed with non-overlapping 2×2 windows, and the convolution has 13 filters, which sums up to 16 feature maps after concatenation. This is heavily inspired by [28]. (b) ENet bottleneck module. conv is either a regular, dilated, or full convolution (also known as deconvolution) with 3×3 filters, or a 5×5 convolution decomposed into two asymmetric ones.

The initial stage contains a single block, that is presented in Figure 2a.

Stage 1 consists of 5 bottleneck blocks, while stage 2 and 3 have the same structure, with the exception that stage 3 does not downsample the input at the beginning (we omit the 0th bottleneck).

These three first stages are the encoder.

Stage 4 and 5 belong to the decoder.

In the decoder max pooling is replaced with max unpooling, and padding is replaced with spatial convolution without bias.

for performance reasons, we decided to place only a bare full convolution as the last module of the network.

Design choices

Feature map resolution

Downsampling images during semantic segmentation has two main drawbacks.

reducing feature map resolution implies loss of spatial information like exact edge shape.
full pixel segmentation requires that the output has the same resolution as the input.

The first issue has been addressed in FCN [12] by adding the feature maps produced by encoder, and in SegNet [10] by saving indices of elements chosen in max-pooling layers, and using them to produce sparse upsampled maps in the decoder. We followed the SegNet approach, because it allows to reduce memory requirements.

downsampling has one big advantage. Filters operating on downsampled images have a bigger receptive field, that allows them to gather more context.

In the end, we have found that it is better to use dilated convolutions for this purpose [30].

Early downsampling

processing large input frames is very expensive.

ENet first two blocks heavily reduce the input size, and use only a small set of feature maps.

visual information is highly spatially redundant.

our intuition is that the initial network layers should not directly contribute to classification.

Decoder size

our architecture consists of a large encoder, and a small decoder.

Nonlinear operations

we have found that removing most ReLUs in the initial layers of the network improved the results.

We replaced all ReLUs in the network with PReLUs [26], which use an additional parameter per feature map, with the goal of learning the negative slope of non-linearities.

Initial layers weights exhibit a large variance and are slightly biased towards positive values, while in the later portions of the encoder they settle to a recurring pattern. All layers in the main branch behave nearly exactly like regular ReLUs, while the weights inside bottleneck modules are negative i.e. the function inverts and scales down negative values.

the decoder weights become much more positive and learn functions closer to identity.

Information-preserving dimensionality changes

but aggressive dimensionality reduction can also hinder the information flow.

introduces a representational bottleneck (or forces one to use a greater number of filters, which lowers computational efficiency).

as proposed in [28], we chose to perform pooling operation in parallel with a convolution of stride 2, and concatenate resulting feature maps.

This technique allowed us to speed up inference time of the initial block 10 times.

Additionally, we have found one problem in the original ResNet architecture. When downsampling, the first 1×1 projection of the convolutional branch is performed with a stride of 2 in both dimensions, which effectively discards 75% of the input.

Increasing the filter size to 2×2 allows to take the full input into consideration, and thus improves the information flow and accuracy.

Factorizing filters

convolutional weights have a fair amount of redundancy

each n×n convolution can be decomposed into two smaller ones following each other: one with a n×1 filter and the other with a 1×n filter [32].

refer to these as asymmetric convolutions.

increase the variety of functions learned by blocks and increase the receptive field.

a sequence of operations used in the bottleneck module (projection, convolution, projection) can be seen as decomposing one large convolutional layer into a series of smaller and simpler operations, that are its low-rank approximation.

Dilated convolutions

They replaced the main convolutional layers inside several bottleneck modules in the stages that operate on the smallest resolutions. These gave a significant accuracy boost, by raising IoU on Cityscapes by around 4 percentage points, with no additional cost.

Regularization

We placed Spatial Dropout at the end of convolutional branches, right before the addition, and it turned out to work much better than stochastic depth.

Results

Performance Analysis

For inference we merge batch normalization and dropout layers into the convolutional filters, to speed up all networks.

Inference time

Table 2: Performance comparison.

NVIDIA TX1

Model	480×320	640×360	1280×720
SegNet	757 ms, 1.3 fps	1251 ms, 0.8 fps	ms, - fps
ENet	47 ms, 21.1 fps	69 ms, 14.6 fps	262 ms, 3.8 fps

NVIDIA Titan X

Model	640×360	1280×720	1920×1080
SegNet	69 ms, 14.6 fps	289 ms, 3.5 fps	637 ms, 1.6 fps
ENet	7 ms, 135.4 fps	21 ms, 46.8 fps	46 ms, 21.6 fps

Table 2 compares inference time for a single input frame of varying resolution. We also report the number of frames per second that can be processed. Dashes indicate that we could not obtain a measurement, due to lack of memory.

Hardware requirements

Table 3: Hardware requirements. FLOPs are estimated for an input of 3 × 640 × 360.

Model	GFLOPs	Parameters	Model size (fp16)
SegNet	286.03	29.46M	56.2 MB
ENet	3.83	0.37M	0.7 MB

Please note that we report storage required to save model parameters in half precision floating point format.

Software limitations

Although applying this method allowed us to greatly reduce the number of floating point operations and parameters, it also increased the number of individual kernels calls, making each of them smaller.

This means that using a higher number of kernels, increases the number of memory transactions, because feature maps have to be constantly saved and reloaded.

We have found that some of these operations can become so cheap, that the cost of GPU kernel launch starts to outweigh the cost of the actual computation.

This means that using a higher number of kernels, increases the number of memory transactions, because feature maps have to be constantly saved and reloaded.

In ENet, PReLUs consume more than a quarter of inference time. Since they are only simple point-wise operations and are very easy to parallelize, we hypothesize it is caused by the aforementioned data movement.

Benchmarks

We have used the Adam optimization algorithm [35] to train the network.

It allowed ENet to converge very quickly and on every dataset we have used training took only 3-6 hours, using four Titan X GPUs.

It was performed in two stages: first we trained only the encoder to categorize downsampled regions of the input image, then we appended the decoder and trained the network to perform upsampling and pixel-wise classification.

Cityscapes

consists of 5000 fine-annotated images

train/val/test: 2975/500/1525

We trained on 19 classes that have been selected in the official evaluation scripts

CamVid

train/test: 367/233

SUN RGB-D

train/test: 5285/5050

37 indoor object classes.

Conclusion

We have proposed a novel neural network architecture designed from the ground up specifically for semantic segmentation. Our main aim is to make efficient use of scarce resources available on embedded platforms, compared to fully fledged deep learning workstations.

Introduction

Network architecture

Design choices

Results

Performance Analysis

Benchmarks

Conclusion

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

[轉帖]

python列出centos7內存使用前50的進程信息

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

一鍵自動化博客發佈工具,用過的人都說好(掘金篇)

通義千問 2.5 “客串” ChatGPT4，你分的清嗎？

Garnet：微軟官方基於.NET開源的高性能分佈式緩存存儲數據庫

Flink執行圖

Java響應式編程

評估統計算法在銀行僞造鈔票檢測中的價值

今年在影院看的電影。。

Scene Parsing

[paper] GAN

唐詩生成器

[paper] Look Into Person

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

[paper] ENet

Introduction

Related work

Network architecture

Design choices

Results

Performance Analysis

Benchmarks

Conclusion