Semantic Flow for Fast and Accurate Scene Parsing

在這裏插入圖片描述

Abstract

In this paper, we focus on effective methods for fast and accurate scene parsing. A common practice to improve the performance is to attain high resolution feature maps with strong semantic representation. Two strategies are widely used—astrous convolutions and feature pyramid fu- sion, are either computation intensive or ineffective. In- spired by Optical Flow for motion alignment between ad- jacent video frames, we propose a Flow Alignment Module (FAM) to learn Semantic Flow between feature maps of ad- jacent levels and broadcast high-level features to high reso- lution features effectively and efficiently. Furthermore, inte- grating our module to a common feature pyramid structure exhibits superior performance over other real-time meth- ods even on very light-weight backbone networks, such as ResNet-18. Extensive experiments are conducted on sev- eral challenging datasets, including Cityscapes, PASCAL Context, ADE20K and CamVid. Particularly, our network is the first to achieve 80.4% mIoU on Cityscapes with a frame rate of 26 FPS. The code will be available at https: //github.com/donnyyou/torchcv .

在本文中,我們專注於快速,準確的場景解析的有效方法。提高性能的常用方法是獲得具有強語義表示的高分辨率特徵圖。廣泛使用兩種策略-運算量大或效率低下的災難性卷積和特徵金字塔融合。受光流啓發,在相鄰視頻幀之間進行運動對齊,我們提出了一種流對齊模塊(FAM),以學習相鄰級別的特徵圖之間的語義流,並有效地廣播高級特徵到高分辨率的特徵高效地此外,即使在重量非常輕的骨幹網絡(例如ResNet-18)上,將我們的模塊集成到一個共同的特徵金字塔結構中,也比其他實時方法具有更好的性能。在包括Cityscapes,PASCAL Context,ADE20K和CamVid在內的多個具有挑戰性的數據集上進行了廣泛的實驗。特別是,我們的網絡率先在Cityscapes上實現了80.4%的mIoU,幀速率爲26 FPS。該代碼將在https://github.com/donnyyou/torchcv上提供。

在這裏插入圖片描述

Figure 1. Inference speed versus mIoU performance on test set of Cityscapes. Previous models are marked as red points, and our models are shown in blue points which achieve the best speed/accuracy trade-off. Note that our method with light-weight ResNet-18 as backbone even achieves comparable accuracy with all accurate models at much faster speed.

圖1.在Cityscapes測試集上推理速度與mIoU性能的關係。 以前的型號標記爲紅點,而我們的模型則顯示爲藍點,可以實現最佳的速度/精度權衡。 請注意,我們使用輕量級ResNet-18作爲主幹的方法甚至可以以更快的速度在所有精確模型中達到相當的精度。

1. Introduction

Scene parsing or semantic segmentation is a fundamental vision task which aims to classify each pixel in the images correctly. Two important factors that have prominent im- pacts on the performance are: detailed resolution informa- tion [42] and strong semantics representation [5, 60]. The seminal work of Long et. al. [29] built a deep Fully Convo- lutional Network (FCN), which is mainly composed from convolutional layers, in order to carve strong semantic rep- resentation. However, detailed object boundary informa- tion, which is also crucial to the performance, is usually missing due to the use of the built-in down-sampling pool- ing and convolutional layers. To alleviate this problem,
many state-of-the-art methods [13, 60, 61, 65] apply atrous convolutions [52] at the last several stages of their networks to yield feature maps with strong semantic representation while at the same time maintaining the high resolution.
Nevertheless, doing so inevitably requires huge extra computation since the feature maps in the last several layers can reach up to 64 times bigger than those in FCNs. Given that the regular FCN using ResNet-18 [17] as the backbone network has a frame rate of 57.2 FPS for a 1024 × 2048 im- age, after applying atrous convolutions [52] to the network as done in [60, 61], the modified network only has a frame rate of 8.7 FPS. Moreover, under single GTX 1080Ti GPU with no ongoing programs, the state-of-art model PSP- Net [60] has a frame rate of only 1.6 FPS for 1024 × 2048 input images. As a consequence, this is excessively prob- lematic to many advanced real-world applications, such as self-driving cars and robots navigation, which desperately demand real-time online data processing.
On the other hand, recent fast models still have a large
accuracy gap to accurate models, e.g., DFANet [23] only achieves 71.2% mIoU though running at 112 FPS shown in Figure 1. In summary, fast and accurate models for scene paring are demanding for real-time applications.
In order to not only maintain detailed resolution informa- tion but also get features that exhibit strong semantic rep- resentation, another direction is to build FPN-like [21, 28] models which leverage the lateral path to fuse feature maps in a top-down manner. In this way, the deep features of last several layers strengthen the shallow features with high resolution and therefore, the refined features are possible to satisfy the above two factors and beneficial to the accuracy improvement. However, such methods [1, 42] still undergo unsatisfying accuracy issues when compared to those net- works holding thick and big feature maps in the last several stages.
場景解析或語義分割是一項基本的視覺任務,旨在對圖像中的每個像素進行正確分類。影響性能的兩個重要因素是:詳細的解析信息[42]和強大的語義表示[5,60]。龍等人的開創性工作。等[29]建立了一個深層的全卷積網絡(FCN),它主要由卷積層組成,以刻畫出強大的語義表示。但是,由於使用了內置的下采樣池和卷積層,通常缺少對性能也至關重要的詳細對象邊界信息。爲了減輕這個問題,
許多最先進的方法[13,60,61,65]在其網絡的最後幾個階段應用無規則卷積[52]來生成具有強語義表示的特徵圖,同時又保持高分辨率。
但是,這樣做不可避免地需要大量額外的計算,因爲最後幾層中的要素映射可以達到FCN中的要素映射最大64倍。假設使用ResNet-18 [17]作爲主幹網絡的常規FCN在1024×2048圖像上具有57.2 FPS的幀速率,然後按照[60,61]的方法對網絡應用了無規則卷積[52] ,修改後的網絡的幀速率僅爲8.7 FPS。此外,在沒有正在進行程序的單個GTX 1080Ti GPU下,最新型號的PSPNet [60]對於1024×2048輸入圖像的幀速率僅爲1.6 FPS。結果,這對於許多先進的現實世界應用(例如無人駕駛汽車和機器人導航)極度不利,這些應用迫切需要實時在線數據處理。
另一方面,最近的快速模型仍然具有很大的優勢
精確模型之間的準確度差距,例如DFANet [23]僅以圖1所示的112 FPS運行,僅達到71.2%mIoU。總而言之,用於實時場景分析的快速準確模型要求實時應用。
爲了不僅保持詳細的分辨率信息,而且獲得表現出強大語義表示的特徵,另一個方向是建立類似於FPN的模型[21,28],該模型利用橫向路徑融合頂部的特徵圖。向下的方式。這樣,最後幾層的深層特徵以高分辨率增強了淺層特徵,因此,細化的特徵有可能滿足上述兩個因素,並且有利於精度的提高。然而,與那些在最後幾個階段持有厚實特徵圖和大型特徵圖的網絡相比,此類方法[1、42]仍然會遇到精度問題。

We believe the main reason lies in the ineffective seman- tics delivery from deep layers to shallow layers of these methods. To mitigate this issue, we propose to learn the Se- mantic Flow between layers with different resolutions. Se- mantic Flow is inspired from the optical flow method [11] used to align pixels between adjacent frames in the video processing task [64]. Based on Semantic Flow, we con- struct a novel network module called Flow Alignment Mod- ule(FAM) to the area of scene parsing. It takes features from adjacent levels as inputs, produces the offset field, and then warps the coarse feature to the fine feature with higher res- olution according to the offset field. Because FAM effec- tively transmits the semantic information from deep layers to shallow layers through very simple operations, it shows superior efficacy in both improving the accuracy and keep- ing super efficiency. Moreover, the proposed module is end- to-end trainable, and can be plugged into any backbone net- works to form new networks called SFNet. As depicted in Figure 1, our method with different backbones outper- forms other competitors by a large margin under the same speed. In particular, our method adopting ResNet-18 as backbone achieves 80.4% mIoU on Cityscapes test server with a frame rate of 26 FPS. When adopting DF2 [48] as backbone, our method achieves 77.8% mIoU with 61 FPS and 74.5% mIoU with 121 FPS when equipping the DF1 backbone. Moreover, when equipped with deeper back- bone, such as ResNet-101, our method achieves compara- ble results with the state-of-the-art model DANet [13] and only requires 33% computation of DANet. Besides, exper- iments also clearly illustrate the generality of our SFNet across various datasets: Cityscapes [8], Pascal Context [34], ADE20K [62] and CamVid [3].
To conclude, our main contributions are three-fold:
• we propose a novel flow-based align module (FAM) to learn Semantic Flow between feature maps of adjacent levels and broadcast high-level features to high resolu- tion features effectively and efficiently.
• We insert FAM into the feature pyramid framework and build a feature pyramid aligned network named SFNet for fast and accurate scene parsing.
• Detailed experiments and analysis indicate the ef- ficacy of our proposed module in both improving the accuracy and keeping light-weight. We achieve state-of-the-art results on Cityscapes, Pascal Context, ADE20K and Camvid datasets. Specifically, our net- work achieves 80.4% mIoU on Cityscapes test server while attaining a real-time speed of 26 FPS on single GTX 1080Ti GPU.

我們認爲,主要原因是這些方法從深層到淺層的語義傳遞效率不高。爲了緩解這個問題,我們建議學習具有不同分辨率的圖層之間的語義流。語義流的靈感來自光流方法[11],該方法用於在視頻處理任務[64]中對齊相鄰幀之間的像素。基於語義流,我們在場景解析領域構建了一個稱爲流對齊模塊(FAM)的新型網絡模塊。它以來自相鄰層的要素爲輸入,生成偏移場,然後根據偏移場將具有較高分辨率的粗特徵變形爲精細特徵。因爲FAM通過非常簡單的操作有效地將語義信息從深層傳輸到淺層,所以它在提高準確性和保持超高效率方面均顯示出卓越的功效。此外,所提議的模塊是端到端可培訓的,並且可以插入任何骨幹網絡中以形成稱爲SFNet的新網絡。如圖1所示,在相同的速度下,採用不同主幹的方法在很大程度上優於其他競爭對手。尤其是,我們採用ResNet-18作爲主幹的方法在Cityscapes測試服務器上的幀速率爲26 FPS,可達到80.4%mIoU。當採用DF2 [48]作爲骨幹網時,我們的方法在裝備DF1骨幹網時可達到77.8%mIoU(61 FPS)和74.5%mIoU(121 FPS)。而且,當配備了更深的骨架(例如ResNet-101)時,我們的方法可以與最新模型DANet [13]取得可比的結果,並且只需要33%的DANet計算即可。此外,實驗還清楚地說明了SFNet在各種數據集上的普遍性:Cityscapes [8],Pascal Context [34],ADE20K [62]和CamVid [3]。
總而言之,我們的主要貢獻包括三個方面:
•我們提出了一種新穎的基於流的對齊模塊(FAM),以學習相鄰層的特徵圖之間的語義流,並有效地將高層特徵廣播到高分辨率的特徵。
•我們將FAM插入要素金字塔框架,並構建一個名爲SFNet的要素金字塔對齊網絡,以進行快速,準確的場景解析。
•詳細的實驗和分析表明,我們提出的模塊在提高精度和保持輕量化方面是有效的。我們在Cityscapes,Pascal Context,ADE20K和Camvid數據集上取得了最先進的結果。具體來說,我們的網絡在Cityscapes測試服務器上實現了80.4%的mIoU,同時在單個GTX 1080Ti GPU上實現了26 FPS的實時速度。

2. Related Work

For scene parsing, there are mainly two paradigms for high-resolution semantic map prediction. One paradigm tries to keep both spatial and semantic information along the main pathway, while the other paradigm distributes spa- tial and semantic information to different parts in a network, then merges them back via different strategies.
The first paradigm is mostly based on astrous convo- lution [52], which keeps high-resolution feature maps in the latter network stages. Current state-of-the-art accurate methods [13,60,65] follow this paradigm and keep improv- ing performance by designing sophisticated head networks to capture contextual information. PSPNet [60] proposes pyramid pooling module (PPM) to model multi-scale con- texts, whilst DeepLab series [4–6] uses astrous spatial pyra- mid pooling (ASPP). In [13,15,16,18,25,53,66], non-local operator [46] and self-attention mechanism [45] are adopt to harvest pixel-wise context from whole image. Meanwhile, graph convolution networks [20, 26] are used to propagate information over the whole image by projecting features into an interaction space.
The second paradigm contains state-of-the-art fast meth- ods, where high-level semantics are represented by low- resolution feature maps. A common strategy follows the same backbone networks for image classification without using astrous convolution, and fuses multi-level feature maps for both spatiality and semantics [1,29,38,42,47]. IC- Net [59] uses multi-scale images as input and a cascade net- work to raise efficiency. DFANet [24] utilizes a light-weight backbone to speed up and is equipped with a cross-level fea- ture aggregation to boost accuracy, while SwiftNet [38] uses lateral connections as the cost-effective solution to restore the prediction resolution while maintaining the speed. To further speed up, low-resolution image is used as input for high-level semantics [31,59]. All these methods reduce fea- ture maps into quite low resolution and upsample them back by a large factor, which causes inferior results especially for small objects and object boundaries. Guided upsam- pling [31] is closely related to our method, where semantic map is upsampled back to the input image size guided by the feature map from an early layer. However, the guidance is insufficient due to both the semantic and resolution gap, which make the model still incomparable to accurate mod- els. In contrast, our method aligns feature maps from ad- jacent levels and enhances a feature pyramid towards both high resolution and strong semantics, and results in state- of-the-art models for both high accuracy and speed.

對於場景解析,主要有兩種用於高分辨率語義圖預測的範例。一個範例試圖將空間和語義信息都保留在主要路徑上,而另一個範例則將空間和語義信息分佈到網絡中的不同部分,然後通過不同的策略將它們合併回去。
第一個範例主要是基於天文卷積[52],它在後面的網絡階段中保留了高分辨率的特徵圖。當前最先進的精確方法[13,60,65]遵循此範例,並通過設計複雜的頭部網絡來捕獲上下文信息來保持性能的提高。 PSPNet [60]提出了金字塔池化模塊(PPM)來建模多尺度上下文,而DeepLab系列[4-6]使用天體空間吡喃池(ASPP)。在[13,15,16,18,25,53,66]中,採用了非局部算子[46]和自我注意機制[45]來從整個圖像中獲取像素級上下文。同時,圖卷積網絡[20、26]用於通過將特徵投影到交互空間中來在整個圖像上傳播信息。
第二種範式包含最新的快速方法,其中高級語義由低分辨率特徵圖表示。一種通用策略遵循相同的骨幹網絡,無需使用卷積卷積進行圖像分類,並且融合了多級特徵圖的空間和語義[1,29,38,42,47]。 ICNet [59]使用多尺度圖像作爲輸入和級聯網絡來提高效率。 DFANet [24]利用輕型骨幹網來加快速度,並配備了跨級別的功能聚合來提高準確性,而SwiftNet [38]使用橫向連接作爲具有成本效益的解決方案來恢復預測分辨率,同時保持速度。爲了進一步加快速度,將低分辨率圖像用作高級語義的輸入[31,59]。所有這些方法都將特徵圖縮小爲相當低的分辨率,並以較大的係數對其進行上採樣,這會導致結果較差,尤其是對於小物體和物體邊界。引導式上樣[31]與我們的方法密切相關,在語義上,語義圖從早期層被上採樣回到特徵圖式所引導的輸入圖像大小。但是,由於語義和分辨率方面的差距,指導還不夠,這使得該模型仍然無法與精確模型相提並論。相比之下,我們的方法從相鄰層對齊特徵圖,並朝着高分辨率和強語義增強了特徵金字塔,併產生了具有高精度和高速度的最新模型。

在這裏插入圖片描述

Figure 2. Visualization of feature maps and semantic flow field in FAM. Feature maps are visualized by averaging along the channel dimension, where large values are denoted by hot colors and vice versa. For visualizing semantic flow field, color code proposed by [2] and showed in top-right corner is adopted, where orientation and magnitude of flow vectors are represented by hue and saturation respectively.

圖2. FAM中特徵圖和語義流域的可視化。 通過沿通道尺寸取平均值來可視化特徵圖,其中較大的值由熱色表示,反之亦然。 爲了可視化語義流場,採用[2]提出的並在右上角顯示的顏色代碼,其中流向量的方向和大小分別由色相和飽和度表示。

There are also a set of works focusing on designing light-weight networks for real-time scene parsing. ESP- Nets [32, 33] save computation by decomposing standard convolution into point-wise convolution and spatial pyra- mid of dilated convolutions. BiSeNet [50] introduces spa- tial path and semantic path to reduce computation. Re- cently, several methods [35,48,58] use auto-ML algorithms to search efficient architectures for scene parsing. Our method is complementary to these works, which will fur- ther boost the segmentation speed as demonstrated in our experiments.
Our proposed semantic flow is inspired by optical flow [11], which is widely used in video semantic segmen- tation for both high accuracy and speed. For accurate re- sults, temporal information is exceedingly exploited by us- ing optical flow. Gadde et. al. [14] warps internal feature maps and Nilsson et. al. [37] warps final semantic maps. To pursue faster speed, optical flow is used to bypass the low-level feature computation of some frames by warping features from their preceding frames [27, 64]. Our work is different from them by propagating information hierar- chically in another dimension, which is orthogonal to the temporal propagation for videos.

還有一組工作專注於設計用於實時場景解析的輕量級網絡。 ESPNets [32,33]通過將標準卷積分解爲點卷積和膨脹卷積的空間金字塔來節省計算量。 BiSeNet [50]引入了空間路徑和語義路徑以減少計算量。最近,有幾種方法[35,48,58]使用自動ML算法來搜索有效的架構以進行場景解析。我們的方法是對這些工作的補充,這將進一步提高分割速度,如我們的實驗所示。
我們提出的語義流受到光流的啓發[11],光流以其高精度和高速度而廣泛用於視頻語義分段中。爲了獲得準確的結果,使用光流可以極大地利用時間信息。加德等等[14]扭曲內部特徵圖和Nilsson等。等[37]扭曲了最終的語義圖。爲了追求更快的速度,光流被用來繞過某些幀的先前幀[27、64]來繞過某些幀的低級特徵計算。我們的工作與他們不同,我們以另一種層次的方式傳播信息,該信息與視頻的時間傳播正交。

3. Method

In this section, we will first give some preliminary knowledge about scene parsing and introduce the misalign- ment problem therein. Then, Flow Alignment Module (FAM) is proposed to resolve the misalignment issue by learning Semantic Flow and warping top-layer feature maps accordingly. Finally, we present the whole network archi- tecture equipped with FAMs based on FPN framework [28] for fast and accurate scene parsing.

在本節中,我們將首先提供有關場景解析的一些初步知識,並在其中介紹未對準問題。 然後,提出了流對齊模塊(FAM),通過學習語義流並相應地扭曲頂層特徵圖來解決不對齊問題。 最後,我們介紹了基於FPN框架[28]的配備FAM的整個網絡體系結構,以實現快速,準確的場景解析。

3.1. Preliminary

在這裏插入圖片描述
pling rates of {4,8,16,32} with respect to the input im- age. The coarsest feature map F5 comes from the deepest layer with strong semantics, FCN-32s directly do prediction on it and causes oversmoothed results without fine details, and improvements are achieved by fusing predictions from lower levels [29]. FPN takes a step further to gradually fuse high-level feature maps into low-level feature maps in a top- down pathway through 2x bilinear upsampling, it was orig- inally proposed for object detection [28] and recently used for scene parsing [21, 47].
最粗糙的特徵圖F5來自語義最強的最深層,FCN-32直接在其上進行預測,並導致平滑的結果而沒有精細的細節,並且通過融合較低層級的預測來實現改進[29]。 FPN採取了進一步措施,通過2倍雙線性上採樣以自上而下的方式將高級特徵圖逐漸融合爲低級特徵圖,最初被提議用於對象檢測[28],最近用於場景解析[21]。 ,47]。

While the whole design looks like symmetry with both downsampling encoder and upsampling decoder, there is an important issue lies in the common and simple opera- tor, bilinear upsampling, which breaks the symmetry. Bi- linear upsampling recovers the resolution of downsampled feature maps by interpolating a set of uniformly sampled positions (i.e., it can only handle one kind of fixed and predefined misalignment), while the misalignment between feature maps caused by a residual connection is far more complex. Therefore, correspondence between feature maps needs to be explicitly established to resolve their true misalignment.

雖然整個設計看起來像具有下采樣編碼器和上採樣解碼器的對稱性,但一個重要的問題在於通用和簡單的運算符雙線性上採樣會破壞對稱性。 雙線性上採樣通過對一組統一採樣的位置進行插值(即,它只能處理一種固定的和預定義的未對準)來恢復下采樣特徵圖的分辨率,而由殘餘連接引起的特徵圖之間的未對準要複雜得多。 。 因此,需要明確建立特徵圖之間的對應關係以解決其真正的未對準問題。

在這裏插入圖片描述
Figure 3. (a) The details of Flow Alignment Module. We combine the transformed high-resolution feature map and low-resolution feature map to generate the semantic flow field, which is utilized to warp the low-resolution feature map to high-resolution feature map. (b) Warp procedure of Flow Alignment Module. The value of the high-resolution feature map is the bilinear interpolation of the neighboring pixels in low-resolution feature map, where the neighborhoods are defined according learned semantic flow field (i.e., offsets).

圖3.(a)流量對準模塊的詳細信息。 我們將轉換後的高分辨率特徵圖和低分辨率特徵圖結合起來,生成語義流場,該語義流場用於將低分辨率特徵圖扭曲爲高分辨率特徵圖。 (b)流量對準模塊的翹曲程序。 高分辨率特徵圖的值是低分辨率特徵圖中相鄰像素的雙線性插值,其中根據學習的語義流場(即偏移量)定義鄰域。

在這裏插入圖片描述
Figure 4. Overview of our proposed SFNet. ResNet-18 backbone with four stages is used for exemplar illustration. FAM: Flow Alignment Module. PPM: Pyramid Pooling Module [60].

圖4.我們提議的SFNet概述。 具有四個階段的ResNet-18主幹用於示例說明。 FAM:流量對齊模塊。 PPM:金字塔合併模塊[60]。

3.2. Flow Alignment Module

The task is formally similar to aligning two video frames
via optical flow [11], which motivates us to design a flow-
based alignment module, and align feature maps of two ad-
jacent levels by predicting a flow field. We define such flow
field Semantic Flow which are generated between differ-
ent levels in a feature pyramid. Specifically, we follow the
design of FlowNet-S [11] for its efficiency. Given two ad-
jacent feature maps Fl and Fl−1, we first upsample Fl to
the same size as Fl−1 via bilinear interpolation, then con-
catenate them together for a convolutional layer using two
kernels with spatial size of 3 × 3, and predict the semantic

該任務在形式上類似於對齊兩個視頻幀
通過光流[11],這促使我們設計出一種
基於對齊模塊,並對齊兩個ad-
通過預測流場來確定液位。 我們定義這樣的流程
在不同的
要素金字塔中的實體級別。 具體來說,我們遵循
FlowNet-S [11]的效率設計。 給定兩個廣告
相鄰特徵圖Fl和Fl-1,我們首先將Fl上採樣到
通過雙線性插值,使尺寸與Fl-1相同,然後
使用兩個將它們分類在一起形成卷積層
空間大小爲3×3的核,並預測語義

在這裏插入圖片描述

shown in Figure 3(b). A similar strategy is used in [63] for self-supervised monodepth learning via view synthesis. The proposed module is light-weight and end-to-end trainable and Figure 3(a) gives the detailed settings of the proposed module while Figure 3(b) shows the warping process. Figure 2 visualizes feature maps of two adjacent levels, their learned semantic flow and the finally warped feature map. As shown in Figure 2, the warped feature is more structured than normal bilinear upsampled feature and leads to more consistent representation inner the objects like bus and car.

在[63]中,類似的策略用於通過視圖合成進行自我監督的單深度學習。 所提出的模塊重量輕且端到端可訓練,圖3(a)給出了所提出模塊的詳細設置,而圖3(b)顯示了翹曲過程。 圖2可視化了兩個相鄰級別的特徵圖,它們的學習語義流以及最終變形的特徵圖。 如圖2所示,與正常的雙線性上採樣特徵相比,扭曲特徵的結構更復雜,並且可以在諸如公共汽車和汽車之類的物體內部產生更一致的表示。

3.3. Network Architectures

Figure 4 illustrates the whole network architecture, which contains a bottom-up pathway as encoder and a top- down pathway as decoder, the encoder has a backbone same as image classification by replacing fully connected layers with contextual modeling module, and the decoder is a FPN equipped with FAM. Details of each part are described as follows.
Backbone We choose standard networks pretrained from ImageNet [43] for image classification as our backbone net- work by removing the last fully connected layer. Specif- ically, ResNet series [17], ShuffleNet v2 [30] and DF se- ries [48] are used in our experiments. All backbones have four stages with residual blocks, and each stage has stride 2 in the first convolution layer to downsample the feature map for both computational efficiency and larger receptive fields.
Contextual Module plays an important role in scene pars- ing to capture long-range contextual information [54, 60], and we adopt Pyramid Pooling Module (PPM) [60] in this work. Since PPM outputs the same resolution feature map as last residual module, we treat PPM and last residual module together as last stage for FPN. Other modules like ASPP [5] can be readily plugged into our architecture in a similar manner, which are also verified in the Experiment section.

圖4說明了整個網絡體系結構,其中包含自下而上的路徑作爲編碼器和自上而下的路徑作爲解碼器,通過使用上下文建模模塊替換完全連接的層,編碼器具有與圖像分類相同的主幹,並且解碼器是FPN配備FAM。每個部分的細節描述如下。
骨幹我們選擇從ImageNet [43]預訓練的用於圖像分類的標準網絡作爲我們的骨幹網絡,方法是刪除最後一個完全連接的層。具體來說,在我們的實驗中使用了ResNet系列[17],ShuffleNet v2 [30]和DF系列[48]。所有主幹具有四個帶有殘差塊的階段,並且每個階段在第一卷積層中都具有步幅2,以對特徵圖進行下采樣,以實現計算效率和更大的接收場。
上下文模塊在捕獲遠程上下文信息的場景解析中起着重要的作用[54,60],我們在這項工作中採用了金字塔池模塊(PPM)[60]。由於PPM輸出的分辨率特徵圖與最後一個殘差模塊相同,因此我們將PPM和最後一個殘差模塊一起視爲FPN的最後一個階段。其他模塊,例如ASPP [5]可以很容易地以類似的方式插入我們的體系結構,這也在“實驗”部分進行了驗證。

在這裏插入圖片描述

在這裏插入圖片描述
Cascaded Deeply Supervised Learning We use deeply su- pervised loss [60] to supervise intermediate outputs of the decoder for easier optimization. In addition, online hard ex- ample mining [44, 50] is used by only training on the 10% hardest pixels sorted by cross-entropy loss.

級聯深度監督學習我們使用深度監督損失[60]來監督解碼器的中間輸出,以便於優化。 另外,僅通過對按交叉熵損失排序的10%最難像素進行訓練,即可使用在線硬示例挖掘[44,50]。

4. Experiment

We first carry out experiments on the Cityscapes [8] dataset, which is comprised of a large, diverse set of high- resolution (2048 × 1024) images recorded in street scenes. This dataset has 5,000 images with high quality pixel-wise annotations for 19 classes, which is further divided into 2975, 500, and 1525 images for training, validation and testing. To be noted, 20,000 coarsely labeled images pro- vided by this dataset are not used in this work. Besides, more experiments on Pascal Context [12], ADE20K [62] and CamVid [3] are summarised to further prove the effec- tiveness of our method.

我們首先對Cityscapes [8]數據集進行實驗,該數據集由記錄在街道場景中的高分辨率的大尺寸圖像集(2048×1024)組成。 該數據集有5,000個圖像,具有19個類別的高質量逐像素註釋,並且進一步分爲2975、500和1525個圖像進行訓練,驗證和測試。 需要注意的是,本工作中未使用該數據集提供的20,000張粗標籤圖像。 此外,總結了有關Pascal Context [12],ADE20K [62]和CamVid [3]的更多實驗,以進一步證明我們方法的有效性。

4.1. Experiments on Cityscapes

Implementation details: We use PyTorch [40] framework
to carry out following experiments. All networks are trained
with the same setting, where stochastic gradient descent
(SGD) with batch size of 16 is used as optimizer, with mo-
mentum of 0.9 and weight decay of 5e-4. All models are
trained for 50K iterations with an initial learning rate of
0.01. As a common practice, the “poly” learning rate pol-
icy is adopted to decay the initial learning rate by multiplying

實施細節:我們使用PyTorch [40]框架
進行以下實驗。 所有網絡都經過培訓
具有相同的設置,其中隨機梯度下降
(SGD),批量大小爲16,用作優化程序,
智力爲0.9,體重衰減爲5e-4。 所有型號都是
經過5萬次迭代訓練,初始學習率爲
0.01。 通常,“多元”學習率
採用icy乘以衰減初始學習率
在這裏插入圖片描述
during training. Data augmentation total iter
contains random horizontal flip, random resizing with scale range of [0.75, 2.0], and random cropping with crop size of 1024 × 1024.

During inference, we use whole picture as input to re- port performance unless explicitly mentioned. For quan- titative evaluation, mean of class-wise intersection-over- union (mIoU) is used for accurate comparison, and num- ber of float-point operations (FLOPs) and frames per sec- ond (FPS) are adopted for speed comparison. Most abla- tion studies are conducted on the validation set, and we also compare our method with other state-of-the-art methods on the test set.
Comparison with baseline methods: Table 1 reports the comparison results against baselines on the validation set of Cityscapes [8], where ResNet-18 [17] serves as the back- bone. Comparing with the naive FCN, dilated FCN im- proves mIoU by 1.1%. By appending the FPN decoder to the naive FCN, we get 74.8% mIoU by an improvement of 3.2%. By replacing bilinear upsampling with the proposed FAM, mIoU is boosted to 77.2%, which improves the naive FCN and FPN decoder by 5.7% and 2.4% respectively. Fi- nally, we append PPM (Pyramid Pooling Module) [60] to capture global contextual information, which achieves the best mIoU of 78.7 % together with FAM. Meanwhile, FAM is complementary to PPM by observing FAM improves PPM from 76.6% to 78.7%.
Positions to insert FAM: We insert FAM to different stage positions in the FPN decoder and report the results as Table 2. From the first three rows, FAM improves all stages and gets the greatest improvement at the last stage, which demonstrate that misalignment exists in all stages on FPN and is more severe in coarse layers. This phe- nomenon is consistent with the fact that coarse layers con- taining stronger semantics but with lower resolution, and can greatly boost segmentation performance when they are appropriately upsampled to high resolution. The best per- formance is achieved by adding FAM to all stages as listed in the last row.
Ablation study on different contextual heads: Consider- ing current state-of-the-art contextual modules are used as heads on dilated backbone networks [5,13,49,56,60,61], we further try different contextual heads in our methods where coarse feature map is used for contextual modeling. Table 3 reports the comparison results, where PPM [60] delivers the best result, while more recently proposed methods such as Non-Local based heads [18, 46, 53] perform worse. There- fore, we choose PPM as our contextual head considering its better performance with lower computational cost.

數據擴充總迭代
包含隨機水平翻轉,縮放範圍爲[0.75,2.0]的隨機大小調整以及裁剪大小爲1024×1024的隨機裁剪。

在推斷過程中,除非明確說明,否則我們將整個圖片用作報告性能的輸入。爲了進行定量評估,使用了類交叉相交(mIoU)的平均值進行精確比較,並採用了浮點運算(FLOP)和每秒幀數(FPS)的速度。比較。大多數消融研究都是在驗證集上進行的,我們還將測試方法上的方法與其他最新方法進行了比較。
與基線方法的比較:表1報告了在Cityscapes [8]的驗證集上與基線的比較結果,其中ResNet-18 [17]作爲背景。與幼稚的FCN相比,膨脹的FCN將mIoU提高了1.1%。通過將FPN解碼器附加到樸素的FCN,我們獲得了74.8%的mIoU,提高了3.2%。通過用提議的FAM代替雙線性上採樣,mIoU提升到77.2%,這將樸素的FCN和FPN解碼器分別提高了5.7%和2.4%。最後,我們附加了PPM(金字塔合併模塊)[60]以捕獲全局上下文信息,這與FAM一起可達到78.7%的最佳mIuU。同時,FAM通過觀察FAM將PPM從76.6%提高到78.7%來補充PPM。
插入FAM的位置:我們將FAM插入到FPN解碼器中的不同階段位置,並將結果報告爲表2。從前三行來看,FAM改善了所有階段,並在最後階段得到最大的改善,這表明存在未對準問題。在FPN上的所有階段,在粗層中更爲嚴重。這種現象與以下事實一致:粗糙層包含更強的語義,但分辨率較低,並且在將它們適當地上採樣爲高分辨率時,可以大大提高分割性能。通過將FAM添加到最後一行中列出的所有階段,可以實現最佳性能。
對不同上下文標題的消融研究:考慮到當前最新的上下文模塊被用作散佈的骨幹網絡上的標題[5,13,​​49,56,60,61],我們進一步嘗試在我們的上下文中使用不同的上下文標題粗特徵圖用於上下文建模的方法。表3報告了比較結果,其中PPM [60]提供了最佳結果,而最近提出的方法(例如基於非本地的打印頭[18、46、53])表現較差。因此,考慮到PPM具有更好的性能和更低的計算成本,我們選擇PPM作爲上下文頭。
在這裏插入圖片描述
Table 2. Ablation study on different positions to insert FAM, where Fl denote the upsampling position between level l and level l − 1,
ResNet-18 with FPN decoder and PPM head serves as baseline.

表2.對插入FAM的不同位置的消融研究,其中F1表示級別l和級別l − 1之間的上採樣位置
帶有FPN解碼器和PPM磁頭的ResNet-18作爲基線。

在這裏插入圖片描述
Table 3. Ablation with different contextual modeling heads, where ResNet-18 with FAM serves as the baseline.

表3.具有不同上下文建模頭的消融,其中帶有FAM的ResNet-18作爲基線。

在這裏插入圖片描述
Table 4. Ablation study on different backbones, where FPN de- coder with PPM head is used as baseline. The top part compare deep networks with sliding window testing, and the bottom part compares light-weight networks using single view testing.

表4.不同骨幹網的消融研究,其中帶有PPM頭的FPN解碼器用作基線。 頂部比較深層網絡和滑動窗口測試,底部比較使用單視圖測試的輕量級網絡。

Ablation study on different backbones: We further carry out a set of experiments with different backbone networks including both deep and light-weight networks, where FPN decoder with PPM head is used as a strong baseline. For heavy networks, we choose ResNet-50 and ResNet- 101 [17] as representation. For light-weight networks, ShuffleNetv2 [30] and DF1/DF2 [48] are experimented. All these backbones are pretrained on ImageNet [43]. Table 4 reports the results, where FAM significantly achieves better mIoU on all backbones with only slightly extra computa- tional cost.

Visualization of Semantic Flow: Figure 5 visualizes se- mantic flow from FAM in different stages. Similar with tra- ditional optical flow, semantic flow is visualized by color coding and is bilinearly interpolated to image size for quick overview. Besides, vector field is also visualized for de- tailed inspection. From the visualization, we observe that semantic flow tend to converge to some positions inside ob- jects, where these positions are generally near object centers and have better receptive fields to activate top-level features with pure, strong semantics. Top-level features at these po- sitions are then propagated to appropriate high-resolution positions following the guidance of semantic flow. In ad- dition, semantic flows also have coarse-to-fine trends from top level to bottom level, which phenomenon is consistent with the fact that semantic flows gradually describe offsets between gradually smaller patterns.
Improvement analysis: Table 5 compares the detailed re- sults of each category on the validation set, where ResNet- 101 is used as backbone, and FPN decoder with PPM head serves as the baseline. Our method improves almost all categories, especially for ’truck’ with more than 19% mIoU improvement. Figure 6 visualizes the prediction errors by both methods, where FAM considerably resolves ambigui- ties inside large objects (e.g., truck) and produces more pre- cise boundaries for small and thin objects (e.g., poles, edges of wall).

ComparisonwithPSPNet: Wecompareoursegmentation results with previous state-of-the-art model PSPNet [60] us- ing ResNet-101 as backbone. We re-implement PSPNet us- ing open-source code provided by the author and achieve 78.8% mIoU on validation set. Based on the same back- bone ResNet-101 without using astrous convolution, our method achieves 79.8% mIoU while being about 3 times faster than PSPNet. Figure 7 shows the comparison results, where our model gets more consistent results for large ob- jects and keeps more detailed information benefited from the well fused multi-level feature pyramid in our decoder.
Comparison with state-of-the-art real-time models: All compared methods are evaluated by single-scale inference and input sizes are also listed for fair comparison. Our speed is tested on one GTX 1080Ti GPU with full image resolution 1024 × 2048 as input, and we report speed of two versions, i.e., without and with TensorRT acceleration. As shown in Table 6, our method based on DF1 achieves more accurate result(74.5%) than all methods faster than it. With DF2, our method outperforms all previous meth- ods while running at 60 FPS. With ResNet-18 as backbone, our method achieves 78.9% mIoU and even reaches per- formance of accurate models which will be discussed in the next experiment. By additionally using Mapillary [36] dataset for pretraining, our ResNet-18 based model achieves 26 FPS with 80.4% mIoU, which sets the new state-of-the- art record on accuracy and speed trade-off on Cityscapes benchmark. More detailed information about Mapillary pretraining and TensorRT acceleration can be referred in supplementary file.

對不同骨幹網的消融研究:我們進一步對包括深層和輕型網絡在內的不同骨幹網進行了一組實驗,其中使用帶有PPM頭的FPN解碼器作爲強基準。對於重型網絡,我們選擇ResNet-50和ResNet-101 [17]作爲表示。對於輕量級網絡,已測試ShuffleNetv2 [30]和DF1 / DF2 [48]。所有這些主幹都在ImageNet上進行了預訓練[43]。表4報告了結果,其中FAM在所有骨幹網上僅以略微額外的計算成本即可顯着提高mIoU。

語義流的可視化:圖5可視化了FAM在不同階段的語義流。與傳統的光流相似,語義流通過顏色編碼可視化,並被雙線性內插到圖像大小,以便快速查看。此外,矢量場也可視化以進行詳細檢查。從可視化中,我們觀察到語義流趨於收斂到對象內部的某些位置,這些位置通常位於對象中心附近,並且具有更好的接受域,可以激活具有純淨,強大語義的頂級功能。然後,根據語義流的指導,將這些位置的頂級功能傳播到適當的高分辨率位置。另外,語義流從頂層到底層也有從粗到精的趨勢,這一現象與語義流逐漸描述逐漸減小的模式之間的偏移這一事實是一致的。
改進分析:表5比較了驗證集上每個類別的詳細結果,其中ResNet-101被用作主幹,帶PPM頭的FPN解碼器用作基線。我們的方法幾乎可以改善所有類別,特別是對於卡車而言,mIoU改善了19%以上。圖6可視化了這兩種方法的預測誤差,其中FAM極大地解決了大型物體(例如卡車)內部的歧義,併爲較小和較薄的物體(例如杆,牆的邊緣)產生了更精確的邊界。

與PSPNet的比較:我們使用ResNet-101作爲骨幹,將細分結果與以前的最新模型PSPNet [60]進行比較。我們使用作者提供的開源代碼重新實現PSPNet,並在驗證集上實現了78.8%的mIoU。基於相同的骨幹ResNet-101,而沒有使用卷積卷積,我們的方法可實現79.8%mIoU,而速度是PSPNet的3倍左右。圖7顯示了比較結果,其中我們的模型針對大對象獲得了更一致的結果,並從解碼器中融合良好的多級特徵金字塔中獲得了更多詳細信息。
與最新的實時模型進行比較:所有比較的方法均通過單尺度推斷進行評估,並且還列出了輸入大小以進行公平比較。我們的速度在一個具有完整圖像分辨率1024×2048作爲輸入的GTX 1080Ti GPU上進行了測試,我們報告了兩種版本的速度,即沒有和有TensorRT加速。如表6所示,與所有方法相比,我們基於DF1的方法獲得了更準確的結果(74.5%)。使用DF2,我們的方法在以60 FPS運行時優於所有以前的方法。以ResNet-18爲骨幹,我們的方法可以達到78.9%的mIoU,甚至可以達到精確模型的性能,這將在下一個實驗中進行討論。通過另外使用Mapillary [36]數據集進行預訓練,我們基於ResNet-18的模型可達到26 FPS,mIoU爲80.4%,這在Cityscapes基準上創下了準確度和速度折衷的最新記錄。有關Mapillary預訓練和TensorRT加速的更多詳細信息,請參閱補充文件。
Comparison with state-of-the-art accurate models:
State-of-the-art accurate models [13, 49, 60, 65] perform multi-scale and horizontal flip inference to achieve better re- sults on the Cityscapes test server. Although our model can run fast in real-time scenario with single-scale inference, for fair comparison, we also report multi-scale with flip testing results, which is common settings following previous meth- ods [13,60]. Number of model parameters and computation FLOPs are also listed for comparison. Table 7 summarizes the results, where our models achieve state-of-the-art accu- racy while costs much less computation. In particular, our method based on ResNet-18 is 1.1% mIoU higher than PSP- Net [60] while only requiring 11% of its computation. Our ResNet-101 based model achieves comparable results with DAnet [13] and only requires 32% of its computation.

與最新的精確模型進行比較:
最新的精確模型[13、49、60、65]執行多尺度和水平翻轉推斷,以在Cityscapes測試服務器上獲得更好的結果。 儘管我們的模型可以通過單尺度推理在實時場景中快速運行,但爲了公平地比較,我們還報告了帶有翻轉測試結果的多尺度,這是先前方法的常見設置[13,60]。 還列出了模型參數和計算FLOP的數量以進行比較。 表7總結了結果,其中我們的模型達到了最先進的精度,而計算成本卻大大降低。 特別是,我們基於ResNet-18的方法比PSPNet [60]高1.1%mIoU,而只需要11%的計算量。 我們基於ResNet-101的模型可與DAnet [13]取得可比的結果,只需要其計算的32%。

在這裏插入圖片描述
Figure 5. Visualization of the learned semantic flow fields. Column (a) lists three exemplar images. Column (b)-(d) show the semantic flow of the three FAMs in an ascending order of resolution during the decoding process, following the same color coding of Figure 2. Column (e) is the arrowhead visualization of flow fields in column (d). Column (f) contains the segmentation results.

圖5.所學語義流字段的可視化。 (a)列列出了三個示例圖像。 (b)-(d)列顯示了在解碼過程中,按照圖2相同的顏色編碼,三個FAM的語義流以分辨率的升序排列。(e)列是( d)。 (f)列包含細分結果。

在這裏插入圖片描述
Table 5. Quantitative per-category comparison results on Cityscapes validation set, where ResNet-101 backbone with the FPN decoder and PPM head serves as the strong baseline. Sliding window crop with horizontal flip is used for testing. Obviously, FAM boosts the performance of almost all the categories.

表5. Cityscapes驗證集上按類別的定量比較結果,其中帶有FPN解碼器和PPM頭的ResNet-101主幹網是強基準。 使用水平翻轉的滑動窗口裁剪進行測試。 顯然,FAM可以提高几乎所有類別的性能。

在這裏插入圖片描述
Figure 6. Qualitative comparison in terms of errors in predictions, where correctly predicted pixels are shown as black background while wrongly predicted pixels are colored with their groundtruth label color codes.
圖6.就預測錯誤而言的定性比較,其中正確預測的像素顯示爲黑色背景,而錯誤預測的像素則用其底標標籤顏色代碼進行着色。
在這裏插入圖片描述
Figure 7. Scene parsing results comparison against PSPNet [60], where significantly improved regions are marked with red dashed boxes. Our method performs better on both small scale and large scale objects.

圖7.場景解析結果與PSPNet [60]的比較,其中顯着改善的區域用紅色虛線框標記。 我們的方法在小規模和大型對象上都表現更好。

4.2. Experiment on More Datasets

To further prove the effectiveness of our method, we per- form more experiments on other three data-sets including Pascal Context [34], ADE20K [62] and CamVid [3]. Stan- dard settings of each benchmark are used, which are sum- marized in supplementary file.
PASCAL Context: provides pixel-wise segmentation an- notation for 59 classes, and contains 4,998 training images and 5,105 testing images. The results are illustrated as Ta- ble 8, our method outperforms corresponding baselines by 1.7% mIoU and 2.6% mIoU with ResNet-50 and ResNet101 as backbones respectively. In addition, our method on both ResNet-50 and ResNet-101 outperforms their existing counterparts by large margins with significant lower com- putational cost.
ADE20K: is a challenging scene parsing dataset annotated with 150 classes, and it contains 20K/2K images for train- ing and validation. Images in this dataset are from different scenes with more scale variations. Table 9 reports the per- formance comparisons, our method improves the baselines by 1.69% mIoU and 1.59% mIoU respectively, and outper- forms previous state-of-the-art methods [60, 61] with much less computation.

爲了進一步證明我們方法的有效性,我們對其他三個數據集進行了更多實驗,包括Pascal Context [34],ADE20K [62]和CamVid [3]。使用每個基準的標準設置,這些設置彙總在補充文件中。
PASCAL上下文:提供59個類別的逐像素分割註釋,幷包含4,998幅訓練圖像和5,105幅測試圖像。結果顯示爲表8,我們的方法分別以ResNet-50和ResNet101爲骨幹,分別比相應的基線高出1.7%mIoU和2.6%mIoU。此外,我們在ResNet-50和ResNet-101上的方法都大大優於現有方法,大大降低了計算成本。
ADE20K:是一個具有挑戰性的場景解析數據集,帶有150個類別,幷包含20K / 2K圖像,用於訓練和驗證。該數據集中的圖像來自不同場景,比例變化更大。表9報告了性能比較,我們的方法分別將基線提高了1.69%mIoU和1.59%mIoU,並以更少的計算量超越了現有的最新技術[60,61]。

在這裏插入圖片描述

Table 6. Comparison on Cityscapes test set with state-of-the-art real-time models. For fair comparison, input size is also consid- ered, and all models use single scale inference.

表6. Cityscapes測試集與最新實時模型的比較。 爲了公平地比較,還考慮了輸入大小,並且所有模型都使用單比例推斷。

在這裏插入圖片描述
Table 7. Comparison on Cityscapes test set with state-of-the-art accurate models. For better accuracy, all models use multi-scale inference.

CamVid: is another road scene dataset for autonomous driving. This dataset involves 367 training images, 101 val- idation images and 233 testing images with resolution of 480 × 360. We apply our method with different light-weight backbones on this dataset and report comparison results in Table 10. With DF2 as backbone, FAM improves its base- line by 3.4% mIoU. Our method based on ResNet-18 per- forms best with 72.4% mIoU while running at 45.2 FPS.

表7. Cityscapes測試集與最新精確模型的比較。 爲了獲得更高的準確性,所有模型都使用多尺度推理。

CamVid:是用於自動駕駛的另一個道路場景數據集。 該數據集包含367個訓練圖像,101個驗證圖像和233個測試圖像,分辨率爲480×360。我們在該數據集上應用了具有不同輕量級骨架的方法,並在表10中報告了比較結果。以DF2作爲骨架,FAM 將其基準提高了3.4%mIoU。 我們的基於ResNet-18的方法在以45.2 FPS運行時,以72.4%mIoU表現最佳。

在這裏插入圖片描述
Table 8. Comparison with the state-of-art methods on Pascal Con- text testing set [34]. All the models use multi-scale inference with horizontal flip.

表8.與Pascal語境測試集上的最新方法的比較[34]。 所有模型都使用水平翻轉的多尺度推理。

在這裏插入圖片描述
Table 9. Results on ADE20K dataset, where our models achieve the best trade-off on speed and accuracy, all models are evaluated using multi-scale inference with horizontal flip.

表9. ADE20K數據集的結果,其中我們的模型在速度和準確性上取得了最佳折衷,所有模型都使用水平翻轉的多尺度推理進行了評估。

在這裏插入圖片描述
Table 10. Accuracy and efficiency comparison with previous state- of-the-art real-time models on CamVid [3] test set, where the input size is 360 × 480 and single scale inference is used.

表10.與CamVid [3]測試集上的現有最新實時模型的精度和效率比較,其中輸入大小爲360×480,並且使用單比例推斷。

5. Conclusion

In this paper, we devise to use the learned Semantic Flow to align multi-level feature maps generated by a fea- ture pyramid to the task of scene parsing. With the pro- posed flow alignment module, high-level features are well flowed to low-level feature maps with high resolution. By discarding atrous convolutions to reduce computation over- head and employing the flow alignment module to enrich the semantic representation of low-level features, our net- work achieves the best trade-off between semantic segmen- tation accuracy and running time efficiency. Experiments on multiple challenging datasets illustrate the efficacy of our method. Since our network is super efficient and shares the same spirit as optical flow for aligning different maps (i.e.,feature maps of different video frames), it can be naturally extended to video semantic segmentation to align feature maps hierarchically and temporally. Besides, we’re also in- terested in extending the idea of semantic flow to other re- lated areas like panoptic segmentation, etc.

在本文中,我們設計使用學習的語義流將特徵金字塔生成的多級特徵圖與場景解析任務對齊。使用建議的流程對齊模塊,高級特徵可以很好地流向具有高分辨率的低級特徵圖。通過丟棄無用的卷積以減少計算開銷,並使用流對齊模塊來豐富低級特徵的語義表示,我們的網絡在語義分段精度和運行時間效率之間實現了最佳折衷。在多個具有挑戰性的數據集上進行的實驗說明了我們方法的有效性。由於我們的網絡非常高效,並且具有與光流相同的精神來對齊不同的地圖(即不同視頻幀的特徵地圖),因此可以自然地擴展到視頻語義分割,以按層次和時間對齊特徵地圖。此外,我們也很感興趣將語義流的概念擴展到其他相關領域,例如全景分割等。

References

[1] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. PAMI, 2017.
[2] Simon Baker, Daniel Scharstein, J. P. Lewis, Stefan Roth, Michael J. Black, and Richard Szeliski. A database and eval- uation methodology for optical flow. International Journal of Computer Vision, 92(1):1–31, Mar 2011.
[3] Gabriel J. Brostow, Julien Fauqueur, and Roberto Cipolla. Semantic object classes in video: A high-definition ground truth database. Pattern Recognition Letters, xx(x):xx–xx, 2008.
[4] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. Deeplab: Semantic im- age segmentation with deep convolutional nets, atrous con- volution, and fully connected crfs. PAMI, 2018.
[5] Liang-ChiehChen,GeorgePapandreou,FlorianSchroff,and Hartwig Adam. Rethinking atrous convolution for seman- tic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
[6] Liang-ChiehChen,YukunZhu,GeorgePapandreou,Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018.
[7] Bowen Cheng, Liang-Chieh Chen, Yunchao Wei, Yukun Zhu, Zilong Huang, Jinjun Xiong, Thomas S. Huang, Wen- Mei Hwu, and Honghui Shi. Spgnet: Semantic prediction guidance for scene parsing. In ICCV, October 2019.
[8] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
[9] HenghuiDing,XudongJiang,AiQunLiu,NadiaMagnenat- Thalmann, and Gang Wang. Boundary-aware feature propa- gation for scene segmentation. 2019.
[10] Henghui Ding, Xudong Jiang, Bing Shuai, Ai Qun Liu, and Gang Wang. Context contrasted feature and gated multi- scale aggregation for scene segmentation. In CVPR, 2018.
[11] Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical flow with convolutional networks. In CVPR, 2015.
[12] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. IJCV, 2010.
[13] Jun Fu, Jing Liu, Haijie Tian, Zhiwei Fang, and Hanqing Lu. Dual attention network for scene segmentation. arXiv preprint arXiv:1809.02983, 2018.
[14] Raghudeep Gadde, Varun Jampani, and Peter V. Gehler. Se- mantic video cnns through representation warping. In ICCV, Oct 2017.
[15] Junjun He, Zhongying Deng, and Yu Qiao. Dynamic multi- scale filters for semantic segmentation. In ICCV, October 2019.
[16] Junjun He, Zhongying Deng, Lei Zhou, Yali Wang, and Yu Qiao. Adaptive pyramid context network for semantic seg- mentation. In CVPR, June 2019.
[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
[18] Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. Ccnet: Criss-cross attention for semantic segmentation. 2019.
[19] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. Spatial transformer networks. ArXiv, abs/1506.02025, 2015.
[20] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. ArXiv, abs/1609.02907, 2016.
[21] Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr Dollar. Panoptic feature pyramid networks. In CVPR, June 2019.
[22] ShuKongandCharlessC.Fowlkes.Recurrentsceneparsing with perspective understanding in the loop. In CVPR, 2018.
[23] Hanchao Li, Pengfei Xiong, Haoqiang Fan, and Jian Sun.
Dfanet: Deep feature aggregation for real-time semantic seg-
mentation. In CVPR, June 2019.
[24] Hanchao Li, Pengfei Xiong, Haoqiang Fan, and Jian Sun.
Dfanet: Deep feature aggregation for real-time semantic seg-
mentation. In CVPR, June 2019.
[25] XiaLi,ZhishengZhong,JianlongWu,YiboYang,Zhouchen
Lin, and Hong Liu. Expectation-maximization attention net-
works for semantic segmentation. In ICCV, 2019.
[26] Yin Li and Abhinav Gupta. Beyond grids: Learning graph
representations for visual recognition. In NIPS. 2018.
[27] Yule Li, Jianping Shi, and Dahua Lin. Low-latency video
semantic segmentation. In CVPR, June 2018.
[28] Tsung-Yi Lin, Piotr Dollr, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie. Feature pyramid
networks for object detection. In CVPR, 2017.
[29] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In
CVPR, 2015.
[30] NingningMa,XiangyuZhang,Hai-TaoZheng,andJianSun.
Shufflenet v2: Practical guidelines for efficient cnn architec-
ture design. In ECCV, September 2018.
[31] Davide Mazzini. Guided upsampling network for real-time
semantic segmentation. In BMVC, 2018.
[32] Sachin Mehta, Mohammad Rastegari, Anat Caspi, Linda
Shapiro, and Hannaneh Hajishirzi. Espnet: Efficient spatial pyramid of dilated convolutions for semantic segmentation. In ECCV, September 2018.
[33] Sachin Mehta, Mohammad Rastegari, Linda Shapiro, and Hannaneh Hajishirzi. Espnetv2: A light-weight, power ef- ficient, and general purpose convolutional neural network. In CVPR, June 2019.
9
[34] Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. The role of context for object detection and se- mantic segmentation in the wild. In CVPR, 2014.
[35] Vladimir Nekrasov, Hao Chen, Chunhua Shen, and Ian Reid. Fast neural architecture search of compact semantic segmen- tation models via auxiliary cells. In CVPR, June 2019.
[36] Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulo, and Peter Kontschieder. The mapillary vistas dataset for semantic understanding of street scenes. In ICCV, 2017.
[37] David Nilsson and Cristian Sminchisescu. Semantic video segmentation by gated recurrent flow propagation. In CVPR, June 2018.
[38] Marin Orsic, Ivan Kreso, Petra Bevandic, and Sinisa Segvic. In defense of pre-trained imagenet architectures for real-time semantic segmentation of road-driving images. In CVPR, June 2019.
[39] Adam Paszke, Abhishek Chaurasia, Sangpil Kim, and Euge- nio Culurciello. Enet: A deep neural network architecture for real-time semantic segmentation.
[40] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Al- ban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.
[41] Eduardo Romera, Jose M. Alvarez, Luis Miguel Bergasa, and Roberto Arroyo. Erfnet: Efficient residual factorized convnet for real-time semantic segmentation. IEEE Trans. Intelligent Transportation Systems, pages 263–272, 2018.
[42] OlafRonneberger,PhilippFischer,andThomasBrox.U-net: Convolutional networks for biomedical image segmentation. MICCAI, 2015.
[43] OlgaRussakovsky,JiaDeng,HaoSu,JonathanKrause,San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. IJCV, 2015.
[44] Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based object detectors with online hard ex- ample mining. In CVPR, 2016.
[45] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017.
[46] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim- ing He. Non-local neural networks. In CVPR, June 2018.
[47] Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene understand- ing. In ECCV, 2018.
[48] Zheng Pan Jiashi Feng Xin Li, Yiming Zhou. Partial order pruning: for best speed/accuracy trade-off in neural architec- ture search. In CVPR, 2019.
[49] Maoke Yang, Kun Yu, Chi Zhang, Zhiwei Li, and Kuiyuan Yang. Denseaspp for semantic segmentation in street scenes. In CVPR, 2018.
[50] Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. Bisenet: Bilateral segmenta- tion network for real-time semantic segmentation. In ECCV, 2018.
[51] Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. Learning a discriminative feature network for semantic segmentation. In CVPR, 2018.
[52] Fisher Yu and Vladlen Koltun. Multi-scale context aggrega- tion by dilated convolutions. ICLR, 2016.
[53] Yuhui Yuan and Jingdong Wang. Ocnet: Object context net- work for scene parsing. arXiv preprint arXiv:1809.00916, 2018.
[54] Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang, Xiaogang Wang, Ambrish Tyagi, and Amit Agrawal. Con- text encoding for semantic segmentation. In CVPR, June 2018.
[55] Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang, Xiaogang Wang, Ambrish Tyagi, and Amit Agrawal. Con- text encoding for semantic segmentation. In CVPR, 2018.
[56] Hang Zhang, Han Zhang, Chenguang Wang, and Junyuan Xie. Co-occurrent features in semantic segmentation. In CVPR, June 2019.
[57] Rui Zhang, Sheng Tang, Yongdong Zhang, Jintao Li, and Shuicheng Yan. Scale-adaptive convolutions for scene pars- ing. In ICCV, 2017.
[58] Yiheng Zhang, Zhaofan Qiu, Jingen Liu, Ting Yao, Dong Liu, and Tao Mei. Customizable architecture search for se- mantic segmentation. In CVPR, June 2019.
[59] Hengshuang Zhao, Xiaojuan Qi, Xiaoyong Shen, Jianping Shi, and Jiaya Jia. Icnet for real-time semantic segmentation on high-resolution images. In ECCV, September 2018.
[60] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In CVPR, 2017.
[61] Hengshuang Zhao, Yi Zhang, Shu Liu, Jianping Shi, Chen Change Loy, Dahua Lin, and Jiaya Jia. Psanet: Point-wise spatial attention network for scene parsing. In ECCV, 2018.
[62] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ADE20K dataset. arXiv preprint arXiv:1608.05442, 2016.
[63] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. Unsupervised learning of depth and ego-motion from video. In CVPR, 2017.
[64] XizhouZhu,YuwenXiong,JifengDai,LuYuan,andYichen Wei. Deep feature flow for video recognition. In CVPR, July 2017.
[65] YiZhu,KaranSapra,FitsumA.Reda,KevinJ.Shih,Shawn Newsam, Andrew Tao, and Bryan Catanzaro. Improving se- mantic segmentation via video propagation and label relax- ation. In CVPR, June 2019.
[66] Zhen Zhu, Mengde Xu, Song Bai, Tengteng Huang, and Xi- ang Bai. Asymmetric non-local neural networks for semantic segmentation. In ICCV, 2019.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章