Deep Snake for Real-Time Instance Segmentation

在這裏插入圖片描述

Abstract

This paper introduces a novel contour-based approach named deep snake for real-time instance segmentation. Un- like some recent methods that directly regress the coordi- nates of the object boundary points from an image, deep snake uses a neural network to iteratively deform an initial contour to the object boundary, which implements the classic idea of snake algorithms with a learning-based approach. For structured feature learning on the contour, we propose to use circular convolution in deep snake, which better exploits the cycle-graph structure of a contour com- pared against generic graph convolution. Based on deep snake, we develop a two-stage pipeline for instance segmen- tation: initial contour proposal and contour deformation, which can handle errors in initial object localization. Ex- periments show that the proposed approach achieves state- of-the-art performances on the Cityscapes, Kins and Sbd datasets while being efficient for real-time instance segmen- tation, 32.3 fps for 512×512 images on a 1080Ti GPU. The code is available at https://github.com/zju3dv/snake/.

本文介紹了一種新穎的基於輪廓的方法,稱爲深度蛇,用於實時實例分割。與其他一些直接從圖像中迴歸對象邊界點座標的最新方法不同,深蛇使用神經網絡將初始輪廓迭代變形爲對象邊界,從而通過學習實現了蛇算法的經典思想。基於方法。對於輪廓上的結構化特徵學習,我們建議在深蛇中使用圓形卷積,與通用圖卷積相比,它可以更好地利用輪廓的循環圖結構。在深蛇的基礎上,我們開發了一個兩段流水線來進行細分:初始輪廓建議和輪廓變形,它們可以處理初始對象定位中的錯誤。實驗表明,所提出的方法在Cityscapes,Kins和Sbd數據集上實現了最先進的性能,同時對於實時實例分割非常有效,在1080Ti GPU上對512×512圖像進行了32.3 fps的分割。該代碼可從https://github.com/zju3dv/snake/獲得。

1. Introduction

Instance segmentation is the cornerstone of many com- puter vision tasks, such as video analysis, autonomous driv- ing, and robotic grasping, which require both accuracy and efficiency. Most of the state-of-the-art instance segmenta- tion methods [17, 25, 4, 18] perform pixel-wise segmen- tation within a bounding box given by an object detector [34], which may be sensitive to the inaccurate bounding box. Moreover, representing an object shape as dense bi- nary pixels generally results in costly post-processing.
An alternative shape representation is the object con- tour, which is composed of a sequence of vertices along the object silhouette. In contrast to pixel-based representa- tion, contour is not limited within a bounding box and has fewer parameters. Such contour-based representation has long been used in image segmentation since the seminar work by Kass et al. [20], which is well known as the snake algorithm or active contour model. Given an initial con- tour, the snake algorithm iteratively deforms it to the object boundary by optimizing an energy functional defined with low-level image features, such as image intensity or gradi- ent. While many variants [5, 6, 14] have been developed in the literature, these methods tend to find local optimal solutions as objective functions are handcrafted and opti- mization is usually nonlinear.

實例分割是許多計算機視覺任務的基石,例如視頻分析,自動駕駛和機器人抓取,這些都需要準確性和效率。大多數最新的實例分割方法[17、25、4、18]在對象檢測器[34]給出的邊界框中執行逐像素分割。邊界框不正確。此外,將對象形狀表示爲密集的二進制像素通常會導致昂貴的後處理。
另一種形狀表示是對象輪廓,它由沿着對象輪廓的一系列頂點組成。與基於像素的表示相反,輪廓不限於邊界框內,而具有較少的參數。自從Kass等人的研討會工作以來,這種基於輪廓的表示法長期以來一直用於圖像分割。 [20],這是衆所周知的蛇算法或主動輪廓模型。給定一個初始輪廓,snake算法通過優化由低級圖像特徵(例如圖像強度或梯度)定義的能量函數,將其迭代變形到對象邊界。儘管在文獻中已經開發了許多變體[5、6、14],但是由於手工設計了目標函數並且優化通常是非線性的,因此這些方法傾向於找到局部最優解。

Some recent learning-based segmentation methods [19, 39] also represent objects as contours and try to directly regress the coordinates of object boundary points from an RGB image. Although such methods are much faster, they do not perform as well as pixel-based methods. Instead, Ling et al. [23] adopt the deformation pipeline of tradi- tional snake algorithms and train a neural network to evolve an initial contour to the object boundary. Given a contour with image features, it regards the input contour as a graph and uses a graph convolutional network (GCN) to predict vertex-wise offsets between contour points and the target boundary points. It achieves competitive accuracy com- pared with pixel-based methods while being much faster. However, the method proposed in [23] is designed to help annotation and lacks a complete pipeline for automatic in- stance segmentation. Moreover, treating the contour as a general graph with a generic GCN does not fully exploit the special topology of a contour.

一些最近的基於學習的分割方法[19,39]也將對象表示爲輪廓,並嘗試直接從RGB圖像中迴歸對象邊界點的座標。儘管此類方法要快得多,但它們的性能不如基於像素的方法。相反,Ling等。 [23]採用傳統蛇形算法的變形管線,並訓練神經網絡以將初始輪廓演化到物體邊界。給定具有圖像特徵的輪廓,它將輸入輪廓視爲圖形,並使用圖形卷積網絡(GCN)預測輪廓點和目標邊界點之間的頂點方向偏移。與基於像素的方法相比,它具有更高的競爭精度,並且速度更快。但是,在[23]中提出的方法旨在幫助註釋,並且缺乏用於自動實例分割的完整管道。此外,使用通用GCN將輪廓視爲通用圖並不能完全利用輪廓的特殊拓撲。

In this paper, we propose a learning-based snake algorithm, named deep snake, for real-time instance segmen- tation. Inspired by previous methods [20, 23], deep snake takes an initial contour as input and deforms it by regressing vertex-wise offsets. Our innovation is in introducing the cir- cular convolution for efficient feature learning on a contour, as illustrated in Figure 1. We observe that the contour is a cycle graph that consists of a sequence of vertices connected in a closed cycle. Since every vertex has the same degree equal to two, we can apply the standard 1D convolution on the vertex features. Considering that the contour is periodic, deep snake introduces the circular convolution, which indi- cates that an aperiodic function (1D kernel) is convolved in the standard way with a periodic function (features defined on the contour). The kernel of circular convolution encodes not only the feature of each vertex but also the relationship among neighboring vertices. In contrast, the generic GCN performs pooling to aggregate information from neighbor- ing vertices. The kernel function in our circular convolution amounts to a learnable aggregation function, which is more expressive and results in better performance than using a generic GCN, as demonstrated by our experimental results in Section 5.2.

在本文中,我們提出了一種基於學習的蛇算法,稱爲深度蛇,用於實時實例分割。受先前方法的啓發[20,23],深蛇將初始輪廓作爲輸入,並通過迴歸頂點偏移來使其變形。我們的創新之處在於引入了用於在輪廓上進行有效特徵學習的圓環卷積,如圖1所示。我們觀察到輪廓是一個循環圖,由封閉的循環中連接的一系列頂點組成。由於每個頂點的相等度等於2,因此我們可以在頂點特徵上應用標準的1D卷積。考慮到輪廓是週期性的,深蛇引入了圓形卷積,這表明非週期性函數(一維核)以標準方式與週期函數(輪廓上定義的特徵)進行卷積。圓形卷積的核不僅編碼每個頂點的特徵,而且編碼相鄰頂點之間的關係。相反,通用GCN執行合併以彙總來自相鄰頂點的信息。我們的循環卷積中的核函數相當於一個可學習的聚集函數,與使用通用GCN相比,它具有更強的表達能力和更好的性能,如5.2節中的實驗結果所示。

在這裏插入圖片描述

Figure 1. The basic idea of deep snake. Given an initial contour, image features are extracted at each vertex (a). Since the contour is a cycle graph, circular convolution is applied for feature learning on the contour (b). The blue, yellow and green nodes denote the input features, the kernel of circular convolution, and the output features, respectively. Finally, offsets are regressed at each vertex to deform the contour to the object boundary ©.

圖1.深蛇的基本概念。 給定初始輪廓,在每個頂點(a)提取圖像特徵。 由於輪廓是循環圖,因此將圓形卷積應用於輪廓(b)上的特徵學習。 藍色,黃色和綠色節點分別表示輸入特徵,圓卷積的核和輸出特徵。 最後,在每個頂點處進行偏移以使輪廓變形爲對象邊界(c)。
Based on deep snake, we develop a pipeline for instance segmentation. Given an initial contour, deep snake can it- eratively deform it to the object boundary and obtain the object shape. The remaining question is how to initial- ize a contour, whose importance has been demonstrated in classic snake algorithms. Inspired by [30, 27, 42], we propose to generate an octagon formed by object extreme points as the initial contour, which generally encloses the object tightly. Specifically, we add deep snake to a detec- tion model. The detected box first gives a diamond contour by connecting four points centered at its borders. Then deep snake takes the diamond as input and outputs offsets that point from four vertices to four extreme points, which are used to construct an octagon following [42]. Finally, deep snake deforms the octagon contour to the object boundary.
Our approach exhibits state-of-the-art performances on Cityscapes [7], Kins [33] and Sbd [15] datasets, while be- ing efficient for real-time instance segmentation, 32.3 fps for 512 × 512 images on a GTX 1080ti GPU. There are two reasons why the learning-based snake is fast while be- ing accurate. First, our approach can deal with errors in the object localization and thus allows a light detector. Second, object contour has fewer parameters than pixel-based repre- sentation and does not require costly post-processing, such as mask upsampling.
In summary, this work has the following contributions:

基於深層蛇,我們開發了用於實例分割的管道。給定初始輪廓,深蛇可以將其迭代變形到對象邊界並獲得對象形狀。剩下的問題是如何初始化輪廓,其重要性已在經典的蛇形算法中得到了證明。受[30,27,42]的啓發,我們建議生成一個由對象極點形成的八邊形作爲初始輪廓,該輪廓通常將對象緊密地包圍起來。具體來說,我們將深蛇添加到檢測模型中。被檢測的盒子首先通過連接以其邊界爲中心的四個點來給出菱形輪廓。然後,深蛇將菱形作爲輸入,並輸出從四個頂點到四個極限點的偏移量,這些偏移量用於在[42]之後構造一個八邊形。最後,深蛇將八邊形輪廓變形爲對象邊界。
我們的方法在Cityscapes [7],Kins [33]和Sbd [15]數據集上展現了最先進的性能,同時對於實時實例分割非常有效,GTX上的512×512圖像爲32.3 fps。 1080ti GPU。基於學習的蛇快速而準確的原因有兩個。首先,我們的方法可以處理對象定位中的錯誤,因此可以使用光檢測器。其次,與基於像素的表示相比,對象輪廓具有更少的參數,並且不需要昂貴的後處理,例如蒙版上採樣。
總而言之,這項工作有以下貢獻:

We propose a learning-based snake algorithm for real- time instance segmentation, which deforms an initial contour to the object boundary and introduces the cir- cular convolution for feature learning on the contour.
• We propose a two-stage pipeline for instance segmen- tation: initial contour proposal and contour deforma- tion. Both stages can deal with errors in the initial ob- ject localization.
• We demonstrate state-of-the-art performances of our approach on Cityscapes, Kins and Sbd datasets. For 512 × 512 images, our algorithm runs at 32.3 fps, which is efficient for real-time instance segmentation.

我們提出了一種基於學習的蛇算法,用於實時實例分割,該算法將初始輪廓變形爲對象邊界,並引入了用於在輪廓上進行特徵學習的循環卷積。
•我們提出了一個用於分段的兩階段管線:初始輪廓提議和輪廓變形。 這兩個階段都可以處理初始對象定位中的錯誤。
•我們在Cityscapes,Kins和Sbd數據集上展示了我們方法的最新性能。 對於512×512圖像,我們的算法以32.3 fps的速度運行,這對於實時實例分割非常有效。

2. Related work

Pixel-based methods. Most methods [8, 22, 17, 25] perform instance segmentation on the pixel level within a re- gion proposal, which works particularly well with standard CNNs. A representative instantiation is Mask R-CNN [17]. It first detects objects and then uses a mask predictor to seg- ment instances within the proposed boxes. To better exploit the spatial information inside the box, PANet [25] fuses mask predictions from fully-connected layers and convo- lutional layers. Such proposal-based approaches achieve state-of-the-art performance. One limitation of these meth- ods is that they cannot resolve errors in localization, such as too small or shifted boxes. In contrast, our approach de- forms the detected boxes to the object boundaries, so the spatial extension of object shapes will not be limited.
There exist some pixel-based methods [2, 29, 26, 11, 40] that are free of region proposals. In these methods, every pixel produces the auxiliary information, and then a clus- tering algorithm groups pixels into object instances based on their information. The auxiliary information could be various, as well as grouping algorithms. [2] predicts the boundary-aware energy for each pixel and uses the water- shed transform algorithm for grouping. [29] differentiates instances by learning instance-level embeddings. [26, 11] consider the input image as a graph and regress pixel affini- ties, which are then processed by a graph merge algorithm. Since the mask is composed of dense pixels, the post- clustering algorithms tend to be time-consuming.

Contour-based methods. In these methods, the object shape comprises a sequence of vertices along the object boundary. Traditional snake algorithms [20, 5, 6, 14] first introduced the contour-based representation for image seg- mentation. They deform an initial contour to the object boundary by optimizing a handcrafted energy with respect to the contour coordinates. To improve the robustness of these methods, [28] proposed to learn the energy function in a data-driven manner. Instead of iteratively optimizing the contour, some recent learning-based methods [19, 39] try to regress the coordinates of contour points from an RGB im- age, which is much faster. However, they are not accurate competitively with state-of-the-art pixel-based methods.

In the field of semi-automatic annotation, [3, 1, 23] have tried to perform the contour labeling using other networks instead of standard CNNs. [3, 1] predict the contour points sequentially using a recurrent neural network. To avoid se- quential inference, [23] follows the pipeline of snake algo- rithms and uses a graph convolutional network to predict vertex-wise offsets for contour deformation. This strategy significantly improves the annotation speed while being as accurate as pixel-based methods. However, [23] lacks a pipeline for instance segmentation and does not fully ex- ploit the special topology of a contour. Instead of treating the contour as a general graph, deep snake leverages the cy- cle graph topology and introduces the circular convolution for efficient feature learning on a contour.

基於像素的方法。大多數方法[8、22、17、25]在區域提議中在像素級別執行實例分割,這在標準CNN上特別有效。代表性實例是Mask R-CNN [17]。它首先檢測對象,然後使用掩碼預測器對建議框中的實例進行細分。爲了更好地利用盒子內部的空間信息,PANet [25]融合了來自全連接層和卷積層的掩膜預測。這種基於提議的方法可實現最先進的性能。這些方法的侷限性在於它們無法解決本地化錯誤,例如框太小或移位。相比之下,我們的方法將檢測到的盒子變形爲對象邊界,因此對象形狀的空間擴展將不受限制。
存在一些基於像素的方法[2、29、26、11、40],這些方法沒有區域建議。在這些方法中,每個像素都產生輔助信息,然後一種聚類算法根據其信息將像素分組爲對象實例。輔助信息以及分組算法可以是多種多樣的。 [2]預測每個像素的邊界感知能量,並使用分水嶺變換算法進行分組。 [29]通過學習實例級嵌入來區分實例。 [26,11]將輸入圖像視爲圖形和迴歸像素親和力,然後通過圖形合併算法對其進行處理。由於掩模由密集的像素組成,因此後聚類算法往往很耗時。

基於輪廓的方法。在這些方法中,對象形狀包括沿對象邊界的一系列頂點。傳統的蛇算法[20、5、6、14]首先引入了基於輪廓的圖像分割表示。它們通過相對於輪廓座標優化手工製作的能量,將初始輪廓變形爲對象邊界。爲了提高這些方法的魯棒性,[28]提出以​​數據驅動的方式學習能量函數。代替迭代優化輪廓,一些最近的基於學習的方法[19,39]嘗試從RGB圖像中迴歸輪廓點的座標,這要快得多。但是,它們與最新的基於像素的方法相比在競爭中並不準確。

在半自動註釋領域,[3,1,23]嘗試使用其他網絡代替標準的CNN來進行輪廓標註。 [3,1]使用遞歸神經網絡順序預測輪廓點。爲了避免順序推理,[23]遵循蛇形算法的流程,並使用圖卷積網絡來預測輪廓變形的頂點偏移。該策略與基於像素的方法一樣準確,可顯着提高註釋速度。然而,[23]缺乏實例分割的流水線,並且沒有充分利用輪廓的特殊拓撲。深蛇沒有將輪廓視爲一般圖形,而是利用了循環圖拓撲結構,並引入了圓形卷積,以便在輪廓上進行有效的特徵學習。

3. Proposed approach

Inspired by [20, 23], we perform object segmentation by deforming an initial contour to the object boundary. Specif- ically, deep snake takes a contour as input based on image features from a CNN backbone and predicts per-vertex off- sets pointing to the object boundary. To fully exploit the contour topology, we introduce the circular convolution for efficient feature learning on the contour, which facilitates deep snake to learn the deformation. Based on deep snake, a pipeline is developed for instance segmentation.

受[20,23]的啓發,我們通過將初始輪廓變形到對象邊界來執行對象分割。 具體來說,深蛇基於CNN主幹的圖像特徵將輪廓作爲輸入,並預測指向對象邊界的每個頂點偏移。 爲了充分利用輪廓拓撲,我們引入了圓形卷積以有效地學習輪廓,這有助於深蛇學習變形。 基於深層蛇,開發了用於實例分割的管道。

3.1. Learning-based snake algorithm

Given an initial contour, traditional snake algorithms treat the coordinates of the vertices as a set of variables and optimize an energy functional with respect to these vari- ables. By designing proper image forces at the contour co- ordinates, active contour models could optimize the contour to the object boundary. However, since the energy func- tional is typically nonconvex and handcrafted based on low- level image features, the deformation process tend to find local optimal solutions.
In contrast, deep snake directly learns to evolve the con- tour from data in an end-to-end manner. Given a contour with N vertices {xi |i = 1, …, N }, we first construct feature vectors for each vertex. The input feature fi for a vertex xi is a concatenation of learning-based features and the vertex coordinate: [F (xi ); x′i ], where F is the feature maps and x′i is a translation-invariant version of vertex xi. The feature maps F is obtained by applying a CNN backbone on the input image, which deep snake shares with the detector in our instance segmentation model. The image feature F (xi ) is computed using the bilinear interpolation of features at the vertex coordinate xi. The appended vertex coordinate is used to model the spatial relationship among contour ver- tices. Since the deformation should not be affected by the absolute location of contour, we compute the translation- invariant coordinate x′i by subtracting the minimum value along x and y axis over all vertices, respectively.

給定初始輪廓,傳統的蛇算法將頂點的座標視爲一組變量,並針對這些變量優化能量函數。通過在輪廓座標上設計適當的圖像力,主動輪廓模型可以優化到對象邊界的輪廓。但是,由於能量函數通常是非凸的,並且是基於低級圖像特徵手工製作的,因此變形過程傾向於找到局部最優解。
相反,深蛇直接學習以端對端的方式從數據中演化出輪廓。給定具有N個頂點的輪廓{xi | i = 1,…,N},我們首先爲每個頂點構造特徵向量。頂點xi的輸入特徵fi是基於學習的特徵和頂點座標的串聯:[F(xi); x’i],其中F是特徵圖,x’i是頂點xi的平移不變形式。通過在輸入圖像上應用CNN主幹來獲得特徵圖F,在我們的實例分割模型中,深層蛇與檢測器共享。使用特徵在頂點座標xi處的雙線性插值來計算圖像特徵F(xi)。附加的頂點座標用於模擬輪廓頂點之間的空間關係。由於變形不應受到輪廓的絕對位置的影響,因此我們通過分別減去所有頂點沿x和y軸的最小值來計算平移不變座標x’i。

Given the input features defined on a contour, deep snake introduces the circular convolution for the feature learning, as illustrated in Figure 2. In general, the features of contour vertices can be treated as a 1-D discrete signal f : Z → RD and processed by the standard convolution. But this breaks the topology of the contour. Therefore, we treat the features on the contour as a periodic signal defined as:

給定輪廓上定義的輸入特徵後,深蛇會引入圓形卷積進行特徵學習,如圖2所示。通常,輪廓頂點的特徵可以視爲一維離散信號f:Z→RD和 由標準卷積處理。 但這破壞了輪廓的拓撲。 因此,我們將輪廓上的特徵視爲週期信號,定義爲:

在這裏插入圖片描述

and propose to encode the periodic features by the circular
convolution defined as:

並提出用圓形編碼週期性特徵
卷積定義爲:
在這裏插入圖片描述

Similar to the standard convolution, we can construct a network layer based on the circular convolution for fea- ture learning, which is easy to be integrated into a mod- ern network architecture. After the feature learning, deep snake applies three 1×1 convolution layers to the output features for each vertex and predicts vertex-wise offsets be- tween contour points and the target points, which are used to deform the contour. In all experiments, the kernel size of circular convolution is fixed to be nine.
As discussed in the introduction, the proposed circular convolution better exploits the circular structure of the con- tour than the generic graph convolution. We will show the experimental comparison in Section 5.2. An alterna- tive method is to use standard CNNs to regress a pixel-wise vector field from the input image to guide the evolution of the initial contour [35, 31, 38]. We argue that an impor- tant advantage of deep snake over the standard CNNs is the object-level structured prediction, i.e., the offset prediction at a vertex depends on other vertices of the same contour. Therefore, it is more reasonable for deep snake to predict an offset for a vertex located in the background and far from the object, which is very common in an initial con- tour. Standard CNNs have difficulty in outputting meaning- ful offsets in this case, since it is ambiguous to decide which object a background pixel belongs to.

與標準卷積類似,我們可以基於圓形卷積構建網絡層以進行功能學習,該層很容易集成到現代網絡體系結構中。在特徵學習之後,深蛇將三個1×1卷積層應用於每個頂點的輸出特徵,並預測輪廓點和目標點之間的頂點方向偏移,這些偏移用於使輪廓變形。在所有實驗中,圓形卷積的內核大小固定爲9。
正如引言中所討論的,與通用圖卷積相比,提出的圓形卷積更好地利用了輪廓的圓形結構。我們將在5.2節中顯示實驗比較。另一種方法是使用標準的CNN從輸入圖像中迴歸像素方向的矢量場,以指導初始輪廓的發展[35、31、38]。我們認爲,深蛇優於標準CNN的重要優勢是對象級結構化預測,即,頂點處的偏移量預測取決於相同輪廓的其他頂點。因此,對於較深的蛇來說,預測位於背景中且遠離物體的頂點的偏移量更爲合理,這在初始輪廓中非常常見。在這種情況下,標準的CNN很難輸出有意義的偏移量,因爲很難確定背景像素屬於哪個對象。

在這裏插入圖片描述

Figure 3. Proposed contour-based model for instance segmentation. (a) Deep snake consists of three parts: a backbone, a fusion block, and a prediction head. It takes a contour as input and outputs vertex-wise offsets to deform the contour. (b) Based on deep snake, we propose a two-stage pipeline for instance segmentation: initial contour proposal and contour deformation. The box proposed by the detector gives a diamond contour, whose four vertices are then deformed to object extreme points by deep snake. An octagon is constructed based on the extreme points. Taking the octagon as the initial contour, deep snake iteratively deforms it to the object boundary.

圖3.提出的基於輪廓的實例分割模型。 (a)深蛇由三部分組成:骨架,融合塊和預測頭。 它以輪廓爲輸入,並輸出頂點偏移以使輪廓變形。 (b)基於深蛇,我們提出了一個兩階段的管道進行實例分割:初始輪廓提議和輪廓變形。 檢測器建議的盒子給出菱形輪廓,然後通過深蛇將其四個頂點變形爲物體的極點。 根據這些極點構造一個八邊形。 以八邊形爲初始輪廓,深蛇將其迭代變形爲對象邊界。

Network architecture. Figure 3(a) shows the detailed schematic. Following ideas from [32, 37, 21], deep snake consists of three parts: a backbone, a fusion block, and a prediction head. The backbone is comprised of 8 “CirConv- Bn-ReLU” layers and uses residual skip connections for all layers, where “CirConv” means circular convolution. The fusion block aims to fuse the information across all contour points at multiple scales. It concatenates features from all layers in the backbone and forwards them through a 1×1 convolution layer followed by max pooling. The fused fea- ture is then concatenated with the feature of each vertex. The prediction head applies three 1×1 convolution layers to the vertex features and output vertex-wise offsets.

網絡架構。 圖3(a)顯示了詳細的示意圖。 按照[32、37、21]的觀點,深蛇由三部分組成:骨架,融合塊和預測頭。 骨幹網由8個“ CirConv-Bn-ReLU”層組成,並對所有層使用剩餘的跳過連接,其中“ CirConv”表示圓形卷積。 融合塊旨在以多種比例將信息融合到所有輪廓點上。 它連接主幹中所有層的要素,並通過1×1卷積層轉發它們,然後進行最大池化。 然後將融合的特徵與每個頂點的特徵連接起來。 預測頭將三個1×1卷積層應用於頂點特徵並輸出頂點方向偏移。

3.2. Deep snake for instance segmentation

Figure 3(b) overviews the proposed pipeline for instance segmentation. We add deep snake to an object detection model. The detector first produces object boxes that are used to construct diamond contours. Then deep snake de- forms the diamond vertices to object extreme points, which are used to construct octagon contours. Finally, our ap- proach takes octagons as initial contours and performs it- erative contour deformation to obtain the object shape.
Initial contour proposal. Most active contour models re-
quire precise initial contours. Since the octagon proposed
in [42] generally tightly encloses the object, we choose it as
the initial contour, as shown in Figure 3(b).
圖3(b)概述了建議的實例細分管道。 我們將深蛇添加到對象檢測模型中。 檢測器首先產生用於構造菱形輪廓的物體箱。 然後,深蛇將鑽石頂點變形爲對象的極點,這些極點用於構造八邊形輪廓。 最後,我們的方法以八邊形爲初始輪廓,並執行迭代輪廓變形以獲得對象形狀。
初始輪廓建議。 最活躍的輪廓模型重新
需要精確的初始輪廓。 自從八角形提出以來
在[42]中,通常將對象緊密包圍,我們選擇它作爲
初始輪廓,如圖3(b)所示。
This octagon
在這裏插入圖片描述

tice, to take in more context information, the diamond con-
tour is uniformly upsampled to 40 points, and deep snake
correspondingly outputs 40 offsets. The loss function only
bb considers the offsets at xi .
We construct the octagon by generating four lines based on extreme points and connecting their endpoints. Specif- ically, the four extreme points form a new object box. For each extreme point, a line extends from it along the corre- sponding box border in both directions to 1/4 of the border length. And the line will be truncated if it meets the box corner. Then the endpoints of the four lines are connected to form the octagon.

我們通過基於極端點生成四條線並連接其端點來構造八邊形。 具體來說,這四個極端點構成一個新的對象框。 對於每個極點,一條線沿對應的框線邊界從兩個方向沿其延伸到邊界長度的1/4。 如果該線與框角相交,則該線將被截斷。 然後,將四條線的端點連接起來以形成八邊形。

在這裏插入圖片描述
However, regressing the offsets in one pass is challeng- ing, especially for vertices far away from the object. In- spired by [20, 23, 36], we deal with this problem in an iter- ative optimization fashion. Specifically, our approach first predicts N offsets based on the current contour and then deforms this contour by vertex-wise adding the offsets to its vertex coordinates. The deformed contour can be used for the next deformation or directly outputted as the object shape. In experiments, the number of inference iteration is set as 3 unless otherwise stated.
Note that the contour is an alternative representation for the spatial extension of an object. By deforming the initial contour to the object boundary, our approach could resolve the localization errors from the detector.

但是,一次偏移偏移是具有挑戰性的,尤其是對於遠離對象的頂點。 受[20,23,36]的啓發,我們以迭代優化的方式處理這個問題。 具體來說,我們的方法首先根據當前輪廓預測N個偏移,然後通過將頂點偏移量添加到其頂點座標來使該輪廓變形。 變形後的輪廓可以用於下一次變形,也可以直接輸出爲對象形狀。 在實驗中,除非另有說明,否則將推理迭代的次數設置爲3。
請注意,輪廓是對象空間擴展的替代表示。 通過將初始輪廓變形到對象邊界,我們的方法可以解決來自探測器的定位誤差。

在這裏插入圖片描述

Figure 4. Given an object box, we perform RoIAlign to obtain the feature map and use a detector to detect the component boxes.

Handling multi-component objects. Due to the occlu- sions, many instances comprise more than one connected component. However, a contour can only outline one con- nected component per bounding box. To overcome this problem, we propose to detect the object components within the object box. Specifically, using the detected box, our approach performs RoIAlign [17] to extract a feature map and adds a detector branch on the feature map to produce the component boxes. Figure 4 shows the basic idea. The following segmentation pipeline remains the same. Our ap- proach obtains the final object shape by merging component contours from the same object box.

圖4.給定一個對象框,我們執行RoIAlign以獲取特徵圖,並使用檢測器檢測組件框。

處理多組件對象。 由於發生這種情況,許多實例包含多個連接的組件。 但是,一個輪廓只能在每個邊框上勾勒出一個連接的組件。 爲了克服這個問題,我們建議檢測對象框中的對象組件。 具體來說,使用檢測到的盒子,我們的方法執行RoIAlign [17]來提取特徵圖,並在特徵圖上添加檢測器分支以生成分量盒。 圖4顯示了基本思想。 以下細分管道保持不變。 我們的方法是通過合併同一對象框中的組件輪廓來獲得最終的對象形狀。

4. Implementation details

在這裏插入圖片描述

Detector. We adopt CenterNet [41] as the detector for all experiments. CenterNet reformulates the detection task as a keypoint detection problem and achieves an impressive trade-off between speed and accuracy. For the object box detector, we adopt the same setting as [41], which outputs class-specific boxes. For the component box detector, a class-agnostic CenterNet is adopted. Specifically, given an H × W × C feature map, the class-agnostic CenterNet out- puts an H ×W ×1 tensor representing the component center and an H × W × 2 tensor representing the box size.

我們採用CenterNet [41]作爲所有實驗的檢測器。 CenterNet將檢測任務重新定義爲關鍵點檢測問題,並在速度和準確性之間取得了令人印象深刻的折衷。 對於對象框檢測器,我們採用與[41]相同的設置,該設置輸出特定於類的框。 對於元件盒檢測器,採用與類無關的CenterNet。 具體而言,給定H×W×C特徵圖,與類無關的CenterNet輸出代表組件中心的H×W×1張量和代表盒子大小的H×W×2張量。

5. Experiments

We compare our approach with the state-of-the-art methods on the Cityscapes [7], Kins [33] and Sbd [15] datasets. Comprehensive ablation studies are conducted to analyze importance of the proposed components in our approach.

我們將我們的方法與Cityscapes [7],Kins [33]和Sbd [15]數據集上的最新方法進行了比較。 進行了全面的消融研究,以分析我們方法中提議的組件的重要性。

5.1. Datasets and Metrics

Cityscapes [7] is a widely used benchmark for urban scene instance segmentation. It contains 2,975 training, 500 validation and 1, 525 testing images with high quality annotations. Besides, it has 20k images with coarse an- notations. This dataset is challenging due to the crowded scenes and the wide range in object scale. The performance is evaluated in terms of the average precision (AP) metric averaged over 8 semantic classes of the dataset. We report our results on the validation and test sets.
Kins [33] was recently created by additionally annotat- ing Kitti [12] dataset with instance-level semantic annota- tion. This dataset is used for amodal instance segmentation, which is a variant of instance segmentation and aims to re- cover complete instance shapes even under occlusion. Kins consists of 7, 474 training images and 7, 517 testing images. Following its setting, we evaluate our approach on 7 object categories in terms of the AP metric.
Sbd [15] re-annotates 11, 355 images from the Pascal Voc [9] dataset with instance-level boundaries and has the same 20 object categories. The reason that we do not directly perform experiments on Pascal Voc is that its annotations contain holes, which is not suitable for contour-based meth- ods. The Sbd dataset is split into 5, 623 training images and 5, 732 testing images. We report our results in terms of 2010 Voc APvol [16], AP50, AP70 metrics. APvol is the average of AP with 9 thresholds from 0.1 to 0.9.

Cityscapes [7]是用於城市場景實例分割的廣泛使用的基準。它包含2,975個培訓,500個驗證和1,525個帶有高質量註釋的測試圖像。此外,它有2萬張帶有粗略註釋的圖像。由於場景擁擠且對象範圍廣,此數據集具有挑戰性。根據對數據集的8個語義類平均的平均精度(AP)度量來評估性能。我們在驗證和測試集上報告我們的結果。
Kins [33]是最近通過用實例級語義註釋對Kitti [12]數據集附加註釋來創建的。該數據集用於模態實例分割,這是實例分割的一種變體,旨在即使在遮擋下也能覆蓋完整的實例形狀。 Kins由7幅474幅訓練圖像和7 517幅測試圖像組成。設置好之後,我們根據AP指標評估了針對7種對象類別的方法。
Sbd [15]使用實例級別的邊界重新註釋了Pascal Voc [9]數據集中的11幅,355張圖像,並具有相同的20個對象類別。我們不直接在Pascal Voc上進行實驗的原因是其註釋中包含孔,因此不適合基於輪廓的方法。 Sbd數據集分爲5個623個訓練圖像和5 732個測試圖像。我們根據2010年Voc APvol [16],AP50,AP70指標報告了我們的結果。 APvol是9個閾值從0.1到0.9的AP的平均值。

5.2. Ablation studies

在這裏插入圖片描述

在這裏插入圖片描述

Table 2. Comparison between graph and circular convolution on Sbd val set. The results are in terms of the APvol metric. Graph and circular convolutions mean the convolution operator in the network. The columns show the results of different infer- ence iterations. Circular convolution outperforms graph convolu- tion across all inference iterations. Furthermore, circular convolu- tion with two iterations outperforms graph convolution with three iterations by 0.6 AP, indicating a stronger deforming ability.

表2. Sbd val集上圖和圓卷積之間的比較。 結果以APvol度量表示。 圖和圓卷積是網絡中的卷積算子。 這些列顯示了不同推理迭代的結果。 在所有推理迭代中,循環卷積的性能均優於圖卷積。 此外,以0.6 AP進行3次迭代的循環卷積優於通過3次迭代進行的圖形卷積,表明變形能力更強。

We conduct ablation studies on the Sbd dataset with the consideration that it has 20 semantic categories and could fully evaluate the ability to deform various object con- tours. The three proposed components are evaluated, in- cluding our network architecture, initial contour proposal, and circular convolution. In these experiments, the detec- tor and deep snake are trained end-to-end for 160 epochs with multi-scale data augmentation. The learning rate starts from 1e−4 and decays with 0.5 at 80 and 120 epochs. Ta- ble 1 summarizes the results of ablation studies.
The row “Baseline” lists the result of a direct combi- nation of Curve-gcn [23] with CenterNet [41]. Specifi- cally, the detector produces object boxes, which gives el- lipses around objects. Then ellipses are deformed into ob- ject boundaries through Graph-ResNet. Note that, the baseline represents the contour as a graph and uses a graph con- volution network for contour deformation.
To validate the advantages of our network, the model in the second row keeps the convolution operator as graph con- volution and replaces Graph-ResNet with our proposed ar- chitecture, which yields 1.4 APvol improvement. The main difference between the two networks is that our architecture appends a global fusion block before the prediction head.
When exploring the influence of the contour initializa- tion, we add the initial contour proposal before the contour deformation. Instead of directly using the ellipse, the pro- posal step generates an octagon initialization by predicting four object extreme points, which not only resolves the de- tection errors but also encloses the object more tightly. The comparison between the second and the third row shows a 1.3 improvement in terms of APvol.
我們對Sbd數據集進行消融研究,考慮到它具有20個語義類別,並且可以充分評估變形各種對象輪廓的能力。對三個提議的組件進行了評估,包括我們的網絡體系結構,初始輪廓提議和圓形卷積。在這些實驗中,使用多尺度數據增強對探測器和深蛇進行了160個紀元的端到端訓練。學習率從1e−4開始,在80和120個時代下降爲0.5。表1總結了消融研究的結果。
“基線”行列出了Curve-gcn [23]與CenterNet [41]直接結合的結果。具體來說,探測器會產生物體箱,從而在物體周圍形成橢圓形。然後,橢圓通過Graph-ResNet變形爲對象邊界。注意,基線將輪廓表示爲圖形,並使用圖形卷積網絡進行輪廓變形。
爲了驗證我們網絡的優勢,第二行的模型將卷積運算符保留爲圖卷積,並用我們建議的體系結構替換了Graph-ResNet,從而提高了1.4 APvol。這兩個網絡之間的主要區別在於我們的架構在預測頭之前附加了一個全局融合塊。
在探討輪廓初始化的影響時,我們在輪廓變形之前添加了初始輪廓建議。建議步驟不是直接使用橢圓,而是通過預測四個對象的極點來生成八邊形初始化,這不僅解決了檢測誤差,而且還更緊密地包圍了對象。第二行和第三行之間的比較顯示APvol改善了1.3。

Finally, the graph convolution is replaced with the cir- cular convolution, which achieves 0.8 APvol improvement. To fully validate the importance of circular convolution, we further compare models with different convolution opera- tors and different inference iterations, as shown in table 2. Circular convolution outperforms graph convolution across all inference iterations. And circular convolution with two iterations outperforms graph convolution with three itera- tions by 0.6 APvol. Figure 5 shows qualitative results of graph and circular convolution on Sbd, where circular con- volution gives a sharper boundary. The quantitative and qualitative results indicate that models with the circular con- volution have a stronger ability to deform contours.

最後,將圖形卷積替換爲圓形卷積,可將0.8 APvol提高。 爲了充分驗證循環卷積的重要性,我們進一步比較了具有不同卷積運算符和不同推理迭代的模型,如表2所示。在所有推理迭代中,循環卷積優於圖形卷積。 具有兩次迭代的循環卷積優於具有3個迭代的圖形卷積(0.6 APvol)。 圖5顯示了Sbd上圖形和圓卷積的定性結果,其中圓卷積給出了更清晰的邊界。 定性和定量的結果表明,具有圓形卷積的模型具有更強的輪廓變形能力。

在這裏插入圖片描述

Figure 5. Comparison between graph convolution (top) and cir- cular convolution (bottom) on Sbd. The result of circular con- volution with two iterations is visually better than that of graph convolution with three iterations.

圖5. Sbd上的圖形卷積(頂部)和圓形卷積(底部)之間的比較。 從視覺上講,經過兩次迭代的循環卷積要比經過三次迭代的圖形卷積好。

5.3. Comparison with the state-of-the-art methods

Performance on Cityscapes. Since fragmented instances are very common in Cityscapes, we adopt the proposed strategy to handle multi-component objects. Our network is trained with multi-scale data augmentation and tested at a single resolution of 1216 × 2432. No testing tricks are used. The detector is first trained alone for 140 epochs, and the learning rate starts from 1e−4 and drops by half at 80, 120 epochs. Then the detection and snake branches are trained end-to-end for 200 epochs, and the learning rate starts from 1e−4 and drops by half at 80, 120, 150 epochs. We choose a model that performs best on the validation set.
Table 3 compares our results with other state-of-the-art methods on the Cityscapes validation and test sets. All methods are tested without tricks. Using only the fine anno- tations, our approach achieves state-of-the-art performances on both validation and test sets. We outperform PANet by 0.9 AP on the validation set and 1.3 AP50 on the test set. According to the approximate timing result in [29], PANet runs at less than 1.0 fps. In contrast, our model runs at 4.6 fps on a 1080 Ti GPU for 1216 × 2432 images, which is about 5 times faster. Our approach achieves 28.2 AP on the test set when the strategy of handling multi-component objects is not adopted. Visual results are shown in Figure 6.
Performance on Kins. As a dataset for amodal instance segmentation, objects in the Kins dataset are all connected as a single component, so the strategy of handling multi- component objects is not adopted. We train the detector and snake end-to-end for 150 epochs. The learning rate starts from 1e−4 and decays with 0.5 and 0.1 at 80 and 120 epochs, respectively. We perform multi-scale training and test the model at a single resolution of 768 × 2496.

在城市景觀上的表現。由於碎片化實例在Cityscapes中非常常見,因此我們採用建議的策略來處理多組件對象。我們的網絡接受了多尺度數據增強的培訓,並以1216×2432的單個分辨率進行了測試。沒有使用任何測試技巧。首先單獨對檢測器進行140個時期的訓練,學習率從1e-4開始,在80、120個時期下降一半。然後,對檢測分支和蛇分支進行端到端訓練200個紀元,學習率從1e-4開始,在80、120、150個紀元時下降一半。我們選擇在驗證集上表現最佳的模型。
表3將我們的結果與Cityscapes驗證和測試集上的其他最新方法進行了比較。所有方法都經過測試,沒有技巧。僅使用精細註釋,我們的方法就可以在驗證和測試集上實現最新的性能。我們在驗證集上的性能優於PANet 0.9 AP,在測試集上的性能優於1.3 AP50。根據[29]中的近似計時結果,PANet的運行速度低於1.0 fps。相比之下,我們的模型在1080 Ti GPU上以4.6 fps的速度運行,可拍攝1216×2432圖像,速度約快5倍。當不採用處理多組件對象的策略時,我們的方法在測試集上達到28.2 AP。視覺結果如圖6所示。
性能表現。作爲用於無模式實例分割的數據集,Kins數據集中的對象都作爲單個組件連接,因此未採用處理多組件對象的策略。我們訓練探測器和首尾相接150個紀元。學習率從1e-4開始,分別在80和120個時代衰減爲0.5和0.1。我們執行多尺度訓練,並以768×2496的單個分辨率測試模型。

在這裏插入圖片描述

Figure 6. Qualitative results on Cityscapes test and Kins test sets. The first two rows show the results on Cityscapes, and the last row lists the results on Kins. Note that the results on Kins are for amodal instance segmentation.

圖6. Cityscapes測試和Kins測試集的定性結果。 前兩行顯示Cityscapes上的結果,最後一行列出Kins上的結果。 請注意,關於Kins的結果適用於無模式實例分割。
在這裏插入圖片描述

Table 3. Results on Cityscapes val (“AP [val]” column) and test (remaining columns) sets. Our approach achieves the state-of-the-art performance, which outperforms PANet [25] by 0.9 AP on the val set and 1.3 AP50 on the test set. According to the timing result in [29], our approach is approximately 5 times faster than PANet.

表3. Cityscapes val(“ AP [val]”列)和測試(剩餘列)集的結果。 我們的方法實現了最先進的性能,其在val set上的性能爲0.9 AP,在測試set上的性能爲1.3 AP50,優於PANet [25]。 根據[29]中的計時結果,我們的方法比PANet快約5倍。

在這裏插入圖片描述

Table 4. Results on Kins test set in terms of the AP metric. The amodal bounding box is used as the ground truth in the detection task. × means no such output in the corresponding method.

表4.基於AP指標的Kins測試集的結果。 非模態邊界框用作檢測任務中的基本事實。 ×表示相應方法中沒有此類輸出。
在這裏插入圖片描述

Table 5. Results on Sbd val set. Our approach outperforms other contour-based methods by a large margin. The improvement in- creases with the IoU threshold, 21.4 AP50 and 36.2 AP70.

表5. Sbd val設置的結果。 我們的方法大大優於其他基於輪廓的方法。 隨着IoU閾值,21.4 AP50和36.2 AP70的增加,這種改進也增加了。

Table 4 shows the comparison with [8, 22, 10, 17, 25] on the Kins dataset in terms of the AP metric. Kins [33] indicates that tackling both amodal and inmodal segmenta- tion simultaneously can improve the performance, as shown in the fourth and the fifth row of Table 4. Our approach learns only the amodal segmentation task and achieves the best performance across all methods. We find that the snake branch can improve the detection performance. When Cen- terNet is trained alone, it obtains 30.5 AP on detection. When trained with the snake branch, its performance improves by 2.3 AP. For 768 × 2496 images on the Kins dataset, our approach runs at 7.6 fps on a 1080 Ti GPU. Figure 7 shows some qualitative results on Kins.

表4顯示了就AP指標而言,與Kins數據集上的[8、22、10、17、25]比較。 Kins [33]指出,同時處理無模式和無模式分割都可以提高性能,如表4的第四行和第五行所示。我們的方法僅學習無模式的分割任務,並且在所有方法中均達到最佳性能。 我們發現蛇分支可以提高檢測性能。 單獨培訓CenterNet時,它在檢測時獲得30.5 AP。 經過蛇分支訓練後,其性能提高了2.3 AP。 對於Kins數據集上的768×2496圖像,我們的方法在1080 Ti GPU上以7.6 fps的速度運行。 圖7顯示了Kins的一些定性結果。
在這裏插入圖片描述

Figure 7. Qualitative results on Sbd val set. Our approach handles errors in object localization in most cases. For example, in the first image, although the detected boxes do not fully cover the boys, our approach recovers the complete object shapes. Zoom in for details.

圖7. Sbd val集的定性結果。 在大多數情況下,我們的方法可以處理對象本地化中的錯誤。 例如,在第一個圖像中,儘管檢測到的盒子沒有完全覆蓋男孩,但是我們的方法恢復了完整的對象形狀。 放大以獲取詳細信息。
在這裏插入圖片描述

Table 6. Running time on the Pascal Voc dataset. “MS” repre- sents Mask R-CNN [17], and “OURS” represents our approach. The last three methods are contour-based methods.

表6. Pascal Voc數據集上的運行時間。 “ MS”代表Mask R-CNN [17],“ OURS”代表我們的方法。 後三種方法是基於輪廓的方法。

Performance on Sbd. Most objects on the Sbd dataset are connected as a single component, so we do not han- dle fragmented instances. For multi-component objects, our approach detects their components separately instead of de- tecting the whole object. We train the detection and snake branches end-to-end for 150 epochs with multi-scale data augmentation. The learning rate starts from 1e−4 and drops by half at 80 and 120 epochs. The network is tested at a single scale of 512 × 512.

In Table 5, we compare with other contour-based meth- ods [19, 39] on the Sbd dataset in terms of the Voc AP met- rics. [19, 39] predict the object contours by regressing shape vectors. STS [19] defines the object contour as a radial vector from the object center, and ESE [39] approximates object contour with 20 and 50 coefficients of Chebyshev polynomial. In contrast, our approach deforms an initial contour to the object boundary. We outperform these meth- ods by a large margin of at least 19.1 APvol. Note that, our approach yields 21.4 AP50 and 36.2 AP70 improvements, demonstrating that the improvement increases with the IoU threshold. This indicates that our algorithm better outlines object boundaries. For 512×512 images on the Sbd dataset, our approach runs at 32.3 fps on a 1080 Ti. Some qualitative results are illustrated in Figure 7.

Sbd上的效果。 Sbd數據集上的大多數對象都是作爲單個組件連接的,因此我們不處理零散的實例。對於多組件對象,我們的方法是單獨檢測其組件,而不是檢測整個對象。我們使用多尺度數據增強對檢測和蛇分支進行端到端訓練,共150個紀元。學習速率從1e−4開始,在80和120個紀元時下降一半。該網絡以512×512的單個比例進行了測試。

在表5中,我們根據Voc AP原理與Sbd數據集上的其他基於輪廓的方法[19,39]進行了比較。 [19,39]通過迴歸形狀矢量來預測對象輪廓。 STS [19]將對象輪廓定義爲從對象中心開始的徑向矢量,ESE [39]用20和50的切比雪夫多項式係數近似對象輪廓。相反,我們的方法將初始輪廓變形爲對象邊界。我們以至少19.1 APvol的大幅度超越了這些方法。請注意,我們的方法產生了21.4 AP50和36.2 AP70的改進,這表明改進隨着IoU閾值的增加而增加。這表明我們的算法更好地勾勒出對象邊界。對於Sbd數據集上的512×512圖像,我們的方法在1080 Ti上以32.3 fps的速度運行。定性結果如圖7所示。

5.4. Running time

Table 6 compares our approach with other methods [8, 22, 17, 19, 39] in terms of running time on the Pascal Voc dataset. Since the Sbd dataset shares images with Pascal Voc and has the same semantic categories, the running time on the Sbd dataset is technically the same as the one on Pascal Voc. We obtain the running time of other methods on Pascal Voc from [39]. For 512 × 512 images on the Sbd dataset, our algorithm runs at 32.3 fps on a desktop with an Intel i7 3.7GHz and a GTX 1080 Ti GPU, which is efficient for real-time instance segmentation. Specifically, CenterNet takes 18.4 ms, the initial contour proposal takes 3.1 ms, and each iteration of contour deformation takes 3.3 ms. Since our approach out- puts the object boundary, no post-processing like upsam- pling is required. If the strategy of handling fragmented instances is adopted, the detector additionally takes 3.6 ms.
表6在Pascal Voc數據集的運行時間方面將我們的方法與其他方法[8,22,17,19,39]進行了比較。 由於Sbd數據集與Pascal Voc共享圖像並且具有相同的語義類別,因此Sbd數據集的運行時間在技術上與Pascal Voc上的運行時間相同。 我們從[39]中獲得了其他方法在Pascal Voc上的運行時間。 對於Sbd數據集上的512×512圖像,我們的算法在具有Intel i7 3.7GHz和GTX 1080 Ti GPU的臺式機上以32.3 fps的速度運行,這對於實時實例分割非常有效。 具體來說,CenterNet需要18.4 ms,初始輪廓建議需要3.1 ms,輪廓變形的每次迭代需要3.3 ms。 由於我們的方法超越了對象邊界,因此無需進行後續處理(例如上採樣)。 如果採用處理碎片實例的策略,則檢測器還需要3.6 ms。

6. Conclusion

We introduced a new contour-based model for real-time instance segmentation. Inspired by traditional snake algo- rithms, our approach deforms an initial contour to the ob- ject boundary and obtains the object shape. To this end, we proposed a learning-based snake algorithm, named deep snake, which introduces the circular convolution for effi- cient feature learning on the contour and regresses vertex- wise offsets for the contour deformation. Based on deep snake, we developed a two-stage pipeline for instance seg- mentation: initial contour proposal and contour deforma- tion. We showed that this pipeline gained a superior perfor- mance than direct regression of the coordinates of the object boundary points. We also showed that the circular convolu- tion learns the structural information of the contour more ef- fectively than the graph convolution. To overcome the lim- itation of the contour that it can only outline one connected component, we proposed to detect the object components within the object box and demonstrated the effectiveness of this strategy on Cityscapes. The proposed model achieved the state-of-the-art results on the Cityscapes, Kins and Sbd datasets with a real-time performance.
我們爲實時實例分割引入了一種基於輪廓的新模型。受傳統蛇算法啓發,我們的方法將初始輪廓變形到對象邊界並獲得對象形狀。爲此,我們提出了一種基於學習的蛇算法,稱爲Deep Snake,該算法引入了圓形卷積,可以有效地學習輪廓輪廓,併爲輪廓變形迴歸頂點偏移。在深蛇的基礎上,我們開發了一個兩段流水線用於分段:初始輪廓提議和輪廓變形。我們證明,與直接回歸對象邊界點的座標相比,該管線具有更高的性能。我們還表明,與圖卷積相比,圓形卷積更有效地學習輪廓的結構信息。爲了克服輪廓線只能勾勒出一個連接的組件的侷限性,我們建議檢測對象框中的對象組件,並證明該策略在Cityscapes上的有效性。所提出的模型以實時性能在Cityscapes,Kins和Sbd數據集上取得了最先進的結果。

References

[1] David Acuna, Huan Ling, Amlan Kar, and Sanja Fidler. Ef- ficient interactive annotation of segmentation datasets with polygon-rnn++. In CVPR, 2018. 3, 7
[2] Min Bai and Raquel Urtasun. Deep watershed transform for instance segmentation. In CVPR, 2017. 2
[3] Lluis Castrejon, Kaustav Kundu, Raquel Urtasun, and Sanja Fidler. Annotating object instances with a polygon-rnn. In CVPR, 2017. 3
[4] Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaox- iao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, Wanli Ouyang, et al. Hybrid task cascade for instance seg- mentation. In CVPR, 2019. 1
[5] Laurent D Cohen. On active contour models and balloons. CVGIP: Image understanding, 53(2):211–218, 1991. 1, 2
[6] Timothy F Cootes, Christopher J Taylor, David H Cooper, and Jim Graham. Active shape models-their training and ap- plication. CVIU, 61(1):38–59, 1995. 1, 2
[7] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016. 2, 5
[8] Jifeng Dai, Kaiming He, and Jian Sun. Instance-aware se- mantic segmentation via multi-task network cascades. In CVPR, 2016. 2, 6, 7, 8
[9] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. IJCV, 88(2):303–338, 2010. 5
[10] Patrick Follmann, Rebecca Ko ̈ Nig, Philipp Ha ̈ Rtinger, Michael Klostermann, and Tobias Bo ̈ Ttger. Learning to see the invisible: End-to-end trainable amodal instance segmen- tation. In WACV, 2019. 6, 7
[11] Naiyu Gao, Yanhu Shan, Yupei Wang, Xin Zhao, Yinan Yu, Ming Yang, and Kaiqi Huang. Ssap: Single-shot instance segmentation with affinity pyramid. In ICCV, 2019. 2
[12] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. IJRR, 32(11):1231–1237, 2013. 5
[13] Ross Girshick. Fast r-cnn. In ICCV, 2015. 5
[14] SteveRGunnandMarkSNixon.Arobustsnakeimplemen- tation; a dual active contour. PAMI, 19(1):63–68, 1997. 1,
2
[15] Bharath Hariharan, Pablo Arbela ́ez, Lubomir Bourdev, Subhransu Maji, and Jitendra Malik. Semantic contours from inverse detectors. In ICCV, 2011. 2, 5
[16] Bharath Hariharan, Pablo Arbela ́ez, Ross Girshick, and Ji- tendra Malik. Simultaneous detection and segmentation. In ECCV, 2014. 5
[17] Kaiming He, Georgia Gkioxari, Piotr Dolla ́r, and Ross Gir- shick. Mask r-cnn. In ICCV, 2017. 1, 2, 5, 6, 7, 8
[18] Zhaojin Huang, Lichao Huang, Yongchao Gong, Chang Huang, and Xinggang Wang. Mask scoring r-cnn. In CVPR, 2019. 1
[19] Saumya Jetley, Michael Sapienza, Stuart Golodetz, and Philip HS Torr. Straight to shapes: Real-time detection of encoded shapes. In CVPR, 2017. 1, 2, 7, 8
[20] Michael Kass, Andrew Witkin, and Demetri Terzopoulos. Snakes: Active contour models. IJCV, 1(4):321–331, 1988. 1, 2, 3, 4
[21] Guohao Li, Matthias Mu ̈ller, Ali Thabet, and Bernard Ghanem. Can gcns go as deep as cnns? In ICCV, 2019. 4
[22] YiLi,HaozhiQi,JifengDai,XiangyangJi,andYichenWei. Fully convolutional instance-aware semantic segmentation. In CVPR, 2017. 2, 6, 7, 8
[23] HuanLing,JunGao,AmlanKar,WenzhengChen,andSanja Fidler. Fast interactive object annotation with curve-gcn. In CVPR, 2019. 1, 2, 3, 4, 5, 6
[24] Shu Liu, Jiaya Jia, Sanja Fidler, and Raquel Urtasun. Sgn: Sequential grouping networks for instance segmentation. In ICCV, 2017. 7
[25] Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Path aggregation network for instance segmentation. In CVPR, 2018. 1, 2, 6, 7
[26] Yiding Liu, Siyu Yang, Bin Li, Wengang Zhou, Jizheng Xu, Houqiang Li, and Yan Lu. Affinity derivation and graph merge for instance segmentation. In ECCV, 2018. 2, 7
[27] Kevis-Kokitsi Maninis, Sergi Caelles, Jordi Pont-Tuset, and Luc Van Gool. Deep extreme cut: From extreme points to object segmentation. In CVPR, 2018. 2
[28] Diego Marcos, Devis Tuia, Benjamin Kellenberger, Lisa Zhang, Min Bai, Renjie Liao, and Raquel Urtasun. Learn- ing deep structured active contours end-to-end. In CVPR, 2018. 2
[29] Davy Neven, Bert De Brabandere, Marc Proesmans, and Luc Van Gool. Instance segmentation by jointly optimiz- ing spatial embeddings and clustering bandwidth. In CVPR, 2019. 2, 6, 7
[30] Dim P Papadopoulos, Jasper RR Uijlings, Frank Keller, and Vittorio Ferrari. Extreme clicking for efficient object anno- tation. In ICCV, 2017. 2
[31] SidaPeng,YuanLiu,QixingHuang,XiaoweiZhou,andHu- jun Bao. Pvnet: Pixel-wise voting network for 6dof pose estimation. In CVPR, 2019. 3
[32] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, 2017. 4
[33] Lu Qi, Li Jiang, Shu Liu, Xiaoyong Shen, and Jiaya Jia. Amodal instance segmentation with kins dataset. In CVPR, 2019. 2, 5, 7
[34] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS, 2015. 1
[35] Christian Rupprecht, Elizabeth Huaroc, Maximilian Baust, and Nassir Navab. Deep active contours. arXiv preprint arXiv:1607.05074, 2016. 3
[36] Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei Liu, and Yu-Gang Jiang. Pixel2mesh: Generating 3d mesh models from single rgb images. In ECCV, 2018. 4
[37] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds. TOG, 2018. 4
9
[38] ZianWang,DavidAcuna,HuanLing,AmlanKar,andSanja Fidler. Object instance annotation with deep extreme level set evolution. In CVPR, 2019. 3
[39] Wenqiang Xu, Haiyang Wang, Fubo Qi, and Cewu Lu. Ex- plicit shape encoding for real-time instance segmentation. In ICCV, 2019. 1, 2, 7, 8
[40] Ze Yang, Yinghao Xu, Han Xue, Zheng Zhang, Raquel Ur- tasun, Liwei Wang, Stephen Lin, and Han Hu. Dense rep- points: Representing visual objects with dense point sets. arXiv preprint arXiv:1912.11473, 2019. 2
[41] Xingyi Zhou, Dequan Wang, and Philipp Kra ̈henbu ̈hl. Ob- jects as points. arXiv preprint arXiv:1904.07850, 2019. 5, 6
[42] Xingyi Zhou, Jiacheng Zhuo, and Philipp Krahenbuhl. Bottom-up object detection by grouping extreme and center points. In CVPR, 2019. 2, 4

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章