Deep Snake for Real-Time Instance Segmentation

在这里插入图片描述

Abstract

This paper introduces a novel contour-based approach named deep snake for real-time instance segmentation. Un- like some recent methods that directly regress the coordi- nates of the object boundary points from an image, deep snake uses a neural network to iteratively deform an initial contour to the object boundary, which implements the classic idea of snake algorithms with a learning-based approach. For structured feature learning on the contour, we propose to use circular convolution in deep snake, which better exploits the cycle-graph structure of a contour com- pared against generic graph convolution. Based on deep snake, we develop a two-stage pipeline for instance segmen- tation: initial contour proposal and contour deformation, which can handle errors in initial object localization. Ex- periments show that the proposed approach achieves state- of-the-art performances on the Cityscapes, Kins and Sbd datasets while being efficient for real-time instance segmen- tation, 32.3 fps for 512×512 images on a 1080Ti GPU. The code is available at https://github.com/zju3dv/snake/.

本文介绍了一种新颖的基于轮廓的方法,称为深度蛇,用于实时实例分割。与其他一些直接从图像中回归对象边界点座标的最新方法不同,深蛇使用神经网络将初始轮廓迭代变形为对象边界,从而通过学习实现了蛇算法的经典思想。基于方法。对于轮廓上的结构化特征学习,我们建议在深蛇中使用圆形卷积,与通用图卷积相比,它可以更好地利用轮廓的循环图结构。在深蛇的基础上,我们开发了一个两段流水线来进行细分:初始轮廓建议和轮廓变形,它们可以处理初始对象定位中的错误。实验表明,所提出的方法在Cityscapes,Kins和Sbd数据集上实现了最先进的性能,同时对于实时实例分割非常有效,在1080Ti GPU上对512×512图像进行了32.3 fps的分割。该代码可从https://github.com/zju3dv/snake/获得。

1. Introduction

Instance segmentation is the cornerstone of many com- puter vision tasks, such as video analysis, autonomous driv- ing, and robotic grasping, which require both accuracy and efficiency. Most of the state-of-the-art instance segmenta- tion methods [17, 25, 4, 18] perform pixel-wise segmen- tation within a bounding box given by an object detector [34], which may be sensitive to the inaccurate bounding box. Moreover, representing an object shape as dense bi- nary pixels generally results in costly post-processing.
An alternative shape representation is the object con- tour, which is composed of a sequence of vertices along the object silhouette. In contrast to pixel-based representa- tion, contour is not limited within a bounding box and has fewer parameters. Such contour-based representation has long been used in image segmentation since the seminar work by Kass et al. [20], which is well known as the snake algorithm or active contour model. Given an initial con- tour, the snake algorithm iteratively deforms it to the object boundary by optimizing an energy functional defined with low-level image features, such as image intensity or gradi- ent. While many variants [5, 6, 14] have been developed in the literature, these methods tend to find local optimal solutions as objective functions are handcrafted and opti- mization is usually nonlinear.

实例分割是许多计算机视觉任务的基石,例如视频分析,自动驾驶和机器人抓取,这些都需要准确性和效率。大多数最新的实例分割方法[17、25、4、18]在对象检测器[34]给出的边界框中执行逐像素分割。边界框不正确。此外,将对象形状表示为密集的二进制像素通常会导致昂贵的后处理。
另一种形状表示是对象轮廓,它由沿着对象轮廓的一系列顶点组成。与基于像素的表示相反,轮廓不限于边界框内,而具有较少的参数。自从Kass等人的研讨会工作以来,这种基于轮廓的表示法长期以来一直用于图像分割。 [20],这是众所周知的蛇算法或主动轮廓模型。给定一个初始轮廓,snake算法通过优化由低级图像特征(例如图像强度或梯度)定义的能量函数,将其迭代变形到对象边界。尽管在文献中已经开发了许多变体[5、6、14],但是由于手工设计了目标函数并且优化通常是非线性的,因此这些方法倾向于找到局部最优解。

Some recent learning-based segmentation methods [19, 39] also represent objects as contours and try to directly regress the coordinates of object boundary points from an RGB image. Although such methods are much faster, they do not perform as well as pixel-based methods. Instead, Ling et al. [23] adopt the deformation pipeline of tradi- tional snake algorithms and train a neural network to evolve an initial contour to the object boundary. Given a contour with image features, it regards the input contour as a graph and uses a graph convolutional network (GCN) to predict vertex-wise offsets between contour points and the target boundary points. It achieves competitive accuracy com- pared with pixel-based methods while being much faster. However, the method proposed in [23] is designed to help annotation and lacks a complete pipeline for automatic in- stance segmentation. Moreover, treating the contour as a general graph with a generic GCN does not fully exploit the special topology of a contour.

一些最近的基于学习的分割方法[19,39]也将对象表示为轮廓,并尝试直接从RGB图像中回归对象边界点的座标。尽管此类方法要快得多,但它们的性能不如基于像素的方法。相反,Ling等。 [23]采用传统蛇形算法的变形管线,并训练神经网络以将初始轮廓演化到物体边界。给定具有图像特征的轮廓,它将输入轮廓视为图形,并使用图形卷积网络(GCN)预测轮廓点和目标边界点之间的顶点方向偏移。与基于像素的方法相比,它具有更高的竞争精度,并且速度更快。但是,在[23]中提出的方法旨在帮助注释,并且缺乏用于自动实例分割的完整管道。此外,使用通用GCN将轮廓视为通用图并不能完全利用轮廓的特殊拓扑。

In this paper, we propose a learning-based snake algorithm, named deep snake, for real-time instance segmen- tation. Inspired by previous methods [20, 23], deep snake takes an initial contour as input and deforms it by regressing vertex-wise offsets. Our innovation is in introducing the cir- cular convolution for efficient feature learning on a contour, as illustrated in Figure 1. We observe that the contour is a cycle graph that consists of a sequence of vertices connected in a closed cycle. Since every vertex has the same degree equal to two, we can apply the standard 1D convolution on the vertex features. Considering that the contour is periodic, deep snake introduces the circular convolution, which indi- cates that an aperiodic function (1D kernel) is convolved in the standard way with a periodic function (features defined on the contour). The kernel of circular convolution encodes not only the feature of each vertex but also the relationship among neighboring vertices. In contrast, the generic GCN performs pooling to aggregate information from neighbor- ing vertices. The kernel function in our circular convolution amounts to a learnable aggregation function, which is more expressive and results in better performance than using a generic GCN, as demonstrated by our experimental results in Section 5.2.

在本文中,我们提出了一种基于学习的蛇算法,称为深度蛇,用于实时实例分割。受先前方法的启发[20,23],深蛇将初始轮廓作为输入,并通过回归顶点偏移来使其变形。我们的创新之处在于引入了用于在轮廓上进行有效特征学习的圆环卷积,如图1所示。我们观察到轮廓是一个循环图,由封闭的循环中连接的一系列顶点组成。由于每个顶点的相等度等于2,因此我们可以在顶点特征上应用标准的1D卷积。考虑到轮廓是周期性的,深蛇引入了圆形卷积,这表明非周期性函数(一维核)以标准方式与周期函数(轮廓上定义的特征)进行卷积。圆形卷积的核不仅编码每个顶点的特征,而且编码相邻顶点之间的关系。相反,通用GCN执行合并以汇总来自相邻顶点的信息。我们的循环卷积中的核函数相当于一个可学习的聚集函数,与使用通用GCN相比,它具有更强的表达能力和更好的性能,如5.2节中的实验结果所示。

在这里插入图片描述

Figure 1. The basic idea of deep snake. Given an initial contour, image features are extracted at each vertex (a). Since the contour is a cycle graph, circular convolution is applied for feature learning on the contour (b). The blue, yellow and green nodes denote the input features, the kernel of circular convolution, and the output features, respectively. Finally, offsets are regressed at each vertex to deform the contour to the object boundary ©.

图1.深蛇的基本概念。 给定初始轮廓,在每个顶点(a)提取图像特征。 由于轮廓是循环图,因此将圆形卷积应用于轮廓(b)上的特征学习。 蓝色,黄色和绿色节点分别表示输入特征,圆卷积的核和输出特征。 最后,在每个顶点处进行偏移以使轮廓变形为对象边界(c)。
Based on deep snake, we develop a pipeline for instance segmentation. Given an initial contour, deep snake can it- eratively deform it to the object boundary and obtain the object shape. The remaining question is how to initial- ize a contour, whose importance has been demonstrated in classic snake algorithms. Inspired by [30, 27, 42], we propose to generate an octagon formed by object extreme points as the initial contour, which generally encloses the object tightly. Specifically, we add deep snake to a detec- tion model. The detected box first gives a diamond contour by connecting four points centered at its borders. Then deep snake takes the diamond as input and outputs offsets that point from four vertices to four extreme points, which are used to construct an octagon following [42]. Finally, deep snake deforms the octagon contour to the object boundary.
Our approach exhibits state-of-the-art performances on Cityscapes [7], Kins [33] and Sbd [15] datasets, while be- ing efficient for real-time instance segmentation, 32.3 fps for 512 × 512 images on a GTX 1080ti GPU. There are two reasons why the learning-based snake is fast while be- ing accurate. First, our approach can deal with errors in the object localization and thus allows a light detector. Second, object contour has fewer parameters than pixel-based repre- sentation and does not require costly post-processing, such as mask upsampling.
In summary, this work has the following contributions:

基于深层蛇,我们开发了用于实例分割的管道。给定初始轮廓,深蛇可以将其迭代变形到对象边界并获得对象形状。剩下的问题是如何初始化轮廓,其重要性已在经典的蛇形算法中得到了证明。受[30,27,42]的启发,我们建议生成一个由对象极点形成的八边形作为初始轮廓,该轮廓通常将对象紧密地包围起来。具体来说,我们将深蛇添加到检测模型中。被检测的盒子首先通过连接以其边界为中心的四个点来给出菱形轮廓。然后,深蛇将菱形作为输入,并输出从四个顶点到四个极限点的偏移量,这些偏移量用于在[42]之后构造一个八边形。最后,深蛇将八边形轮廓变形为对象边界。
我们的方法在Cityscapes [7],Kins [33]和Sbd [15]数据集上展现了最先进的性能,同时对于实时实例分割非常有效,GTX上的512×512图像为32.3 fps。 1080ti GPU。基于学习的蛇快速而准确的原因有两个。首先,我们的方法可以处理对象定位中的错误,因此可以使用光检测器。其次,与基于像素的表示相比,对象轮廓具有更少的参数,并且不需要昂贵的后处理,例如蒙版上采样。
总而言之,这项工作有以下贡献:

We propose a learning-based snake algorithm for real- time instance segmentation, which deforms an initial contour to the object boundary and introduces the cir- cular convolution for feature learning on the contour.
• We propose a two-stage pipeline for instance segmen- tation: initial contour proposal and contour deforma- tion. Both stages can deal with errors in the initial ob- ject localization.
• We demonstrate state-of-the-art performances of our approach on Cityscapes, Kins and Sbd datasets. For 512 × 512 images, our algorithm runs at 32.3 fps, which is efficient for real-time instance segmentation.

我们提出了一种基于学习的蛇算法,用于实时实例分割,该算法将初始轮廓变形为对象边界,并引入了用于在轮廓上进行特征学习的循环卷积。
•我们提出了一个用于分段的两阶段管线:初始轮廓提议和轮廓变形。 这两个阶段都可以处理初始对象定位中的错误。
•我们在Cityscapes,Kins和Sbd数据集上展示了我们方法的最新性能。 对于512×512图像,我们的算法以32.3 fps的速度运行,这对于实时实例分割非常有效。

2. Related work

Pixel-based methods. Most methods [8, 22, 17, 25] perform instance segmentation on the pixel level within a re- gion proposal, which works particularly well with standard CNNs. A representative instantiation is Mask R-CNN [17]. It first detects objects and then uses a mask predictor to seg- ment instances within the proposed boxes. To better exploit the spatial information inside the box, PANet [25] fuses mask predictions from fully-connected layers and convo- lutional layers. Such proposal-based approaches achieve state-of-the-art performance. One limitation of these meth- ods is that they cannot resolve errors in localization, such as too small or shifted boxes. In contrast, our approach de- forms the detected boxes to the object boundaries, so the spatial extension of object shapes will not be limited.
There exist some pixel-based methods [2, 29, 26, 11, 40] that are free of region proposals. In these methods, every pixel produces the auxiliary information, and then a clus- tering algorithm groups pixels into object instances based on their information. The auxiliary information could be various, as well as grouping algorithms. [2] predicts the boundary-aware energy for each pixel and uses the water- shed transform algorithm for grouping. [29] differentiates instances by learning instance-level embeddings. [26, 11] consider the input image as a graph and regress pixel affini- ties, which are then processed by a graph merge algorithm. Since the mask is composed of dense pixels, the post- clustering algorithms tend to be time-consuming.

Contour-based methods. In these methods, the object shape comprises a sequence of vertices along the object boundary. Traditional snake algorithms [20, 5, 6, 14] first introduced the contour-based representation for image seg- mentation. They deform an initial contour to the object boundary by optimizing a handcrafted energy with respect to the contour coordinates. To improve the robustness of these methods, [28] proposed to learn the energy function in a data-driven manner. Instead of iteratively optimizing the contour, some recent learning-based methods [19, 39] try to regress the coordinates of contour points from an RGB im- age, which is much faster. However, they are not accurate competitively with state-of-the-art pixel-based methods.

In the field of semi-automatic annotation, [3, 1, 23] have tried to perform the contour labeling using other networks instead of standard CNNs. [3, 1] predict the contour points sequentially using a recurrent neural network. To avoid se- quential inference, [23] follows the pipeline of snake algo- rithms and uses a graph convolutional network to predict vertex-wise offsets for contour deformation. This strategy significantly improves the annotation speed while being as accurate as pixel-based methods. However, [23] lacks a pipeline for instance segmentation and does not fully ex- ploit the special topology of a contour. Instead of treating the contour as a general graph, deep snake leverages the cy- cle graph topology and introduces the circular convolution for efficient feature learning on a contour.

基于像素的方法。大多数方法[8、22、17、25]在区域提议中在像素级别执行实例分割,这在标准CNN上特别有效。代表性实例是Mask R-CNN [17]。它首先检测对象,然后使用掩码预测器对建议框中的实例进行细分。为了更好地利用盒子内部的空间信息,PANet [25]融合了来自全连接层和卷积层的掩膜预测。这种基于提议的方法可实现最先进的性能。这些方法的局限性在于它们无法解决本地化错误,例如框太小或移位。相比之下,我们的方法将检测到的盒子变形为对象边界,因此对象形状的空间扩展将不受限制。
存在一些基于像素的方法[2、29、26、11、40],这些方法没有区域建议。在这些方法中,每个像素都产生辅助信息,然后一种聚类算法根据其信息将像素分组为对象实例。辅助信息以及分组算法可以是多种多样的。 [2]预测每个像素的边界感知能量,并使用分水岭变换算法进行分组。 [29]通过学习实例级嵌入来区分实例。 [26,11]将输入图像视为图形和回归像素亲和力,然后通过图形合并算法对其进行处理。由于掩模由密集的像素组成,因此后聚类算法往往很耗时。

基于轮廓的方法。在这些方法中,对象形状包括沿对象边界的一系列顶点。传统的蛇算法[20、5、6、14]首先引入了基于轮廓的图像分割表示。它们通过相对于轮廓座标优化手工制作的能量,将初始轮廓变形为对象边界。为了提高这些方法的鲁棒性,[28]提出以​​数据驱动的方式学习能量函数。代替迭代优化轮廓,一些最近的基于学习的方法[19,39]尝试从RGB图像中回归轮廓点的座标,这要快得多。但是,它们与最新的基于像素的方法相比在竞争中并不准确。

在半自动注释领域,[3,1,23]尝试使用其他网络代替标准的CNN来进行轮廓标注。 [3,1]使用递归神经网络顺序预测轮廓点。为了避免顺序推理,[23]遵循蛇形算法的流程,并使用图卷积网络来预测轮廓变形的顶点偏移。该策略与基于像素的方法一样准确,可显着提高注释速度。然而,[23]缺乏实例分割的流水线,并且没有充分利用轮廓的特殊拓扑。深蛇没有将轮廓视为一般图形,而是利用了循环图拓扑结构,并引入了圆形卷积,以便在轮廓上进行有效的特征学习。

3. Proposed approach

Inspired by [20, 23], we perform object segmentation by deforming an initial contour to the object boundary. Specif- ically, deep snake takes a contour as input based on image features from a CNN backbone and predicts per-vertex off- sets pointing to the object boundary. To fully exploit the contour topology, we introduce the circular convolution for efficient feature learning on the contour, which facilitates deep snake to learn the deformation. Based on deep snake, a pipeline is developed for instance segmentation.

受[20,23]的启发,我们通过将初始轮廓变形到对象边界来执行对象分割。 具体来说,深蛇基于CNN主干的图像特征将轮廓作为输入,并预测指向对象边界的每个顶点偏移。 为了充分利用轮廓拓扑,我们引入了圆形卷积以有效地学习轮廓,这有助于深蛇学习变形。 基于深层蛇,开发了用于实例分割的管道。

3.1. Learning-based snake algorithm

Given an initial contour, traditional snake algorithms treat the coordinates of the vertices as a set of variables and optimize an energy functional with respect to these vari- ables. By designing proper image forces at the contour co- ordinates, active contour models could optimize the contour to the object boundary. However, since the energy func- tional is typically nonconvex and handcrafted based on low- level image features, the deformation process tend to find local optimal solutions.
In contrast, deep snake directly learns to evolve the con- tour from data in an end-to-end manner. Given a contour with N vertices {xi |i = 1, …, N }, we first construct feature vectors for each vertex. The input feature fi for a vertex xi is a concatenation of learning-based features and the vertex coordinate: [F (xi ); x′i ], where F is the feature maps and x′i is a translation-invariant version of vertex xi. The feature maps F is obtained by applying a CNN backbone on the input image, which deep snake shares with the detector in our instance segmentation model. The image feature F (xi ) is computed using the bilinear interpolation of features at the vertex coordinate xi. The appended vertex coordinate is used to model the spatial relationship among contour ver- tices. Since the deformation should not be affected by the absolute location of contour, we compute the translation- invariant coordinate x′i by subtracting the minimum value along x and y axis over all vertices, respectively.

给定初始轮廓,传统的蛇算法将顶点的座标视为一组变量,并针对这些变量优化能量函数。通过在轮廓座标上设计适当的图像力,主动轮廓模型可以优化到对象边界的轮廓。但是,由于能量函数通常是非凸的,并且是基于低级图像特征手工制作的,因此变形过程倾向于找到局部最优解。
相反,深蛇直接学习以端对端的方式从数据中演化出轮廓。给定具有N个顶点的轮廓{xi | i = 1,…,N},我们首先为每个顶点构造特征向量。顶点xi的输入特征fi是基于学习的特征和顶点座标的串联:[F(xi); x’i],其中F是特征图,x’i是顶点xi的平移不变形式。通过在输入图像上应用CNN主干来获得特征图F,在我们的实例分割模型中,深层蛇与检测器共享。使用特征在顶点座标xi处的双线性插值来计算图像特征F(xi)。附加的顶点座标用于模拟轮廓顶点之间的空间关系。由于变形不应受到轮廓的绝对位置的影响,因此我们通过分别减去所有顶点沿x和y轴的最小值来计算平移不变座标x’i。

Given the input features defined on a contour, deep snake introduces the circular convolution for the feature learning, as illustrated in Figure 2. In general, the features of contour vertices can be treated as a 1-D discrete signal f : Z → RD and processed by the standard convolution. But this breaks the topology of the contour. Therefore, we treat the features on the contour as a periodic signal defined as:

给定轮廓上定义的输入特征后,深蛇会引入圆形卷积进行特征学习,如图2所示。通常,轮廓顶点的特征可以视为一维离散信号f:Z→RD和 由标准卷积处理。 但这破坏了轮廓的拓扑。 因此,我们将轮廓上的特征视为周期信号,定义为:

在这里插入图片描述

and propose to encode the periodic features by the circular
convolution defined as:

并提出用圆形编码周期性特征
卷积定义为:
在这里插入图片描述

Similar to the standard convolution, we can construct a network layer based on the circular convolution for fea- ture learning, which is easy to be integrated into a mod- ern network architecture. After the feature learning, deep snake applies three 1×1 convolution layers to the output features for each vertex and predicts vertex-wise offsets be- tween contour points and the target points, which are used to deform the contour. In all experiments, the kernel size of circular convolution is fixed to be nine.
As discussed in the introduction, the proposed circular convolution better exploits the circular structure of the con- tour than the generic graph convolution. We will show the experimental comparison in Section 5.2. An alterna- tive method is to use standard CNNs to regress a pixel-wise vector field from the input image to guide the evolution of the initial contour [35, 31, 38]. We argue that an impor- tant advantage of deep snake over the standard CNNs is the object-level structured prediction, i.e., the offset prediction at a vertex depends on other vertices of the same contour. Therefore, it is more reasonable for deep snake to predict an offset for a vertex located in the background and far from the object, which is very common in an initial con- tour. Standard CNNs have difficulty in outputting meaning- ful offsets in this case, since it is ambiguous to decide which object a background pixel belongs to.

与标准卷积类似,我们可以基于圆形卷积构建网络层以进行功能学习,该层很容易集成到现代网络体系结构中。在特征学习之后,深蛇将三个1×1卷积层应用于每个顶点的输出特征,并预测轮廓点和目标点之间的顶点方向偏移,这些偏移用于使轮廓变形。在所有实验中,圆形卷积的内核大小固定为9。
正如引言中所讨论的,与通用图卷积相比,提出的圆形卷积更好地利用了轮廓的圆形结构。我们将在5.2节中显示实验比较。另一种方法是使用标准的CNN从输入图像中回归像素方向的矢量场,以指导初始轮廓的发展[35、31、38]。我们认为,深蛇优于标准CNN的重要优势是对象级结构化预测,即,顶点处的偏移量预测取决于相同轮廓的其他顶点。因此,对于较深的蛇来说,预测位于背景中且远离物体的顶点的偏移量更为合理,这在初始轮廓中非常常见。在这种情况下,标准的CNN很难输出有意义的偏移量,因为很难确定背景像素属于哪个对象。

在这里插入图片描述

Figure 3. Proposed contour-based model for instance segmentation. (a) Deep snake consists of three parts: a backbone, a fusion block, and a prediction head. It takes a contour as input and outputs vertex-wise offsets to deform the contour. (b) Based on deep snake, we propose a two-stage pipeline for instance segmentation: initial contour proposal and contour deformation. The box proposed by the detector gives a diamond contour, whose four vertices are then deformed to object extreme points by deep snake. An octagon is constructed based on the extreme points. Taking the octagon as the initial contour, deep snake iteratively deforms it to the object boundary.

图3.提出的基于轮廓的实例分割模型。 (a)深蛇由三部分组成:骨架,融合块和预测头。 它以轮廓为输入,并输出顶点偏移以使轮廓变形。 (b)基于深蛇,我们提出了一个两阶段的管道进行实例分割:初始轮廓提议和轮廓变形。 检测器建议的盒子给出菱形轮廓,然后通过深蛇将其四个顶点变形为物体的极点。 根据这些极点构造一个八边形。 以八边形为初始轮廓,深蛇将其迭代变形为对象边界。

Network architecture. Figure 3(a) shows the detailed schematic. Following ideas from [32, 37, 21], deep snake consists of three parts: a backbone, a fusion block, and a prediction head. The backbone is comprised of 8 “CirConv- Bn-ReLU” layers and uses residual skip connections for all layers, where “CirConv” means circular convolution. The fusion block aims to fuse the information across all contour points at multiple scales. It concatenates features from all layers in the backbone and forwards them through a 1×1 convolution layer followed by max pooling. The fused fea- ture is then concatenated with the feature of each vertex. The prediction head applies three 1×1 convolution layers to the vertex features and output vertex-wise offsets.

网络架构。 图3(a)显示了详细的示意图。 按照[32、37、21]的观点,深蛇由三部分组成:骨架,融合块和预测头。 骨干网由8个“ CirConv-Bn-ReLU”层组成,并对所有层使用剩余的跳过连接,其中“ CirConv”表示圆形卷积。 融合块旨在以多种比例将信息融合到所有轮廓点上。 它连接主干中所有层的要素,并通过1×1卷积层转发它们,然后进行最大池化。 然后将融合的特征与每个顶点的特征连接起来。 预测头将三个1×1卷积层应用于顶点特征并输出顶点方向偏移。

3.2. Deep snake for instance segmentation

Figure 3(b) overviews the proposed pipeline for instance segmentation. We add deep snake to an object detection model. The detector first produces object boxes that are used to construct diamond contours. Then deep snake de- forms the diamond vertices to object extreme points, which are used to construct octagon contours. Finally, our ap- proach takes octagons as initial contours and performs it- erative contour deformation to obtain the object shape.
Initial contour proposal. Most active contour models re-
quire precise initial contours. Since the octagon proposed
in [42] generally tightly encloses the object, we choose it as
the initial contour, as shown in Figure 3(b).
图3(b)概述了建议的实例细分管道。 我们将深蛇添加到对象检测模型中。 检测器首先产生用于构造菱形轮廓的物体箱。 然后,深蛇将钻石顶点变形为对象的极点,这些极点用于构造八边形轮廓。 最后,我们的方法以八边形为初始轮廓,并执行迭代轮廓变形以获得对象形状。
初始轮廓建议。 最活跃的轮廓模型重新
需要精确的初始轮廓。 自从八角形提出以来
在[42]中,通常将对象紧密包围,我们选择它作为
初始轮廓,如图3(b)所示。
This octagon
在这里插入图片描述

tice, to take in more context information, the diamond con-
tour is uniformly upsampled to 40 points, and deep snake
correspondingly outputs 40 offsets. The loss function only
bb considers the offsets at xi .
We construct the octagon by generating four lines based on extreme points and connecting their endpoints. Specif- ically, the four extreme points form a new object box. For each extreme point, a line extends from it along the corre- sponding box border in both directions to 1/4 of the border length. And the line will be truncated if it meets the box corner. Then the endpoints of the four lines are connected to form the octagon.

我们通过基于极端点生成四条线并连接其端点来构造八边形。 具体来说,这四个极端点构成一个新的对象框。 对于每个极点,一条线沿对应的框线边界从两个方向沿其延伸到边界长度的1/4。 如果该线与框角相交,则该线将被截断。 然后,将四条线的端点连接起来以形成八边形。

在这里插入图片描述
However, regressing the offsets in one pass is challeng- ing, especially for vertices far away from the object. In- spired by [20, 23, 36], we deal with this problem in an iter- ative optimization fashion. Specifically, our approach first predicts N offsets based on the current contour and then deforms this contour by vertex-wise adding the offsets to its vertex coordinates. The deformed contour can be used for the next deformation or directly outputted as the object shape. In experiments, the number of inference iteration is set as 3 unless otherwise stated.
Note that the contour is an alternative representation for the spatial extension of an object. By deforming the initial contour to the object boundary, our approach could resolve the localization errors from the detector.

但是,一次偏移偏移是具有挑战性的,尤其是对于远离对象的顶点。 受[20,23,36]的启发,我们以迭代优化的方式处理这个问题。 具体来说,我们的方法首先根据当前轮廓预测N个偏移,然后通过将顶点偏移量添加到其顶点座标来使该轮廓变形。 变形后的轮廓可以用于下一次变形,也可以直接输出为对象形状。 在实验中,除非另有说明,否则将推理迭代的次数设置为3。
请注意,轮廓是对象空间扩展的替代表示。 通过将初始轮廓变形到对象边界,我们的方法可以解决来自探测器的定位误差。

在这里插入图片描述

Figure 4. Given an object box, we perform RoIAlign to obtain the feature map and use a detector to detect the component boxes.

Handling multi-component objects. Due to the occlu- sions, many instances comprise more than one connected component. However, a contour can only outline one con- nected component per bounding box. To overcome this problem, we propose to detect the object components within the object box. Specifically, using the detected box, our approach performs RoIAlign [17] to extract a feature map and adds a detector branch on the feature map to produce the component boxes. Figure 4 shows the basic idea. The following segmentation pipeline remains the same. Our ap- proach obtains the final object shape by merging component contours from the same object box.

图4.给定一个对象框,我们执行RoIAlign以获取特征图,并使用检测器检测组件框。

处理多组件对象。 由于发生这种情况,许多实例包含多个连接的组件。 但是,一个轮廓只能在每个边框上勾勒出一个连接的组件。 为了克服这个问题,我们建议检测对象框中的对象组件。 具体来说,使用检测到的盒子,我们的方法执行RoIAlign [17]来提取特征图,并在特征图上添加检测器分支以生成分量盒。 图4显示了基本思想。 以下细分管道保持不变。 我们的方法是通过合并同一对象框中的组件轮廓来获得最终的对象形状。

4. Implementation details

在这里插入图片描述

Detector. We adopt CenterNet [41] as the detector for all experiments. CenterNet reformulates the detection task as a keypoint detection problem and achieves an impressive trade-off between speed and accuracy. For the object box detector, we adopt the same setting as [41], which outputs class-specific boxes. For the component box detector, a class-agnostic CenterNet is adopted. Specifically, given an H × W × C feature map, the class-agnostic CenterNet out- puts an H ×W ×1 tensor representing the component center and an H × W × 2 tensor representing the box size.

我们采用CenterNet [41]作为所有实验的检测器。 CenterNet将检测任务重新定义为关键点检测问题,并在速度和准确性之间取得了令人印象深刻的折衷。 对于对象框检测器,我们采用与[41]相同的设置,该设置输出特定于类的框。 对于元件盒检测器,采用与类无关的CenterNet。 具体而言,给定H×W×C特征图,与类无关的CenterNet输出代表组件中心的H×W×1张量和代表盒子大小的H×W×2张量。

5. Experiments

We compare our approach with the state-of-the-art methods on the Cityscapes [7], Kins [33] and Sbd [15] datasets. Comprehensive ablation studies are conducted to analyze importance of the proposed components in our approach.

我们将我们的方法与Cityscapes [7],Kins [33]和Sbd [15]数据集上的最新方法进行了比较。 进行了全面的消融研究,以分析我们方法中提议的组件的重要性。

5.1. Datasets and Metrics

Cityscapes [7] is a widely used benchmark for urban scene instance segmentation. It contains 2,975 training, 500 validation and 1, 525 testing images with high quality annotations. Besides, it has 20k images with coarse an- notations. This dataset is challenging due to the crowded scenes and the wide range in object scale. The performance is evaluated in terms of the average precision (AP) metric averaged over 8 semantic classes of the dataset. We report our results on the validation and test sets.
Kins [33] was recently created by additionally annotat- ing Kitti [12] dataset with instance-level semantic annota- tion. This dataset is used for amodal instance segmentation, which is a variant of instance segmentation and aims to re- cover complete instance shapes even under occlusion. Kins consists of 7, 474 training images and 7, 517 testing images. Following its setting, we evaluate our approach on 7 object categories in terms of the AP metric.
Sbd [15] re-annotates 11, 355 images from the Pascal Voc [9] dataset with instance-level boundaries and has the same 20 object categories. The reason that we do not directly perform experiments on Pascal Voc is that its annotations contain holes, which is not suitable for contour-based meth- ods. The Sbd dataset is split into 5, 623 training images and 5, 732 testing images. We report our results in terms of 2010 Voc APvol [16], AP50, AP70 metrics. APvol is the average of AP with 9 thresholds from 0.1 to 0.9.

Cityscapes [7]是用于城市场景实例分割的广泛使用的基准。它包含2,975个培训,500个验证和1,525个带有高质量注释的测试图像。此外,它有2万张带有粗略注释的图像。由于场景拥挤且对象范围广,此数据集具有挑战性。根据对数据集的8个语义类平均的平均精度(AP)度量来评估性能。我们在验证和测试集上报告我们的结果。
Kins [33]是最近通过用实例级语义注释对Kitti [12]数据集附加注释来创建的。该数据集用于模态实例分割,这是实例分割的一种变体,旨在即使在遮挡下也能覆盖完整的实例形状。 Kins由7幅474幅训练图像和7 517幅测试图像组成。设置好之后,我们根据AP指标评估了针对7种对象类别的方法。
Sbd [15]使用实例级别的边界重新注释了Pascal Voc [9]数据集中的11幅,355张图像,并具有相同的20个对象类别。我们不直接在Pascal Voc上进行实验的原因是其注释中包含孔,因此不适合基于轮廓的方法。 Sbd数据集分为5个623个训练图像和5 732个测试图像。我们根据2010年Voc APvol [16],AP50,AP70指标报告了我们的结果。 APvol是9个阈值从0.1到0.9的AP的平均值。

5.2. Ablation studies

在这里插入图片描述

在这里插入图片描述

Table 2. Comparison between graph and circular convolution on Sbd val set. The results are in terms of the APvol metric. Graph and circular convolutions mean the convolution operator in the network. The columns show the results of different infer- ence iterations. Circular convolution outperforms graph convolu- tion across all inference iterations. Furthermore, circular convolu- tion with two iterations outperforms graph convolution with three iterations by 0.6 AP, indicating a stronger deforming ability.

表2. Sbd val集上图和圆卷积之间的比较。 结果以APvol度量表示。 图和圆卷积是网络中的卷积算子。 这些列显示了不同推理迭代的结果。 在所有推理迭代中,循环卷积的性能均优于图卷积。 此外,以0.6 AP进行3次迭代的循环卷积优于通过3次迭代进行的图形卷积,表明变形能力更强。

We conduct ablation studies on the Sbd dataset with the consideration that it has 20 semantic categories and could fully evaluate the ability to deform various object con- tours. The three proposed components are evaluated, in- cluding our network architecture, initial contour proposal, and circular convolution. In these experiments, the detec- tor and deep snake are trained end-to-end for 160 epochs with multi-scale data augmentation. The learning rate starts from 1e−4 and decays with 0.5 at 80 and 120 epochs. Ta- ble 1 summarizes the results of ablation studies.
The row “Baseline” lists the result of a direct combi- nation of Curve-gcn [23] with CenterNet [41]. Specifi- cally, the detector produces object boxes, which gives el- lipses around objects. Then ellipses are deformed into ob- ject boundaries through Graph-ResNet. Note that, the baseline represents the contour as a graph and uses a graph con- volution network for contour deformation.
To validate the advantages of our network, the model in the second row keeps the convolution operator as graph con- volution and replaces Graph-ResNet with our proposed ar- chitecture, which yields 1.4 APvol improvement. The main difference between the two networks is that our architecture appends a global fusion block before the prediction head.
When exploring the influence of the contour initializa- tion, we add the initial contour proposal before the contour deformation. Instead of directly using the ellipse, the pro- posal step generates an octagon initialization by predicting four object extreme points, which not only resolves the de- tection errors but also encloses the object more tightly. The comparison between the second and the third row shows a 1.3 improvement in terms of APvol.
我们对Sbd数据集进行消融研究,考虑到它具有20个语义类别,并且可以充分评估变形各种对象轮廓的能力。对三个提议的组件进行了评估,包括我们的网络体系结构,初始轮廓提议和圆形卷积。在这些实验中,使用多尺度数据增强对探测器和深蛇进行了160个纪元的端到端训练。学习率从1e−4开始,在80和120个时代下降为0.5。表1总结了消融研究的结果。
“基线”行列出了Curve-gcn [23]与CenterNet [41]直接结合的结果。具体来说,探测器会产生物体箱,从而在物体周围形成椭圆形。然后,椭圆通过Graph-ResNet变形为对象边界。注意,基线将轮廓表示为图形,并使用图形卷积网络进行轮廓变形。
为了验证我们网络的优势,第二行的模型将卷积运算符保留为图卷积,并用我们建议的体系结构替换了Graph-ResNet,从而提高了1.4 APvol。这两个网络之间的主要区别在于我们的架构在预测头之前附加了一个全局融合块。
在探讨轮廓初始化的影响时,我们在轮廓变形之前添加了初始轮廓建议。建议步骤不是直接使用椭圆,而是通过预测四个对象的极点来生成八边形初始化,这不仅解决了检测误差,而且还更紧密地包围了对象。第二行和第三行之间的比较显示APvol改善了1.3。

Finally, the graph convolution is replaced with the cir- cular convolution, which achieves 0.8 APvol improvement. To fully validate the importance of circular convolution, we further compare models with different convolution opera- tors and different inference iterations, as shown in table 2. Circular convolution outperforms graph convolution across all inference iterations. And circular convolution with two iterations outperforms graph convolution with three itera- tions by 0.6 APvol. Figure 5 shows qualitative results of graph and circular convolution on Sbd, where circular con- volution gives a sharper boundary. The quantitative and qualitative results indicate that models with the circular con- volution have a stronger ability to deform contours.

最后,将图形卷积替换为圆形卷积,可将0.8 APvol提高。 为了充分验证循环卷积的重要性,我们进一步比较了具有不同卷积运算符和不同推理迭代的模型,如表2所示。在所有推理迭代中,循环卷积优于图形卷积。 具有两次迭代的循环卷积优于具有3个迭代的图形卷积(0.6 APvol)。 图5显示了Sbd上图形和圆卷积的定性结果,其中圆卷积给出了更清晰的边界。 定性和定量的结果表明,具有圆形卷积的模型具有更强的轮廓变形能力。

在这里插入图片描述

Figure 5. Comparison between graph convolution (top) and cir- cular convolution (bottom) on Sbd. The result of circular con- volution with two iterations is visually better than that of graph convolution with three iterations.

图5. Sbd上的图形卷积(顶部)和圆形卷积(底部)之间的比较。 从视觉上讲,经过两次迭代的循环卷积要比经过三次迭代的图形卷积好。

5.3. Comparison with the state-of-the-art methods

Performance on Cityscapes. Since fragmented instances are very common in Cityscapes, we adopt the proposed strategy to handle multi-component objects. Our network is trained with multi-scale data augmentation and tested at a single resolution of 1216 × 2432. No testing tricks are used. The detector is first trained alone for 140 epochs, and the learning rate starts from 1e−4 and drops by half at 80, 120 epochs. Then the detection and snake branches are trained end-to-end for 200 epochs, and the learning rate starts from 1e−4 and drops by half at 80, 120, 150 epochs. We choose a model that performs best on the validation set.
Table 3 compares our results with other state-of-the-art methods on the Cityscapes validation and test sets. All methods are tested without tricks. Using only the fine anno- tations, our approach achieves state-of-the-art performances on both validation and test sets. We outperform PANet by 0.9 AP on the validation set and 1.3 AP50 on the test set. According to the approximate timing result in [29], PANet runs at less than 1.0 fps. In contrast, our model runs at 4.6 fps on a 1080 Ti GPU for 1216 × 2432 images, which is about 5 times faster. Our approach achieves 28.2 AP on the test set when the strategy of handling multi-component objects is not adopted. Visual results are shown in Figure 6.
Performance on Kins. As a dataset for amodal instance segmentation, objects in the Kins dataset are all connected as a single component, so the strategy of handling multi- component objects is not adopted. We train the detector and snake end-to-end for 150 epochs. The learning rate starts from 1e−4 and decays with 0.5 and 0.1 at 80 and 120 epochs, respectively. We perform multi-scale training and test the model at a single resolution of 768 × 2496.

在城市景观上的表现。由于碎片化实例在Cityscapes中非常常见,因此我们采用建议的策略来处理多组件对象。我们的网络接受了多尺度数据增强的培训,并以1216×2432的单个分辨率进行了测试。没有使用任何测试技巧。首先单独对检测器进行140个时期的训练,学习率从1e-4开始,在80、120个时期下降一半。然后,对检测分支和蛇分支进行端到端训练200个纪元,学习率从1e-4开始,在80、120、150个纪元时下降一半。我们选择在验证集上表现最佳的模型。
表3将我们的结果与Cityscapes验证和测试集上的其他最新方法进行了比较。所有方法都经过测试,没有技巧。仅使用精细注释,我们的方法就可以在验证和测试集上实现最新的性能。我们在验证集上的性能优于PANet 0.9 AP,在测试集上的性能优于1.3 AP50。根据[29]中的近似计时结果,PANet的运行速度低于1.0 fps。相比之下,我们的模型在1080 Ti GPU上以4.6 fps的速度运行,可拍摄1216×2432图像,速度约快5倍。当不采用处理多组件对象的策略时,我们的方法在测试集上达到28.2 AP。视觉结果如图6所示。
性能表现。作为用于无模式实例分割的数据集,Kins数据集中的对象都作为单个组件连接,因此未采用处理多组件对象的策略。我们训练探测器和首尾相接150个纪元。学习率从1e-4开始,分别在80和120个时代衰减为0.5和0.1。我们执行多尺度训练,并以768×2496的单个分辨率测试模型。

在这里插入图片描述

Figure 6. Qualitative results on Cityscapes test and Kins test sets. The first two rows show the results on Cityscapes, and the last row lists the results on Kins. Note that the results on Kins are for amodal instance segmentation.

图6. Cityscapes测试和Kins测试集的定性结果。 前两行显示Cityscapes上的结果,最后一行列出Kins上的结果。 请注意,关于Kins的结果适用于无模式实例分割。
在这里插入图片描述

Table 3. Results on Cityscapes val (“AP [val]” column) and test (remaining columns) sets. Our approach achieves the state-of-the-art performance, which outperforms PANet [25] by 0.9 AP on the val set and 1.3 AP50 on the test set. According to the timing result in [29], our approach is approximately 5 times faster than PANet.

表3. Cityscapes val(“ AP [val]”列)和测试(剩余列)集的结果。 我们的方法实现了最先进的性能,其在val set上的性能为0.9 AP,在测试set上的性能为1.3 AP50,优于PANet [25]。 根据[29]中的计时结果,我们的方法比PANet快约5倍。

在这里插入图片描述

Table 4. Results on Kins test set in terms of the AP metric. The amodal bounding box is used as the ground truth in the detection task. × means no such output in the corresponding method.

表4.基于AP指标的Kins测试集的结果。 非模态边界框用作检测任务中的基本事实。 ×表示相应方法中没有此类输出。
在这里插入图片描述

Table 5. Results on Sbd val set. Our approach outperforms other contour-based methods by a large margin. The improvement in- creases with the IoU threshold, 21.4 AP50 and 36.2 AP70.

表5. Sbd val设置的结果。 我们的方法大大优于其他基于轮廓的方法。 随着IoU阈值,21.4 AP50和36.2 AP70的增加,这种改进也增加了。

Table 4 shows the comparison with [8, 22, 10, 17, 25] on the Kins dataset in terms of the AP metric. Kins [33] indicates that tackling both amodal and inmodal segmenta- tion simultaneously can improve the performance, as shown in the fourth and the fifth row of Table 4. Our approach learns only the amodal segmentation task and achieves the best performance across all methods. We find that the snake branch can improve the detection performance. When Cen- terNet is trained alone, it obtains 30.5 AP on detection. When trained with the snake branch, its performance improves by 2.3 AP. For 768 × 2496 images on the Kins dataset, our approach runs at 7.6 fps on a 1080 Ti GPU. Figure 7 shows some qualitative results on Kins.

表4显示了就AP指标而言,与Kins数据集上的[8、22、10、17、25]比较。 Kins [33]指出,同时处理无模式和无模式分割都可以提高性能,如表4的第四行和第五行所示。我们的方法仅学习无模式的分割任务,并且在所有方法中均达到最佳性能。 我们发现蛇分支可以提高检测性能。 单独培训CenterNet时,它在检测时获得30.5 AP。 经过蛇分支训练后,其性能提高了2.3 AP。 对于Kins数据集上的768×2496图像,我们的方法在1080 Ti GPU上以7.6 fps的速度运行。 图7显示了Kins的一些定性结果。
在这里插入图片描述

Figure 7. Qualitative results on Sbd val set. Our approach handles errors in object localization in most cases. For example, in the first image, although the detected boxes do not fully cover the boys, our approach recovers the complete object shapes. Zoom in for details.

图7. Sbd val集的定性结果。 在大多数情况下,我们的方法可以处理对象本地化中的错误。 例如,在第一个图像中,尽管检测到的盒子没有完全覆盖男孩,但是我们的方法恢复了完整的对象形状。 放大以获取详细信息。
在这里插入图片描述

Table 6. Running time on the Pascal Voc dataset. “MS” repre- sents Mask R-CNN [17], and “OURS” represents our approach. The last three methods are contour-based methods.

表6. Pascal Voc数据集上的运行时间。 “ MS”代表Mask R-CNN [17],“ OURS”代表我们的方法。 后三种方法是基于轮廓的方法。

Performance on Sbd. Most objects on the Sbd dataset are connected as a single component, so we do not han- dle fragmented instances. For multi-component objects, our approach detects their components separately instead of de- tecting the whole object. We train the detection and snake branches end-to-end for 150 epochs with multi-scale data augmentation. The learning rate starts from 1e−4 and drops by half at 80 and 120 epochs. The network is tested at a single scale of 512 × 512.

In Table 5, we compare with other contour-based meth- ods [19, 39] on the Sbd dataset in terms of the Voc AP met- rics. [19, 39] predict the object contours by regressing shape vectors. STS [19] defines the object contour as a radial vector from the object center, and ESE [39] approximates object contour with 20 and 50 coefficients of Chebyshev polynomial. In contrast, our approach deforms an initial contour to the object boundary. We outperform these meth- ods by a large margin of at least 19.1 APvol. Note that, our approach yields 21.4 AP50 and 36.2 AP70 improvements, demonstrating that the improvement increases with the IoU threshold. This indicates that our algorithm better outlines object boundaries. For 512×512 images on the Sbd dataset, our approach runs at 32.3 fps on a 1080 Ti. Some qualitative results are illustrated in Figure 7.

Sbd上的效果。 Sbd数据集上的大多数对象都是作为单个组件连接的,因此我们不处理零散的实例。对于多组件对象,我们的方法是单独检测其组件,而不是检测整个对象。我们使用多尺度数据增强对检测和蛇分支进行端到端训练,共150个纪元。学习速率从1e−4开始,在80和120个纪元时下降一半。该网络以512×512的单个比例进行了测试。

在表5中,我们根据Voc AP原理与Sbd数据集上的其他基于轮廓的方法[19,39]进行了比较。 [19,39]通过回归形状矢量来预测对象轮廓。 STS [19]将对象轮廓定义为从对象中心开始的径向矢量,ESE [39]用20和50的切比雪夫多项式系数近似对象轮廓。相反,我们的方法将初始轮廓变形为对象边界。我们以至少19.1 APvol的大幅度超越了这些方法。请注意,我们的方法产生了21.4 AP50和36.2 AP70的改进,这表明改进随着IoU阈值的增加而增加。这表明我们的算法更好地勾勒出对象边界。对于Sbd数据集上的512×512图像,我们的方法在1080 Ti上以32.3 fps的速度运行。定性结果如图7所示。

5.4. Running time

Table 6 compares our approach with other methods [8, 22, 17, 19, 39] in terms of running time on the Pascal Voc dataset. Since the Sbd dataset shares images with Pascal Voc and has the same semantic categories, the running time on the Sbd dataset is technically the same as the one on Pascal Voc. We obtain the running time of other methods on Pascal Voc from [39]. For 512 × 512 images on the Sbd dataset, our algorithm runs at 32.3 fps on a desktop with an Intel i7 3.7GHz and a GTX 1080 Ti GPU, which is efficient for real-time instance segmentation. Specifically, CenterNet takes 18.4 ms, the initial contour proposal takes 3.1 ms, and each iteration of contour deformation takes 3.3 ms. Since our approach out- puts the object boundary, no post-processing like upsam- pling is required. If the strategy of handling fragmented instances is adopted, the detector additionally takes 3.6 ms.
表6在Pascal Voc数据集的运行时间方面将我们的方法与其他方法[8,22,17,19,39]进行了比较。 由于Sbd数据集与Pascal Voc共享图像并且具有相同的语义类别,因此Sbd数据集的运行时间在技术上与Pascal Voc上的运行时间相同。 我们从[39]中获得了其他方法在Pascal Voc上的运行时间。 对于Sbd数据集上的512×512图像,我们的算法在具有Intel i7 3.7GHz和GTX 1080 Ti GPU的台式机上以32.3 fps的速度运行,这对于实时实例分割非常有效。 具体来说,CenterNet需要18.4 ms,初始轮廓建议需要3.1 ms,轮廓变形的每次迭代需要3.3 ms。 由于我们的方法超越了对象边界,因此无需进行后续处理(例如上采样)。 如果采用处理碎片实例的策略,则检测器还需要3.6 ms。

6. Conclusion

We introduced a new contour-based model for real-time instance segmentation. Inspired by traditional snake algo- rithms, our approach deforms an initial contour to the ob- ject boundary and obtains the object shape. To this end, we proposed a learning-based snake algorithm, named deep snake, which introduces the circular convolution for effi- cient feature learning on the contour and regresses vertex- wise offsets for the contour deformation. Based on deep snake, we developed a two-stage pipeline for instance seg- mentation: initial contour proposal and contour deforma- tion. We showed that this pipeline gained a superior perfor- mance than direct regression of the coordinates of the object boundary points. We also showed that the circular convolu- tion learns the structural information of the contour more ef- fectively than the graph convolution. To overcome the lim- itation of the contour that it can only outline one connected component, we proposed to detect the object components within the object box and demonstrated the effectiveness of this strategy on Cityscapes. The proposed model achieved the state-of-the-art results on the Cityscapes, Kins and Sbd datasets with a real-time performance.
我们为实时实例分割引入了一种基于轮廓的新模型。受传统蛇算法启发,我们的方法将初始轮廓变形到对象边界并获得对象形状。为此,我们提出了一种基于学习的蛇算法,称为Deep Snake,该算法引入了圆形卷积,可以有效地学习轮廓轮廓,并为轮廓变形回归顶点偏移。在深蛇的基础上,我们开发了一个两段流水线用于分段:初始轮廓提议和轮廓变形。我们证明,与直接回归对象边界点的座标相比,该管线具有更高的性能。我们还表明,与图卷积相比,圆形卷积更有效地学习轮廓的结构信息。为了克服轮廓线只能勾勒出一个连接的组件的局限性,我们建议检测对象框中的对象组件,并证明该策略在Cityscapes上的有效性。所提出的模型以实时性能在Cityscapes,Kins和Sbd数据集上取得了最先进的结果。

References

[1] David Acuna, Huan Ling, Amlan Kar, and Sanja Fidler. Ef- ficient interactive annotation of segmentation datasets with polygon-rnn++. In CVPR, 2018. 3, 7
[2] Min Bai and Raquel Urtasun. Deep watershed transform for instance segmentation. In CVPR, 2017. 2
[3] Lluis Castrejon, Kaustav Kundu, Raquel Urtasun, and Sanja Fidler. Annotating object instances with a polygon-rnn. In CVPR, 2017. 3
[4] Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaox- iao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, Wanli Ouyang, et al. Hybrid task cascade for instance seg- mentation. In CVPR, 2019. 1
[5] Laurent D Cohen. On active contour models and balloons. CVGIP: Image understanding, 53(2):211–218, 1991. 1, 2
[6] Timothy F Cootes, Christopher J Taylor, David H Cooper, and Jim Graham. Active shape models-their training and ap- plication. CVIU, 61(1):38–59, 1995. 1, 2
[7] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016. 2, 5
[8] Jifeng Dai, Kaiming He, and Jian Sun. Instance-aware se- mantic segmentation via multi-task network cascades. In CVPR, 2016. 2, 6, 7, 8
[9] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. IJCV, 88(2):303–338, 2010. 5
[10] Patrick Follmann, Rebecca Ko ̈ Nig, Philipp Ha ̈ Rtinger, Michael Klostermann, and Tobias Bo ̈ Ttger. Learning to see the invisible: End-to-end trainable amodal instance segmen- tation. In WACV, 2019. 6, 7
[11] Naiyu Gao, Yanhu Shan, Yupei Wang, Xin Zhao, Yinan Yu, Ming Yang, and Kaiqi Huang. Ssap: Single-shot instance segmentation with affinity pyramid. In ICCV, 2019. 2
[12] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. IJRR, 32(11):1231–1237, 2013. 5
[13] Ross Girshick. Fast r-cnn. In ICCV, 2015. 5
[14] SteveRGunnandMarkSNixon.Arobustsnakeimplemen- tation; a dual active contour. PAMI, 19(1):63–68, 1997. 1,
2
[15] Bharath Hariharan, Pablo Arbela ́ez, Lubomir Bourdev, Subhransu Maji, and Jitendra Malik. Semantic contours from inverse detectors. In ICCV, 2011. 2, 5
[16] Bharath Hariharan, Pablo Arbela ́ez, Ross Girshick, and Ji- tendra Malik. Simultaneous detection and segmentation. In ECCV, 2014. 5
[17] Kaiming He, Georgia Gkioxari, Piotr Dolla ́r, and Ross Gir- shick. Mask r-cnn. In ICCV, 2017. 1, 2, 5, 6, 7, 8
[18] Zhaojin Huang, Lichao Huang, Yongchao Gong, Chang Huang, and Xinggang Wang. Mask scoring r-cnn. In CVPR, 2019. 1
[19] Saumya Jetley, Michael Sapienza, Stuart Golodetz, and Philip HS Torr. Straight to shapes: Real-time detection of encoded shapes. In CVPR, 2017. 1, 2, 7, 8
[20] Michael Kass, Andrew Witkin, and Demetri Terzopoulos. Snakes: Active contour models. IJCV, 1(4):321–331, 1988. 1, 2, 3, 4
[21] Guohao Li, Matthias Mu ̈ller, Ali Thabet, and Bernard Ghanem. Can gcns go as deep as cnns? In ICCV, 2019. 4
[22] YiLi,HaozhiQi,JifengDai,XiangyangJi,andYichenWei. Fully convolutional instance-aware semantic segmentation. In CVPR, 2017. 2, 6, 7, 8
[23] HuanLing,JunGao,AmlanKar,WenzhengChen,andSanja Fidler. Fast interactive object annotation with curve-gcn. In CVPR, 2019. 1, 2, 3, 4, 5, 6
[24] Shu Liu, Jiaya Jia, Sanja Fidler, and Raquel Urtasun. Sgn: Sequential grouping networks for instance segmentation. In ICCV, 2017. 7
[25] Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Path aggregation network for instance segmentation. In CVPR, 2018. 1, 2, 6, 7
[26] Yiding Liu, Siyu Yang, Bin Li, Wengang Zhou, Jizheng Xu, Houqiang Li, and Yan Lu. Affinity derivation and graph merge for instance segmentation. In ECCV, 2018. 2, 7
[27] Kevis-Kokitsi Maninis, Sergi Caelles, Jordi Pont-Tuset, and Luc Van Gool. Deep extreme cut: From extreme points to object segmentation. In CVPR, 2018. 2
[28] Diego Marcos, Devis Tuia, Benjamin Kellenberger, Lisa Zhang, Min Bai, Renjie Liao, and Raquel Urtasun. Learn- ing deep structured active contours end-to-end. In CVPR, 2018. 2
[29] Davy Neven, Bert De Brabandere, Marc Proesmans, and Luc Van Gool. Instance segmentation by jointly optimiz- ing spatial embeddings and clustering bandwidth. In CVPR, 2019. 2, 6, 7
[30] Dim P Papadopoulos, Jasper RR Uijlings, Frank Keller, and Vittorio Ferrari. Extreme clicking for efficient object anno- tation. In ICCV, 2017. 2
[31] SidaPeng,YuanLiu,QixingHuang,XiaoweiZhou,andHu- jun Bao. Pvnet: Pixel-wise voting network for 6dof pose estimation. In CVPR, 2019. 3
[32] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, 2017. 4
[33] Lu Qi, Li Jiang, Shu Liu, Xiaoyong Shen, and Jiaya Jia. Amodal instance segmentation with kins dataset. In CVPR, 2019. 2, 5, 7
[34] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS, 2015. 1
[35] Christian Rupprecht, Elizabeth Huaroc, Maximilian Baust, and Nassir Navab. Deep active contours. arXiv preprint arXiv:1607.05074, 2016. 3
[36] Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei Liu, and Yu-Gang Jiang. Pixel2mesh: Generating 3d mesh models from single rgb images. In ECCV, 2018. 4
[37] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds. TOG, 2018. 4
9
[38] ZianWang,DavidAcuna,HuanLing,AmlanKar,andSanja Fidler. Object instance annotation with deep extreme level set evolution. In CVPR, 2019. 3
[39] Wenqiang Xu, Haiyang Wang, Fubo Qi, and Cewu Lu. Ex- plicit shape encoding for real-time instance segmentation. In ICCV, 2019. 1, 2, 7, 8
[40] Ze Yang, Yinghao Xu, Han Xue, Zheng Zhang, Raquel Ur- tasun, Liwei Wang, Stephen Lin, and Han Hu. Dense rep- points: Representing visual objects with dense point sets. arXiv preprint arXiv:1912.11473, 2019. 2
[41] Xingyi Zhou, Dequan Wang, and Philipp Kra ̈henbu ̈hl. Ob- jects as points. arXiv preprint arXiv:1904.07850, 2019. 5, 6
[42] Xingyi Zhou, Jiacheng Zhuo, and Philipp Krahenbuhl. Bottom-up object detection by grouping extreme and center points. In CVPR, 2019. 2, 4

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章