语义分割进阶之路之回首2015cvpr(五)

Feedforward Semantic Segmentation With Zoom-Out Features [full paper] [ext. abstract]
Mohammadreza Mostajabi, Payman Yadollahpour, Gregory Shakhnarovich

 

其实2015年期间的语义分割大致有两种流派,各式随机场流派和超像素流派。现在是2019年8月,目前最优质的的图像分割算法已经选择了随机场流派,但是超像素流派依然也是研究方向。

本篇论文能够刊登在CVPR2015汇报中,说明极具代表性,其主要的思路就是超像素流派的语义分割。

早在2015年,其实超像素的语义分割是主导地位,许多优秀的数据集制作、采集公司都会针对超像素进行标注,这足以说明超像素方法在图像分割中的地位。

 

Abstract

We introduce a purely feed-forward architecture for semantic segmentation. We map small image elements (superpixels) to rich feature representations extracted from a sequence of nested regions of increasing extent. These regions are obtained by ”zooming out” from the superpixel all the way to scene-level resolution. This approach exploits statistical structure in the image and in the label space without setting up explicit structured prediction mechanisms, and thus avoids complex and expensive inference. Instead superpixels are classified by a feedforward multilayer network. Our architecture achieves 69.6% average accuracy on the PASCAL VOC 2012 test set.

 

我们为语义分段引入了纯粹的前馈架构。 我们将小图像元素(超像素)映射到从增加范围的嵌套区域序列中提取的丰富特征表示。 通过从超像素一直“缩小”到场景级分辨率来获得这些区域。 该方法利用图像和标签空间中的统计结构,而不设置明确的结构化预测机制,因此避免了复杂且昂贵的推理。 相反,超像素由前馈多层网络分类。 我们的架构在PASCAL VOC 2012测试装置上实现了69.6%的平均精度。

 

其实本文的摘要写的不太容易看懂,它采用的是“多尺度”的方法做分割,这种多尺度不同于现在我们常说的SPPNet的多尺度,这个多尺度如下图,就是先从最小的超像素开始做卷积,然后对超像素以外一点范围内的图像做卷积,然后再扩大一点做卷积,最后对整幅图片做卷积。这就是本篇论文的trick:zooming out,这里的zoom不是德州扑克里面的快速桌的意思,是“多尺度的意思”

 

 

 

Introduction

We consider one of the central vision tasks, semantic segmentation: assigning to each pixel in an image a category-level label. Despite attention it has received, it remains challenging, largely due to complex interactions between neighboring as well as distant image elements, the importance of global context, and the interplay between semantic labeling and instance-level detection. A widely accepted conventional wisdom, followed in much of modern segmentation literature, is that segmentation should be treated as a structured prediction task, which most often means using a random field or structured support vector machine model of considerable complexity. This in turn brings up severe challenges, among them the intractable nature of inference and learning in many “interesting” models. To alleviate this, many recently proposed methods rely on a pre-processing stage, or a few stages, to produce a manageable number of hypothesized regions, or even complete segmentations, for an image. These are then scored, ranked or combined in a variety of ways. Here we consider a departure from these conventions, and approach semantic segmentation as a single-stage classification task, in which each image element (superpixel) is labeled by a feedforward model, based on evidence computed from the image.

The “secret” behind our method is that the evidence used in the feedforward classification is not computed from a small local region in isolation, but collected from a sequence of levels, obtained by “zooming out” from the closeup view of the superpixel. Starting from the superpixel itself, to a small region surrounding it, to a larger region around it and all the way to the entire image, we compute a rich feature representation at each level and combine all the features before feeding them to a classifier. This allows us to exploit statistical structure in the label space and dependencies between image elements at different resolutions without explicitly encoding these in a complex model.

We do not mean to dismiss structured prediction or inference, and as we discuss in Section 5, these tools may be complementary to our architecture. In this paper we explore how far we can go without resorting to explicitly structured models. We use convolutional neural networks (convnets) to extract features from larger zoom-out regions. Convnets, (re)introduced to vision in 2012, have facilitated a dramatic advance in classification, detection, fine-grained recognition and other vision tasks. Segmentation has remained conspicuously left out from this wave of progress; while image classification and detection accuracies on VOC have improved by nearly 50% (relative), segmentation numbers have improved only modestly. A big reason for this is that neural networks are inherently geared for “non-structured” classification and regression, and it is still not clear how they can be harnessed in a structured prediction framework. In this work we propose a way to leverage the power of representations learned by convnets, by framing segmentation as classification and making the structured aspect of it implicit. Finally, we show that use of multi-layer neural network trained with asymmetric loss to classify superpixels represented by zoom-out features, leads to significant improvement in segmentation accuracy over simpler models and conventional (symmetric) loss. Below we give a high-level description of our method, then discuss related work and position our work in its context. Most of the technical details are deferred to Section 4 in which we describe implementation and report on results, before concluding in Section 5.

 

我们考虑一个中心视觉任务,即语义分割:为图像中的每个像素分配一个类别级别的标签。尽管它已经受到关注,但它仍然具有挑战性,主要是由于相邻和远程图像元素之间的复杂交互,全局上下文的重要性以及语义标记和实例级检测之间的相互作用。在现代分割文献的大部分中,广泛接受的传统智慧是将分割视为结构化预测任务,其通常意味着使用相当复杂的随机场或结构化支持向量机模型。这反过来又带来了严峻的挑战,其中包括许多“有趣”模型中推理和学习的棘手性质。为了缓解这种情况,许多最近提出的方法依赖于预处理阶段或几个阶段来为图像产生可管理数量的假设区域,或甚至完整的分割。然后以各种方式对这些进行评分,排名或组合。在这里,我们考虑脱离这些约定,并将语义分割作为单阶段分类任务,其中每个图像元素(超像素)基于从图像计算的证据由前馈模型标记。

我们方法背后的“秘密”是前馈分类中使用的证据不是从孤立的小局部区域计算出来的,而是从一系列水平中收集的,通过从超像素的特写视图“缩小”获得。 从超像素本身到其周围的小区域,到它周围的较大区域以及一直到整个图像,我们在每个级别计算丰富的特征表示,并在将它们馈送到分类器之前组合所有特征。 这允许我们利用标签空间中的统计结构和不同分辨率的图像元素之间的依赖关系,而无需在复杂模型中明确地编码它们。

我们并不是要忽视结构化预测或推理,正如我们在第5节中讨论的那样,这些工具可能是我们架构的补充。在本文中,我们将探讨在不采用明确结构化模型的情况下我们能走多远。我们使用卷积神经网络(convnets)从较大的缩小区域中提取特征。 2012年(重新)引入愿景的Convnets促进了分类,检测,细粒度识别和其他视觉任务的显着进步。从这一波进展中显然遗漏了细分;虽然VOC的图像分类和检测精度提高了近50%(相对),但细分数量仅略有改善。其中一个重要原因是神经网络固有地适用于“非结构化”分类和回归,并且仍然不清楚它们如何在结构化预测框架中被利用。在这项工作中,我们提出了一种方法来利用由网络学习的表示的力量,通过将分割框架化为分类并使其结构化方面隐含。最后,我们展示了使用不对称损失训练的多层神经网络对由缩小特征表示的超像素进行分类,导致分割精度优于简单模型和传统(对称)损失。下面我们对我们的方法进行高级描述,然后讨论相关工作并将我们的工作定位在其上下文中。在结束第5节之前,大多数技术细节都推迟到第4节,其中我们描述了实施和结果报告。

 

本文的文法我并不认同,既然能获得CVPR2015的刊稿机会,那一定是技术很强,引文主要写的内容是用条件随机场麻烦,用超像素好,然后卷积神经网络用在超像素更好。

2.Zoom-out feature fusion

We cast category-level segmentation of an image as classifying a set of superpixels. Since we expect to apply the same classification machine to every superpixel, we would like the nature of the superpixels to be similar, in particular their size. In our experiments we use SLIC [1], but other methods that produce nearly-uniform grid of superpixels might work similarly well. Figures 2 provides a few illustrative examples for this discussion.

Local The narrowest scope is the superpixel itself. We expect the features extracted here to capture local evidence: color, texture, small intensity/gradient patterns, and other properties computed over a relatively small contiguous set of pixels. The local features may be quite different even for neighboring superpixels, especially if these straddle category or object boundaries. Proximal As we zoom out and include larger spatial area around the superpixel, we can capture visual cues from surrounding superpixels. Features computed from these levels may capture information not available in the local scope; e.g., for locations at the boundaries of objects they will represent the appearance of both categories. For classes with non-uniform appearance they may better capture characteristic distributions for that class. We can expect somewhat more complex features to be useful at this level, but it is usually still too myopic for confident reasoning about presence of objects. Two neighboring superpixels could still have quite different features at this level, however some degree of smoothness is likely to arise from the significant overlap between neighbors’ proximal regions, e.g., A and B in Fig. 2. As another example, consider color features over the body of a leopard; superpixels for individual dark brown spots might appear quite different from their neighbors (yellow fur) but their proximal regions will have pretty similar distributions (mix of yellow and brown). Superpixels that are sufficiently far from each other could still, of course, have drastically different proximal features, e.g., A and C in Fig. 2. Distant Zooming out further, we move to the distant levels : regions large enough to include sizeable fractions of objects, and sometimes entire objects. At this level our scope is wide enough to allow reasoning about shape, presence of more complex patterns in color and gradient, and the spatial layout of such patterns. Therefore we can expect more complex features that represent these properties to be useful here. Distant regions are more likely to straddle true boundaries in the image, and so this higher-level feature extraction may include a significant area in both the category of the superpixel at hand and nearby categories. For example, consider a person sitting on a chair; bottle on a dining table; pasture animals on the background of grass, etc. Naturally we expect this to provide useful information on both the appearance of a class and its context. For nearby superpixels and far enough zoom-out level, distant regions will have a very large overlap, which will gradually diminish with distance between superpixels. This is likely to lead to somewhat gradual changes in features, and to impose a system of implicit smoothness “terms”, which depend both on the distance in the image and on the similarity in appearance in and around superpixels. Imposing such smoothness in a CRF usually leads to a very complex, intractable model.

 

缩放功能融合
我们将图像的类别级别分割投射为对一组超像素进行分类。 由于我们希望将相同的分类机应用于每个超像素,我们希望超像素的性质相似,特别是它们的大小。 在我们的实验中,我们使用SLIC [1],但是产生几乎均匀的超像素网格的其他方法可能同样有效。 图2提供了该讨论的一些说明性示例。

局部最窄的范围是超像素本身。我们期望这里提取的特征捕获局部证据:颜色,纹理,小强度/梯度模式,以及在相对较小的连续像素集上计算的其他属性。即使对于相邻的超像素,局部特征也可能完全不同,特别是如果这些特征跨越类别或对象边界。近端当我们缩小幷包含超像素周围较大的空间区域时,我们可以捕获周围超像素的视觉线索。从这些级别计算的特征可以捕获在本地范围内不可用的信息;例如,对于物体边界处的位置,它们将代表两个类别的外观。对于具有非均匀外观的类,它们可以更好地捕获该类的特征分布。我们可以期待一些更复杂的特征在这个级别上有用,但是对于存在对象的自信推理,它通常仍然过于近视。两个相邻的超像素在这个级别上仍然可以具有完全不同的特征,但是某些程度的平滑性可能是由于邻居的近端区域(例如,图2中的A和B)之间的显着重叠引起的。作为另一个例子,考虑颜色特征。豹子的身体;个别深褐色斑点的超像素可能看起来与它们的邻居(黄色毛皮)完全不同,但它们的近端区域将具有非常相似的分布(黄色和棕色的混合)。当然,相距离足够远的超像素仍然具有截然不同的近端特征,例如图2中的A和C.远距离缩小,我们移动到远处:大到足以包括相当大的部分的区域对象,有时是整个对象。在这个级别,我们的范围足够宽,可以推断出形状,颜色和渐变中更复杂图案的存在以及这些图案的空间布局。因此,我们可以预期代表这些属性的更复杂的功能在这里很有用。远距离区域更可能跨越图像中的真实边界,因此这种更高级别的特征提取可以包括手边超像素类别和附近类别中的重要区域。例如,考虑一个人坐在椅子上;餐桌上的瓶子;草等背景上的牧场动物当然,我们希望这能提供有关课堂外观及其背景的有用信息。对于附近的超像素和足够远的缩小水平,远距离区域将具有非常大的重叠,其将随着超像素之间的距离逐渐减小。这可能导致特征的某种逐渐变化,并且施加隐含平滑度“系统”的系统,其依赖于图像中的距离和超像素中和周围的外观相似性。在CRF中施加这样的平滑度通常会导致非常复杂,难以处理的模型。

 

 

他介绍了他的方案,将分割分成几个阶段,分别时local proximal distant scene

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章