语义分割进阶之路之回首2015cvpr(三)

Weakly Supervised Semantic Segmentation for Social Images [full paper] [ext. abstract]
Wei Zhang, Sheng Zeng, Dequan Wang, Xiangyang Xue

 

因为这些文章并没有许多人有深入研究,所以就没有什么直接参考,所以我打算以边翻译边总结的方式学习这个算法。

abstract:

Image semantic segmentation is the task of partitioning image into several regions based on semantic concepts. In this paper, we learn a weakly supervised semantic segmentation model from social images whose labels are not pixellevel but image-level; furthermore, these labels might be noisy. We present a joint conditional random field model leveraging various contexts to address this issue. More specifically, we extract global and local features in multiple scales by convolutional neural network and topic model. Inter-label correlations are captured by visual contextual cues and label co-occurrence statistics. The label consistency between image-level and pixel-level is finally achieved by iterative refinement. Experimental results on two realworld image datasets PASCAL VOC2007 and SIFT-Flow demonstrate that the proposed approach outperforms stateof-the-art weakly supervised methods and even achieves accuracy comparable with fully supervised methods.

图像语义分割是基于语义概念将图像划分为若干区域的任务。 在本文中,我们从社会图像中学习了一种弱监督的语义分割模型,其标签不是像素级而是图像级; 此外,这些标签可能会很嘈杂。 我们提出了一个联合条件随机场模型,利用各种上下文来解决这个问题。 更具体地说,我们通过卷积神经网络和主题模型提取多尺度的全局和局部特征。 通过视觉上下文线索和标签共现统计捕获标签间相关性。 图像级和像素级之间的标签一致性最终通过迭代细化来实现。 在两个真实世界图像数据集PASCAL VOC2007和SIFT-Flow上的实验结果表明,所提出的方法优于现有技术的弱监督方法,甚至达到与完全监督方法相当的精度。

 

首先,看abstract 就能明白,这是早期的利用鼎鼎大名的条件随机场在图像分割中的实践。它采用了图级的标注代替了像素级别标注。最终他的结果在voc中表现得和强监督学习训练的图像分割所差不多。

我先说明一下本文的弱监督,是怎么个弱法:

他是用这种具有标签的图像直接分割,不再借助工具,描边标注。

 

引文部分

Semantic segmentation, i.e., parsing image into several semantic regions, assigns each pixel (or superpixel) to one of the predefined semantic categories. Most state-ofthe-art methods rely on a sufficiently huge amount of annotated samples in training. However, there are not enough labeled samples for this task because pixel-level (or superpixel-level) annotation is time-consuming and laborintensive. Recent works have begun to address the semantic segmentation problem in the weakly supervised settings, where each training image is only annotated by image-level labels [24, 25, 26, 27, 30, 33, 34]. The existing weakly supervised semantic segmentation methods are based on one strict assumption that image-level labels are guaranteed to be precise by professional annotators. With the prevalence of photo sharing websites and collaborative image tagging system, e.g., Flickr, a large number of social images with user provided labels are available from the Internet. These labels are usually image-level;what’s more, they might be noisy: There are either incorrect additional labels assigned to a training image or labels missing from the ground truth. Figure 1 shows several social images and the associated noisy labels. It is challenging but attractive to learn an effective semantic segmentation model from such social images. In this paper, we propose a weakly supervised semantic segmentation model to overcome the challenge posed by noisy image-level labels for training. We learn a joint conditional random field (CRF) from weakly labeled social images by sufficiently leveraging various contexts, e.g., the associations between high-level semantic concepts and low-level visual appearance, inter-label correlations, spatial neighborhoods, and label consistency between image-level and pixel-level. More specifically, each image is segmented into superpixels with multiple quantization levels. Global features for the whole image and local features for the superpixels in multiple scales are extracted by convolutional neural network (CNN) and latent semantic concept model (LSC). Then we capture the inter-label correlations by visual contextual cues as well as label co-occurrence statistics. The label consistency between image-level and pixel-level is finally achieved by iterative refinement in a flip-flop manner. We conduct experiments on two challenging datasets, PASCAL VOC 2007 and SIFT-Flow datasets. The proposed approach achieves comparable results or outperforms previous state-of-the-art methods, even though it is in the weakest supervision, which demonstrates that the image-level labels, especially potential relationships, are more efficiently utilized by our method. The main contributions of this paper are summarized as follows:

• We propose a weakly supervised semantic segmentation model for social images, where only image-level labels are available for training, or even worse, the annotations can be noisy.

• We design a joint learning framework to sufficiently leverage various contexts including feature-label association, inter-label correlation, spatial neighborhood cues, and label consistency.

• We learn inter-label correlation not only by investigating label co-occurrence statistics from training samples but also by looking at the overlap of the most informative regions for different classes.

 

语义分割(即,将图像解析成若干语义区域)将每个像素(或超像素)分配给预定义语义类别之一。大多数最先进的方法在训练中依赖于足够大量的注释样本。但是,没有足够的标记样本用于此任务,因为像素级(或超像素级)注释是耗时且劳动密集的。最近的工作已经开始解决弱监督设置中的语义分割问题,其中每个训练图像仅由图像级标签注释[24,25,26,27,30,33,34]。现有的弱监督语义分割方法基于一个严格的假设,即专业注释器保证图像级标签是精确的。随着照片共享网站和协作图像标记系统(例如,Flickr)的普及,可以从因特网获得具有用户提供的标签的大量社交图像。这些标签通常是图像级别的;更重要的是,它们可能会产生噪音:分配给训练图像的标签不正确或标签遗漏的标签不正确。图1显示了几个社交图像和相关的噪声标签。从这样的社交图像中学习有效的语义分割模型具有挑战性但很有吸引力。在本文中,我们提出了一种弱监督的语义分割模型,以克服噪声图像级标签对训练的挑战。我们通过充分利用各种上下文来学习来自弱标记社交图像的联合条件随机场(CRF),例如,高级语义概念与低级视觉外观,标签间相关性,空间邻域和标签之间的一致性之间的关联。图像级和像素级。更具体地,每个图像被分割成具有多个量化级别的超像素。通过卷积神经网络(CNN)和潜在语义概念模型(LSC)提取多尺度超像素的整个图像和局部特征的全局特征。然后我们通过视觉上下文线索以及标签共现统计来捕获标签间相关性。图像级和像素级之间的标签一致性最终通过以触发器方式的迭代细化来实现。我们对两个具有挑战性的数据集进行了实验,PASCAL VOC 2007和SIFT-Flow数据集。所提出的方法实现了可比较的结果,或者优于以前最先进的方法,即使它处于最弱的监督中,这表明我们的方法可以更有效地利用图像级标签,尤其是潜在的关系。本文的主要贡献概括如下:

•我们提出了一种用于社交图像的弱监督语义分割模型,其中只有图像级标签可用于训练,或者更糟糕的是,注释可能是嘈杂的。

•我们设计了一个联合学习框架,以充分利用各种上下文,包括特征标签关联,标签间相关性,空间邻域线索和标签一致性。

•我们不仅通过调查来自训练样本的标签共现统计数据,而且通过查看不同类别的信息最丰富的区域的重叠来学习标签间相关性。

 

主要介绍了算法的特点吧,就是能从图级别坐到分割。所面对的挑战比较重要,分为三个点:

1. 图级别的样本就存在标注的问题,这很影响最终的分割效果。

2. 设计一种利用上下文关联的方法做分割。

3. 意思就是他还注重研究了物体重叠区域。

 

 

2.related work

In the past years, image semantic segmentation has attracted a lot of attentions. Most of the existing works model the task as a fully supervised problem [32]. Shotton et al. [19] implemented semantic segmentation by incorporating shape-texture color, location and edge clues in a CRF model over image pixels. This model is then extended in the follow-up works [10, 12, 13]. Kohli et al. utilized the higher order potentials as a soft decision to ensure that pixels constituting a particular segment have the same semantic concept [10]. Ladicky et al. extended the higher order potentials to hierarchical structure by using multiple segmentations in [12] and further integrated label co-occurrence statistics in [13]. However, these methods heavily rely on pixel-level annotations during the training stage. In addition to fully supervised semantic segmentation, there have been several works in the weakly supervised settings as well recently. The method in [31] attempted to automatically annotate image regions by learning a correlative multi-label multi-instance model from image-level tagged data. Verbeek and Triggs [24] used several appearance descriptors to learn the latent aspect model via probabilistic Latent Semantic Analysis (pLSA) [8], and integrated the spanning tree structure and Markov Random Fields to capture spatial information. Vezhnevets and Buhmann [25] cast the weakly supervised task as a multi-instance multi-task learning problem with the framework of Semantic Texton Forest (STF) [18]. Based on [25], Vezhnevets et al. [26, 27] integrated the latent correlations among the superpixels belonging to different images which share the same labels into CRF. Xu et al. [30] simplified the previous complicated framework by a graphical model that encodes the presence/absence of a class as well as the assignments of semantic labels to superpixels. [33] performed semantic segmentation in weak supervision via classifier evaluation where the classifier parameters are firstly sampled at random and then the superpixel classifiers are evaluated by measuring the distance between the ground-truth negative samples and the predicted positive samples. It should be pointed out that all above approaches are based on the assumption that the given image-level labels for training are correct and complete, which is not practical in many real-world applications. It is a realistic problem where the end goal is pixel-level labels but the input is noisy image-level annotations. To address the problem of having noise in the ground truth, we investigate label correlations based on both label co-occurrence statistics and visual contextual cues simultaneously, which differs from the existing weakly supervised methods [24, 25, 26, 27, 30]. In addition, to make the proposed framework more robust under the noisy condition, we take latent semantic concept model as a mid-level representation, which also helps to narrow down the gap between semantic space and feature space; in contrast, the previous methods (e.g., [26, 30]) only used the appearance model as a low-level representation. In comparison with the stateof-the-art weakly supervised methods (e.g., [27, 30]), we utilize multiple scale segmentations to overcome the weakness of single choice of segmentation which fails to cover different quantization levels of objects.

 

 

在过去的几年中,图像语义分割引起了很多关注。大多数现有工作将任务建模为完全监督的问题[32]。 Shotton等人。 [19]通过在图像像素上的CRF模型中结合形状 - 纹理颜色,位置和边缘线索来实现语义分割。然后在后续工作[10,12,13]中扩展该模型。 Kohli等人。利用高阶电位作为软判决,以确保构成特定片段的像素具有相同的语义概念[10]。 Ladicky等人。通过在[12]中使用多个分段并在[13]中进一步集成标签共现统计,将高阶势扩展到分层结构。然而,这些方法在训练阶段严重依赖于像素级注释。除了完全监督的语义分割之外,最近还在弱监督设置中进行了多项工作。 [31]中的方法试图通过从图像级标记数据中学习相关的多标签多实例模型来自动注释图像区域。 Verbeek和Triggs [24]使用几个外观描述符通过概率潜在语义分析(pLSA)[8]来学习潜在方面模型,并集成了生成树结构和马尔可夫随机场来捕获空间信息。 Vezhnevets和Buhmann [25]将弱监督任务作为一个多实例多任务学习问题,使用语义文本森林(STF)框架[18]。基于[25],Vezhnevets等。 [26,27]将属于共享相同标签的不同图像的超像素之间的潜在相关性整合到CRF中。徐等人。 [30]通过图形模型简化了先前的复杂框架,该图形模型编码类的存在/不存在以及语义标签到超像素的分配。 [33]通过分类器评估在弱监督下进行语义分割,其中首先对分类器参数进行随机采样,然后通过测量地面实况负样本与预测的正样本之间的距离来评估超像素分类器。应该指出的是,所有上述方法都基于以下假设:用于训练的给定图像级标签是正确和完整的,这在许多实际应用中是不实际的。这是一个现实问题,其最终目标是像素级标签,但输入是嘈杂的图像级注释。为了解决基础事实中存在噪声的问题,我们同时研究了基于标签共现统计和视觉上下文线索的标签相关性,这与现有的弱监督方法不同[24,25,26,27,30]。另外,为了使拟议的框架在噪声条件下更加鲁棒,我们将潜在的语义概念模型作为中层表示,这也有助于缩小语义空间与特征空间之间的差距;相反,先前的方法(例如,[26,30])仅将外观模型用作低级表示。与现有技术的弱监督方法(例如,[27,30])相比,我们利用多个尺度分割来克服单个选择的分割的弱点,其不能覆盖对象的不同量化水平。

 

其实高水平论文就是不一样啊,人家的related work是真的相关,相关的一批,连最早期的crf用于图像分割都引用了。所以想写好论文,一定要大量阅读优秀论文。

这里主要介绍了早期的crf应用图像分割,但是这些方法受强监督的压力,就是在弱监督上不行;然后介绍了一些弱监督的方法,也是基于crf的,这篇论文对crf的研究是真的非常有深度的,如果把所有参考文献也一一阅读意义非凡。他认为许多图级别的crf方法受标签影响很严重,所以同样是不行的,然后他在related work的篇章中又一次迫不及待地说了他们采用一种能够同时针对解决标签问题并且能解决图级分割问题的方案。

 

3. The Proposed Model

Suppose that each image I is associated with a label vector y = [y1, ..., yL], where L is the number of categories, and yi = 1 indicates that the i-th category is present in this image, otherwise yi = 0. In the training set, y is given; however, it might be noisy. In the test set, y is unknown. For each image, we firstly employ the existing multi-scale segmentation algorithm to get a set of superpixels {xp}M p=1 over multiple quantization levels. Here, M is the total number of superpixels in image I. The label of superpixel xp is denoted as hp ∈ {1, 2, ..., L}, and the labels of all superpixels for image I are h = [h1, ..., hM], which are not available for training. Our goal is to infer semantic label for each superpixel in an image and the adjacent superpixels sharing the same semantic label are fused as a whole one. We jointly build a conditional random field (CRF) over the image-level label variables y and the superpixel-level label variables h. We leverage label-pair correlation and connect each superpixel to its neighbors to encode local smoothness constraints. Thus we formulate an energy function E with five types of potentials as follows:

 

 

3.候选框模型
假设每个图像I与标签矢量y = [y1,...,yL]相关联,其中L是类别的数量,yi = 1表示第i个类别存在于该图像中,否则yi = 0.在训练集中,给出y; 但是,它可能会很吵。 在测试集中,y是未知的。 对于每个图像,我们首先使用现有的多尺度分割算法在多个量化级别上获得一组超像素{xp} M p = 1。 这里,M是图像I中超像素的总数。超像素xp的标记表示为hp∈{1,2,...,L},图像I的所有超像素的标签是h = [h1, ......,hM],不可用于培训。 我们的目标是推断图像中每个超像素的语义标签,并且相邻的超像素共享相同的语义标签语义标签作为一个整体融合。 我们在图像级标签变量y和超像素级标签变量h上共同构建条件随机场(CRF)。 我们利用标签对相关性并将每个超像素连接到其邻居以编码局部平滑约束。 因此,我们用五种类型的电势来制定能量函数E,如下所示:

他提到他会用CRF判断所有像素的类别,其中判断指标中很重要的一点就是上下文、邻近像素。

其实自FCN以来上下文和邻近像素这个研究方向也是后续语义分割的研究重点和重心,可以看到2015的这两篇论文,FCN包括这一篇,都是后续图像分割的非常重要的风向标。

 

后续的我会回家以后更新。

 

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章