语义分割进阶之路之回首2015cvpr(六)

Semantic Object Segmentation via Detection in Weakly Labeled Video [full paper] [ext. abstract]

 

从标题就很能看出本篇论文是针对两个点,video和weakly label。我个人对于video的语义分割并没有了解,停留在先将video转成图像,然后通过图像分割的阶段。

 

abstract 

Semantic object segmentation in video is an importantstep for large-scale multimedia analysis. In many cases,however, semantic objects are only tagged at video-level,making them difficult to be located and segmented. To address this problem, this paper proposes an approach to segment semantic objects in weakly labeled video via objectdetection. In our approach, a novel video segmentationby-detection framework is proposed, which first incorporates object and region detectors pre-trained on still images to generate a set of detection and segmentation proposals. Based on the noisy proposals, several object tracksare then initialized by solving a joint binary optimizationproblem with min-cost flow. As such tracks actually provide rough configurations of semantic objects, we thus refine the object segmentation while preserving the spatiotemporal consistency by inferring the shape likelihoods of pixels from the statistical information of tracks. Experimentalresults on Youtube-Objects dataset and SegTrack v2 datasetdemonstrate that our method outperforms state-of-the-artsand shows impressive results

视频中的语义对象分割是大规模多媒体分析的重要步骤。然而,在许多情况下,语义对象仅在视频级别被标记,使得难以定位和分割它们。为了解决这个问题,本文提出了一种通过对象检测来对弱标记视频中的语义对象进行分割的方法。在我们的方法中,提出了一种新颖的视频分割检测框架,其首先结合了在静止图像上预训练的对象和区域检测器,以生成一组检测和分割建议。基于噪声提议,然后通过解决具有最小成本流的联合二进制优化问题来初始化若干对象轨迹。由于这样的轨道实际上提供了语义对象的粗略配置,因此我们通过从轨道的统计信息推断像素的形状可能性来细化对象分割同时保持时空一致性。关于Youtube-Objects数据集和SegTrack v2数据集的实验结果证明我们的方法优于现有技术并且显示出令人印象深刻的结果

之前做过目标跟踪的项目,所以大概了解到对于跟踪,需要先做检测然后跟踪。而本篇论文也是差不多的思路,既然是video,那么就得跟踪每一个像素,所以他明确的说出他的核心创新:他们对弱监督的视频基于检测作语义分割。

其实我记得在15--16年那段时间跟踪是做的比较多的,几乎全部都是基于检测的跟踪,那么本篇论文既然能上cvpr,一定有其过人之处,所以我估计本篇的最重要的点,就在于怎么做视频分割以及怎么做这个弱监督。

现在是2019年,从2017年开始这个无人驾驶汽车就是语义分割最大的应用之一,蔚蓝汽车、momenta等高级科技研发公司都是做自动驾驶的,而这个自动驾驶的语义分割可不是你弄一弄图像就能解决的,所以关注视频语义分割也是重点。

 

Introduction
 

Semantic video object segmentation, which aims to detect and segment object-like regions in video according topredefined object labels, is an essential step in computer vision and multimedia analysis. In many scenarios, however,objects are only labeled at video-level, making them difficult to be located and segmented. As videos tagged withonly semantic labels are growing explosively on the Internet, it is necessary to explore how to segment the desiredsemantic objects in such weakly labeled videos.Recently, many approaches (e.g., [12, 25, 21]) have beenproposed to address this problem through weakly supervised learning. In the learning process, object-relevant instances were usually selected among videos sharing thesame semantic tags, while background instances were sam pled from videos with irrelevant tags. These instances,which may be inadequately selected or inaccurately labeled, were then fed into the weakly supervised learningframework (e.g., multi-instance learning [12], negative mining [25] and label transfer [21]) to train a segment classifier.Although these approaches can achieve promising resultson certain scenarios, the ambiguity of training instancesmay lead to unexpected segmentation results. Moreover,multiple videos are required during training, preventing theusage of these approaches in single video segmentation.To address these problems, this paper proposes to segment semantic objects in video via detection. The proposedapproach does not need segment-level training stage andthus avoids the selection of ambiguous instances. Instead,the image-based object detectors, which have demonstratedgreat successes in segmenting semantic objects in images[32, 17, 14, 26, 31], are employed into our segmentationby-detection framework. In this framework, object and region detectors pre-trained on still images are first used togenerate a set of detection and segmentation proposals onvarious video frames. As shown in Fig. 1, such image-basedproposals often lack spatiotemporal consistency and may beinaccurate due to blurred boundaries generated from videocompression, object occlusion and camera motion. Therefore, we propose to initialize several object tracks from these noisy proposals by solving a joint assignment problem formulated as min-cost flow. As these tracks actuallyprovide rough configurations of semantic objects, we thuspropose to infer the shape likelihoods of pixels from the statistical information of tracks. In this manner, backgroundnoises can be suppressed and the segmentation results ofdesired objects can be refined while preserving the spatiotemporal consistency. To validate the effectiveness of theproposed approach, we have conducted extensive experiments on two public video benchmarks, including YoutubeObjects dataset and SegTrack v2 dataset. Experimental results show that our approach outperforms several state-ofthe-art weakly-supervised and unsupervised approaches onchallenging object categories. The main contributions ofthe proposed approach are summarized as follows:

 

1) A novel segmentation-by-detection framework is proposed for semantic object segmentation in weakly labeled video, and demonstrates impressive performance on two public video benchmarks.

2) We present an algorithm to initialize object tracks fromimage-based detection and segmentation proposals. Bysolving a joint assignment problem with min-cost flow,this algorithm is robust to noisy proposals.

3) We refine the object tracks by inferring shape likelihoodsfrom the statistical information of tracks, which can effectively suppress background noise while preservingthe spatiotemporal consistency of foreground objects.

 

语义视频对象分割旨在根据顶部定义的对象标签检测和分割视频中的对象类区域,是计算机视觉和多媒体分析中必不可少的步骤。然而,在许多情况下,对象仅在视频级别标记,使得难以定位和分割对象。由于标记了仅具有语义标签的视频在互联网上爆炸式增长,因此有必要探索如何在这种弱标签视频中分割所需的语义对象。最近,已经提出了许多方法(例如,[12,25,21])来解决这个问题。通过弱监督学习的问题。在学习过程中,通常在共享相同语义标签的视频中选择对象相关实例,而背景实例则从具有不相关标签的视频中进行采样。然后将这些可能被选择不当或标记不准确的实例输入弱监督学习框架(例如,多实例学习[12],负挖掘[25]和标签转移[21])以训练分段分类器。这些方法在某些情况下可以取得有希望的结果,训练实例的模糊性可能导致意外的分割结果。此外,在训练过程中需要多个视频,以防止这些方法在单个视频分割中的使用。为了解决这些问题,本文提出通过检测来分割视频中的语义对象。所提出的方法不需要分段级训练阶段,并且可以避免选择不明确的实例。相反,基于图像的物体探测器已被证明在分割图像中的语义对象[32,17,14,26,31]方面取得了巨大成功,它们被用于我们的分割检测框架中。在该框架中,首先使用在静止图像上预训练的对象和区域检测器来生成关于各种视频帧的一组检测和分割建议。如图1所示,这种基于图像的建议通常缺乏时空一致性,并且由于视频压缩,物体遮挡和相机运动产生的模糊边界可能不准确。因此,我们建议通过解决作​​为最小成本流量的联合分配问题来从这些噪声提议初始化几个对象轨道。由于这些轨道实际上提供了语义对象的粗略配置,因此我们建议从轨道的统计信息推断出像素的形状可能性。以这种方式,可以抑制背景噪声,并且可以在保持时空一致性的同时细化所需对象的分割结果。为了验证所提出方法的有效性,我们对两个公共视频基准进行了大量实验,包括YoutubeObjects数据集和SegTrack v2数据集。实验结果表明,我们的方法在具有挑战性的对象类别方面优于几种最先进的弱监督和无监督方法。拟议方法的主要贡献概括如下:

1)针对弱标记视频中的语义对象分割,提出了一种新的检测分割框架,并在两个公共视频基准测试中表现出令人印象深刻的性能。
2)我们提出了一种算法,用于从基于图像的检测和分割建议初始化对象轨迹。 通过最小成本流程解决联合分配问题,该算法对于噪声提议是鲁棒的。
3)通过轨迹统计信息推断形状似然,细化物体轨迹,有效抑制背景噪声,同时保持前景物体的时空一致性。

看完 引文基本上能大概了解他的思路,就是他所针对的是相同弱监督标签的视频,然后对视频进行检测能够有效确定目标的边界。

 

related work

Video object segmentation approaches in the current literature can be grouped into the (semi-)supervised, unsupervised and weakly supervised ones.

Supervised and semi-supervised approaches typicallyact through training label classifiers [20, 27] or propagatinguser-annotated labels over time [13, 29, 28, 2]. Althoughbeing well studied in a long period, such methods are limited to a small range of applications for its extreme dependence on labor-intensive pixel annotations to train suitablemodels.

Unsupervised approaches generally focus on segmenting the most primal object [8, 18, 19, 22] in a single videoand co-segmenting the common object among a video collection [9, 11, 30]. As several recent successes, Lee etal. [18] attempted to segment the foreground object throughidentifing the key segments of highly salient appearance andmotion in the video. Dong et al. [8] proposed to denselyextract object segments with high objectness and smoothevolvement based on directed acyclic graph. Papazoglou etal. [22] developed a fast object segmentation approach thatquickly estimates rough object configurations through theuse of inside-outside maps.

Weakly supervised

approaches have received growingattention for its convenience in gathering video-level labelsand the prospect in analyzing web-scale data. Existing algorithms employed variants on the learning techniques topredict the confidence of each pixel belonging to a givenconcept. Hartmann et al. [12] first addressed it by traininglarge-scale segment classifiers. Tang et al. [25] comparedthe segments in positive videos with a large collection ofnegative examples and identified those of distinct appearance as foreground. The study was further pushed forwardby Xiao et al. [21] for handling this problem in multi-classcriterion as opposed to traditional binary classification.A common issue affecting the performance of weakly supervised approaches is the learning procedure with ambiguous training labels (i.e. locations of target objects). Different from these methods, our approach addresses video segmentation with weak labels through leveraging image-basedobject detectors and avoids such a procedure. Detectionbased approaches have been widely studied on image segmentation [16, 31, 14, 7, 17, 26, 32]. For example, Weiet al. [31] utilized detectors to guide semantic object segmentation in images without any pixel-level training stage.Inspired by these successes, this paper proposes to segmentsemantic objects in weakly labeled videos via object detection, which still receives less attention in the literature.

当前文献中的视频对象分割方法可以分为(半)监督的,无监督的和弱监督的。
监督和半监督方法通常通过训练标签分类器[20,27]或随时间传播用户注释标签[13,29,28,2]来实现。尽管在很长一段时间内进行了很好的研究,但这种方法仅限于小范围的应用,因为它极其依赖于劳动密集型像素注释来训练合适的模型。
无监督方法通常侧重于在单个视频中分割最原始对象[8,18,19,22]并在视频集合中共同分割共同对象[9,11,30]。最近几次成功,李等人。 [18]试图通过识别视频中高度显着的外观和运动的关键片段来分割前景对象。董等人。 [8]提出基于有向无环图密集提取具有高对象度和平滑度的对象段。 Papazoglou等。 [22]开发了一种快速对象分割方法,通过使用内外映射快速估计粗糙对象配置。

弱监督
由于其收集视频级标签的便利性和分析网络规模数据的前景,人们越来越关注这些方法。现有算法采用学习技术的变体,预测了属于给定概念的每个像素的置信度。 Hartmann等人。 [12]首先通过训练大规模分段分类器来解决它。唐等人。 [25]将正片中的片段与大量的负片例子进行了比较,并将那些具有鲜明外观的片段视为前景。 Xiao等人进一步推动了这项研究。 [21]用于在多类标准中处理该问题而不是传统的二元分类。影响弱监督方法的性能的常见问题是具有模糊训练标签(即目标对象的位置)的学习过程。与这些方法不同,我们的方法通过利用基于图像的对象检测器来解决弱标签的视频分割,并避免这种过程。基于检测的方法已经在图像分割上被广泛研究[16,31,14,7,17​​,26,32]。例如,Weiet al。 [31]利用探测器在没有任何像素级训练阶段的情况下引导图像中的语义对象分割。本文提出通过对象检测对弱标记视频中的语义对象进行分割,在文献中仍然较少受到关注。

其实做视频分割的人应该认真研读一下related work,因为这里面提到了很多篇在此之前就在做视频语义分割的论文了,确实很难想象,2015年的最好的语义分割面对简单图像正确率仅仅只有80%出头的算法的时代,也早已开始研究视频的语义分割了。

 

接下来就是算法的正文内容:

首先不用介绍的就是detection proposals这个做图像的人都太熟悉了,检测框定位。

接下来是本文的创新点:segmentation proposals。这个词看着就奇怪,detection框一般都是矩形,那这个segmentation框你怎么弄呢?

对于普通的FCN体系算法,都是通过卷积以后的上采样还原出segmentation轮廓,再加上反卷积特征完善。这里直接出segmentation框是什么情况?

带着这个 问题我们来看作者的解答:

首先这个效果并不是很差,也没很好。

首先视频处理方面他就是分解帧,然后对帧进行处理。

他采用了一个motion-aware version of region detector。

其中的motion-aware是来自ICCV2013的视频“目标分割”的论文,在2013年没有对语义分割过度的研究,当时还是“目标分割”。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章