語義分割進階之路之回首2015cvpr(六)

Semantic Object Segmentation via Detection in Weakly Labeled Video [full paper] [ext. abstract]

 

從標題就很能看出本篇論文是針對兩個點,video和weakly label。我個人對於video的語義分割並沒有瞭解,停留在先將video轉成圖像,然後通過圖像分割的階段。

 

abstract 

Semantic object segmentation in video is an importantstep for large-scale multimedia analysis. In many cases,however, semantic objects are only tagged at video-level,making them difficult to be located and segmented. To address this problem, this paper proposes an approach to segment semantic objects in weakly labeled video via objectdetection. In our approach, a novel video segmentationby-detection framework is proposed, which first incorporates object and region detectors pre-trained on still images to generate a set of detection and segmentation proposals. Based on the noisy proposals, several object tracksare then initialized by solving a joint binary optimizationproblem with min-cost flow. As such tracks actually provide rough configurations of semantic objects, we thus refine the object segmentation while preserving the spatiotemporal consistency by inferring the shape likelihoods of pixels from the statistical information of tracks. Experimentalresults on Youtube-Objects dataset and SegTrack v2 datasetdemonstrate that our method outperforms state-of-the-artsand shows impressive results

視頻中的語義對象分割是大規模多媒體分析的重要步驟。然而,在許多情況下,語義對象僅在視頻級別被標記,使得難以定位和分割它們。爲了解決這個問題,本文提出了一種通過對象檢測來對弱標記視頻中的語義對象進行分割的方法。在我們的方法中,提出了一種新穎的視頻分割檢測框架,其首先結合了在靜止圖像上預訓練的對象和區域檢測器,以生成一組檢測和分割建議。基於噪聲提議,然後通過解決具有最小成本流的聯合二進制優化問題來初始化若干對象軌跡。由於這樣的軌道實際上提供了語義對象的粗略配置,因此我們通過從軌道的統計信息推斷像素的形狀可能性來細化對象分割同時保持時空一致性。關於Youtube-Objects數據集和SegTrack v2數據集的實驗結果證明我們的方法優於現有技術並且顯示出令人印象深刻的結果

之前做過目標跟蹤的項目,所以大概瞭解到對於跟蹤,需要先做檢測然後跟蹤。而本篇論文也是差不多的思路,既然是video,那麼就得跟蹤每一個像素,所以他明確的說出他的核心創新:他們對弱監督的視頻基於檢測作語義分割。

其實我記得在15--16年那段時間跟蹤是做的比較多的,幾乎全部都是基於檢測的跟蹤,那麼本篇論文既然能上cvpr,一定有其過人之處,所以我估計本篇的最重要的點,就在於怎麼做視頻分割以及怎麼做這個弱監督。

現在是2019年,從2017年開始這個無人駕駛汽車就是語義分割最大的應用之一,蔚藍汽車、momenta等高級科技研發公司都是做自動駕駛的,而這個自動駕駛的語義分割可不是你弄一弄圖像就能解決的,所以關注視頻語義分割也是重點。

 

Introduction
 

Semantic video object segmentation, which aims to detect and segment object-like regions in video according topredefined object labels, is an essential step in computer vision and multimedia analysis. In many scenarios, however,objects are only labeled at video-level, making them difficult to be located and segmented. As videos tagged withonly semantic labels are growing explosively on the Internet, it is necessary to explore how to segment the desiredsemantic objects in such weakly labeled videos.Recently, many approaches (e.g., [12, 25, 21]) have beenproposed to address this problem through weakly supervised learning. In the learning process, object-relevant instances were usually selected among videos sharing thesame semantic tags, while background instances were sam pled from videos with irrelevant tags. These instances,which may be inadequately selected or inaccurately labeled, were then fed into the weakly supervised learningframework (e.g., multi-instance learning [12], negative mining [25] and label transfer [21]) to train a segment classifier.Although these approaches can achieve promising resultson certain scenarios, the ambiguity of training instancesmay lead to unexpected segmentation results. Moreover,multiple videos are required during training, preventing theusage of these approaches in single video segmentation.To address these problems, this paper proposes to segment semantic objects in video via detection. The proposedapproach does not need segment-level training stage andthus avoids the selection of ambiguous instances. Instead,the image-based object detectors, which have demonstratedgreat successes in segmenting semantic objects in images[32, 17, 14, 26, 31], are employed into our segmentationby-detection framework. In this framework, object and region detectors pre-trained on still images are first used togenerate a set of detection and segmentation proposals onvarious video frames. As shown in Fig. 1, such image-basedproposals often lack spatiotemporal consistency and may beinaccurate due to blurred boundaries generated from videocompression, object occlusion and camera motion. Therefore, we propose to initialize several object tracks from these noisy proposals by solving a joint assignment problem formulated as min-cost flow. As these tracks actuallyprovide rough configurations of semantic objects, we thuspropose to infer the shape likelihoods of pixels from the statistical information of tracks. In this manner, backgroundnoises can be suppressed and the segmentation results ofdesired objects can be refined while preserving the spatiotemporal consistency. To validate the effectiveness of theproposed approach, we have conducted extensive experiments on two public video benchmarks, including YoutubeObjects dataset and SegTrack v2 dataset. Experimental results show that our approach outperforms several state-ofthe-art weakly-supervised and unsupervised approaches onchallenging object categories. The main contributions ofthe proposed approach are summarized as follows:

 

1) A novel segmentation-by-detection framework is proposed for semantic object segmentation in weakly labeled video, and demonstrates impressive performance on two public video benchmarks.

2) We present an algorithm to initialize object tracks fromimage-based detection and segmentation proposals. Bysolving a joint assignment problem with min-cost flow,this algorithm is robust to noisy proposals.

3) We refine the object tracks by inferring shape likelihoodsfrom the statistical information of tracks, which can effectively suppress background noise while preservingthe spatiotemporal consistency of foreground objects.

 

語義視頻對象分割旨在根據頂部定義的對象標籤檢測和分割視頻中的對象類區域,是計算機視覺和多媒體分析中必不可少的步驟。然而,在許多情況下,對象僅在視頻級別標記,使得難以定位和分割對象。由於標記了僅具有語義標籤的視頻在互聯網上爆炸式增長,因此有必要探索如何在這種弱標籤視頻中分割所需的語義對象。最近,已經提出了許多方法(例如,[12,25,21])來解決這個問題。通過弱監督學習的問題。在學習過程中,通常在共享相同語義標籤的視頻中選擇對象相關實例,而背景實例則從具有不相關標籤的視頻中進行採樣。然後將這些可能被選擇不當或標記不準確的實例輸入弱監督學習框架(例如,多實例學習[12],負挖掘[25]和標籤轉移[21])以訓練分段分類器。這些方法在某些情況下可以取得有希望的結果,訓練實例的模糊性可能導致意外的分割結果。此外,在訓練過程中需要多個視頻,以防止這些方法在單個視頻分割中的使用。爲了解決這些問題,本文提出通過檢測來分割視頻中的語義對象。所提出的方法不需要分段級訓練階段,並且可以避免選擇不明確的實例。相反,基於圖像的物體探測器已被證明在分割圖像中的語義對象[32,17,14,26,31]方面取得了巨大成功,它們被用於我們的分割檢測框架中。在該框架中,首先使用在靜止圖像上預訓練的對象和區域檢測器來生成關於各種視頻幀的一組檢測和分割建議。如圖1所示,這種基於圖像的建議通常缺乏時空一致性,並且由於視頻壓縮,物體遮擋和相機運動產生的模糊邊界可能不準確。因此,我們建議通過解決作​​爲最小成本流量的聯合分配問題來從這些噪聲提議初始化幾個對象軌道。由於這些軌道實際上提供了語義對象的粗略配置,因此我們建議從軌道的統計信息推斷出像素的形狀可能性。以這種方式,可以抑制背景噪聲,並且可以在保持時空一致性的同時細化所需對象的分割結果。爲了驗證所提出方法的有效性,我們對兩個公共視頻基準進行了大量實驗,包括YoutubeObjects數據集和SegTrack v2數據集。實驗結果表明,我們的方法在具有挑戰性的對象類別方面優於幾種最先進的弱監督和無監督方法。擬議方法的主要貢獻概括如下:

1)針對弱標記視頻中的語義對象分割,提出了一種新的檢測分割框架,並在兩個公共視頻基準測試中表現出令人印象深刻的性能。
2)我們提出了一種算法,用於從基於圖像的檢測和分割建議初始化對象軌跡。 通過最小成本流程解決聯合分配問題,該算法對於噪聲提議是魯棒的。
3)通過軌跡統計信息推斷形狀似然,細化物體軌跡,有效抑制背景噪聲,同時保持前景物體的時空一致性。

看完 引文基本上能大概瞭解他的思路,就是他所針對的是相同弱監督標籤的視頻,然後對視頻進行檢測能夠有效確定目標的邊界。

 

related work

Video object segmentation approaches in the current literature can be grouped into the (semi-)supervised, unsupervised and weakly supervised ones.

Supervised and semi-supervised approaches typicallyact through training label classifiers [20, 27] or propagatinguser-annotated labels over time [13, 29, 28, 2]. Althoughbeing well studied in a long period, such methods are limited to a small range of applications for its extreme dependence on labor-intensive pixel annotations to train suitablemodels.

Unsupervised approaches generally focus on segmenting the most primal object [8, 18, 19, 22] in a single videoand co-segmenting the common object among a video collection [9, 11, 30]. As several recent successes, Lee etal. [18] attempted to segment the foreground object throughidentifing the key segments of highly salient appearance andmotion in the video. Dong et al. [8] proposed to denselyextract object segments with high objectness and smoothevolvement based on directed acyclic graph. Papazoglou etal. [22] developed a fast object segmentation approach thatquickly estimates rough object configurations through theuse of inside-outside maps.

Weakly supervised

approaches have received growingattention for its convenience in gathering video-level labelsand the prospect in analyzing web-scale data. Existing algorithms employed variants on the learning techniques topredict the confidence of each pixel belonging to a givenconcept. Hartmann et al. [12] first addressed it by traininglarge-scale segment classifiers. Tang et al. [25] comparedthe segments in positive videos with a large collection ofnegative examples and identified those of distinct appearance as foreground. The study was further pushed forwardby Xiao et al. [21] for handling this problem in multi-classcriterion as opposed to traditional binary classification.A common issue affecting the performance of weakly supervised approaches is the learning procedure with ambiguous training labels (i.e. locations of target objects). Different from these methods, our approach addresses video segmentation with weak labels through leveraging image-basedobject detectors and avoids such a procedure. Detectionbased approaches have been widely studied on image segmentation [16, 31, 14, 7, 17, 26, 32]. For example, Weiet al. [31] utilized detectors to guide semantic object segmentation in images without any pixel-level training stage.Inspired by these successes, this paper proposes to segmentsemantic objects in weakly labeled videos via object detection, which still receives less attention in the literature.

當前文獻中的視頻對象分割方法可以分爲(半)監督的,無監督的和弱監督的。
監督和半監督方法通常通過訓練標籤分類器[20,27]或隨時間傳播用戶註釋標籤[13,29,28,2]來實現。儘管在很長一段時間內進行了很好的研究,但這種方法僅限於小範圍的應用,因爲它極其依賴於勞動密集型像素註釋來訓練合適的模型。
無監督方法通常側重於在單個視頻中分割最原始對象[8,18,19,22]並在視頻集合中共同分割共同對象[9,11,30]。最近幾次成功,李等人。 [18]試圖通過識別視頻中高度顯着的外觀和運動的關鍵片段來分割前景對象。董等人。 [8]提出基於有向無環圖密集提取具有高對象度和平滑度的對象段。 Papazoglou等。 [22]開發了一種快速對象分割方法,通過使用內外映射快速估計粗糙對象配置。

弱監督
由於其收集視頻級標籤的便利性和分析網絡規模數據的前景,人們越來越關注這些方法。現有算法採用學習技術的變體,預測了屬於給定概念的每個像素的置信度。 Hartmann等人。 [12]首先通過訓練大規模分段分類器來解決它。唐等人。 [25]將正片中的片段與大量的負片例子進行了比較,並將那些具有鮮明外觀的片段視爲前景。 Xiao等人進一步推動了這項研究。 [21]用於在多類標準中處理該問題而不是傳統的二元分類。影響弱監督方法的性能的常見問題是具有模糊訓練標籤(即目標對象的位置)的學習過程。與這些方法不同,我們的方法通過利用基於圖像的對象檢測器來解決弱標籤的視頻分割,並避免這種過程。基於檢測的方法已經在圖像分割上被廣泛研究[16,31,14,7,17​​,26,32]。例如,Weiet al。 [31]利用探測器在沒有任何像素級訓練階段的情況下引導圖像中的語義對象分割。本文提出通過對象檢測對弱標記視頻中的語義對象進行分割,在文獻中仍然較少受到關注。

其實做視頻分割的人應該認真研讀一下related work,因爲這裏面提到了很多篇在此之前就在做視頻語義分割的論文了,確實很難想象,2015年的最好的語義分割面對簡單圖像正確率僅僅只有80%出頭的算法的時代,也早已開始研究視頻的語義分割了。

 

接下來就是算法的正文內容:

首先不用介紹的就是detection proposals這個做圖像的人都太熟悉了,檢測框定位。

接下來是本文的創新點:segmentation proposals。這個詞看着就奇怪,detection框一般都是矩形,那這個segmentation框你怎麼弄呢?

對於普通的FCN體系算法,都是通過卷積以後的上採樣還原出segmentation輪廓,再加上反捲積特徵完善。這裏直接出segmentation框是什麼情況?

帶着這個 問題我們來看作者的解答:

首先這個效果並不是很差,也沒很好。

首先視頻處理方面他就是分解幀,然後對幀進行處理。

他採用了一個motion-aware version of region detector。

其中的motion-aware是來自ICCV2013的視頻“目標分割”的論文,在2013年沒有對語義分割過度的研究,當時還是“目標分割”。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章