語義分割進階之路之回首2015cvpr(三)

Weakly Supervised Semantic Segmentation for Social Images [full paper] [ext. abstract]
Wei Zhang, Sheng Zeng, Dequan Wang, Xiangyang Xue

 

因爲這些文章並沒有許多人有深入研究,所以就沒有什麼直接參考,所以我打算以邊翻譯邊總結的方式學習這個算法。

abstract:

Image semantic segmentation is the task of partitioning image into several regions based on semantic concepts. In this paper, we learn a weakly supervised semantic segmentation model from social images whose labels are not pixellevel but image-level; furthermore, these labels might be noisy. We present a joint conditional random field model leveraging various contexts to address this issue. More specifically, we extract global and local features in multiple scales by convolutional neural network and topic model. Inter-label correlations are captured by visual contextual cues and label co-occurrence statistics. The label consistency between image-level and pixel-level is finally achieved by iterative refinement. Experimental results on two realworld image datasets PASCAL VOC2007 and SIFT-Flow demonstrate that the proposed approach outperforms stateof-the-art weakly supervised methods and even achieves accuracy comparable with fully supervised methods.

圖像語義分割是基於語義概念將圖像劃分爲若干區域的任務。 在本文中,我們從社會圖像中學習了一種弱監督的語義分割模型,其標籤不是像素級而是圖像級; 此外,這些標籤可能會很嘈雜。 我們提出了一個聯合條件隨機場模型,利用各種上下文來解決這個問題。 更具體地說,我們通過卷積神經網絡和主題模型提取多尺度的全局和局部特徵。 通過視覺上下文線索和標籤共現統計捕獲標籤間相關性。 圖像級和像素級之間的標籤一致性最終通過迭代細化來實現。 在兩個真實世界圖像數據集PASCAL VOC2007和SIFT-Flow上的實驗結果表明,所提出的方法優於現有技術的弱監督方法,甚至達到與完全監督方法相當的精度。

 

首先,看abstract 就能明白,這是早期的利用鼎鼎大名的條件隨機場在圖像分割中的實踐。它採用了圖級的標註代替了像素級別標註。最終他的結果在voc中表現得和強監督學習訓練的圖像分割所差不多。

我先說明一下本文的弱監督,是怎麼個弱法:

他是用這種具有標籤的圖像直接分割,不再借助工具,描邊標註。

 

引文部分

Semantic segmentation, i.e., parsing image into several semantic regions, assigns each pixel (or superpixel) to one of the predefined semantic categories. Most state-ofthe-art methods rely on a sufficiently huge amount of annotated samples in training. However, there are not enough labeled samples for this task because pixel-level (or superpixel-level) annotation is time-consuming and laborintensive. Recent works have begun to address the semantic segmentation problem in the weakly supervised settings, where each training image is only annotated by image-level labels [24, 25, 26, 27, 30, 33, 34]. The existing weakly supervised semantic segmentation methods are based on one strict assumption that image-level labels are guaranteed to be precise by professional annotators. With the prevalence of photo sharing websites and collaborative image tagging system, e.g., Flickr, a large number of social images with user provided labels are available from the Internet. These labels are usually image-level;what’s more, they might be noisy: There are either incorrect additional labels assigned to a training image or labels missing from the ground truth. Figure 1 shows several social images and the associated noisy labels. It is challenging but attractive to learn an effective semantic segmentation model from such social images. In this paper, we propose a weakly supervised semantic segmentation model to overcome the challenge posed by noisy image-level labels for training. We learn a joint conditional random field (CRF) from weakly labeled social images by sufficiently leveraging various contexts, e.g., the associations between high-level semantic concepts and low-level visual appearance, inter-label correlations, spatial neighborhoods, and label consistency between image-level and pixel-level. More specifically, each image is segmented into superpixels with multiple quantization levels. Global features for the whole image and local features for the superpixels in multiple scales are extracted by convolutional neural network (CNN) and latent semantic concept model (LSC). Then we capture the inter-label correlations by visual contextual cues as well as label co-occurrence statistics. The label consistency between image-level and pixel-level is finally achieved by iterative refinement in a flip-flop manner. We conduct experiments on two challenging datasets, PASCAL VOC 2007 and SIFT-Flow datasets. The proposed approach achieves comparable results or outperforms previous state-of-the-art methods, even though it is in the weakest supervision, which demonstrates that the image-level labels, especially potential relationships, are more efficiently utilized by our method. The main contributions of this paper are summarized as follows:

• We propose a weakly supervised semantic segmentation model for social images, where only image-level labels are available for training, or even worse, the annotations can be noisy.

• We design a joint learning framework to sufficiently leverage various contexts including feature-label association, inter-label correlation, spatial neighborhood cues, and label consistency.

• We learn inter-label correlation not only by investigating label co-occurrence statistics from training samples but also by looking at the overlap of the most informative regions for different classes.

 

語義分割(即,將圖像解析成若干語義區域)將每個像素(或超像素)分配給預定義語義類別之一。大多數最先進的方法在訓練中依賴於足夠大量的註釋樣本。但是,沒有足夠的標記樣本用於此任務,因爲像素級(或超像素級)註釋是耗時且勞動密集的。最近的工作已經開始解決弱監督設置中的語義分割問題,其中每個訓練圖像僅由圖像級標籤註釋[24,25,26,27,30,33,34]。現有的弱監督語義分割方法基於一個嚴格的假設,即專業註釋器保證圖像級標籤是精確的。隨着照片共享網站和協作圖像標記系統(例如,Flickr)的普及,可以從因特網獲得具有用戶提供的標籤的大量社交圖像。這些標籤通常是圖像級別的;更重要的是,它們可能會產生噪音:分配給訓練圖像的標籤不正確或標籤遺漏的標籤不正確。圖1顯示了幾個社交圖像和相關的噪聲標籤。從這樣的社交圖像中學習有效的語義分割模型具有挑戰性但很有吸引力。在本文中,我們提出了一種弱監督的語義分割模型,以克服噪聲圖像級標籤對訓練的挑戰。我們通過充分利用各種上下文來學習來自弱標記社交圖像的聯合條件隨機場(CRF),例如,高級語義概念與低級視覺外觀,標籤間相關性,空間鄰域和標籤之間的一致性之間的關聯。圖像級和像素級。更具體地,每個圖像被分割成具有多個量化級別的超像素。通過卷積神經網絡(CNN)和潛在語義概念模型(LSC)提取多尺度超像素的整個圖像和局部特徵的全局特徵。然後我們通過視覺上下文線索以及標籤共現統計來捕獲標籤間相關性。圖像級和像素級之間的標籤一致性最終通過以觸發器方式的迭代細化來實現。我們對兩個具有挑戰性的數據集進行了實驗,PASCAL VOC 2007和SIFT-Flow數據集。所提出的方法實現了可比較的結果,或者優於以前最先進的方法,即使它處於最弱的監督中,這表明我們的方法可以更有效地利用圖像級標籤,尤其是潛在的關係。本文的主要貢獻概括如下:

•我們提出了一種用於社交圖像的弱監督語義分割模型,其中只有圖像級標籤可用於訓練,或者更糟糕的是,註釋可能是嘈雜的。

•我們設計了一個聯合學習框架,以充分利用各種上下文,包括特徵標籤關聯,標籤間相關性,空間鄰域線索和標籤一致性。

•我們不僅通過調查來自訓練樣本的標籤共現統計數據,而且通過查看不同類別的信息最豐富的區域的重疊來學習標籤間相關性。

 

主要介紹了算法的特點吧,就是能從圖級別坐到分割。所面對的挑戰比較重要,分爲三個點:

1. 圖級別的樣本就存在標註的問題,這很影響最終的分割效果。

2. 設計一種利用上下文關聯的方法做分割。

3. 意思就是他還注重研究了物體重疊區域。

 

 

2.related work

In the past years, image semantic segmentation has attracted a lot of attentions. Most of the existing works model the task as a fully supervised problem [32]. Shotton et al. [19] implemented semantic segmentation by incorporating shape-texture color, location and edge clues in a CRF model over image pixels. This model is then extended in the follow-up works [10, 12, 13]. Kohli et al. utilized the higher order potentials as a soft decision to ensure that pixels constituting a particular segment have the same semantic concept [10]. Ladicky et al. extended the higher order potentials to hierarchical structure by using multiple segmentations in [12] and further integrated label co-occurrence statistics in [13]. However, these methods heavily rely on pixel-level annotations during the training stage. In addition to fully supervised semantic segmentation, there have been several works in the weakly supervised settings as well recently. The method in [31] attempted to automatically annotate image regions by learning a correlative multi-label multi-instance model from image-level tagged data. Verbeek and Triggs [24] used several appearance descriptors to learn the latent aspect model via probabilistic Latent Semantic Analysis (pLSA) [8], and integrated the spanning tree structure and Markov Random Fields to capture spatial information. Vezhnevets and Buhmann [25] cast the weakly supervised task as a multi-instance multi-task learning problem with the framework of Semantic Texton Forest (STF) [18]. Based on [25], Vezhnevets et al. [26, 27] integrated the latent correlations among the superpixels belonging to different images which share the same labels into CRF. Xu et al. [30] simplified the previous complicated framework by a graphical model that encodes the presence/absence of a class as well as the assignments of semantic labels to superpixels. [33] performed semantic segmentation in weak supervision via classifier evaluation where the classifier parameters are firstly sampled at random and then the superpixel classifiers are evaluated by measuring the distance between the ground-truth negative samples and the predicted positive samples. It should be pointed out that all above approaches are based on the assumption that the given image-level labels for training are correct and complete, which is not practical in many real-world applications. It is a realistic problem where the end goal is pixel-level labels but the input is noisy image-level annotations. To address the problem of having noise in the ground truth, we investigate label correlations based on both label co-occurrence statistics and visual contextual cues simultaneously, which differs from the existing weakly supervised methods [24, 25, 26, 27, 30]. In addition, to make the proposed framework more robust under the noisy condition, we take latent semantic concept model as a mid-level representation, which also helps to narrow down the gap between semantic space and feature space; in contrast, the previous methods (e.g., [26, 30]) only used the appearance model as a low-level representation. In comparison with the stateof-the-art weakly supervised methods (e.g., [27, 30]), we utilize multiple scale segmentations to overcome the weakness of single choice of segmentation which fails to cover different quantization levels of objects.

 

 

在過去的幾年中,圖像語義分割引起了很多關注。大多數現有工作將任務建模爲完全監督的問題[32]。 Shotton等人。 [19]通過在圖像像素上的CRF模型中結合形狀 - 紋理顏色,位置和邊緣線索來實現語義分割。然後在後續工作[10,12,13]中擴展該模型。 Kohli等人。利用高階電位作爲軟判決,以確保構成特定片段的像素具有相同的語義概念[10]。 Ladicky等人。通過在[12]中使用多個分段並在[13]中進一步集成標籤共現統計,將高階勢擴展到分層結構。然而,這些方法在訓練階段嚴重依賴於像素級註釋。除了完全監督的語義分割之外,最近還在弱監督設置中進行了多項工作。 [31]中的方法試圖通過從圖像級標記數據中學習相關的多標籤多實例模型來自動註釋圖像區域。 Verbeek和Triggs [24]使用幾個外觀描述符通過概率潛在語義分析(pLSA)[8]來學習潛在方面模型,並集成了生成樹結構和馬爾可夫隨機場來捕獲空間信息。 Vezhnevets和Buhmann [25]將弱監督任務作爲一個多實例多任務學習問題,使用語義文本森林(STF)框架[18]。基於[25],Vezhnevets等。 [26,27]將屬於共享相同標籤的不同圖像的超像素之間的潛在相關性整合到CRF中。徐等人。 [30]通過圖形模型簡化了先前的複雜框架,該圖形模型編碼類的存在/不存在以及語義標籤到超像素的分配。 [33]通過分類器評估在弱監督下進行語義分割,其中首先對分類器參數進行隨機採樣,然後通過測量地面實況負樣本與預測的正樣本之間的距離來評估超像素分類器。應該指出的是,所有上述方法都基於以下假設:用於訓練的給定圖像級標籤是正確和完整的,這在許多實際應用中是不實際的。這是一個現實問題,其最終目標是像素級標籤,但輸入是嘈雜的圖像級註釋。爲了解決基礎事實中存在噪聲的問題,我們同時研究了基於標籤共現統計和視覺上下文線索的標籤相關性,這與現有的弱監督方法不同[24,25,26,27,30]。另外,爲了使擬議的框架在噪聲條件下更加魯棒,我們將潛在的語義概念模型作爲中層表示,這也有助於縮小語義空間與特徵空間之間的差距;相反,先前的方法(例如,[26,30])僅將外觀模型用作低級表示。與現有技術的弱監督方法(例如,[27,30])相比,我們利用多個尺度分割來克服單個選擇的分割的弱點,其不能覆蓋對象的不同量化水平。

 

其實高水平論文就是不一樣啊,人家的related work是真的相關,相關的一批,連最早期的crf用於圖像分割都引用了。所以想寫好論文,一定要大量閱讀優秀論文。

這裏主要介紹了早期的crf應用圖像分割,但是這些方法受強監督的壓力,就是在弱監督上不行;然後介紹了一些弱監督的方法,也是基於crf的,這篇論文對crf的研究是真的非常有深度的,如果把所有參考文獻也一一閱讀意義非凡。他認爲許多圖級別的crf方法受標籤影響很嚴重,所以同樣是不行的,然後他在related work的篇章中又一次迫不及待地說了他們採用一種能夠同時針對解決標籤問題並且能解決圖級分割問題的方案。

 

3. The Proposed Model

Suppose that each image I is associated with a label vector y = [y1, ..., yL], where L is the number of categories, and yi = 1 indicates that the i-th category is present in this image, otherwise yi = 0. In the training set, y is given; however, it might be noisy. In the test set, y is unknown. For each image, we firstly employ the existing multi-scale segmentation algorithm to get a set of superpixels {xp}M p=1 over multiple quantization levels. Here, M is the total number of superpixels in image I. The label of superpixel xp is denoted as hp ∈ {1, 2, ..., L}, and the labels of all superpixels for image I are h = [h1, ..., hM], which are not available for training. Our goal is to infer semantic label for each superpixel in an image and the adjacent superpixels sharing the same semantic label are fused as a whole one. We jointly build a conditional random field (CRF) over the image-level label variables y and the superpixel-level label variables h. We leverage label-pair correlation and connect each superpixel to its neighbors to encode local smoothness constraints. Thus we formulate an energy function E with five types of potentials as follows:

 

 

3.候選框模型
假設每個圖像I與標籤矢量y = [y1,...,yL]相關聯,其中L是類別的數量,yi = 1表示第i個類別存在於該圖像中,否則yi = 0.在訓練集中,給出y; 但是,它可能會很吵。 在測試集中,y是未知的。 對於每個圖像,我們首先使用現有的多尺度分割算法在多個量化級別上獲得一組超像素{xp} M p = 1。 這裏,M是圖像I中超像素的總數。超像素xp的標記表示爲hp∈{1,2,...,L},圖像I的所有超像素的標籤是h = [h1, ......,hM],不可用於培訓。 我們的目標是推斷圖像中每個超像素的語義標籤,並且相鄰的超像素共享相同的語義標籤語義標籤作爲一個整體融合。 我們在圖像級標籤變量y和超像素級標籤變量h上共同構建條件隨機場(CRF)。 我們利用標籤對相關性並將每個超像素連接到其鄰居以編碼局部平滑約束。 因此,我們用五種類型的電勢來制定能量函數E,如下所示:

他提到他會用CRF判斷所有像素的類別,其中判斷指標中很重要的一點就是上下文、鄰近像素。

其實自FCN以來上下文和鄰近像素這個研究方向也是後續語義分割的研究重點和重心,可以看到2015的這兩篇論文,FCN包括這一篇,都是後續圖像分割的非常重要的風向標。

 

後續的我會回家以後更新。

 

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章