語義分割進階之路之回首2015cvpr(五)

Feedforward Semantic Segmentation With Zoom-Out Features [full paper] [ext. abstract]
Mohammadreza Mostajabi, Payman Yadollahpour, Gregory Shakhnarovich

 

其實2015年期間的語義分割大致有兩種流派,各式隨機場流派和超像素流派。現在是2019年8月,目前最優質的的圖像分割算法已經選擇了隨機場流派,但是超像素流派依然也是研究方向。

本篇論文能夠刊登在CVPR2015彙報中,說明極具代表性,其主要的思路就是超像素流派的語義分割。

早在2015年,其實超像素的語義分割是主導地位,許多優秀的數據集製作、採集公司都會針對超像素進行標註,這足以說明超像素方法在圖像分割中的地位。

 

Abstract

We introduce a purely feed-forward architecture for semantic segmentation. We map small image elements (superpixels) to rich feature representations extracted from a sequence of nested regions of increasing extent. These regions are obtained by ”zooming out” from the superpixel all the way to scene-level resolution. This approach exploits statistical structure in the image and in the label space without setting up explicit structured prediction mechanisms, and thus avoids complex and expensive inference. Instead superpixels are classified by a feedforward multilayer network. Our architecture achieves 69.6% average accuracy on the PASCAL VOC 2012 test set.

 

我們爲語義分段引入了純粹的前饋架構。 我們將小圖像元素(超像素)映射到從增加範圍的嵌套區域序列中提取的豐富特徵表示。 通過從超像素一直“縮小”到場景級分辨率來獲得這些區域。 該方法利用圖像和標籤空間中的統計結構,而不設置明確的結構化預測機制,因此避免了複雜且昂貴的推理。 相反,超像素由前饋多層網絡分類。 我們的架構在PASCAL VOC 2012測試裝置上實現了69.6%的平均精度。

 

其實本文的摘要寫的不太容易看懂,它採用的是“多尺度”的方法做分割,這種多尺度不同於現在我們常說的SPPNet的多尺度,這個多尺度如下圖,就是先從最小的超像素開始做卷積,然後對超像素以外一點範圍內的圖像做卷積,然後再擴大一點做卷積,最後對整幅圖片做卷積。這就是本篇論文的trick:zooming out,這裏的zoom不是德州撲克裏面的快速桌的意思,是“多尺度的意思”

 

 

 

Introduction

We consider one of the central vision tasks, semantic segmentation: assigning to each pixel in an image a category-level label. Despite attention it has received, it remains challenging, largely due to complex interactions between neighboring as well as distant image elements, the importance of global context, and the interplay between semantic labeling and instance-level detection. A widely accepted conventional wisdom, followed in much of modern segmentation literature, is that segmentation should be treated as a structured prediction task, which most often means using a random field or structured support vector machine model of considerable complexity. This in turn brings up severe challenges, among them the intractable nature of inference and learning in many “interesting” models. To alleviate this, many recently proposed methods rely on a pre-processing stage, or a few stages, to produce a manageable number of hypothesized regions, or even complete segmentations, for an image. These are then scored, ranked or combined in a variety of ways. Here we consider a departure from these conventions, and approach semantic segmentation as a single-stage classification task, in which each image element (superpixel) is labeled by a feedforward model, based on evidence computed from the image.

The “secret” behind our method is that the evidence used in the feedforward classification is not computed from a small local region in isolation, but collected from a sequence of levels, obtained by “zooming out” from the closeup view of the superpixel. Starting from the superpixel itself, to a small region surrounding it, to a larger region around it and all the way to the entire image, we compute a rich feature representation at each level and combine all the features before feeding them to a classifier. This allows us to exploit statistical structure in the label space and dependencies between image elements at different resolutions without explicitly encoding these in a complex model.

We do not mean to dismiss structured prediction or inference, and as we discuss in Section 5, these tools may be complementary to our architecture. In this paper we explore how far we can go without resorting to explicitly structured models. We use convolutional neural networks (convnets) to extract features from larger zoom-out regions. Convnets, (re)introduced to vision in 2012, have facilitated a dramatic advance in classification, detection, fine-grained recognition and other vision tasks. Segmentation has remained conspicuously left out from this wave of progress; while image classification and detection accuracies on VOC have improved by nearly 50% (relative), segmentation numbers have improved only modestly. A big reason for this is that neural networks are inherently geared for “non-structured” classification and regression, and it is still not clear how they can be harnessed in a structured prediction framework. In this work we propose a way to leverage the power of representations learned by convnets, by framing segmentation as classification and making the structured aspect of it implicit. Finally, we show that use of multi-layer neural network trained with asymmetric loss to classify superpixels represented by zoom-out features, leads to significant improvement in segmentation accuracy over simpler models and conventional (symmetric) loss. Below we give a high-level description of our method, then discuss related work and position our work in its context. Most of the technical details are deferred to Section 4 in which we describe implementation and report on results, before concluding in Section 5.

 

我們考慮一箇中心視覺任務,即語義分割:爲圖像中的每個像素分配一個類別級別的標籤。儘管它已經受到關注,但它仍然具有挑戰性,主要是由於相鄰和遠程圖像元素之間的複雜交互,全局上下文的重要性以及語義標記和實例級檢測之間的相互作用。在現代分割文獻的大部分中,廣泛接受的傳統智慧是將分割視爲結構化預測任務,其通常意味着使用相當複雜的隨機場或結構化支持向量機模型。這反過來又帶來了嚴峻的挑戰,其中包括許多“有趣”模型中推理和學習的棘手性質。爲了緩解這種情況,許多最近提出的方法依賴於預處理階段或幾個階段來爲圖像產生可管理數量的假設區域,或甚至完整的分割。然後以各種方式對這些進行評分,排名或組合。在這裏,我們考慮脫離這些約定,並將語義分割作爲單階段分類任務,其中每個圖像元素(超像素)基於從圖像計算的證據由前饋模型標記。

我們方法背後的“祕密”是前饋分類中使用的證據不是從孤立的小局部區域計算出來的,而是從一系列水平中收集的,通過從超像素的特寫視圖“縮小”獲得。 從超像素本身到其周圍的小區域,到它周圍的較大區域以及一直到整個圖像,我們在每個級別計算豐富的特徵表示,並在將它們饋送到分類器之前組合所有特徵。 這允許我們利用標籤空間中的統計結構和不同分辨率的圖像元素之間的依賴關係,而無需在複雜模型中明確地編碼它們。

我們並不是要忽視結構化預測或推理,正如我們在第5節中討論的那樣,這些工具可能是我們架構的補充。在本文中,我們將探討在不採用明確結構化模型的情況下我們能走多遠。我們使用卷積神經網絡(convnets)從較大的縮小區域中提取特徵。 2012年(重新)引入願景的Convnets促進了分類,檢測,細粒度識別和其他視覺任務的顯着進步。從這一波進展中顯然遺漏了細分;雖然VOC的圖像分類和檢測精度提高了近50%(相對),但細分數量僅略有改善。其中一個重要原因是神經網絡固有地適用於“非結構化”分類和迴歸,並且仍然不清楚它們如何在結構化預測框架中被利用。在這項工作中,我們提出了一種方法來利用由網絡學習的表示的力量,通過將分割框架化爲分類並使其結構化方面隱含。最後,我們展示了使用不對稱損失訓練的多層神經網絡對由縮小特徵表示的超像素進行分類,導致分割精度優於簡單模型和傳統(對稱)損失。下面我們對我們的方法進行高級描述,然後討論相關工作並將我們的工作定位在其上下文中。在結束第5節之前,大多數技術細節都推遲到第4節,其中我們描述了實施和結果報告。

 

本文的文法我並不認同,既然能獲得CVPR2015的刊稿機會,那一定是技術很強,引文主要寫的內容是用條件隨機場麻煩,用超像素好,然後卷積神經網絡用在超像素更好。

2.Zoom-out feature fusion

We cast category-level segmentation of an image as classifying a set of superpixels. Since we expect to apply the same classification machine to every superpixel, we would like the nature of the superpixels to be similar, in particular their size. In our experiments we use SLIC [1], but other methods that produce nearly-uniform grid of superpixels might work similarly well. Figures 2 provides a few illustrative examples for this discussion.

Local The narrowest scope is the superpixel itself. We expect the features extracted here to capture local evidence: color, texture, small intensity/gradient patterns, and other properties computed over a relatively small contiguous set of pixels. The local features may be quite different even for neighboring superpixels, especially if these straddle category or object boundaries. Proximal As we zoom out and include larger spatial area around the superpixel, we can capture visual cues from surrounding superpixels. Features computed from these levels may capture information not available in the local scope; e.g., for locations at the boundaries of objects they will represent the appearance of both categories. For classes with non-uniform appearance they may better capture characteristic distributions for that class. We can expect somewhat more complex features to be useful at this level, but it is usually still too myopic for confident reasoning about presence of objects. Two neighboring superpixels could still have quite different features at this level, however some degree of smoothness is likely to arise from the significant overlap between neighbors’ proximal regions, e.g., A and B in Fig. 2. As another example, consider color features over the body of a leopard; superpixels for individual dark brown spots might appear quite different from their neighbors (yellow fur) but their proximal regions will have pretty similar distributions (mix of yellow and brown). Superpixels that are sufficiently far from each other could still, of course, have drastically different proximal features, e.g., A and C in Fig. 2. Distant Zooming out further, we move to the distant levels : regions large enough to include sizeable fractions of objects, and sometimes entire objects. At this level our scope is wide enough to allow reasoning about shape, presence of more complex patterns in color and gradient, and the spatial layout of such patterns. Therefore we can expect more complex features that represent these properties to be useful here. Distant regions are more likely to straddle true boundaries in the image, and so this higher-level feature extraction may include a significant area in both the category of the superpixel at hand and nearby categories. For example, consider a person sitting on a chair; bottle on a dining table; pasture animals on the background of grass, etc. Naturally we expect this to provide useful information on both the appearance of a class and its context. For nearby superpixels and far enough zoom-out level, distant regions will have a very large overlap, which will gradually diminish with distance between superpixels. This is likely to lead to somewhat gradual changes in features, and to impose a system of implicit smoothness “terms”, which depend both on the distance in the image and on the similarity in appearance in and around superpixels. Imposing such smoothness in a CRF usually leads to a very complex, intractable model.

 

縮放功能融合
我們將圖像的類別級別分割投射爲對一組超像素進行分類。 由於我們希望將相同的分類機應用於每個超像素,我們希望超像素的性質相似,特別是它們的大小。 在我們的實驗中,我們使用SLIC [1],但是產生幾乎均勻的超像素網格的其他方法可能同樣有效。 圖2提供了該討論的一些說明性示例。

局部最窄的範圍是超像素本身。我們期望這裏提取的特徵捕獲局部證據:顏色,紋理,小強度/梯度模式,以及在相對較小的連續像素集上計算的其他屬性。即使對於相鄰的超像素,局部特徵也可能完全不同,特別是如果這些特徵跨越類別或對象邊界。近端當我們縮小幷包含超像素周圍較大的空間區域時,我們可以捕獲周圍超像素的視覺線索。從這些級別計算的特徵可以捕獲在本地範圍內不可用的信息;例如,對於物體邊界處的位置,它們將代表兩個類別的外觀。對於具有非均勻外觀的類,它們可以更好地捕獲該類的特徵分佈。我們可以期待一些更復雜的特徵在這個級別上有用,但是對於存在對象的自信推理,它通常仍然過於近視。兩個相鄰的超像素在這個級別上仍然可以具有完全不同的特徵,但是某些程度的平滑性可能是由於鄰居的近端區域(例如,圖2中的A和B)之間的顯着重疊引起的。作爲另一個例子,考慮顏色特徵。豹子的身體;個別深褐色斑點的超像素可能看起來與它們的鄰居(黃色毛皮)完全不同,但它們的近端區域將具有非常相似的分佈(黃色和棕色的混合)。當然,相距離足夠遠的超像素仍然具有截然不同的近端特徵,例如圖2中的A和C.遠距離縮小,我們移動到遠處:大到足以包括相當大的部分的區域對象,有時是整個對象。在這個級別,我們的範圍足夠寬,可以推斷出形狀,顏色和漸變中更復雜圖案的存在以及這些圖案的空間佈局。因此,我們可以預期代表這些屬性的更復雜的功能在這裏很有用。遠距離區域更可能跨越圖像中的真實邊界,因此這種更高級別的特徵提取可以包括手邊超像素類別和附近類別中的重要區域。例如,考慮一個人坐在椅子上;餐桌上的瓶子;草等背景上的牧場動物當然,我們希望這能提供有關課堂外觀及其背景的有用信息。對於附近的超像素和足夠遠的縮小水平,遠距離區域將具有非常大的重疊,其將隨着超像素之間的距離逐漸減小。這可能導致特徵的某種逐漸變化,並且施加隱含平滑度“系統”的系統,其依賴於圖像中的距離和超像素中和周圍的外觀相似性。在CRF中施加這樣的平滑度通常會導致非常複雜,難以處理的模型。

 

 

他介紹了他的方案,將分割分成幾個階段,分別時local proximal distant scene

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章