語義分割進階之路之回首CVPR2016(二)

 Hierarchically Gated Deep Networks for Semantic Segmentation. 

Guo-Jun Qi

 

本文結合了LSTM+CNNs,在CNNs中加入了分級結構判斷是否是同一尺度。

可以看出從2015到2016的語義分割發展方向,多尺度問題一直是多篇論文的方向,本片解決的問題也是多尺度問題。

 

Abstract

Semantic segmentation aims to parse the scene structure of images by annotating the labels to each pixel so that images can be segmented into different regions. While image structures usually have various scales, it is difficult to use a single scale to model the spatial contexts for all individual pixels. Multi-scale Convolutional Neural Networks (CNNs) and their variants have made striking success for modeling the global scene structure for an image. However, they are limited in labeling fine-grained local structures like pixels and patches, since spatial contexts might be blindly mixed up without appropriately customizing their scales. To address this challenge, we develop a novel paradigm of multiscale deep network to model spatial contexts surrounding different pixels at various scales. It builds multiple layers of memory cells, learning feature representations for individual pixels at their customized scales by hierarchically absorbing relevant spatial contexts via memory gates between layers. Such Hierarchically Gated Deep Networks (HGDNs) can customize a suitable scale for each pixel, thereby delivering better performance on labeling scene structures of various scales. We conduct the experiments on two datasets, and show competitive results compared with the other multi-scale deep networks on the semantic segmentation task.    

摘要
語義分割旨在通過將標籤註釋到每個像素來解析圖像的場景結構,使得圖像可以被分割成不同的區域。雖然圖像結構通常具有各種比例,但是難以使用單個比例來爲所有單個像素建模空間上下文。多尺度卷積神經網絡(CNN)及其變體在爲圖像的全局場景結構建模方面取得了驚人的成功。然而,它們僅限於對像素和斑塊等細粒度局部結構進行標記,因爲如果沒有適當地定製它們的尺度,空間背景可能會被盲目地混合在一起。爲了應對這一挑戰,我們開發了一種新的多尺度深度網絡範例,用於模擬不同尺度的不同像素周圍的空間背景。它構建多層存儲器單元,通過層之間的存儲器門層次地吸收相關的空間上下文,以其定製的尺度學習各個像素的特徵表示。這種分層門控深度網絡(HGDN)可以爲每個像素定製合適的比例,從而在標記各種比例的場景結構時提供更好的性能。我們在兩個數據集上進行實驗,並在語義分割任務上與其他多尺度深度網絡相比顯示出競爭結果。

在摘要中,介紹了本文的大致創新點就是多尺度,本文推出一種結構,這種結構能夠分析多種尺度,然後分別對不同尺度做分割,其實就是結合了lstm的記憶細胞,然後在每一層中加入一個判斷門,同一種尺度的在一起處理。

 

Introduction

The goal of semantic segmentation [2, 17, 1] is to segment images into different regions usually by assigning one of semantic labels to each pixel. It is a crucial step towards understanding image scene structures. The label of an image pixel cannot be determined by its local features extracted from a small sliding windows or neighborhood surrounding it. Rather, pixel labels are usually defined in spatial contexts, whose scales often have large variation in sizes. For example, sky and sea have a large scale of spatial context, but vessel and pedestrian are more localized to a relatively small scale of context. Moreover, even the regions of the same category may have various sizes of spatial contexts, making it impossible to fix a scale to model each pixel. This inspires us to develop a model that is capable of learning to customize spatial contexts and their scales for individual pixels in an image. To model the spatial context of a pixel, a typical approach is to model the dependencies between the adjacent local image structures based on 2D Markov Random Fields [4][12][13] and Conditional Random Fields [7][21][9]. These models usually capture the local similarity between adjacent image structures of various scales, ranging from pixels, patches to regions. Then, the scene labeling is performed by maximizing the consistency between the similar neighbors which are considered as being in the same spatial context. On the other hand, the success of deep learning framework on ImageNet challenge [11] has inspired us to apply hierarchical neural networks to build the spatial context on various scales. Convolutional Neural Networks (CNNs) [14], among all deep learning models, have shown their striking performances on modeling the image structures on different levels with multiple layered convolutional kernels. The CNN models have been generalized to scene labeling. For example, multi-scale CNNs are proposed in [3], which produce and concatenate the feature maps of all scales. Usually, the learned features have to be postprocessed by up-sampling coarser-scale maps to match with finer-grained image pixels, and the global contextual coherence and spatial consistency are imposed by CRFs and segmentation trees. Long et al. [16] propose an alternative paradigm of fully convolutional networks to annotate images pixel-wise. They present an approach to de-convolve coarser scale of output maps to label the image pixels. Unlike these two CNN-based models, we are interested in a hierarchical network, which not only explores multi-scale structures of input images, but also avoids producing coarse results that have to be up-sampled for labeling pixels. Such a model is highlighted with customized scale of the spatial context for each individual pixels so that the local structures  can be modeled on a suitable scale. Recently, Long Short Term Memory (LSTM) recurrent neural networks [8] has been applied for scene labeling [1]. The LSTM networks are originally used to model sequential data, such as sentences [19] and videos [22]. The networks are composed of a series of recurrently connected memory cells, which are able to capture long-range dependencies between different time frames. All the information entering and leaving memory cells are controlled by several types of gates, which ensures only the information relevant to the task would be maintained in their memory spaces.   

語義分割[2,17,1]的目標是通常通過爲每個像素分配一個語義標籤來將圖像分割成不同的區域。這是瞭解圖像場景結構的關鍵一步。圖像像素的標籤不能通過從小的滑動窗口或其周圍的鄰域提取的局部特徵來確定。相反,像素標籤通常在空間上下文中定義,其尺度通常在大小上具有大的變化。例如,天空和海洋具有大規模的空間背景,但船隻和行人更加侷限於相對較小的背景。此外,即使相同類別的區域可能具有各種大小的空間上下文,使得不可能固定比例來模擬每個像素。這激發了我們開發出一種模型,該模型能夠學習如何爲圖像中的單個像素定製空間上下文及其比例。爲了模擬像素的空間背景,一種典型的方法是基於2D馬爾可夫隨機場[4] [12] [13]和條件隨機場[7] [21] [9]對相鄰局部圖像結構之間的依賴關係進行建模。 ]。這些模型通常捕獲各種尺度的相鄰圖像結構之間的局部相似性,範圍從像素,貼片到區域。然後,通過最大化被認爲處於相同空間上下文中的相似鄰居之間的一致性來執行場景標記。另一方面,ImageNet挑戰[11]深度學習框架的成功激發了我們應用層次神經網絡來構建各種尺度的空間背景。在所有深度學習模型中,卷積神經網絡(CNNs)[14]在多層卷積核上不同層次的圖像結構建模方面表現出了驚人的表現。 CNN模型已被推廣到場景標記。例如,在[3]中提出了多尺度CNN,其產生並連接所有尺度的特徵圖。通常,學習的特徵必須通過上採樣粗尺度圖來進行後處理以匹配更細粒度的圖像像素,並且CRF和分割樹強加全局上下文相干性和空間一致性。龍等人。 [16]提出了一種完全卷積網絡的替代範例,用於按像素方式註釋圖像。他們提出了一種解卷積較粗糙的輸出圖譜以標記圖像像素的方法。與這兩個基於CNN的模型不同,我們對分層網絡感興趣,它不僅探索輸入圖像的多尺度結構,而且避免產生粗略結果,必須對標記像素進行上採樣。利用每個單獨像素的空間上下文的定製比例來突出顯示這樣的模型,使得可以在合適的比例上對局部結構進行建模。最近,長期短期記憶(LSTM)遞歸神經網絡[8]已應用於場景標記[1]。 LSTM網絡最初用於對連續數據建模,例如句子[19]和視頻[22]。網絡由一系列循環連接的存儲器單元組成,這些存儲器單元能夠捕獲不同時間幀之間的長程依賴性。進入和離開存儲器單元的所有信息由幾種類型的門控制,這確保僅將與任務相關的信息保持在它們的存儲空間中。

 

引文中花了一些篇幅介紹現在的多尺度語義分割方案,就是MRF和CRF,兩種場,這兩種場的目的就是在最終上採樣後能夠做的更加潤滑一些。然後對比自己的分級,說明了自己算法的幾大優勢。

第一步分解釋“分級結構”能夠有效解決多尺度圖像精度問題,第二部分解釋了借鑑了LSTM,在每一層卷積過程中加入一個門結構(HGDNs)能夠有效的區分不同尺度的像素塊,從而對各個尺度進行接下來的操作。

這裏先解釋一下LSTM,經常有做nlp的朋友提到lstm,但是做識別和檢測很少用到,但是到了語義分割種,lstm又開始流行了。LSTM是RNN的範疇,在語義分割種可以理解爲相鄰像素之間的關係,即相鄰像素之間能夠通過LSTM來判斷是否是同一個類別。

這裏還要解釋一下什麼是深度學習的“post-process”這個在2016年的語義分割文章中算是比較常見的詞語,直譯過來是後置處理,其實就是在做完反捲積上採樣後,其實圖像還是破碎化的,爲了讓圖像更加潤滑,又加入一個操作,這個操作就叫後處理,一般用CRF做後處理方法。

RELATED WORK介紹了一些LSTM用在多尺度上面的研究,意義並沒有那麼大,因爲LSTM主流不是在圖像中的應用,所以也並沒有太大意義,作者非常靠譜,在related work中僅僅只佔了一點點的篇幅,也可以這麼說,本片所採用的大量思路都是創新思路,所以相關工作在之前也確實很少。related work我們也就跳過了

 

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章