语义分割进阶之路之回首CVPR2016(二)

 Hierarchically Gated Deep Networks for Semantic Segmentation. 

Guo-Jun Qi

 

本文结合了LSTM+CNNs,在CNNs中加入了分级结构判断是否是同一尺度。

可以看出从2015到2016的语义分割发展方向,多尺度问题一直是多篇论文的方向,本片解决的问题也是多尺度问题。

 

Abstract

Semantic segmentation aims to parse the scene structure of images by annotating the labels to each pixel so that images can be segmented into different regions. While image structures usually have various scales, it is difficult to use a single scale to model the spatial contexts for all individual pixels. Multi-scale Convolutional Neural Networks (CNNs) and their variants have made striking success for modeling the global scene structure for an image. However, they are limited in labeling fine-grained local structures like pixels and patches, since spatial contexts might be blindly mixed up without appropriately customizing their scales. To address this challenge, we develop a novel paradigm of multiscale deep network to model spatial contexts surrounding different pixels at various scales. It builds multiple layers of memory cells, learning feature representations for individual pixels at their customized scales by hierarchically absorbing relevant spatial contexts via memory gates between layers. Such Hierarchically Gated Deep Networks (HGDNs) can customize a suitable scale for each pixel, thereby delivering better performance on labeling scene structures of various scales. We conduct the experiments on two datasets, and show competitive results compared with the other multi-scale deep networks on the semantic segmentation task.    

摘要
语义分割旨在通过将标签注释到每个像素来解析图像的场景结构,使得图像可以被分割成不同的区域。虽然图像结构通常具有各种比例,但是难以使用单个比例来为所有单个像素建模空间上下文。多尺度卷积神经网络(CNN)及其变体在为图像的全局场景结构建模方面取得了惊人的成功。然而,它们仅限于对像素和斑块等细粒度局部结构进行标记,因为如果没有适当地定制它们的尺度,空间背景可能会被盲目地混合在一起。为了应对这一挑战,我们开发了一种新的多尺度深度网络范例,用于模拟不同尺度的不同像素周围的空间背景。它构建多层存储器单元,通过层之间的存储器门层次地吸收相关的空间上下文,以其定制的尺度学习各个像素的特征表示。这种分层门控深度网络(HGDN)可以为每个像素定制合适的比例,从而在标记各种比例的场景结构时提供更好的性能。我们在两个数据集上进行实验,并在语义分割任务上与其他多尺度深度网络相比显示出竞争结果。

在摘要中,介绍了本文的大致创新点就是多尺度,本文推出一种结构,这种结构能够分析多种尺度,然后分别对不同尺度做分割,其实就是结合了lstm的记忆细胞,然后在每一层中加入一个判断门,同一种尺度的在一起处理。

 

Introduction

The goal of semantic segmentation [2, 17, 1] is to segment images into different regions usually by assigning one of semantic labels to each pixel. It is a crucial step towards understanding image scene structures. The label of an image pixel cannot be determined by its local features extracted from a small sliding windows or neighborhood surrounding it. Rather, pixel labels are usually defined in spatial contexts, whose scales often have large variation in sizes. For example, sky and sea have a large scale of spatial context, but vessel and pedestrian are more localized to a relatively small scale of context. Moreover, even the regions of the same category may have various sizes of spatial contexts, making it impossible to fix a scale to model each pixel. This inspires us to develop a model that is capable of learning to customize spatial contexts and their scales for individual pixels in an image. To model the spatial context of a pixel, a typical approach is to model the dependencies between the adjacent local image structures based on 2D Markov Random Fields [4][12][13] and Conditional Random Fields [7][21][9]. These models usually capture the local similarity between adjacent image structures of various scales, ranging from pixels, patches to regions. Then, the scene labeling is performed by maximizing the consistency between the similar neighbors which are considered as being in the same spatial context. On the other hand, the success of deep learning framework on ImageNet challenge [11] has inspired us to apply hierarchical neural networks to build the spatial context on various scales. Convolutional Neural Networks (CNNs) [14], among all deep learning models, have shown their striking performances on modeling the image structures on different levels with multiple layered convolutional kernels. The CNN models have been generalized to scene labeling. For example, multi-scale CNNs are proposed in [3], which produce and concatenate the feature maps of all scales. Usually, the learned features have to be postprocessed by up-sampling coarser-scale maps to match with finer-grained image pixels, and the global contextual coherence and spatial consistency are imposed by CRFs and segmentation trees. Long et al. [16] propose an alternative paradigm of fully convolutional networks to annotate images pixel-wise. They present an approach to de-convolve coarser scale of output maps to label the image pixels. Unlike these two CNN-based models, we are interested in a hierarchical network, which not only explores multi-scale structures of input images, but also avoids producing coarse results that have to be up-sampled for labeling pixels. Such a model is highlighted with customized scale of the spatial context for each individual pixels so that the local structures  can be modeled on a suitable scale. Recently, Long Short Term Memory (LSTM) recurrent neural networks [8] has been applied for scene labeling [1]. The LSTM networks are originally used to model sequential data, such as sentences [19] and videos [22]. The networks are composed of a series of recurrently connected memory cells, which are able to capture long-range dependencies between different time frames. All the information entering and leaving memory cells are controlled by several types of gates, which ensures only the information relevant to the task would be maintained in their memory spaces.   

语义分割[2,17,1]的目标是通常通过为每个像素分配一个语义标签来将图像分割成不同的区域。这是了解图像场景结构的关键一步。图像像素的标签不能通过从小的滑动窗口或其周围的邻域提取的局部特征来确定。相反,像素标签通常在空间上下文中定义,其尺度通常在大小上具有大的变化。例如,天空和海洋具有大规模的空间背景,但船只和行人更加局限于相对较小的背景。此外,即使相同类别的区域可能具有各种大小的空间上下文,使得不可能固定比例来模拟每个像素。这激发了我们开发出一种模型,该模型能够学习如何为图像中的单个像素定制空间上下文及其比例。为了模拟像素的空间背景,一种典型的方法是基于2D马尔可夫随机场[4] [12] [13]和条件随机场[7] [21] [9]对相邻局部图像结构之间的依赖关系进行建模。 ]。这些模型通常捕获各种尺度的相邻图像结构之间的局部相似性,范围从像素,贴片到区域。然后,通过最大化被认为处于相同空间上下文中的相似邻居之间的一致性来执行场景标记。另一方面,ImageNet挑战[11]深度学习框架的成功激发了我们应用层次神经网络来构建各种尺度的空间背景。在所有深度学习模型中,卷积神经网络(CNNs)[14]在多层卷积核上不同层次的图像结构建模方面表现出了惊人的表现。 CNN模型已被推广到场景标记。例如,在[3]中提出了多尺度CNN,其产生并连接所有尺度的特征图。通常,学习的特征必须通过上采样粗尺度图来进行后处理以匹配更细粒度的图像像素,并且CRF和分割树强加全局上下文相干性和空间一致性。龙等人。 [16]提出了一种完全卷积网络的替代范例,用于按像素方式注释图像。他们提出了一种解卷积较粗糙的输出图谱以标记图像像素的方法。与这两个基于CNN的模型不同,我们对分层网络感兴趣,它不仅探索输入图像的多尺度结构,而且避免产生粗略结果,必须对标记像素进行上采样。利用每个单独像素的空间上下文的定制比例来突出显示这样的模型,使得可以在合适的比例上对局部结构进行建模。最近,长期短期记忆(LSTM)递归神经网络[8]已应用于场景标记[1]。 LSTM网络最初用于对连续数据建模,例如句子[19]和视频[22]。网络由一系列循环连接的存储器单元组成,这些存储器单元能够捕获不同时间帧之间的长程依赖性。进入和离开存储器单元的所有信息由几种类型的门控制,这确保仅将与任务相关的信息保持在它们的存储空间中。

 

引文中花了一些篇幅介绍现在的多尺度语义分割方案,就是MRF和CRF,两种场,这两种场的目的就是在最终上采样后能够做的更加润滑一些。然后对比自己的分级,说明了自己算法的几大优势。

第一步分解释“分级结构”能够有效解决多尺度图像精度问题,第二部分解释了借鉴了LSTM,在每一层卷积过程中加入一个门结构(HGDNs)能够有效的区分不同尺度的像素块,从而对各个尺度进行接下来的操作。

这里先解释一下LSTM,经常有做nlp的朋友提到lstm,但是做识别和检测很少用到,但是到了语义分割种,lstm又开始流行了。LSTM是RNN的范畴,在语义分割种可以理解为相邻像素之间的关系,即相邻像素之间能够通过LSTM来判断是否是同一个类别。

这里还要解释一下什么是深度学习的“post-process”这个在2016年的语义分割文章中算是比较常见的词语,直译过来是后置处理,其实就是在做完反卷积上采样后,其实图像还是破碎化的,为了让图像更加润滑,又加入一个操作,这个操作就叫后处理,一般用CRF做后处理方法。

RELATED WORK介绍了一些LSTM用在多尺度上面的研究,意义并没有那么大,因为LSTM主流不是在图像中的应用,所以也并没有太大意义,作者非常靠谱,在related work中仅仅只占了一点点的篇幅,也可以这么说,本片所采用的大量思路都是创新思路,所以相关工作在之前也确实很少。related work我们也就跳过了

 

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章