(ICCV 2015) Human Parsing with Contextualized Convolutional Neural Network
(T-PAMI 2016) Human Parsing with Contextualized Convolutional Neural Network
Paper: http://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Liang_Human_Parsing_With_ICCV_2015_paper.pdf
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7423822&queryText=human%20parsing%20with%20contextualized&newsearch=true
Project: http://hcp.sysu.edu.cn/deep-human-parsing/
提出Contextualized Convolutional Neural Network (CoCNN),在CNN中加入cross-layer context, global image-level context, within-super-pixel context和cross-super-pixel neighborhood context。
cross-layer context: 加入local-to-global-to-local結構(把前面層的feature map加到後面層中,把輸入的圖片卷積後加入最後一個feature map中),結合了不同卷積層的全局語義信息和局部細節。
global image-level label prediction: 把分割中所有標籤整合成一個binary vector進行多標籤分類。
within-super-pixel smoothing and cross-superpixel neighbourhood voting: 平滑局部標籤
In this work, we address the human parsing task with a novel Contextualized Convolutional Neural Network (CoCNN) architecture, which well integrates the cross-layer context, global image-level context, within-super-pixel context and cross-super-pixel neighborhood context into a unified network.
the cross-layer context is captured by our basic local-to-global-to-local structure, which hierarchically combines the global semantic information and the local fine details across different convolutional layers.
the global image-level label prediction is used as an auxiliary objective in the intermediate layer of the Co-CNN.
the within-super-pixel smoothing and cross-superpixel neighbourhood voting are formulated as natural subcomponents of the Co-CNN to achieve the local label consistency in both training and testing process.
Introduction
none of previous methods has achieved excellent dense prediction over raw image pixels in a fully end-to-end way.
diverse contextual information and mutual relationships among the key components of human parsing (i.e. semantic labels, spatial layouts and shape priors) should be well addressed during predicting the pixel-wise labels.
the predicted label maps are desired to be detail-preserved and of high-resolution, in order to recognize or highlight very small labels (e.g. sunglass or belt).
In this paper, we present a novel Contextualized Convolutional Neural Network (Co-CNN) that successfully addresses the above mentioned issues.
Figure 1. Our Co-CNN integrates the cross-layer context, global image-level context and local super-pixel contexts into a unified network. It consists of cross-layer combination, global image-level label prediction, within-super-pixel smoothing and cross-super-pixel neighborhood voting.
given an input 150 × 100 image, we extract the feature maps for four resolutions (i.e., 150 × 100, 75 × 50, 37 × 25 and 18 × 12). Then we gradually up-sample the feature maps and combine the corresponding early, fine layers (blue dash line) and deep, coarse layers (blue circle with plus) under the same resolutions to capture the cross-layer context.
an auxiliary objective (shown as “Squared loss on image-level labels”) is appended after the down-sampling stream to predict global image-level labels. These predicted probabilities are then aggregated into the subsequent layers after the up-sampling (green line) and used to re-weight pixel-wise prediction (green circle with plus).
the within-super-pixel smoothing and cross-super-pixel neighborhood voting are performed based on the predicted confidence maps (orange planes) and the generated super-pixel over-segmentation map to produce the final parsing result.
Only down-sampling, up-sampling, and prediction layers are shown; intermediate convolution layers are omitted. For better viewing of all figures in this paper, please see original zoomed-in color pdf file.
Related Work
Human Parsing
Semantic Segmentation with CNN
The Proposed Co-CNN Architecture
Local-to-global-to-local Hierarchy
Our basic local-to-global-to-local structure captures the cross-layer context. It simultaneously considers the local fine details and global structure information.
the early convolutional layers with high spatial resolutions (e.g.,
The feature maps up-sampled from the low resolutions and those from the high resolutions are then aggregated with the element-wise summation, shown as the blue circle with plus in Figure 1.
To capture more detailed local boundaries, the input image is further filtered with the
Global Image-level Context
An auxiliary objective for multi-label prediction is used after the intermediate layers with spatial resolution of
Following the fully-connected layer, the C-way softmax which produces a probability distribution over the C class labels is appended.
Squared loss is used during the global image label prediction.
Suppose for each image
concatenating the predicted label probabilities with the intermediate convolutional layers (image label concatenation in Figure 1) and element-wise summation with label confidence maps (element-wise summation in Figure 1).
Local Super-pixel Context
It is advantageous that super-pixel guidance is used at the later stage, which avoids making premature decisions and thus learning unsatisfactory convolution filters.
Within-super-pixel Smoothing
For each input image
Cross-super-pixel Neighborhood Voting
we can take the neighboring larger regions into account for better inference, and exploit more statistical structures and correlations between different super-pixels.
For each super-pixel s, we first compute a concatenation of bag-of-words from RGB, Lab and HOG descriptor for each super-pixel, and the feature of each super-pixel can be denoted as
Our within-super-pixel smoothing and cross-super-pixel neighborhood voting can be seen as two types of pooling methods, which are performed on the local responses within the irregular regions depicted by super-pixels.
Parameter details of Co-CNN
Experiments
Experimental Settings
Dataset
the large ATR dataset [15] and the small Fashionista dataset [31]
Implementation Details
We augment the training images with the horizontal reflections, which improves about 4% in terms of F-1 scores.
Given a test image, we use the human detection algorithm [10] to detect the human body.
To evaluate the performance, we re-scale the output pixel-wise prediction back to the size of the original ground-truth labeling.
We train the networks for roughly 90 epochs, which takes 4 to 5 days.
Our Co-CNN can rapidly process one 150 × 100 image within about 0.0015 second. After incorporating the super-pixel extraction [17], we test one image within about 0.15 second.
Results and Comparisons
Discussion on Our Network
Local-to-Global-to-Local Hierarchy
Global Image-level Context
Local Super-pixel Contexts
Conclusions and Future Work
In this work, we proposed a novel Co-CNN architecture for human parsing task, which integrates the cross-layer context, global image label context and local super-pixel contexts into a unified network.