[paper] Co-CNN

(ICCV 2015) Human Parsing with Contextualized Convolutional Neural Network
(T-PAMI 2016) Human Parsing with Contextualized Convolutional Neural Network
Paper: http://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Liang_Human_Parsing_With_ICCV_2015_paper.pdf
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7423822&queryText=human%20parsing%20with%20contextualized&newsearch=true
Project: http://hcp.sysu.edu.cn/deep-human-parsing/

提出Contextualized Convolutional Neural Network (CoCNN)，在CNN中加入cross-layer context, global image-level context, within-super-pixel context和cross-super-pixel neighborhood context。

cross-layer context: 加入local-to-global-to-local結構（把前面層的feature map加到後面層中，把輸入的圖片卷積後加入最後一個feature map中），結合了不同卷積層的全局語義信息和局部細節。

global image-level label prediction: 把分割中所有標籤整合成一個binary vector進行多標籤分類。

within-super-pixel smoothing and cross-superpixel neighbourhood voting: 平滑局部標籤

In this work, we address the human parsing task with a novel Contextualized Convolutional Neural Network (CoCNN) architecture, which well integrates the cross-layer context, global image-level context, within-super-pixel context and cross-super-pixel neighborhood context into a unified network.

the cross-layer context is captured by our basic local-to-global-to-local structure, which hierarchically combines the global semantic information and the local fine details across different convolutional layers.
the global image-level label prediction is used as an auxiliary objective in the intermediate layer of the Co-CNN.
the within-super-pixel smoothing and cross-superpixel neighbourhood voting are formulated as natural subcomponents of the Co-CNN to achieve the local label consistency in both training and testing process.

Introduction

none of previous methods has achieved excellent dense prediction over raw image pixels in a fully end-to-end way.

diverse contextual information and mutual relationships among the key components of human parsing (i.e. semantic labels, spatial layouts and shape priors) should be well addressed during predicting the pixel-wise labels.
the predicted label maps are desired to be detail-preserved and of high-resolution, in order to recognize or highlight very small labels (e.g. sunglass or belt).

In this paper, we present a novel Contextualized Convolutional Neural Network (Co-CNN) that successfully addresses the above mentioned issues.

Figure 1. Our Co-CNN integrates the cross-layer context, global image-level context and local super-pixel contexts into a unified network. It consists of cross-layer combination, global image-level label prediction, within-super-pixel smoothing and cross-super-pixel neighborhood voting.

given an input 150 × 100 image, we extract the feature maps for four resolutions (i.e., 150 × 100, 75 × 50, 37 × 25 and 18 × 12). Then we gradually up-sample the feature maps and combine the corresponding early, fine layers (blue dash line) and deep, coarse layers (blue circle with plus) under the same resolutions to capture the cross-layer context.
an auxiliary objective (shown as “Squared loss on image-level labels”) is appended after the down-sampling stream to predict global image-level labels. These predicted probabilities are then aggregated into the subsequent layers after the up-sampling (green line) and used to re-weight pixel-wise prediction (green circle with plus).
the within-super-pixel smoothing and cross-super-pixel neighborhood voting are performed based on the predicted confidence maps (orange planes) and the generated super-pixel over-segmentation map to produce the final parsing result.

Only down-sampling, up-sampling, and prediction layers are shown; intermediate convolution layers are omitted. For better viewing of all figures in this paper, please see original zoomed-in color pdf file.

Human Parsing

Semantic Segmentation with CNN

The Proposed Co-CNN Architecture

Local-to-global-to-local Hierarchy

Our basic local-to-global-to-local structure captures the cross-layer context. It simultaneously considers the local fine details and global structure information.

the early convolutional layers with high spatial resolutions (e.g., 150×100 ) often capture more local details while the ones with low spatial resolutions (e.g., 18×12 ) can capture more structure information with high-level semantics.

The feature maps up-sampled from the low resolutions and those from the high resolutions are then aggregated with the element-wise summation, shown as the blue circle with plus in Figure 1.

To capture more detailed local boundaries, the input image is further filtered with the 5×5 convolutional filters and then aggregated into the later feature maps.

Global Image-level Context

An auxiliary objective for multi-label prediction is used after the intermediate layers with spatial resolution of 18×12 , as shown in the pentagon in Figure 1.

Following the fully-connected layer, the C-way softmax which produces a probability distribution over the C class labels is appended.

Squared loss is used during the global image label prediction.

Suppose for each image I in the training set, y=[y1,y2,⋯,yC] is the ground-truth multi-label vector. yc=1,(c=1,⋯,C) if the image is annotated with class c , and otherwise yc=0 .

concatenating the predicted label probabilities with the intermediate convolutional layers (image label concatenation in Figure 1) and element-wise summation with label confidence maps (element-wise summation in Figure 1).

Local Super-pixel Context

It is advantageous that super-pixel guidance is used at the later stage, which avoids making premature decisions and thus learning unsatisfactory convolution filters.

Within-super-pixel Smoothing

For each input image I , we first compute the over-segmentation of I using the entropy rate based segmentation algorithm [17] and obtain 500 super-pixels per image. Given the C confidence maps {xc}C1 in the prediction layer, the within-super-pixel smoothing is performed on each map xc . Let us denote the super-pixel covering the pixel at location (i,j) by sij , the smoothed confidence maps x~c can be computed by

x ~ i, j, c = 1 ∥ s i j ∥ \sum (i', j') \in s i j x i', j', c

Cross-super-pixel Neighborhood Voting

we can take the neighboring larger regions into account for better inference, and exploit more statistical structures and correlations between different super-pixels.

For each super-pixel s, we first compute a concatenation of bag-of-words from RGB, Lab and HOG descriptor for each super-pixel, and the feature of each super-pixel can be denoted as bs . The cross neighborhood voted response x¯s of the super-pixel s is calculated by

x ¯ s = (1 - α) x ~ s + α \sum s' \in D s exp ( - ∥ b s - b s ' ∥ 2 ) \sum s ^ \in D s exp ( - ∥ b s - b s ' ∥ 2 ) x ~ s'

Our within-super-pixel smoothing and cross-super-pixel neighborhood voting can be seen as two types of pooling methods, which are performed on the local responses within the irregular regions depicted by super-pixels.

Parameter details of Co-CNN

Experiments

Experimental Settings

Dataset

the large ATR dataset [15] and the small Fashionista dataset [31]

Implementation Details

We augment the training images with the horizontal reflections, which improves about 4% in terms of F-1 scores.

Given a test image, we use the human detection algorithm [10] to detect the human body.

To evaluate the performance, we re-scale the output pixel-wise prediction back to the size of the original ground-truth labeling.

We train the networks for roughly 90 epochs, which takes 4 to 5 days.

Our Co-CNN can rapidly process one 150 × 100 image within about 0.0015 second. After incorporating the super-pixel extraction [17], we test one image within about 0.15 second.

Results and Comparisons

Discussion on Our Network

Local-to-Global-to-Local Hierarchy

Global Image-level Context

Local Super-pixel Contexts

Conclusions and Future Work

In this work, we proposed a novel Co-CNN architecture for human parsing task, which integrates the cross-layer context, global image label context and local super-pixel contexts into a unified network.

lijiancheng0614

發佈了104 篇原創文章 · 獲贊 66 · 訪問量 22萬+

私信關注

Introduction

The Proposed Co-CNN Architecture

Local-to-global-to-local Hierarchy

Global Image-level Context

Local Super-pixel Context

Parameter details of Co-CNN

Experiments

Experimental Settings

Results and Comparisons

Discussion on Our Network

Conclusions and Future Work

[轉帖]cpupower

今天，昨天，近七天，近30天，近90天，js封裝

今年在影院看的電影。。

Scene Parsing

[paper] GAN

唐詩生成器

[paper] Look Into Person

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

[paper] Co-CNN

Introduction

Related Work

The Proposed Co-CNN Architecture

Local-to-global-to-local Hierarchy

Global Image-level Context

Local Super-pixel Context

Parameter details of Co-CNN

Experiments

Experimental Settings

Results and Comparisons

Discussion on Our Network

Conclusions and Future Work