2018 CVPR-Human Semantic Parsing for Person Re-identification

論文地址

Motivation

現有方法都是利用檢測框來對局部特徵進行提取，這樣的框框精度較低，有沒有更加精準的方法來提取局部細節特徵呢？
現有的方法涉及很多模塊，相對比較複雜，那麼這些複雜模塊是否有必要呢？有沒有簡單的方法來達到相同的性能呢？

Contribution

通過大量實驗證明使用簡單且有效的訓練方法能夠顯著超過SOTA
提出了SPReID，利用human semantic parsing來處理局部視覺線索；訓練了自己的分割模型，並在human分割任務上超過了SOTA方法
本文的方法大幅度提升了Market-1501、CUHK03、DukeMTMC-reID上的性能

思考

這篇文章用分割(像素級的精度)來提高卷積網絡對局部細節特徵的提取，並將現有的Re-ID數據集混合起來擴充訓練集，方法簡單粗暴而且效果不錯。
對於這種簡單的混合訓練是不是可以繼續改進呢？
對於Human semantic parsing模型結果的融合能不能進一步優化呢？
文章是不是應該做一個與bbox方法直接對比的實驗來表明human semantic parsing的優越性呢？

1. Introduction

ReID問題定義
ReID的挑戰：
- 不同攝像頭下行人的光照、背景、遮擋、身體部分以及姿勢可能發生劇烈變化
- 即使在一個攝像頭下，上述的條件可能在隨着行人移動的時間發生變化
- 不同攝像頭下的單人圖像會帶來巨大的類內差異，影響泛化性能
- re-id數據集中的圖片分辨率都相對較低，增加了提取具有辨別力特徵的難度
基於上述挑戰，一個有效的re-ID系統學到的特徵應該能代表明確的身份、環境不變性、視角不變性。
最近研究主題從image-level的全局特徵轉向part-level的局部特徵，但是低分辨率往往會造成part detection的不準確；同時現有的part-level方法都過於複雜
針對上述兩個問題，本文表明簡單的基於Inception-V3使用一個簡單的訓練策略僅在全圖上進行處理性能能夠超過SOTA；利用具有像素級精度的語義分割來從身體部分提取局部特徵

局部特徵的探索:(結構較爲複雜)
- human body part
- extract multiple patches from image(與human body part有一定的聯繫，難以解決part misalignment)
- attention-based models(利用LSTM聚焦具有高區分度的地方)
- pose estimation model對產生body part
利用高語義信息(人物屬性)
loss：
- contrastive loss
- verification loss
- triplet loss
- multi-class classification loss
- multi-class與verification loss的結合

3.Methodology

3.1. Inception-V3 Architecture

Inception-V3的簡單介紹：
- 48層，與ResNet152有相當的精度，卻有很小的計算量。
- output32stride，最後一個inception-block經過GAP輸出爲2048維向量

3.2. Human Semantic Parsing Model

更好的利用local cues，像素級的semantic具有更高的精度，相比bbox對姿勢的變化更加魯棒
Incpetion-V3作爲分割的backbone(兩處修改)：
- 分割對最後一層的分辨率要求較高，將最後一層的grid reduction module的stride從2降到了1
- 將最後一層的卷積變爲了dilated convolution，同時移除了GAP，加了atrous spatial pyramid pooling(rates=3,6,9,12)，後面節一個1x1卷積作爲分類器
- 通過上述方法可以完成像素級的多分類任務

3.3. Person Re-identification Model *

在SPReID中的Inception-V3移除了GAP，這樣輸出爲2048通道的32 output stride
本文的baseline person re-identification model使用GAP匯聚了卷積輸出的特徵，產生2048維的全局表示，softmax cross-entropy loss訓練
局部視覺線索的利用：使用由human semantic parsing model產生的五個身體不同區域的概率圖
在SPReID中，利用每個概率圖對卷積backbone的輸出進行了彙總得到了5*2048的特徵圖，每一行對應通過一個概率圖彙總得到的特徵向量，相比GAP，這樣的方法對空間位置有一定可知性，也可以將概率圖看成身體各部分的權重
實現細節：
- 將兩個分支的輸出特徵圖flatten進行了矩陣乘：eg.對於一個身體部位概率圖：30x30x2048 –> 900x2048;30x30x1 –> 900x1; 1x900 x 900x2048 –> 1x2048
- 將head、upper-body、lower-body、shoes的結果進行了元素間的max操作並與背景以及global representation拼接起來
語義分割模型通常需要較高分辨率的圖片，對於送入re-identification backbone的圖片先通過雙線性插值縮小，再在最後的激活值處通過雙線性插值放大來匹配human semantic parsing branch的分支

4. Experiments

4.1. Datasets and Evaluation Measures

本文方法主要在Market1501、CUHK03、DukeMTMC-reID上進行了評估
訓練過程通過了3DPeS、CUHK01、CUHK02、PRID、PSDB、Shinpukan、VIPeR數據集進行了擴增，訓練集達到了11100張，17000個人物

4.2. Training the Networks

baseline model的訓練：
- 先用inputsize=492x164在訓練集上訓練了200K iteration
- 最後分別在Market1501、CUHK03、DukeMTMC-reID上使用inputsize=748x246進行了50Kiteration的fine tune
SPReID model的訓練：
- 在總的10個數據上以相同的配置進行訓練，inputsize爲512x170
semantic parsing model的訓練在LIP數據集30000張圖片上進行訓練，將20個語義標籤的預測概率組合形成5個粗略標籤來對person re-identification進行人體解析，一小心還超過了SOTA方法，結果如下：

4.3. Person Re-identification Performance

Effect of input image resolution: 輸入圖片尺寸大小對模型性能的影響，結果如下表：

在更高的分辨率上訓練能得到更好的性能

Choice of re-identification backbone architecture: 不同模型作爲backbone，結果如下表：

- Inception-V3與ResNet152有相當的性能，但是能節省3倍的計算開銷，同時從另一個角度說明了不同模型對在高分辨率圖像上fine-tune都有提升

SPReID Performance:

對於with and without前景差異的比較： $S P R e I D^{w / f g}$ 以及 $S P R e I D^{w o / f g}$
SPReID與baseline相比，唯一的不同是如何彙總最後一層卷積的特徵，利用human semantic parsing來更好的篩選出有用的特徵

Effect of weight sharing:

本文的模型有兩個彙總特徵的部分：對於直接用GAP產生特徵圖；利用人體部位概率圖進行加權彙總。對這兩部分的backbone參數共享的效果進行了實驗，結果如下表：

4.4. Comparison with the state-of-the-art

比較分成了三塊：
- SOTA
- 用兩階段訓練策略得到的baseline model
- SPReID

5. Implementation Details

Person Re-identification:

batch size:15
momentum:0.9
weight decay:0.0005
gradient clipping:2.0
learning rate:第一階段爲0.01，第二階段爲0.001
使用exponential shift with the rate of 0.9 衰減了10次學習率
Nesterov Accelarated Gradient
ImageNet pre-trained models

Human Semantic Parsing:

30個iterations
Inception-V3、atrous spatial pyramid pooling、1x1 convolutional layer的學習率分別爲0.01,0.1,0.1
inputsize:512x512

6. Conclusion

本文提出的兩個問題：
- 實現SOTA的model需要這麼複雜嗎？
- bbox是處理局部視覺線索最好的方法嗎？
通過大量的實驗解決上述問題：
- 僅使用簡單的model在大量高分辨率的圖片上訓練即可超越SOTA
- 使用human sematic parsing來處理局部視覺線索可以進一步提高性能
- 這裏是不是應該再做個bbox的實驗？

2018 CVPR-Human Semantic Parsing for Person Re-identification

Motivation

Contribution

思考

1. Introduction

3.Methodology

3.1. Inception-V3 Architecture

3.2. Human Semantic Parsing Model

3.3. Person Re-identification Model *

4. Experiments

4.1. Datasets and Evaluation Measures

4.2. Training the Networks

4.3. Person Re-identification Performance

4.4. Comparison with the state-of-the-art

5. Implementation Details

6. Conclusion

python-4.替換空格

2016 ECCV-Gated Siamese Convolutional Neural Network Architecture for Human Re-ID

2014 CVPR-DeepReID Deep Filter Pairing Neural Network for Person Re-Identification

2017 TOMM-A Discriminatively Learned CNN Embedding for Person Re-identification

python-2.找出數組中重複的數字

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

2018 CVPR-Human Semantic Parsing for Person Re-identification

Motivation

Contribution

思考

1. Introduction

2. Related Work

3.Methodology

3.1. Inception-V3 Architecture

3.2. Human Semantic Parsing Model

3.3. Person Re-identification Model *

4. Experiments

4.1. Datasets and Evaluation Measures

4.2. Training the Networks

4.3. Person Re-identification Performance

4.4. Comparison with the state-of-the-art

5. Implementation Details

6. Conclusion