2018-CVPR-Harmonious Attention Network for Person Re-Identification

原創

_Xiaobo

2018-09-25 23:24

論文地址
 代碼實現【Pytorch】

Motivation

對於大幅度的姿勢變化以及錯檢帶來的行人框對準問題，現有方法採用constrained attention selection mechanisms解決並不是最優的，如何更好的優化該問題呢？

Contribution

提出了新的聯合學習多尺度注意力徐州與特徵表示方法
Harmonious Attention Moudle
- hard region-level
- soft pixel-level
  ==> a lightweight Harmouious Attention module
cross-attention interaction learning scheme：進一步提高注意力選擇與特徵表示的兼容性

1.Introduction

本文關注的問題：
- 檢測算法帶來的對準、背景混雜、遮擋、缺失身體問題
- 不同攝像頭視角下姿勢變化的圖像匹配不對準問題
現有方法解決思路：
- 成對圖像匹配中的局部區域校準和顯着性加權 ==> 缺點：依賴手工特徵，缺少深度特徵的判別力
- Attention deep learning model:藉助現有分類模型，過於複雜且只有粗糙的區域注意力，忽視了細節信息，對小數據集的訓練不是很有效
本文將注意力選擇與特徵表示進行聯合學習，提出了一個輕量級網絡HACNN

2.Related Work

attention selection techniques:
- hand-crafted features
attention deep learning methods(PDC等):
- regional attention selection sub-network(hard attenion)
- soft attention
HA-CNN的優勢：
- soft + hard
- multi-level correlated attention
- cross-attention interaction learning

3. Harmonious Attention Network

目標：在劇烈的視角變化的情況下學到最優的深度特徵表示模型

HA-CNN Overview

- a harmonious attention learning scheme：對於邊界框未知的不對準進行attention selection - hard attention ==> local branch - soft attention ==> global branch - a cross-attention interaction learning scheme between the local and global branches：提高柔和與兼容性來同時優化每一個branch

3.1.Harmonious Attention Learning

hard regional attention(STN) + soft spatial(RAN) + channel attention(SE)

(Ⅰ)Soft Spatial-Channel Attention

(1) Spatial attention:

4層的網絡(10個參數)
- a global cross-channel averaging pooling layer(通道維度池化)
- 3 x 3 conv s = 2
- resizing bilinear layer
- scaling conv layer:自適應學習融合尺寸，達到與通道注意力的最優融合
跨通道池化公式定義: $h \times w \times c$ ==> $h \times w \times 1$ ，對於第二層的卷積降低了c倍參數
$S^l_{input}= \frac{1}{c}\sum_{i=1}^{c}X^l_{1:h,1:w,i}$
cross-channel pooling合理性：所有通道共享相同的空間注意力圖

Channel Attention

4-layers
squeeze-and-excitation sub-network

(Ⅱ)Hard Regional Attention

作用：利用STN思想，在不同的層次通過轉換矩陣定位潛在的 $T$ 個判別區域
$\mathbf{A}^l = \left[ \begin{matrix} s_h & 0 & t_x \\ 0 & s_w & t_y \\ \end{matrix} \right]$
與STN的區別：

(Ⅲ)Cross-Attention Interaction Learning

通過全局與局部特徵的交互來提高聯合學習soft與hard attention的效果：
- 利用Hard attention產生的區域將全局與局部特徵對應
  $\mathbf{X}_L^ {l,k} = \mathbf{X}_L^{l,k}+\mathbf{X}_G ^{(l,k)}$
反向傳播過程中，全局分支的參數通過全局與局部損失聯合進行優化
$\Delta\mathbf{W}_G^{(l)} = \frac{\partial\mathcal{L}_G}{\partial\mathbf{X}_G^{(l)}}\frac{\partial\mathbf{X}_G^{(l)}}{\partial\mathbf{W}_G^{(l)}} + \sum_{k=1}^T\frac{\partial\mathcal{L}_L}{\partial\widetilde\mathbf{X}_G^{(l,k)}}\frac{\partial\widetilde\mathbf{X}_G^{(l,k)}}{\partial\mathbf{W}_G^{(l)}}$

3.2. Person Re-ID by HA-CNN

將行人圖片通過HACNN得到1024維的特徵表示，並計算 $l_2$ 距離進行排序

4. Experiments

Datasets and Evaluation Protocol

CUHK03、Market-1501、DukeMTMC
CMC與mAP

Implementation Details

Tensorflow
Inception units: $d_1=128, d_2=256,d_3=384$
$T=4$
Adam、lr:5x10e-4、 $\beta_1=0.9, \beta_2=0.999$
batch size:32、epoch:150、momentum：0.9
no augmengtation method

4.1. Comparisons to State-of-the-Art Methods

Evaluation on Market-1501

Evaluation on DukeMTMC-ReID

Evaluation on CUHK03

4.2. Further Analysis and Discussions

Effect of Different Types of Attention

評估不同的attention component
- 每個component都對性能有提升
- SSA與SCA結合有互補作用
- hard與soft attention結合進一步提升了性能

Effect of Cross-Attention Interaction Learning

CAIL對性能提升顯著

Effect of Joint Local and Global Features

全局特徵與局部特徵具有互補性

Visualisation of Harmonious Attention

不同層次的HA與SA的可視化

Model Complexity

5. Conclusion

提出了輕量級網絡HACNN，在三個基準數據上取得了SOTA方法
相比其他工作，本文通過結合soft、hard attention提出了Harmounious Attention Module，能更好解決不對準問題以及提高attention方法的互補性
提出了CAIL來進一步優化模型的學習

思考

本文工作充分利用了現有的attention方法，並沒有藉助ImageNet預訓練模型，取得了SOTA性能，是否今後的工作也可以進一步嘗試在re-id數據集上更有針對性的搭建模型呢
hard attention得到的區域包含很多噪聲，有沒有更好的方法更精準的定位呢？

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.