Re-ID：AlignedReID: Surpassing Human-Level Performance in Person Re-Identification 論文解析

剛讀完這篇文章，賊6，用動態規劃求最小路徑進行特徵對齊，很新奇，而且準確率很高。

下面是我對這篇論文的一個整理~

這篇文章作者提出了AlignedReID的方法，其亮點在於：在數據集Market1501與CUHK03上，該方法實現的rank-1 accuracy 首超人類，
作者認爲：
- Traditional approaches have focused on low-level features such as colors, shapes, and local descriptors. With the renaissance of deep learning, the convolutional neural network (CNN) has dominated this field.
- 傳統的方法大多采用CNN提取低級別的特徵。
- Many CNN-based approaches learn a global feature, without considering the spatial structure of the person. This has a few major drawbacks:
  - inaccurate person detection boxes might impact feature learning.
  - the pose change or non-rigid body deformation makes the metric learning difficult.
  - occluded parts of the human body might introduce irrelevant context into the learned feature.
  - it is non-trivial to emphasis local differences in a global feature, especially when we have to distinguish two people with very similar apperances.
- 許多基於CNN的方法只學習了全局的特徵，而沒有考慮人體的空間結構，這會導致以下這些問題：
  - 不準確的人物檢測框可能會影響特徵的學習；
  - 姿勢的改變和人體的變形可能會導致度量學習的困難；
  - 人體的部分身體部位被遮擋可能會引入無關的上下文信息；
  - 在全局特徵上強調局部差異是非常重要的，尤其是在區分兩個外貌非常相似的人的時候
- 爲了解決以上問題，過去的研究將重心放在part-based, local feature learning。有些研究將整個身體分割爲幾個固定的部分，而不考慮這幾個部分之間的對應關係。這樣的話無法解決以上問題。還有研究使用pose estimation幫助人體幾個部分的對齊，但這樣需要額外的supervision and a pose estimation step。
所以，作者採用了AlignedReID的方法：
- In this paper, we propose a new approach, called AlignedReID, which still learns a global feature, but perform an automatic part alignment during the learning, without requring extra supervision or explicit pose estimation.
- 作者提出的方法中，仍然是學習全局的特徵，但是能自動進行各部分的對齊，且這一操作不需要額外的supervision 和 explicit pose estimation.
- In the local branch, we align local parts by introducing a shortest path loss.
- 在局部特徵的學習中，我們通過計算最短路徑進行對齊操作。
- In the inference stage, we discard the local branch and only extract the global feature.
- 在預測階段，只使用了全局特徵而沒有采用局部特徵。
- In other words, the global feature itself, with the aid of local features learning, can greatly address the drawbacks we mentioned above, in our new joint learning framework.
- 換句話說，在基於局部特徵學習得到的全局特徵能夠解決基於CNN方法遇到的那四個問題。
- In addition, the form of global feature keeps our approach attractive for the deployment of a large ReID system, without costly local features matching.
- 作者還說，全局特徵的形式使得他們的方法在大型的人物重識別中仍然能夠很好的工作，而不需採用消耗巨大的局部特徵匹配。
- We also adopt a mutual learning approach in the metric learning setting, to allow two models to learn better representations from each other.
- 對於度量學習，作者採用的是mutual learning 的方法，並取得了很好的結果。
現有幾個概念需要補充一下：
- Metric Learning：Deep metric learning methods transform raw images into embedding feature, then compute the feature distances as their similarities. Usually, two images of the same person are defined as a positive pair, whereas two images of different persons are a negative pair. Triplet loss is motivatived by the margin enforced between positive and negative pairs. Selecting suitable samples for the training model through hard mining has been shown to be effective. Combining softmax loss with metric learning loss to speed up the convergence is also a popular method.
- Feature Alignments: Consider the spatial local information when learning features.
- Mutual Learning: presents a deep mutual learning strategy where an ensemble of students learn collaboratively and teach each other throughout the training process.
- Re-Ranking: After obtaining the image features, most current works choose the L2 Euclidean distance to compute a similarity score for a ranking or retrieval task.
下面對AlignedReID的原理進行更深的一步介紹：
- In AlignedReID, we generate a single global feature as the final output of the input image, and use the L2 distance as the similarity metric. However, the global feature is learned jointly with local features in the learning stage.
- Re-ID一般分爲兩步：一是提取特徵，二是進行度量學習。在AlignedReID中，每張輸入圖片的最終輸出是單一的全局特徵，而該全局特徵是與局部特徵聯合訓練得來的。
  - A global feature(a C-d vector) is extracted by directly applying global pooling on the feature map.
  - 對於全局特徵的提取，便是用global pooling在feature map上滑動提取特徵。
  - For the local features, a horizontal pooling, which is a global pooling in the horizontal direction, is first applied to extract a local feature for each row, and a 1X1 convolution is then applied to reduce the channel number from C to c. In this way, each local feature(a c-d vector) represents a horizontal part of image for a person.
  - 對於局部特徵提取，便是用horizontal pooling對feature map進行逐行提取，然後再進行1x1的卷積操作。這樣得到的特徵代表人體的水平部分。
  - As a result, a person image is represented by a global feature and H local features.
  - 最後，一張圖像就可以用一個全局特徵和多個局部特徵代替。
  - The distance of two person images is the summation of their global and local distances.
  - 兩張圖片的距離是全局特徵距離與局部特徵距離之和。
  - The global distance is simply the L2 distance of the global features.
  - 全局特徵距離是指全局特徵之間的L2距離。
  - For the local distance, we dynamically match the local parts from top to bottom to find the alignment of local feature with the minimum total distance.
  - 局部特徵距離是指通過動態規劃的方法求出的最短路徑，並通過該最短距離找到對齊的局部特徵。
  - This is based on a simple assumption that, for two images of the same person, the local feature from one body part of the first image is more similar to the semantically corresponding body part of the other image.
  - 當然這一度量學習是基於假設：對於同一個人的同一部位在不同的圖片中具有較高的相似度。
  - Given the local features of two image, F=f1,...,fH and G=g1,...,gH , we first normalize the distance to [0, 1) by an element-wise transformation:
    - where di,j is the distance between the i-th vertical part of
      the first image and the j-th vertical part of the second image. A distance matrix D is formed based on these distances, where its (i, j)-element is di,j .
  - We define the local distance between the two images as the total distance of the shortest path from (1, 1) to (H, H) in the matrix D.
  - 以上公式是matrix D的每個元素的計算公式
  - The distance can be calculated through dynamic programming as follows:
    - where Si,j is the total distance of the shortest path when walking from (1, 1) to (i, j) in the distance matrix D, andSH,H is the total distance of the final shortest path between two image.
  - 以上公式便是動態規劃中求最短路徑所採用的狀態轉移方程。
  - Non-corresponding alignments are necessary to maintain the order of vertical alignment, as well as make the correspnding alignments possible.
  - 在最短路徑中，可能包含非對齊的特徵，但這非但不會對結果造成影響，而且還會對維護垂直方向對齊的次序起着至關重要的作用。
  - The reason for using the global distance to mine hard samples is due to two consideration:
    - First, the calculation of the global distance is much faster than that of the local distance.
    - Second, we observe that there is no significant difference in mining hard samples using both distances.
  - Note that in the inference stage, we only use the global features to compute the similaritity of two person images. We make this choice mainly because we unexpectedly observed that the global feature itself is also almost as good as the combined features.
  - This somehow counter-intuitive phenomenon might be caused by two factors:
    - the feature map jointly learned is better than learning the global feature only, because we have exploited the structure prior of the person image in the learning stage;
    - with the aid of local feature matching, the global feature can pay more attention to the body of the person, rather than over fitting the background.
  - 以上解釋了爲什麼只使用全局特徵距離而不使用局部特徵或者兩者都使用。
  - We apply mutual learning to train models for AlignedReID, which can further improve performance.
  - 作者採用mutual learning去訓練模型，因爲這樣可以提高性能。
  - A distillation-based model usually transfers knowledge from a pre-trained large teacher network to a small student network.
  - 一個好的模型通常都是採用遷移學習的方法：預訓練一個模型然後在進行微調獲得自己的模型。
  - In this paper, we train a set of student models simultaneously, transferring knowledge between each other.
  - 這篇論文同時訓練多個模型，並讓它們相互學習。
  - We propose a new mutual learning loss for metric learning.
    - The overall loss function include the metric loss, the metric mutual loss, the classification loss and classification mutual loss.
    - The metric loss is decided by both the global distances and the local distances, while the metric mutual loss is decided only by the global distances.
    - The classification mutual loss is the KL divergence for classification.
  - The mutual learning loss is defined as:
  - By applying the zero gradient function, the second-order gradients is:
  - We found that it speeds up the convergence and improves the accuracy compared to a mutual loss without the zero gradient function.
  - 這篇論文定義了新的mutual learning loss，且該loss中的zero gradient function加快了收斂速度，並提高了準確率。
最後作者講述了他們的實驗：
- 介紹了Market1501、CUHK03、MARS、CUHK-SYSU四個數據集
- 然後介紹了它們實現的細節：即實驗的各個參數
- 接着通過與baseline的對比凸顯AlignedReID的優勢
- 再分析了Mutual Learning 發揮的作用
- 還有將該模型與state-of-the-art method進行對比
- 最後介紹了怎麼測量Human Perfomance in Person Reid，並與AlignedReID的性能進行比較
作者還給出了導致人類準確率低於AlignedReID的猜測：
- First, the annotator usually summarizes some attributes, such as gender, age, and etc., to decide whether the imges contain the same person. Howeverm the summarized attributes might be incorrect.
- Second, color bias exists between cameras, and it could make the same person looks differently in the query and ground truth images such as in (c).
- Last, different camera angles and human poses might mislead the judgement of body shapes.
最後再補充句作者的感慨：the end-to-end learning with structure prior is more powerful than a “blind” end-to-end learning.