本文是復旦大學發表於 AAAI 2019 的工作。截至目前CASIA-B正確率最高的網絡。

英文粘貼原文，google參與翻譯但人工爲主。有不對的地方歡迎評論。

粉色部分爲本人理解添加，非原文內容。

摘要

As a unique biometric feature that can be recognized at a distance, gait has broad applications in crime prevention, forensic identiﬁcation and social security.

作爲一種可以遠距離識別的獨特生物識別功能，步態在預防犯罪，法醫鑑定和社會保障方面具有廣泛的應用。

To portray a gait, existing gait recognition methods utilize either a gait template, where temporal information is hard to preserve, or a gait sequence, which must keep unnecessary sequential constraints and thus loses the ﬂexibility of gait recognition.

爲了描繪步態，現有的步態識別方法利用步態模板（其中時間信息難以保存）或步態序列，其必須保持不必要的順序約束並因此失去步態識別的靈活性。

In this paper we present a novel perspective, where a gait is regarded as a set consisting of independent frames. We propose a new network named GaitSet to learn identity information from the set.

在本文中，我們提出了一種新穎的視角，其中步態被視爲由獨立幀組成的（圖像）序列。我們提出了一個名爲GaitSet的新網絡來學習（圖像）序列中的身份信息。

Based on the set perspective, our method is immune to permutation of frames, and can naturally integrate frames from different videos which have been ﬁlmed under different scenarios, such as diverse viewing angles, different clothes/carrying conditions.

基於（圖像）序列視角，我們的方法不受幀的排列的影響，並且可以自然地整合來自不同視頻的幀，這些視頻已經在不同的場景下被完成，例如不同的視角，不同的衣服/攜帶條件。

Experiments show that under normal walking conditions, our single-model method achieves an average rank-1 accuracy of 95.0% on the CASIAB gait dataset and an 87.1% accuracy on the OU-MVLP gait dataset.

實驗表明，在正常步行條件下，我們的單模型方法在CASIAB步態數據集上實現了平均95.0％的一次命中準確度，在OU-MVLP步態數據集上達到了87.1％的準確度。

These results represent new state-of-the-art recognition accuracy.

這些結果代表了新的最先進的識別準確度。

On various complex scenarios, our model exhibits a signiﬁcant level of robustness. It achieves accuracies of 87.2% and 70.4% on CASIA-B underbag-carrying and coat-wearing walking conditions, respectively.

在各種複雜場景中，我們的模型具有顯着的魯棒性。它分別對攜帶CARA-Bunderbag和塗層的行走條件達到了87.2％和70.4％的準確率。

These outperform the existing best methods by a large margin.

這些都大大優於現有的最佳方法。

The method presented can also achieve a satisfactory accuracy with a small number of frames in a test sample, e.g., 82.5% on CASIAB with only 7 frames.

所提出的方法可以在小幀數測試樣本中獲得令人滿意的正確率，例如在CASIAB上僅用7幀得到82.5％的正確率。

The source code has been released at https://github.com/AbnerHqC/GaitSet.
代碼開源到網址：https://github.com/AbnerHqC/GaitSet。

1.介紹

Unlike other biometrics such as face, ﬁngerprint and iris, gait is a unique biometric feature that can be recognized at a distance without the cooperation of subjects and intrusion to them.Therefore,it has broad applications in crime prevention, forensic identiﬁcation and social security.

與臉部，指紋和虹膜等其他生物識別技術不同，步態是一種獨特的生物特徵，可以遠距離識別，非侵入且無需受試者的合作。因此，它被廣泛應用於犯罪防範、法醫鑑定和社會保障。

However, gait recognition suffers from exterior factors such as the subject’s walking speed, dressing and carrying condition, and the camera’s viewpoint and frame rate.

然而，步態識別受到外部因素的影響，例如受試者的步行速度，穿着和攜帶狀況，以及相機的視點和幀速率。

There are two main ways to identify gait in literature,i.e.,regarding gait as an image and regarding gait as a video sequence. The ﬁrst category compresses all gait silhouettes into one image, or gait template for gait recognition.

在文獻中識別步態有兩種主要方式，即將步態視爲圖像和將步態視爲視頻序列。第一類將所有步態輪廓壓縮成一個圖像，或用步態模板進行步態識別。"第一類典型代表典型代表GEI，如下圖最後一列就是前幾列圖像的GEI，Gait Energy Image"

Simple and easy to implement, gait template easily loses temporal and ﬁne-grained spatial information. Differently, the second category extracts features directly from the original gait silhouette sequences in recent years.

步態模板簡單易行，但很容易丟失時間和細粒度的空間信息。不同的是，近幾年第二類直接從原始步態輪廓序列中提取特徵的算法更多。

However, these methods are vulnerable to exterior factors. Further, deep neural networks like3D-CNN for extracting sequential information are harder to train than those using a single template like Gait Energy Image.

但是，這些方法容易受到外部因素的影響。此外，用於提取序列信息的深度神經網絡如 3D-CNN 比使用像 GEI 這樣的單個模板的深度神經網絡更難訓練。

To solve these problems, we present a novel perspective which regards gait as a set of gait silhouettes. As a periodic motion, gait can be represented by a single period.

爲了解決這些問題，我們提出了一種新思路即將步態特徵視爲一組步態輪廓圖。作爲週期性運動，步態可以由一個週期表示。

In a silhouette sequence containing one gait period, it was observed that the silhouette in each position has unique appearance, as shown in Fig. 1.

在包含一個步態週期的輪廓序列中，觀察到每個位置的輪廓具有獨特的外觀，如圖1所示。

圖1：從左上角到右下角是CASIA-B步態數據集中的一個目標的完整週期輪廓。

Even if these silhouettes are shufﬂed, it is not difﬁcult to rearrange them into correct order only by observing the appearance of them. Thus, we assume the appearance of a silhouette has contained its position information. With this assumption, order information of gait sequence is not necessary and we can directly regard gait as a set to extract temporal information.

即使這些輪廓是亂序的，但只有通過觀察它們的外觀就能將它們重新排列成正確的順序。因此，我們假設輪廓的外觀包含其位置信息。通過這種假設，步態序列的順序信息不是必需的（輸入特徵），我們可以直接將步態視爲一組（圖像）來提取時間信息。

We propose an end-to-end deep learning model called GaitSet whose scheme is shown in Fig. 2.

我們提出了一種端到端的深度學校模型稱作GaitSet，其框架圖見圖2。

圖2：GaitSet的框架。 'SP'代表Set Pooling。梯形表示卷積和池化塊，同一列中的梯形具有相同的參數，這些參數由帶有大寫字母的矩形表示。請注意，儘管MGP中的塊與主流水線中的塊具有相同的參數，但其參數僅在主流水線中的塊之間共享，而不與MGP中的塊共享。HPP代表水平金字塔池化。

The input of our model is a set of gait silhouettes.

我們這個模型的輸入是一組步態輪廓圖像。（就像圖1那種）

First, a CNN is used to extract frame-level features from each silhouette independently. Second, an operation called Set Pooling is used to aggregate frame-level features into a single set-level feature.

首先，CNN用於獨立地從每個輪廓中提取幀級特徵。其次，名爲Set Pooling的操作用於將幀級特徵聚合成獨立序列級特徵。

Since this operation is applied on high-level feature maps instead of the original silhouettes, it can preserve spatialand temporal information better than gait template.This will be justiﬁed by the experiment in Sec. 4.3.

由於此操作應用於高級特徵（原始輪廓卷積之後就變成高級特徵了）而不是原始輪廓，因此它可以比步態模板更好地保留空間和時間信息。（其實我感覺這句話說的有點不太好理解，也可能是我理解能力有限，作者應該想表達的是：整個過程提取了每一幀圖像的空間特徵同時還提取了整個序列的時間特徵，比步態模板的方式提取的特徵更全面，側重點應該在保留時間特徵的同時提取了各幀特徵）這部分的實驗驗證在Sec.4.3中詳細介紹。

Third, a structure called Horizontal Pyramid Mapping is used to map the set-level feature into a more discriminative space to obtain the ﬁnal representation.

第三，使用稱爲水平金字塔映射（Horizontal Pyramid Mapping，HPM）的結構將序列級特徵映射到更具辨別力的空間以獲得最終表示。（這句話的後半句說的很玄乎啊，主要discriminative這個詞用的太好了，讓人不明覺厲。我的理解就是把這個序列級特徵，就是包含了時間和空間的特徵壓縮成一維特徵便於最後全連接做分類。）

The superiorities of the proposed method are summarized as follows:

該方法的優越性總結如下：

Flexible

Our model is pretty ﬂexible since there are no any constraints on the input of our model except the size of the silhouette. It means that the input set can contain any number of non-consecutive silhouettes ﬁlmed under different viewpoints with different walking conditions. Related experiments are shown in Sec. 4.4

靈活性

我們的模型非常靈活，因爲除了輪廓的大小之外，我們模型的輸入沒有任何限制。這意味着輸入的序列可以包含在不同視點下具有不同行走條件的任意數量的非連續輪廓。相關實驗見Sec.4.4。（此處原文忘記寫句號了我幫他們填上了哈哈哈）

Fast

Our model directly learns the representation of gait instead of measuring the similarity between a pair of gait templates or sequences. Thus, the representation of each sample needs to be calculated only once, then the recognition can be completed by calculating the Euclidean distance between representations of different samples.

快速性

我們的模型直接學習步態的表示，而不是測量一對步態模板或序列之間的相似性。因此，每個樣本的表示僅需要計算一次，然後可以通過計算不同樣本的表示之間的歐式距離來完成識別。

Effective

Our model greatly improves the performance on the CASIA-B and the OUMVLP datasets, showing its strong robustness to view and walking condition variations and high generalization ability to large datasets.

有效性

我們的模型極大地提高了CASIA-B和OUMVLP數據集的性能，顯示了其對視圖和行走條件變化的強大魯棒性以及對大型數據集的高泛化能力。

2. 相關工作

In this section, we will give a brief survey on gait recognition and set-based deep learning methods.

這部分我們會簡要介紹步態識別和基於序列的深度學習方法的回顧。

2.1 步態識別

Gait recognition can be grouped into template-based and sequence-based categories.

步態識別可以分爲基於模板和基於序列兩種。

Approaches in the former category ﬁrst obtain human silhouettes of each frame by background subtraction. Second, they generate a gait template by rendering pixel level operators on the aligned silhouettes.Third, they extract the representation of the gait by machine learning approaches such as Canonical Correlation Analysis(CCA), Linear Discriminant Analysis (LDA) and deep learning. Fourth, they measure the similarity between pairs of representations by Euclidean distance or some metric learning approaches. Finally, they assign a label to the template by some classiﬁer, e.g., nearest neighbor classiﬁer.

前一類中的方法首先通過背景減法獲得每個幀的人體輪廓。第二步，將排列好的輪廓在幀級進行操作以生成步態模板。第三步，他們通過機器學習方法提取步態的表示，例如典型相關分析（CCA），線性判別分析（LDA）和深度學習。第四，它們通過歐幾里德距離或一些度量學習方法來測量表示對（表示對就是輸入的圖像序列和訓練過程中已經存儲的一組圖像序列）之間的相似性。最後，他們通過某些分類器，例如，最近鄰居分類器，來爲（輸入的待檢測）模板分配標籤。

Previous works generally divides this pipeline into two parts, template generation and matching.

以前的工作通常將此流程分爲兩部分，模板生成和匹配。

The goal of generation is to compress gait information into a single image, e.g., Gait Energy Image (GEI) and Chrono-Gait Image (CGI).

（模板）生成的目標是將步態信息壓縮成單個圖像，例如步態能量圖像（GEI）和計時步態圖像（CGI）。

In template matching approaches, View Transformation Model (VTM) learns a projection between different views. (Hu et al. 2013) proposed View-invariant Discriminative Projection (ViDP) to project the templates into a latent space to learn a view-invariance representation.

在模板匹配方法中，視角轉換模型（VTM）學習不同視圖之間的投影。（Hu et al.2013）提出了視角不變判別投影（ViDP）將模板投影到潛在空間以學習視角不變性表示。（關於潛在空間 latent space參考https://www.quora.com/What-is-the-meaning-of-latent-space，其實就是一個說不定幾維的空間，這個空間中同一類的物體離的更近，以便於分類。上述鏈接可能打不開，內容見下圖）

Recently, as deep learning performs well on various generation tasks, it has been employed on gait recognition task (Yu et al. 2017a; He et al. 2019; Takemura et al. 2018a; Shiraga et al. 2016; Yu et al. 2017b; Wu et al. 2017).
最近，由於深度學習在各種生成任務上表現良好，因此它已被（廣泛）用於步態識別任務（列舉了一堆相關文獻）。

As the second category, video-based approaches directly take a sequence of silhouettes as input. Based on the way of extracting temporal information, they can be classiﬁed into LSTM-based approaches (Liao et al. 2017) and 3D CNN-based approaches (Wolf, Babaee, and Rigoll 2016; Wu et al. 2017).

作爲第二類，基於視頻的方法直接採用一系列輪廓作爲輸入。基於提取時間信息的方式，可以將它們分類爲基於LSTM的方法和基於3D CNN的方法。

The advantages of these approaches are that 1) focusing on each silhouette, they can obtain more comprehensive spatial information.2)They can gather more temporal information because specialized structures are utilized to extract sequential information. However, The price to pay for these advantages is high computational cost.

這些方法的優點在於：1）關注每個輪廓以獲得更全面的空間信息.2）可以收集更多的時間信息，因爲利用了專門的結構來提取順序信息。然而，爲這些優勢付出的代價是高計算成本。

2.2 無序序列的深度學習

Most works in deep learning focus on regular input representations like sequence and images. The concept of unordered set is ﬁrst introduced into computer vision by (Charles et al. 2017) (PointNet) to tackle point cloud tasks. Using unordered set, PointNet can avoid the noise and the extension of data caused by quantization, and obtain a high performance. Since then, set-based methods have been wildly used in point cloud ﬁeld(Wangetal.2018c;ZhouandTuzel2018; Qi et al. 2017).

大多數深度學習工作都致力於常規輸入表示，如序列和圖像。無序集的概念首先被（Charles et al.2017）（PointNet）引入到計算機視覺中，以解決點雲任務。PointNet使用無序序列，可以避免由量化引起的噪聲和數據擴展，並獲得更好的性能。於是，基於序列的方法被廣泛用於點雲領域（列舉相關文獻）。

Recently, such methods are introduced into computer vision domains like content recommendation (Hamilton, Ying, and Leskovec 2017) and image captioning (Krause et al. 2017) to aggregate features in a form of a set. (Zaheer et al. 2017) further formalized the deep learning tasks deﬁned on sets and characterizes the permutation invariant functions. To the best of our knowledge, it has not been employed in gait recognition domain up to now.

最近，這些方法被引入計算機視覺領域，如內容推薦和圖像字幕，用於聚合一個序列的特徵。Zaheer等人進一步給出了深度學習任務中的序列描述和排列不變函數。據我們所知，它至今尚未被用於步態識別領域。

3. GaitSet

In this section, we describe our method for learning discriminative information from a set of gait silhouettes. The overall pipeline is illustrated in Fig. 2.

在本節中，我們將介紹從一組步態輪廓中學習判別信息的方法。整個流程如圖2所示。

3.1 問題表述

We begin with formulating our concept of regarding gait as a set.

首先，將步態視爲一組序列。

Given a dataset of N people with identities yi,i ∈ 1,2,...,N, we assume the gait silhouettes of a certain person subject to a distribution Pi which is only related to its identity.

給定一個數據集，數據集中一共N個人，每個人用yi表示（共有y1,y2,...yN這麼多個表示）。假設某個人的步態輪廓分佈Pi只與這個人的ID有關（就是說一個人的輪廓和這個人是一一對應的，不會搞錯，其實就是步態識別的可行基礎，即每個人的步態獨具特色）。

Therefore, all silhouettes in one or more sequences of a person can be regarded as a set of n silhouettes Xi = {x(ij) | j = 1,2,...,n}, where x(ij) ∼Pi. （爲了方便打字，本文用x(ij) 代表）

因此，在一個或多個序列中，所有的輪廓可以被看做是Xi = {x(ij) | j = 1,2,...,n}, 其中 x(ij) ∼Pi。

插入一段解釋或者說是總結（以CASIC-B數據集爲例）：

數據集中有N=124個人，每個人用yi表示，比如我沒記錯的話ID=109的那個人的視頻好多連人都沒出現視頻就結束了，那麼在這個論文中就說y109視頻不全。

在全部數據集中閉着眼睛任選出來一個輪廓怎麼表示那？假如選到的輪廓圖所在序列一共有20幀，選的的輪廓圖是序列中的第3幀，那麼表示方法就是x(20 3)，其所在序列表示爲X20。

Under this assumption, we tackle the gait recognition task through 3 steps, formulated as:

在這個假設下，我們通過3個步驟解決步態識別任務，表述爲：

where F is a convolutional network aims to extract framelevel features from each gait silhouette.

其中F是卷積網絡，旨在從每個步態輪廓中提取幀級特徵。

The function G is a permutation invariant function used to map a set of framelevel feature to a set-level feature (Zaheer et al. 2017). It is implemented by an operation called Set Pooling (SP) which will be introduced in Sec. 3.2.

函數G是用於將一組幀級特徵映射到序列級特徵的排列不變函數。該函數通過Set Pooling（SP）實現，詳細信息在Sec.3.2中介紹。

The function H is used to learn the discriminative representation of Pi from the set-level feature. This function is implemented by a structure called Horizontal Pyramid Mapping (HMP) which will be discussed in Sec. 3.3.

函數H用於從序列級特徵中學習Pi的辨別表示。（就是對序列級特徵進行分類，對應到每個人身上）這個函數是通過一個叫做Horizontal Pyramid Mapping(HPM此處原文應該是打錯了)的結構實現的，將在Sec.3.3中介紹。

The input Xi is a tensor with four dimensions, i.e. set dimension, image channel dimension, image hight dimension, and image width dimension.

輸入Xi是具有四個維度的tensor，分別是序列維度，圖像通道維度，圖像高度和圖像寬度維度。tensor.shape=(n幀,2通道,64,64)

3.2 Set Pooling

The goal of Set Pooling (SP) is to aggregate gait information of elements in a set, formulated as z = G(V ), where z denotes the set-level feature and V = {vj|j = 1,2,...,n} denotes the frame-level features. (vj表示)

Set Pooling(SP)的目的在於收集一下整個序列的步態信息，公式化表示成z = G(V )，其中z表示序列級特徵， V = {vj|j = 1,2,...,n}表示幀級特徵。

There are two constraints in this operation.

此處有兩個約束條件。

First, to take set as an input, it should be a permutation invariant function which is formulated as:

第一，將序列作爲輸入，它應該是一個排列不變函數，其表達式爲：

其中π爲任意排列組合。

Second, since in real-life scenario the number of a person’s gait silhouettes can be arbitrary, the function G should be able to take a set with arbitrary cardinality.

第二，因爲現實生活場景中，一個人的步態輪廓數可是是任意的，函數G應該可以輸入任意基數的序列。（就是這個序列可長可短，多少幀都行，這是GaitSet宣傳的一大優勢）

Next, we describe several instantiations of G. It will be shown in the experiments that although different instantiations of SP do have sort of inﬂuence on the performances, they do not differ greatly and all of them exceed GEI-based methods by a large margin.

下面，我們介紹了函數G的幾個實例。在實驗中將顯示儘管SP的不同實例確實對性能有影響，但它們沒有很大差異並且它們都大大超過基於GEI的方法。

Statistical Functions 統計函數

To meet the requirement of invariant constraint in Equ. 2, a natural choice of SP is to apply statistical functions on the set dimension. Considering the representativeness and the computational cost, we studied three statistical functions: max(·), mean(·) and median(·). The comparison will be shown in Sec. 4.3.

在滿足Equ. 2中不變約束的要求下，SP一個很自然的選取是在序列維度上應用統計函數。考慮到典型性和計算成本，研究了三個統計函數：max（·），mean（·）和median（·）。比較將在Sec.4.3中展示。

Joint Function 聯合函數

We also studied two ways to join 3 statistical functions mentioned above:

我們也研究了兩種上述3個統計函數共同作用的情況：

其中，cat表示在通道維度連接，1_1C表示1×1卷積層，max、mean、median都是應用在序列維度。Equ.4 是Equ.3的增強版，多出來的1×1卷積層可以學習合適的權重以組合不同統計函數提取的信息。

Attention 注意力機制

這部分原文大量使用了refine這個詞，我大概有個理解，但是沒想好這個詞怎麼翻譯才合理。

Since visual attention was successfully applied in lots of tasks, we use it to improve the performance of SP.

由於視覺注意力已成功應用於大量任務中，因此我們使用它來提高SP的性能。

Its structure is shown in Fig. 3. The main idea is to utilize the global information to learn an element-wise attention map for each frame-level feature map to reﬁne it.

其結構如圖3所示。主要思想是利用全局信息來學習每個幀級特徵圖的元素注意力圖，以便提煉更有價值信息。

圖3 Set Pooling(SP)應用注意力機制的結構。1_1C和cat分別代表1×1卷積層和連接。乘法和加法都是逐點的。

Global information is ﬁrst collected by the statistical functions in the left. Then it is fed into a 1×1 convolutional layer along with the original feature map to calculate an attention for the reﬁnement. The ﬁnal set-level feature z will be extracted by employing MAX on the set of the reﬁned frame-level feature maps. The residual structure can accelerate and stabilize the convergence.

首先由左側（上面）的統計函數收集全局信息。然後，將其與原始特徵圖一起送入1×1卷積層計算注意力以精煉特徵信息。通過在所設置的幀級特徵映射的集合上使用MAX來提取最終的設置級特徵z。最終的序列級特徵z將被MAX應用在序列維度。殘餘結構可以加速並穩定收斂。

3.3 Horizontal Pyramid Mapping

In literature, splitting feature map into strips is commonly used in person re-identiﬁcation task.The images are cropped and resized into uniform size according to pedestrian size whereas the discriminative parts vary from image to image.

在文獻中，將特徵圖分割成條的方式經常用於人的重新識別任務。根據行人大小裁剪圖像並將其尺寸調整爲均勻尺寸，但辨別部分仍然因圖像而異。

(Fu et al. 2018) proposed Horizontal Pyramid Pooling (HPP) to deal with it. HPP has 4 scales and thus can help the deep network focus on features with different sizes togather both local and global information. We improve HPP to make it adapt better for gait recognition task.

（Fu et al.2018）提出了Horizontal Pyramid Pooling(HPP) 來處理上述問題。 HPP有4個等級，因此可以幫助深度網絡同時提取局部和全局特徵。我們改進了HPP使其更適合步態識別任務。

Instead of applying a 1×1 convolutional layer after the pooling, we use independent fully connect layers (FC) for each pooled feature to map it into the discriminative space, as shown in Fig. 4. We call it Horizontal Pyramid Mapping (HPM).

如圖4所示，我們對每個池化後的特徵使用獨立的完全連接層（FC）將其映射到判別空間，而不是在合併後應用1×1卷積層。我們稱這樣的操作爲Horizontal Pyramid Mapping (HPM)。

圖4 HPM結構圖

Speciﬁcally, HPM has S scales. On scale s ∈ 1,2,...,S, the feature map extracted by SP is split into $2^{s-1}$ strips on height dimension, i.e. $\sum_{s=1}^{S} 2^{s-1}$ strips in total.

具體而言，HPM具有S個尺度。再尺度s ∈ 1,2,...,S上，由SP提取的特徵圖在高度尺寸上被分成 $2^{s-1}$ 條，即總共 $\sum_{s=1}^{S} 2^{s-1}$ 條。

(舉個例子，假如S=3，則一個人的特徵在豎直方向上如下圖被分割成3種尺度， $2^{s-1}$ =4條，所有尺度的條加在一起一共是1+2+4=7= $\sum_{s=1}^{S} 2^{s-1}$ )

Then a Global Pooling is applied to the 3-D strips to get 1-D features. For a strip zs,t where t ∈ 1,2,..., $2^{s-1}$ stands index of the strip in the scale, the Global Pooling is formulated as f's,t = maxpool(zs,t) + avgpool(zs,t), where maxpool and avgpool denote Global Max Pooling and Global Average Pooling respectively. Note that the functions maxpool and avgpool are used at the same time because it outperforms applying anyone of them alone.

然後，用一個全局池化將3維條變成1維特徵。對於一個條zs,t來說，t ∈ 1,2,..., $2^{s-1}$ 代表尺度s種條的角標，全局池化的公式是 f's,t = maxpool(zs,t) + avgpool(zs,t)，其中maxpool和avgpool分別代表全局最大池化和全局平均池化。注：同時使用maxpool和avgpool是因爲同時使用比只使用其中一種效果要好。

The ﬁnal step is to employ FCs to map the features f‘ into a discriminative space. Since strips in different scales depict features of different receptive ﬁelds, and different strips in each scales depict features of different spatial positions, it comes naturally to use independent FCs, as shown in Fig. 4.

最後一步是使用FC（全連接）將特徵f'映射到辨別空間。因爲不用的條在不同的尺度中描述不同的感受野，並且不同的條在每個尺度中秒速不同空間位置的特徵，因此如圖4，很自然會想到用獨立的FC。

3.4 Multilayer Global Pipeline

Different layers of a convolutional network have different receptive ﬁelds. The deeper the layer is, the larger the receptive ﬁeld will be. Thus, pixels in feature maps of a shallow layer focus on local and ﬁne-grained information while those in a deeper layer focus on more global and coarse-grained information.

不同層的卷積網絡具有不同的感受野。越深層具有越大的感受野。因此，淺層特徵更注重細粒度，而深層特徵蘊含更多全局粗粒度信息。

The set-level features extracted by applying SP on different layers have analogical property. As shown in the main pipeline of Fig. 2, there is only one SP on the last layer of the convolutional network. To collect various-level set information, Multilayer Global Pipeline (MGP) is proposed. It has a similar structure with the convolutional network in the main pipeline and the set-level features extracted in different layers are added to MGP.

SP提取的序列級特徵在不同層有相似的屬性。如圖2所示的主流程，在卷積網絡的最後只有一個SP。爲了收集不同級別的序列信息而提出Multilayer Global Pipeline (MGP)。

The ﬁnal feature map generated by MGP will also be mapped into $\sum_{s=1}^{S} 2^{s-1}$ features by HPM. Note that the HPM after MGP does not share parameters with the HPM after the main pipeline.

最終由MGP生成的特徵也被HPM分成 $\sum_{s=1}^{S} 2^{s-1}$ 條特徵。注意：在MGP後面的HPM不會和主流程後面的HPM共享參數。

3.5 訓練和測試

訓練損失函數

As aforementioned, the output of the network is $2\times \sum_{s=1}^{S} 2^{s-1}$ features with dimension d. The corresponding features among different samples will be used to compute the loss.

如上所述，網絡的輸出是具有d個維度的 $2\times \sum_{s=1}^{S} 2^{s-1}$ 個特徵。不同樣本對應的特徵將被用於計算損失。

In this paper, Batch All (BA+) triplet loss is employed to train the network (Hermans, Beyer, and Leibe 2017).

本文中，訓練網絡使用Batch All（BA+）三元損失。（BA+三元損失在文章《In Defense of the Triplet Loss for Person Re-Identification》中的Sec.2的第6段介紹。）

A batch with size of p×k is sampled from the training set where p denotes the number of persons and k denotes the number of training samples each person has in the batch.

從訓練集中拿出一個大小是p*k的batch，其中p是人數，k是每個人拿k張圖。

Note that although the experiment shows that our model performs well when it is fed with the set composed by silhouettes gathered from arbitrary sequences, a sample used for training is actually composed by silhouettes sampled in one sequence.

注：雖然我們的模型在輸入任意序列中的輪廓測試時表現良好，但是訓練的時候其實是用一個序列中的輪廓訓練的。（我理解的這句話意思是：測試階段，可以混合輸入一個人任意序列中的某些輪廓，但是訓練時，是每個人每次只輸入一個序列中的某些輪廓）

測試

Given a query Q, the goal is to retrieve all the sets with the same identity in gallery set G. Denote the sample in G as g. The Q is ﬁrst put into GaitSet net to generate multiscale features, followed by concatenating all these features into a ﬁnal representations Fq as shown in Fig. 2. The same process is applied on each g to get Fg. Finally,Fq is compared with every Fg using Euclidean distance to calculate Rank 1 recognition accuracy.

給定一個待驗證序列Q，目標是在圖片序列G中遍歷全部序列找到與給定相同的ID。設G中的樣本爲g。首先將Q輸入到GaitSet網絡中生成多尺度特徵，然後將這些特徵連接起來形成最終的表示Fq，如圖2所示。每一個樣本g都走一遍一樣的流程，即輸入Gait Set網絡並連起來，生成Fg。最終，Fq與每一個Fg計算歐式距離來判斷一次命中的識別正確率。

4 實驗

Our empirical experiments mainly contain three parts. The ﬁrst part compares GaitSet with other state-of-the-art methods on two public gait datasets: CASIA-B (Yu, Tan, and Tan 2006) and OU-MVLP (Takemura et al. 2018b). The Second part is ablation experiments conducted on CASIA-B. In the third part, we investigated the practicality of GaitSet in three aspects: the performance on limited silhouettes, multiple views and multiple walking conditions.

我們的實驗注意包含3個部分。第一部分是比較GaitSet和其他頂級算法在2個公開數據集CASIA-B和OU-MVLP上的效果。第二部分是對CASIA-B進行的消融實驗（類似控制變量）。第三部分從三個方面研究了GaitSet的實用性：有限輪廓下的性能表現，多視圖和多步行條件下的性能。

4.1 數據集和訓練細節

CASIA-B

CASIA-B dataset (Yu, Tan, and Tan 2006) is a popular gait dataset. It contains 124 subjects (labeled in 001-124), 3 walking conditions and 11 views (0◦,18◦,...,180◦). The walking condition contains normal (NM) (6 sequences per subject), walking with bag (BG) (2 sequences per subject) and wearing coat or jacket (CL) (2 sequences per subject). Namely,eachsubjecthas 11×(6+2+2) = 110 sequences.

CASIA-B 數據集是一個流行的步態數據集。內含124個對象（標記爲001-124號），3種走路狀態和11個角度（0°,18°,...,180°）。行走狀態包含正常（NM）（每人6組）、揹包（GB）（每人2組）、穿外套或夾克衫（CL）（每人2組）。也就是，每個人有 11×(6+2+2) = 110 個序列。

As there is no ofﬁcial partition of training and test sets of this dataset, we conduct experiments on three settings which are popular in current literatures. We name these three settings as small-sample training (ST), medium-sample training (MT) and large-sample training (LT). In ST, the ﬁrst 24 subjects (labeled in 001-024) are used for training and the rest 100 subjects are leaved for test. In MT, the ﬁrst 62 subjects are used for training and the rest 62 subjects are leaved for test. In LT, the ﬁrst 74 subjects are used for training and the rest 50 subjects are leaved for test.

因爲數據集不存在官方規定的訓練和測試部分，我們用當前文獻中流行的3種分配方法進行實驗。我們將這3種分配方法取名爲小樣本訓練（ST）、中樣本訓練（MT）、大樣本訓練（LT）。ST是前24人用作訓練集，餘下的後100人作爲驗證集。MT是前62人用作訓練集，餘下的後62人作爲驗證集。LT是前74人作爲訓練集，餘下的後50人作爲驗證集。

In the test sets of all three settings, the ﬁrst 4 sequences of the NM condition(NM #1-4) are kept in gallery, and the rest 6 sequences are divided into 3 probe subsets, i.e. NM subsets containing NM #5-6, BG subsets containing BG #1-2 and CL subsets containing CL #1-2.

在所有三種設置的測試集中，NM條件的前4個序列（NM＃1-4）保持在圖庫中，其餘6個序列被分成3個探針子序列，即包含NM＃5-6的NM，BG＃1-2，CL＃1-2的CL子序列。

OU-MVLP

OU-MVLP dataset (Takemura et al. 2018b) is so far the world’s largest public gait dataset. It contains 10,307 subjects,14views(0◦,15◦,...,90◦;180◦,195◦,...,270◦)per subject and 2 sequences (#00-01) per view. The sequences are divided into training and test set by subjects (5153 subjects for training and 5154 subjects for test). In the test set, sequences with index #01 are kept in gallery and those with index #00 are used as probes.

OU-MVLP 數據集是迄今爲止世界最大的公開步態數據集。其中包含10307個人，每人14個角度，每個角度2個序列。全部序列被分成訓練集和驗證集（訓練集包含5153個人，測試集包含5154個人）。測試集中，#01號序列被歸爲圖庫，#00號序列被用作探針。

訓練細節

In all the experiments, the input is a set of aligned silhouettes in size of 64 × 44. The silhouettes are directly provided by the datasets and are aligned based on methods in (Takemura et al. 2018b). The set cardinality in the training is set to be 30. Adam is chosen as an optimizer (Kingma and Ba 2015). The number of scales S in HPM is set as 5. The margin in BA+ triplet loss is set as 0.2. The models are trained with 8 NVIDIA 1080TI GPUs. 1)

在所有的實驗中，輸入都是一系列64×44的對齊輪廓。輪廓有數據集直接提供並且對齊是基於Takemure的方法。訓練集使用每個人每個序列的30張圖片。優化器是Adam優化器。HPM種的尺度S=5。三元損失BA+的margin設置爲0.2。用8個NVIDIA 1080TI GPU訓練的模型。

1) In CASIA-B, the mini-batch is composed by the manner introduced in Sec. 3.5 with p = 8 and k = 16. We set the number of channels in C1 and C2 as 32, in C3 and C4 as 64 and in C5 and C6 as 128. Under this setting, the average computational complexity of our model is 8.6GFLOPs. The learning rate is set to be 1e − 4. For ST, we train our model for 50K iterations. For MT, we train it for 60K iterations. For LT, we train it for 80K iterations.

1）CASIA-B中，mini-batch由前面Sec.3.5中介紹的p=8,k=16兩部分組成。將C1和C2的通道數設置成32，C3和C4的通道數設置成64，C5和C6的通道數爲128。按照這種設置，我們模型的平均計算複雜度是8.6GFLOPs。學習率是1e-4。ST訓練時，模型訓練50000輪，MT 60000輪，LT80000輪。

2)In OU-MVLP, since it contains 20 times more sequences than CASIA-B, we use convolutional layers with more channels (C1 = C2 = 64,C3 = C4 = 128,C5 = C6 = 256) and train it with larger batch size (p = 32,k = 16). The learning rate is 1e−4 in the ﬁrst 150K iterations, and then is changed into 1e−5 for the rest of 100K iterations.

2）OU-MVLP中比CASIA-B多了20倍的序列，因此我們使用更深的卷積層（C1 = C2 = 64,C3 = C4 = 128,C5 = C6 = 256）並且訓練的batch size更大（p = 32,k = 16）。前150000輪學習率是1e-4，後100000輪學習率衰減到1e-5。

4.2 主要結果

CASIA-B

Tab. 1 shows the comparison between the stateof-the-art methods 1 and our GaitSet. Except of ours, other results are directly taken from their original papers. All the results are averaged on the 11 gallery views and the identical views are excluded. For example, the accuracy of probe view 36◦ is averaged on 10 gallery views, excluding gallery view 36◦

Tab.1展示了Gait Set與頂級算法之間的比較。出來GaitSet，其他數據是直接從各自的文章中引用的。所有結果均在11個視角中取平均值，並且不包括相同的視角。例如，視角36°探針的正確率是平均了除36°以外的10個視角。

An interesting pattern between views and accuracies can be observed in Tab. 1. Besides 0◦ and 180◦ , the accuracy of 90◦ is a local minimum value. It is always worse than that of 72◦ or 108◦.

從表1中可以看出視角和正確率有一種有趣的關係。除0°和180°外，90°的精度是局部最小值。 90°時總是比72°或108°更差。

The possible reason is that gait information contains not only those parallel to the walking direction like stride which can be observed most clearly at 90◦, but also those vertical to the walking direction like a left-right swinging of body or arms which can be observed most clearly at 0◦ or 180◦. So, both parallel and vertical perspectives lose some part of gait information while views like 36◦ or 144◦ can obtain most of it.

可能的原因是，步態信息不僅包含與步行方向平行的步幅信息，例如在90°時可以最清楚地觀察到的步幅，還包括與行走方向垂直的步態信息，如可以觀察到的身體或手臂的左右擺動最明顯的是0°或180°。因此，平行視角（90°）和垂直視角（0°&180°）都會丟失部分步態信息，而像36°或144°這樣的視圖可以獲得大部分信息。

Small-Sample Training (ST)

Our method achieves a high performance even with only 24 subjects in the training set and exceed the best performance reported so far (Wuetal. 2017) over 10 percent on the views they reported. There are mainly two reasons.

我們的方法在僅有24個目標的訓練集依然能夠獲得迄今爲止所有算法中最佳效果切超過此前最好值10%有如下兩個主要原因：

1) As our model regards the input as a set, images used to train the convolution network in the main pipeline are dozens of times more than those models based on gait templates. Taking a mini-batch for an example, our model is fed with 30×128 = 3840 silhouettes while under the same batch size models using gait templates can only get 128 templates.

1）由於我們的模型將輸入視爲一組圖像，用於訓練主流水線中的卷積網絡的圖像比基於步態模板的模型多幾十倍。拿一個mini-batch舉例子，我們的模型輸入30×128=3840個輪廓而同樣batch size的步態模板類模型只能獲得128個模板。

2)Since the sample sets used in training phase are composed by frames selected randomly from the sequence, each sequence in the training set can generate multiple different sets. Thus any units related to set feature learning like MGP and HPM can also be trained well.

2）由於訓練階段中使用的樣本序列由從序列中隨機選擇的幀組成，因此訓練集中的每個序列可以生成多個不同的集合。因此，與MGP和HPM等序列特徵學習相關的任何單元也可以很好地得到訓練。

Medium-Sample Training (MT) & Large-Sample Training (LT)

Tab. 1 shows that our model obtains very nice results on the NM subset, especially on LT where results of all views except 180◦ are over 90%. On the BG and CL subsets, although the accuracies of some views like 0◦ and 180◦ are still not high, the mean accuracies of our model exceed those of other models for at least 18.8%.

Tab.1顯示我們的模型在NM子序列上獲得了非常好的結果，特別是在LT上，除180°以外的所有視圖的結果都超過90％。在BG和CL子序列上，雖然0°和180°等一些視圖的準確度仍然不高，但我們模型的平均精度超過其他模型的平均精度至少18.8％。

OU-MVLP

Tab. 3 shows our results. As some of the previous works did not conduct experiments on all 14 views, we list our results on two kinds of gallery sets, i.e. all 14 views and 4 typical views (0◦,30◦60◦90◦). All the results are averaged on the gallery views and the identical views are excluded. The results show that our methods can generalize well on the dataset with such a large scale and wide view variation. Further, since representation for each sample only needs to be calculated once, our model can complete the test (containing 133780 sequences) in only 7 minutes with 8 NVIDIA 1080TI GPUs. It is note worthy that since some subjects miss several gait sequences and we did not remove them from the probe, the maximum of rank-1 accuracy cannot reach 100%. If we ignore the cases which have no corresponding samples in the gallery, the average rank-1 accuracy of all probe views is 93.3% rather than 87.1%.

Tab.3顯示了我們的結果。由於之前的工作沒有涵蓋全部14個視角的實驗，我們列出了兩種圖庫序列的結果，即14個視角和4個典型視角（0◦,30◦60◦90◦）。所有結果均在全部視角中取平均值，並且不包括相同的視角。結果顯示，我們的方法在如此大規模多角度的數據集上仍然具有很強的泛化能力。此外，由於每個樣本的表達僅需要計算一次，因此用8塊NVIDIA 1080TI GPU測試一次模型（包含133780個序列）只需要7分鐘。值得注意的是，由於一些目標錯過了幾個步態序列並且我們沒有從探針中移除它們，因此一次命中率的最大值不能達到100％。如果忽略掉上述有問題的樣本，一次命中率會從87.1%提高到93.3%。

4.3 AblationExperiments 消融實驗

Tab. 2 shows the thorough results of ablation experiments. The effectiveness of every innovation in Sec. 3 is studied.

Tab.2展示了消融實驗的全部結果。研究了Sec.3中每項創新的有效性。

Set VS. GEI

The ﬁrst two lines of Tab. 2 show the effectiveness of regarding gait as a set. With fully identical networks, the result of using set exceeds that of using GEI by more than 10% on NM subset and more than 25% on CL subset. The only difference is that in GEI experiment, gait silhouettes are averaged into a single GEI before being fed into the network.

Tab.2的前兩行顯示了將步態作爲序列的有效性。對於完全相同的網絡來講，使用序列而不是GEI可以對NM子序列識別率提高10%，CL提高25%。前兩行實驗唯一不同的是GEI實驗中，步態輪廓在送入網絡之前按照均勻權值合併成一張GEI。

There are mainly two reasons for this phenomenal improvement. 1) Our SP extracts the set-level feature based on high-level feature map where temporal information can be well preserved and spatial information has been sufﬁciently processed. 2) As mentioned in Sec. 4.2, regarding gait as a set enlarges the volume of training data.

這種顯著改善主要有兩個原因。1）SP基於高級的特徵圖提取了序列級特徵，可以很好的保留時間信息同時也充分運用空間信息。2）如Sec.4.2中提到的，將步態視作序列相當於增強訓練數據。

Impact of SP

In Tab. 2, the results from the third line to the eighth line show the impact of different SP strategies. SP with attention, 1×1 convolution (1 1C) joint function and max(·) obtain the highest accuracy on the NM, BG, and CL subsets respectively. Considering SP with max(·) also achieved the second best performance on the NM and BG subset and has the most concise structure, we choose it as SP in the ﬁnal version of GaitSet.

Tab.2中，3-8行的結果顯示不同SP策略的影響，分別是SP+注意力機制，1×1卷積連接函數和max(·)，3種SP策略在各個子序列中都有獲得最高正確率。但考慮到max(·)除了在CL子序列中獲得最高正確率，還在NM和BG子序列中獲得第二高的正確率，我們選擇max(·)作爲SP的最終策略。

Impact of HPM and MGP

The second and the third lines of Tab. 2 compare the impact of independent weight in HPM. It can be seen that using independent weight improves the accuracy by about 2% on each subset.In the experiments, we also ﬁnd out that the introduction of independent weight helps the network converge faster.The last two lines of Tab.2 show that MGP can bring improvement on all three test subsets. This result is consistent the theory mentioned in Sec. 3.4 that set-level features extracted from different layers of the main pipeline contain different valuable information.

Tab.2的第2、3行比較了HPM獨立權重的影響。可以看出，獨立權重可以使每種子序列的正確率提高2%。實驗中，我們還發現引入獨立權重可以幫助網絡更快的聚合。 Tab.2的最後兩行顯示除MGP可以同時提高所有子序列的正確率。這個結果的理論依據如Sec.3.4所述幀級特徵是從主流程中不同層裏提取到的，包含了不同的有價值信息。

4.4 Practicality 實用性

Due to the ﬂexibility of set, GaitSet has great potential in more complicated practical conditions. In this section, we investigate the practicality of GaitSet through three novel scenarios. 1) How will it perform when the input set only contains a few silhouettes? 2) Can silhouettes with different views enhance the identiﬁcation accuracy? 3) Whether can the model effectively extract discriminative representation from a set containing silhouettes shot under different walking conditions. It is worth noting that we did not retrain our model in these experiments. It is fully identical to that in Sec. 4.2 with setting LT. Note that, all the experiments containing random selection in this section are ran for 10 times and the average accuracies are reported.

由於序列的靈活性，GaitSet仍有很大潛力可以挑戰更復雜的實際情況。在這部分，我們通過3個新穎的場景來研究Gait Set的實用性。1）GaiSet能否在輸入僅有幾個輪廓時表現仍然良好？2）具有不同視角的輪廓是否可以提高識別準確度？3）模型是否可以有效的從一個包含不同行走狀態輪廓的序列中提取表達特徵？值得注意的是，我們沒有在這些實驗中重新訓練我們的模型。模型時與Sec.4.2中的LT配置完全相同的。注：所有包含隨機選取的實驗都運行了10次並報告10次實驗的平均精度。

Limited Silhouettes 有限輪廓數量

In real forensic identiﬁcation scenarios, there are cases that we do not have a continuous sequence of a subject’s gait but only some ﬁtful and sporadic silhouettes. We simulate such a circumstance by randomly selecting a certain number of frames from sequences to compose each sample in both gallery and probe. Fig. 5 shows the relationship between the number of silhouettes in each input set and the rank-1 accuracy averaged on all 11 probe views.

在真實法醫鑑定場景中，很多情況下我們無法獲取目標的連續步態序列，只有一些斷斷續續零星的輪廓。我們通過隨機選取連續序列中的一些幀來模擬上述場景。Fig.5中顯示了每組輸入序列輪廓的數量和11個視角的一次命中率之間的關係。

圖5.CASIA-B數據集使用LT訓練，平均一次命中率受輪廓數量的約束。正確率是11個視角中除去相同視角的平均值。並且最終報告的結果是10次實驗的平均值。

Our method attains an 82% accuracy with only 7 silhouettes. The result also indicates that our model makes full use of the temporal information of gait. Since 1) the accuracy rises monotonically with the increase of the number of silhouettes. 2) The accuracy is close to the best performance when the samples contain more than 25 silhouettes. This number is consistent with the number of frames that one gait period contains.

我們的方法在僅輸入7個輪廓就可以得到82%的正確率。結果還表明我們的模型充分利用了步態的時間信息。因爲：

1）隨着輪廓數量的增加，精度單調上升。

2）當樣本含量超過25個輪廓後，正確率接近最佳狀態。這個數字與一個步態週期所包含的幀數一致。

MultipleViews 多視角

There are conditions that different views of one person’s gait can be gathered. We simulate these scenarios by constructing each sample with silhouettes selected from two sequences with the same walking condition but different views. To eliminate the effects of silhouette number, we also conduct an experiment in which the silhouette number is limited to 10. Speciﬁcally, in the contrast experiments of single view, an input set is composed by 10 silhouettes from one sequence. In the two-view experiment, an input set is composed by 5 silhouettes from each of two sequences. Note that in this experiment, only probe samples are composed by the way discussed above, whereas sample in the gallery is composed by all silhouettes from one sequence.

有些情況下收集到的是一個人不同視角的步態信息。我們通過從具一個人的相同步態情況不同視角序列中抽取輪廓的方式模擬上述情況。爲了消除輪廓數的影響，我們還進行了一個實驗，其中輪廓數限制爲10。具體而言，在單視角對比實驗中，一個輸入序列由10個輪廓構成。在2個視角實驗中，一個輸入序列由每個序列抽取5個輪廓共計10個輪廓組成。值得注意的是，實驗中只有探針是如上組成，圖庫中其他樣本是由一個序列中的全部輪廓構成的。

Tab. 4 shows the results. As there are too many view pairs to be shown, we summarize the results by averaging accuracies of each possible view difference. For example, the result of 90◦ difference is averaged by accuracies of 6 view pairs (0◦&90◦,18◦&108◦,...,90◦&180◦). Further, the 9 view differences are folded at 90◦ and those larger than 90◦ are averaged with the corresponding view differences less than 90◦. For example, the results of 18◦ view difference are averaged with those of 162◦ view difference.

Tab.4顯示了結果。由於需要展示的視角對太多了，我們將每個可能的相差視角結果取了平均值。例如：90°一列6個視角對（0◦&90◦,18◦&108◦,...,90◦&180◦）的平均值。另外，視角差共計有9中可能，大於90°和小於90°部分對稱合併求取平均值了。例如：18°視角差和162°視角差的正確率合併在一起計算平均正確率。

It can be seen that our model can aggregate information from different views and boost the performance. This can be explained by the pattern between views and accuracies that we have discussed in Sec. 4.2. Containing multiple views in the input set can let the model gather both parallel and vertical information, resulting in performance improvement.

可以看出，我們的模型可以聚合來自不同視圖的信息並提高性能。這可以通過我們在Sec.4.2中討論的視圖和準確度之間的模式來解釋。包含多視角輸入序列可以讓模型聚集平行視角（90°）和垂直視角（0°&180°）信息，以獲得更好的表現。

Multiple Walking Conditions

In real life, it is highly possible that gait sequences of the same person are under different walking conditions. We simulate such a condition by forming input set with silhouettes from two sequences with same view but different walking conditions. We conduct experiments with different silhouette number constraints. Note that in this experiment, only probe samples are composed by the way discussed above. Any sample in the gallery is constituted by all silhouettes from one sequence. What’s more, the probe-gallery division of this experiment is different. For each subject, sequences NM #02, BG #02 and CL #02 are kept in the gallery and sequences NM #01, BG #01 and CL #01 are used as probe.
現實生活中，很可能同一個人有不同的行走狀態。我們通過從同一個同樣視角的兩個不同行走狀態的序列中抽取輪廓構成模擬上述情況的序列。我們用不同輪廓數量進行實驗。注意：只有探針樣本是如上方法構造的，其他樣本還是用一個序列中的全部輪廓。另外，該實驗的探針劃分有些不同。對於每個目標而言，序列 NM #02, BG #02 和 CL #02還保持在圖庫中，但是NM #01, BG #01 和CL #01作爲探針。

Tab. 5 shows the results. First, the accuracies will still be boosted with the increase of silhouette number. Second,when the number of silhouettes are ﬁxed, the results reveal relationships between different walking conditions. Silhouettes of BG and CL contain massive but different noises, which makes them complementary with each other. Thus, their combination can improve the accuracy. However, silhouettes of NM contain few noises, so substituting some of them with silhouettes of other two conditions cannot bring extra information but only noises and can decrease the accuracies.
Tab.5顯示了結果。首先，正確率還是回隨着輪廓數量的增加而增長。第二，當輪廓數量固定時，揭示了不同行走條件之間的關係。輪廓BG和 CL包含了大量不同的輪廓且噪聲不同，這使得他們互補。因此，他們的結合可以提升準確率。但是，NM輪廓包含很少的噪聲，因此用其他兩個條件的輪廓代替NM中的一些不能帶來有用的信息，只能產生噪音，並且會降低精度。

5 結論

In this paper, we presented a novel perspective that regards gait as a set and thus proposed a GaitSet approach. The GaitSet can extract both spatial and temporal information more effectively and efﬁciently than those existing methods regarding gait as a template or sequence. It also provide a novel way to aggregate valuable information from different sequences to enhance the recognition accuracy. Experiments on two benchmark gait datasets has indicated that compared with other state-of-the-art algorithms, GaitSet achieves the highest recognition accuracy, and reveals a wide range of ﬂexibility on various complex environments, showing a great potential in practical applications. In the future, we will investigate a more effective instantiation for Set Pooling (SP) and further improve the performance in complex scenarios.

在本文中，我們提出了一種新的視角，將步態視爲一組序列，從而提出了一種GaitSet方法。GaitSet可以比那些將步態作爲模板或序列的現有方法更有效地提取空間和時間信息。它還提供了一種從不同序列聚合有價值信息的新方法，以提高識別準確性。兩個基準步態數據集（公開標準數據集）的實驗表明，與其他最先進的算法相比，GaitSet實現了最高的識別精度，並在各種複雜環境中顯示出廣泛的靈活性，在實際應用中顯示出巨大的潛力。後續，我們將研究更有效的Set Pooling（SP）實例化，並進一步提高複雜場景的性能。

【論文翻譯】-- GaitSet: Regarding Gait as a Set for Cross-View Gait Recognition

摘要

1.介紹

Flexible

靈活性

Fast

快速性

Effective

有效性

2. 相關工作

2.1 步態識別

2.2 無序序列的深度學習

3. GaitSet

3.1 問題表述

3.2 Set Pooling

Statistical Functions 統計函數

Joint Function 聯合函數

Attention 注意力機制

3.3 Horizontal Pyramid Mapping

3.4 Multilayer Global Pipeline

3.5 訓練和測試

訓練損失函數

測試

4 實驗

4.1 數據集和訓練細節

CASIA-B

OU-MVLP

訓練細節

4.2 主要結果

CASIA-B

Small-Sample Training (ST)

Medium-Sample Training (MT) & Large-Sample Training (LT)

OU-MVLP

4.3 AblationExperiments 消融實驗

Set VS. GEI

Impact of SP

Impact of HPM and MGP

4.4 Practicality 實用性

Limited Silhouettes 有限輪廓數量

MultipleViews 多視角

Multiple Walking Conditions

5 結論