【論文翻譯】論文中英對照翻譯--（Learning Generalized Deep Feature Representation for Face Anti-Spoofing）（其一）

【開始時間】2018.10.23

【完成時間】2018.10.25

【中文譯名】人臉反欺騙的廣義深層特徵表示

【論文鏈接】論文鏈接

【說明】此論文較長，本人將它分爲了兩部分，這是前半部分

【補充】

1）論文的發表時間是：2018年5月14日，是在IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY上發表的論文

2）2017年12月20日收到手稿；2018年3月13日修訂；2018年3月20日接受；2018年4月11日出版；本版日期：2018年5月14日。這項研究是在新加坡南洋科技大學的快速富物搜索(ROSE)實驗室進行的。

【聲明】本文是本人根據原論文進行翻譯，有些地方加上了自己的理解，有些專有名詞用了最常用的譯法，時間匆忙，如有遺漏及錯誤，望各位包涵

題目：人臉反欺騙的廣義深層特徵表示

Abstract（摘要）

In this paper, we propose a novel framework leveraging the advantages of the representational ability of deep learning and domain generalization for face spoofing detection. In particular, the generalized deep feature representation is achieved by taking both spatial and temporal information into consideration, and a 3D convolutional neural network architecture tailored for the spatial-temporal input is proposed. The network is first initialized by training with augmented facial samples based on cross-entropy loss and further enhanced with a specifically designed generalization loss, which coherently serves as the regularization term. The training samples from different domains can seamlessly work together for learning the generalized feature representation by manipulating their feature distribution distances. We evaluate the proposed framework with different experimental setups using various databases. Experimental results indicate that our method can learn more discriminative and generalized information compared with the state-of-the-art methods.

本文提出了一種新的人臉欺騙檢測框架，該框架充分利用了深度學習的表徵能力和領域泛化的特點，實現了人臉欺騙檢測。特別是考慮了時空信息，實現了廣義深度特徵表示，並提出了一種適合時空輸入的三維卷積神經網絡結構。該網絡首先通過基於交叉熵損失的增強樣本訓練來初始化，然後通過特定設計的泛化損失作爲正則化項，對網絡進行進一步增強。不同領域的訓練樣本通過調整特徵分佈距離，可以無縫地學習廣義特徵表示。我們使用不同的數據庫並採用不同的實驗設置，對我們提出的的框架進行了評價。實驗結果表明，與現有的方法相比，我們的方法能夠學習到更多的判別信息和廣義信息。

Index Terms—Face spoofing, deep learning, 3D CNN, domain generalization.

術語索引--------面部欺騙，深度學習，三維CNN，領域泛化。

I. INTRODUCTION（引言）

BIOMETRICS offers a powerful and practical solution to authentication-required applications. Due to the breakthrough of biometrics authentication via deep learning and its better security capability compared with traditional authentication methods (e.g., password, secret question, token code), more and more attention has been attracted from both academia and industry nowadays. Typical biometric modalities include fingerprint, iris, face and voice print, among which “face” is the most popular one as it does not require any additional hardware infrastructure and almost all mobile phones are equipped with a front-facing camera. Despite the success of face recognition, it is still vulnerable to the presentation attacks due to the popularity of social media from which facial images are easy to acquire [1]. For instance, a presentation attack can record the face information of a person by printing (printing attack), replaying on screen (replay attack) or even counterfeiting the face via 3D masking [2] and VR [3], which brings extremely challenging security issues.

生物識別技術爲有認證需求的應用程序提供了一個強大而實用的解決方案。由於生物特徵認證的深度學習突破及其與傳統認證方法(密碼、祕密問題、令牌碼等)相比具有更好的安全性，近年來從學術界到工業節都受到越來越多的關注。典型的生物識別方式包括指紋、虹膜、人臉和語音打印，其中“臉”是最受歡迎的，因爲它不需要任何附加的硬件基礎設施，幾乎所有的手機都配備了正面攝像頭（ front-facing camera）。儘管人臉識別取得了成功，但由於社交媒體的普及，面部圖像很容易從社交媒體中獲取，因此仍然容易受到演示攻擊（ the presentation attacks） [1]。例如，演示攻擊可以通過打印(打印攻擊)、在屏幕上重放(重放攻擊)，甚至通過3D面具[2]和VR[3]僞造人臉來記錄人的面部信息，這就帶來了極具挑戰性的安全問題。

Security concerns of face recognition systems have motivated a number of studies for face spoofing detection. From the perspective of evaluating the disturbance information injected into the spoofing media, a series of approaches aim at extracting the distortion information, which may appear on spoofed face samples. Typical spoofing artifacts include texture artifacts [4], motion artifacts [5] and image quality relevant artifacts [6]. Other approaches focus on the system level in which specific sensors (e.g., gravity sensor) can be utilized for auxiliary assistance [7] or additional hardware can be incorporated into the verification system (e.g., infrared sensor [8]). Moreover, human-computer interaction may also be required for spoofing detection (head moving, eye blinking, etc.) [9], [10].

人臉識別系統的安全性問題引發了大量人臉欺騙檢測的研究，從對注入到欺騙介質中的干擾信息進行評估的角度出發，提出了一系列針對僞人臉樣本中可能出現的失真信息的提取方法。典型的欺騙工件包括紋理僞影[4]、運動僞影[5]和與圖像質量相關的僞影（relevant artifacts）[6]。其他方法側重於系統一級，其中可利用特定傳感器(例如重力傳感器)輔助協助[7]，或將額外硬件納入覈查系統(例如，紅外傳感器[8])。此外，欺騙檢測(頭部移動、眨眼等)也可能需要人機交互[9]、[10]。

With numerous approaches proposed to deal with the artifacts within a single image, there are still two important issues in face anti-spoofing. On one hand, how to generalize well to the “unseen data” becomes pivotal, as obtaining enough data with sufficient variability in the training process is not always practical. On the other hand, much less work has been dedicated to extracting information along the temporal direction, which can also provide valuable cues (liveness information, unexpected motion [9], [10], temporal aliasing, etc.). More importantly, learning spatial plus temporal features would become more difficult, as more training data would be necessary and the lack of generalization could be even more pronounced. All these issues cast challenges on the generalization capability of robust feature representation. In view of this, we focus on deep feature representation in a generalized way by exploiting the information from both spatial and temporal dimensions. In particular, 3D convolutional neural networks (3D CNN), which have been proved to be efficient for action recognition task [11], are employed to learn spoofing-specific information based on typical printed and replay video attacks.The solution incorporates 2D and 3D features related to the presentation attack problem, and learns not only spatial variations associated with attacks but also artifacts that takeplace over time. More specifically, we employ the 3D CNN architecture with a data augmentation strategy for the spoofing detection task. To obtain a more robust and generalized 3D CNN model, the lack of generalization is dealt with by introducing a regularization mechanism, which focuses on improving classification accuracy during training as well as

generalizing to unknown conditions by minimizing the feature distribution dissimilarity across domains. These capabilities allow us to make a further step regarding the detection of attacks under unknown or different conditions.

雖然衆多處理單一圖像中的真僞的方法已經被提出，但是面對反欺騙仍然存在兩個重要的問題。一方面，如何很好地泛化到“看不見的數據”成爲關鍵，因爲在訓練過程中獲得足夠的、具有足夠可變性的數據並不總是切實可行的。另一方面，更少的工作是沿着時間方向提取信息，這也可以提供有價值的線索(活性信息、意外運動[9]、[10]、時間混疊等)。另一方面，很少的工作是從時間的角度提取信息，而這也可以提供有價值的線索(活性信息、意外運動[9]、[10]、時間混疊等)。更重要的是，學習空間加上時間特徵將變得更加困難，因爲需要更多的訓練數據，而缺乏泛化的情況可能更加明顯。所有這些問題都對魯棒特徵表示的泛化能力提出了挑戰，爲此，我們從空間和時間兩個維度出發，對深度特徵表示進行了廣義的研究。特別是，已被證明對行動識別任務[11]有效的三維卷積神經網絡(3D CNN)，被用於學習基於典型打印和重放視頻攻擊的特定欺騙信息。該解決方案結合了與表示攻擊問題相關的2d和3d特徵，不僅學習了與攻擊相關的空間變化，而且還學習了隨時間而發生的僞影信息（artifacts）。更具體地說，我們採用了帶有數據增強策略的3D CNN結構來完成欺騙檢測任務，爲了獲得一個更魯棒、更泛化的3D CNN模型，通過引入正則化機制來解決其泛化不足的問題，該機制的重點是在訓練過程中提高分類精度，並通過最小化域間的特徵分佈差異來泛化到未知條件。這些能力使我們能夠進一步探測未知或不同條件下的攻擊。

The main contributions of our work are as follows.

we apply a 3D CNN network which take both spatial and temporal information into consideration with a specifically designed data augmentation method for face spoofing detection.
To further improve the generalization performance, we employ a generalization regularization by minimizing the Maximum Mean Discrepancy distance among different domains.
We conduct extensive experimental analysis on four different datasets as well as our proposed cross-camera based protocol. The results show that our proposed framework can achieve significantly better performance compared with other state-of-the-art methods.

我們工作的主要貢獻如下。

我們採用一種三維cnn網絡，該網絡考慮了時間和空間信息，採用了一種專門設計的數據增強方法進行人臉欺騙檢測。
爲了進一步提高泛化性能，我們採用了一種泛化正則化方法，使不同區域間的最大平均差異距離最小。
我們對四種不同的數據集和基於交叉攝像機的協議進行了廣泛的實驗分析，結果表明，與其他先進的方法相比，我們提出的框架可以獲得更好的性能。

II. RELATED WORK（相關工作）

A. Face Anti-Spoofing（人臉反欺騙）

In terms of various application scenarios, we roughly categorize existing face spoofing detection methods into three categories, including motion analysis based [5] (which may require user cooperation), texture analysis based [4], [12], and sensor-assisted detection [7]. The first two categories can be generally applied to face verification/registration task with personal computers and mobile phones, while the last one requires extra hardwares. To further enhance the robustness

of biometric spoofing detection, some other biometrics information can be incorporated into the face antispoofing system (e.g. [13]–[16]).

在不同的應用場景中，我們將現有的人臉欺騙檢測方法大致分爲三類，包括基於的運動分析[5](可能需要用戶合作)、基於的紋理分析[4]、[12]和傳感器輔助檢測[7]。前兩類通常可以應用於個人計算機和移動電話的人臉驗證/註冊任務，而最後一類則需要額外的硬件。爲了進一步增強生物識別欺騙的魯棒性，檢測時，還可以將其他一些生物特徵信息納入人臉反欺騙系統(例如[13]-[16])。

Motion analysis relies on extracting liveness information (e.g., eye blinking, lips movement, head rotation) for distinguishing between genuine and spoofed ones. For instance, such liveness information can be obtained via optical flow. In [5], Kollreider et al. reported that even subtle movement can be regarded as motion cues. For these kind of methods, the user assistance is usually required. Though motion analysis based methods are effective to counter printed photo attacks, they may suffer performance drops when the spoofing attack is conducted by video replay.

運動分析依賴於提取活躍度信息(如眨眼、嘴脣移動、頭部旋轉)，以區分真假臉。例如，這種活性信息可以通過光流獲得。在[5]中，kolereider等人報告說，即使是細微的運動也可以看作是運動線索。對於這類方法，通常需要用戶的幫助。雖然基於運動分析的方法能夠有效地抵抗印刷照片的攻擊，但當通過視頻重放進行欺騙攻擊時，它們的性能可能會下降。

The idea of facial texture and distortion analysis originates from the assumption that the spoofed medium is likely to lack high-frequency information, due to the face media reproduction process. By analyzing the texture artifacts left behind during an attack, we can extract useful information such that the genuine and spoofed faces can be properly distinguished.In [17], a texture analysis method based on two dimensional Fourier spectrum is conducted. In [18], Tan et al. proposed a total-variation based decomposition method and extracted the different-of-Gaussian (DoG) information on the high-frequency part. The final model is learned in a bilinear sparse low-rank regression manner. Texture features designed for object detection/recognition tasks have also been proved to be effective for face spoofing detection. In [4], multi-scale Local Binary Pattern (LBP) with Support Vector Machine (SVM) classifier was proposed, achieving superior performance on NUAA [18] and Idiap REPLAY-ATTACK databases [19]. The multi-scale LBP feature was further extended to facial component based method followed by fisher vector [20], such that more discriminative information can be extracted. Other texture features, such as Scale Invariant Feature Transform (SIFT) and Speed Up Robust feature (SURF) [21], can also be applied to the face anti-spoofing task. As the high-frequency information can also be discarded in the temporal domain, the texture features based on 2-D plane can be extended to 3-D plane [22]. By jointly exploring color and texture information, the face anti-spoofing performance can be largely improved [12], [23]. Recently, a dynamic texture face spoofing was proposed [24] by considering volume local binary count patterns. Moreover, by incorporating flash light, the texture pattern can be detected more readily [25]. Another stream of feature design is based on image quality methods. In [6], 25 quality assessment based metrics were employed as the discriminative features for face spoofing detection. In [26], the authors extended the method in a regression manner to tackle the problem whereby samples were taken from multiple camera models. In [27], a feature concatenation based method was proposed by considering specular, blurriness and color distortion. However, both texture-based and distortion-based features are likely to be overfitted to one particular setup, which may limit their application for practical scenarios when confronting diverse image/video capturing conditions.

人臉紋理和失真分析的思想來源於這樣一種假設，即由於人臉媒體的再現過程，欺騙介質很可能缺乏高頻信息。通過分析攻擊過程中留下的紋理僞影，我們可以提取有用的信息，從而正確區分真假人臉。在[17]中，提出了一種基於二維傅立葉譜的紋理分析方法。在[18]中，tan等人提出了一種基於全變差的分解方法，並提取了高頻部分不同的高斯(DOG)信息。最後的模型採用雙線性稀疏低秩迴歸方法學習，爲目標檢測/識別任務設計的紋理特徵也被證明對於人臉欺騙檢測有效。在[4]中，提出了使用支持向量機分類器的多尺度局部二值模式(LBP)，在NUAA[18]和Idiap重播攻擊數據庫[19]上取得了較好的性能。將多尺度LBP特徵進一步擴展到基於人臉分量的方法中，然後採用Fisher向量[20]，從而提取出更多的判別信息。其他紋理特徵，如尺度不變特徵變換(cale Invariant Feature Transform---Sift)和加速魯棒特徵( Speed Up Robust feature---SURF)[21]，也可應用於人臉防欺騙任務。由於高頻信息在時域上也可以被丟棄，基於二維平面的紋理特徵可以擴展到三維平面[22]。通過對顏色和紋理信息的聯合研究，可以大大提高人臉的抗欺騙性能[12]，[23]。最近，人們提出了一種考慮體積局部二進制計數模式的動態紋理人臉欺騙。此外，通過結合閃光燈，可以更容易地檢測紋理模式[25]。另一種特徵設計流基於圖像質量方法。在文獻[6]中，採用了25個基於質量評估的度量作爲人臉欺騙檢測的判別特徵，在[26]中，作者將該方法進行了迴歸擴展，以解決從多個攝像機模型中抽取樣本的問題。

在[27]中，提出了一種考慮鏡面、模糊和顏色失真的基於特徵級聯的方法。然而，基於紋理的特徵和基於失真的特徵都可能被過擬合於特定的設置，這可能限制了它們在實際場景中的應用，當他們在面對不同的圖像/視頻捕獲條件時。

In addition to motion analysis and texture analysis methods, additional sensors can also be leveraged for face spoofing detection. Compared with face images directly captured by the popular camera models, 3D depth information [28], [29], multi-spectrum and infrared images [8], and even vein flow information [30] can be obtained if additional sensors are deployed. Such methods can be enhanced by audio information [31], which can further improve the robustness of face spoofing detection. However, as additional equipments are required in such methods, they are usually more expensive

除了運動分析和紋理分析方法之外，還可以利用額外的傳感器來進行人臉欺騙檢測。與傳統攝像機模型直接獲取的人臉圖像相比，增加傳感器可以獲得三維深度信息[28]、[29]、多光譜和紅外圖像[8]，甚至靜脈流信息[30]。這些方法可以通過音頻信息[31]得到增強，從而進一步提高魯棒性。然而，由於這些方法需要額外的設備，所以通常成本更高。

Deep learning based methods have also been proved to be effective for biometric spoofing detection tasks. Yang et al. [32] first proposed to use Convolutional Neural Network (CNN) for face spoofing detection. Some other works [33]–[36] have been proposed to modify the network architecture directly, which can further improve the detection accuracy. In [37], a CNN has been proved to be effective for face, fingerprint, and iris spoofing detection. Nogueira et al. [38] further showed that a pre-trained CNN model based on ImageNet [39] can be transferred to fingerprint spoofing detection without any fine-tuning process. In [2],

a deep dictionary learning based method was proposed for mask attacking detection. Additional information (e.g., eye blinking) can also be considered as auxiliary information by associating it with deep learning [40], which further improves the face spoofing detection performance. More recently, Atoum et al. [41] proposed a depth-based CNN for face spoofing detection to extract depth information based on RGB face images. Gan et al. [42] proposed a 3D CNN based framework to jointly capture the spatial and temporal information. As [42] also deals with 3D CNN for the PAD problem, it is important to highlight the differences between their method and the one we propose herein. In summary, our technique prioritizes 3×3×3 convolutions for better efficiency, and a streamlined strategy for temporal feature learning is adopted with different pre-preprocessing and augmentation mechanisms. In general, deep learning methods can achieve desirable performance when the training and testing samples are acquired in very similar conditions (e.g., captured with the same type of phone). However, such environment cannot be always ensured due to the diverse capturing devices, illumination conditions and shooting angles [43].

基於深度學習的方法對於生物識別欺騙檢測任務也被證明是有效的。[32]首次提出將卷積神經網絡(CNN)用於人臉欺騙檢測，文[33]-[36]提出了直接修改網絡結構的方法，進一步提高了檢測精度。在[37]中，CNN被證明是一種有效的人臉、指紋和虹膜欺騙檢測方法。Nogueira等人[38]進一步表明，基於ImageNet[39]的預先訓練的cnn模型可以在沒有任何微調過程的情況下轉移到指紋欺騙檢測中。文[2]中提出了一種基於深度字典學習的面具攻擊檢測方法。額外的信息(如眨眼)也可以被認爲是輔助信息，通過將其與深度學習聯繫起來[40]，這進一步提高了人臉欺騙檢測性能。最近，阿圖姆等人[41]提出了一種基於深度的cnn人臉欺騙檢測方法，用於提取基於RGB人臉圖像的深度信息。42]提出了一種基於3D CNN的聯合捕獲時空信息的框架。由於文[42]也針對PAD問題提出了3D CNN，因此必須強調它們與本文提出的方法之間的區別。總之，爲了提高效率，我們對3×3×3卷積進行了排序，並針對不同的預處理和增強機制，採用了一種簡化（流線型）的時態特徵學習策略。一般來說，當訓練和測試樣本是在非常相似的條件下獲得的時候(例如，用同一類型的手機捕捉到的)，深度學習方法可以獲得理想的性能。然而，由於捕獲設備、光照條件和拍攝角度的不同，這種環境並不總是被保證。

B. Multimedia Recapturing Detection（多媒體重捕檢測）

Multimedia recapturing aims at reproducing the content illegally from the perspective of security. During the multimedia content reproduction process, the camera, display screen as well as the lighting condition are carefully tuned to obtain the reproduced content with the best quality. To the best of our knowledge, the first work addressing the problem of image recapturing detection on LCD screens was proposed in [44], whereby three distortion types, including the texture

pattern caused by aliasing, the loss-of-detail pattern caused by the low resolution of LCD screens and the color distortion caused by the device gamut were analyzed. To address this problem, LBP, multi-scale wavelet statistics as well as color channel statistics were combined as a single feature vector for classification. As claimed in [45], although the texture pattern can be eliminated by setting the recapturing condition properly, the loss-of-detail artifact cannot be avoided during recapturing, which can be further employed as discriminative features for image reproduction detection. Recently, Li et al. [46] proposed a CNN+RNN framework to exploit the deep representation of recapturing artifacts, which was proved to be effective when using 32×32 image block as the input of the network. For video reproduction, Wang and Farid [47] proposed to explore geometry principles based on the motivation that the recaptured scene is constrained to a planar surface, while the original video was taken by projecting objects from the real world to the camera. In [47], both mathematical analysis and experimental results showed that the reproduction process can cause “non-zero” skew in the projection matrix by assumin that the skew value of camera for the original capturing was zero. Along this vein, the algorithm proposed in [48]detected the radial lens distortion based on the geometry principle. A mathematical model was built for lens distortion and distorted line based on the edge of video frame, which was regarded as discriminative cue for reproduction identification. In [48], the characteristic ghosting artifact, which is generated

從安全的角度來看，多媒體重捕的目的是非法複製內容。在多媒體內容再現過程中，對攝像機、顯示屏以及光照條件進行了精心的調整，可以獲得最優質的再現內容。據我們所知，文章 [44]首次提出瞭解決LCD屏幕上圖像重建檢測問題的工作，其中分析了三種失真類型，包括混疊引起的紋理模式、液晶顯示屏低分辨率引起的細節丟失和器件色域引起的顏色失真。針對這一問題，將LBP、多尺度小波統計和彩色信道統計相結合，作爲單一的特徵向量進行分類。正如在[45]中所聲稱的那樣，雖然通過適當設置恢復條件可以消除紋理模式，但在恢復過程中無法避免細節僞像的丟失，這可以進一步用作圖像再現檢測中的鑑別特徵。.最近，Li等人。[46]提出了一種CNN+RNN的框架，當該框架以32×32圖像塊作爲網絡的輸入，可以有效地利用了再現僞影的深度表示。對於視頻再現，Wang和Farid[47]提出了一種探索幾何原理的方法，他們將再現的場景限制在平面上，而原始的視頻則是通過將物體從現實世界投射到攝像機上來實現的。在[47]中，無論是數學分析還是實驗結果都表明，在投影矩陣中，通過假設攝像機的偏斜值爲零，再現過程會導致投影矩陣中的“非零”偏斜。沿着這一思路，[48]中提出的基於幾何原理的徑向透鏡畸變檢測算法。基於視頻幀的邊緣，建立了鏡頭畸變和畸變線的數學模型，作爲識別再現的依據。在[48]中，由於攝像機和投影屏幕之間缺乏同步而產生的特徵重影僞影，可以作爲鑑別信息被由兩個Dirac脈衝組成的濾波器檢測出來。

III. METHODOLOGY（方法）

Generally speaking, both spatial and temporal artifacts (e.g., unexpected texture patterns, color distortions and blurring [44], [49]) may occur during the face spoofing process.Regarding the texture pattern, such pattern appearing in spatial dimension is caused by the mismatch of the replay device resolution and the capturing device resolution [17] and texture distortion appeared on replay medium due to blurring artifact [27] and surface/glasses reflection [50], while

in temporal domain it is derived from the divergence between flash frequency of display device (e.g., 120 Hz) and the sampling frequency of video signal (e.g., 25 Hz). The color distortion is due to the mismatch of color gamut between the display medium and the recapturing model [51], [52]. Besides the texture pattern and color distortion, the unexpected motion such as display device shaking along the temporal dimension can also be beneficial for spoofing detection. Instead

of using the hand-crafted features in inferring the distinctive information, applying Convolutional Neural Network (CNN) to spoofing detection has shown promising results for different spoofing setups. However, as most of the current adopted CNN models for spoofing detection are based on 2D images

trained in a label-guided manner [37], [38], [41], there are two outstanding limitations:

Due to the limitation of the 2D CNN structure, the tem-poral statistics encoded in contiguous frames are ignored.
Directly applying the classification loss with label information can lead to overfitting problem to a certain database collection. In this scenario, the trained model cannot generalize well to the unseen data.

一般來說，在人臉欺騙過程中，可能會出現時空僞影(例如，意料之外的紋理模式、顏色畸變和模糊環[44]、[49])。關於紋理模式，這種模式出現在空間維中是由於重放設備分辨率和捕獲設備分辨率[17]的不匹配造成的，而紋理失真則是由於模糊僞影[27]和表面/眼鏡反射[50]而出現在重放介質上的。。而在時域，它是由顯示設備的閃光燈頻率(例如120赫茲)和視頻信號的採樣頻率(例如25赫茲)之間的差異導致的。顏色失真是由於顯示介質與再現模型[51]、[52]之間色域不匹配造成的，除了紋理模式和顏色失真外，顯示裝置沿時間方向抖動等意想不到的運動也有利於欺騙檢測。將卷積神經網絡(Cnn)應用於欺騙檢測，而不是利用手工構造的特徵來推斷不同的信息，在不同的欺騙機制中顯示出了很好的效果。然而，由於目前採用的cnn欺騙檢測模型大多是基於以標籤引導方式訓練的2d圖像[37]、[38]、[41]，因此存在兩個突出的侷限性：

由於二維CNN結構的侷限性，忽略了連續幀編碼的統計量。
直接將分類損失與標籤信息相結合會導致對某一數據庫集合的過度擬合問題，在這種情況下，經過訓練的模型不能很好地泛化到未見數據。

In view of these limitations, we develop a 3D CNN architecture such that discriminative information can be learned from both spatial and temporal dimensions. In particular, when training and testing samples are captured under similar environments, our model can achieve lower error rate compared with 2D CNN models as well as other handcrafted features used in prior art. More importantly, when training a CNN by considering face samples collected from different cameras under diverse illumination conditions, the extracted featuresacross domains are expected to lie in a similar manifold such that a classifier trained with such features will have better generalization ability. In view of this, we also take advantage of domain generalization in network training by introducing a regularization term, which forces the learned features to share similar distributions. The pipeline of our proposed scheme is shown in Fig. 1.

鑑於這些侷限性，我們開發了一種三維cnn結構，可以從空間和時間兩方面學習鑑別信息。特別是，當訓練和測試樣本在類似的環境下被捕獲時，我們的模型可以獲得比2d cnn模型以及現有技術中使用的其他手工製作的特徵的方法更低的誤差率。更重要的是，當訓練cnn時，考慮在不同的光照條件下從不同的攝像機採集的人臉樣本，所提取的區域特徵將位於一個相似的流形中，因此使用這些特徵訓練的分類器具有更好的泛化能力。鑑於此，我們還在網絡訓練中引入了一個正則化項，使學習到的特徵共享相似的分佈，從而利用了網絡訓練中的領域泛化，並在圖1中給出了該方案的流水線圖。

A. 3D Convolutional Neural Network（3D卷積神經網絡）

In the 2D convolutional neural network, the convolution process is only applied on the 2D feature maps to compute the response in the spatial dimension, which has largely ignored the temporal information. In contrast with 2D CNN, the 3D CNN is conducted by convolving an input cube,which is stacked by multiple contiguous frames with a 3D kernel. We refer to the 3D convolution kernel size in the l−th layer by W l × H l × T l , where T l denotes the temporal depth and W l × H l represents the spatial size of the kernel. As such, the temporal information can also be preserved in the feature map. By jointly considering the temporal information, we can achieve better feature learning capability for face spoofing detection. In particular, each convolution operation is performed followed by a non-linear activation function such as ReLU. Mathematically, such process can be formulated as ：

在二維卷積神經網絡中，卷積過程只應用在二維特徵映射上來計算空間維上的響應，而忽略了時間信息。與二維CNN相比，三維CNN是通過一個輸入立方體來進行的，輸入立方體由多個連續幀疊加而成，該立方體由一個三維核組成，我們將1層−第四層中的三維卷積核尺寸稱爲 W l × H l × T l ，其中Tl表示時間深度，Wl×Hl表示核的空間大小。因此，時間信息也可以保存在特徵映射中。通過同時考慮時間信息，我們可以獲得更好的人臉欺騙檢測的特徵學習能力。特別是當每次卷積運算之後都會有一個非線性激活函數時，如relu。從數學上講，這樣的過程可以表述爲：

where is the value of a unit at position (i, j,k) in the d1−th feature map from the (l−1)-th layer, is the value of the element at position (m,n, p) of the 3D convolution kernel connected to the d 2 -th feature map in the l−th layer, is the bias term, and σ(·) denotes a non-linear activation layer. Subsequently, a 3D pooling layer is applied to reduce the resolution of feature maps and enhance the invariance of the input signals to distortions. According to the research in [53], smaller receptive fields of 3D convolution kernels with deeper architectures can yield better performance for video classification. Although our problem is different from [53], we found out that adopting a smaller receptive field leads to better results for face spoofing detection as well. Therefore, in the 3D CNN architecture, we only consider the spatial-temporal receptive field as 3 × 3 × 3. The proposed 3D CNN model is detailed in Table I. This architecture has five

convolutional layers followed by the fully connected layer. The study regarding the appropriate number of convolutional layers is presented in Section IV-D.

其中，是在第(l−1)層的第d1個特徵途圖中位置爲(i，j，k)的一個單元的值，是連接到第l層中第d2個特徵圖的3d卷積核位於(m，n，p)的元素的值。爲偏置項，σ(·)表示非線性激活層。然後，採用3D池層來降低特徵映射的分辨率，提高輸入信號對失真的不變性。根據文獻[53]的研究，結構較深的三維卷積核的接受域較小，可以獲得更好的視頻分類性能。雖然我們的問題與[53]不同，但我們發現，採用較小的接收場也可以獲得更好的人臉欺騙檢測結果，因此，在三維CNN結構中，我們只考慮3×3×3的時空接受場。所提出的三維CNN模型詳見表一。該體系結構有五個卷積層，然後是完全連接層。在第IV-D節中對適當的卷積層數進行了研究。

表1、提出的3D卷積神經網絡的結構

圖1、本文提出的人臉欺騙檢測方案的流水線，最終目標函數由分類損失和廣義損失共同決定，fc2層的輸出作爲潛在的判別特徵進行分類。三維卷積層包括三維卷積模式、三維批量歸一化、LeakyReLU層和三維最大池化層，第二完全連通層(Fc2)用於潛在判別特徵提取。

B. Data Augmentation（數據增強）

As it can be observed from Table I, our proposed 3D CNN model has more than 4M parameters to be optimized. However, existing samples in public databases are not enough to train such model. Therefore, the underfitting problem can not be avoided due to the large number of parameters in the model and the sparsity of training samples. To address this issue, we propose a data augmentation method based on video cubes to increase the number of training data. It should be noted that traditional augmentation methods such as injecting additional noise may not be feasible for the spoofing detection problem, given that the distortion information plays a key role in face spoofing detection. Therefore, the strategy of augmenting the video cubes is developed concerning this task.

從表一可以看出，我們提出的三維cnn模型有超過4M個參數需要優化，但是現有的公共數據庫樣本不足以訓練這類模型，因此，由於模型中大量參數和訓練樣本的稀疏性，無法避免模型的欠擬合問題。針對這一問題，我們提出了一種基於視頻立方體的數據增強方法，以增加訓練數據的數量。需要注意的是，傳統的增強方法，如注入附加噪聲等，對於欺騙檢測問題並不可行，因爲失真信息在人臉欺騙檢測中起着關鍵的作用。因此，針對這一任務，提出了一種增強視頻立方體的策略。

1) Spatial Augmentation: To mitigate the variation of background for face spoofing detection, face detection is usually conducted as a pre-processing step [19]. However, variations of background near face regions can even be beneficial to face spoofing detection when considering deep learning approaches, as spoofing artifacts can be from the background region or the bezel of spoofing medium. Therefore, we propose to shift the boundingbox in four different directions (up, down, right and left) by α · l, where l is equal to the width/height of bounding box. The parameter α is a predefined scaling factor, which is empirically set to 0.2 in our work. We stop the spatial augmentation if the bounding box moves out of the image boundary. We show an example of spatial augmentation in Fig. 2.

1)空間增強：爲了減少人臉欺騙檢測背景的變化，人臉檢測通常是作爲預處理步驟進行的[19]，但是，在考慮深入學習的方法時，人臉區域附近背景的變化甚至有利於人臉欺騙檢測，因爲欺騙僞影可以來自背景區域或欺騙介質的邊框。因此，我們提出用α·l將邊框在四個不同方向(上、下、右、左)移動，其中l等於邊框的寬度/高度，參數α是一個預定義的縮放因子，在我們的工作中經驗地設置爲0.2，如果邊框移出圖像邊界，我們停止空間增強，我們在圖2中給出了一個空間增強的例子。

圖2、基於空間增強的數據說明。

2) Gamma Correction Based Augmentation: To take the display medium diversity due to different types of capturing devices into consideration, we conduct a gamma correction based augmentation on each individual frame of a given video cube. Considering the face captured by a certain camera model with gamma value γ 1 , the gamma correction process to γ 2 can be represented as

2）基於伽馬校正的增強：爲了考慮由於不同類型的捕獲設備而產生的顯示介質的多樣性，我們對給定的視頻立方體的每個幀進行伽馬校正增強。考慮到用γ值爲γ1的攝像機模型捕捉到的人臉，對γ2的伽馬校正過程可以表示爲：

where I and I aug are the original pixel and augmented pixel, respectively, in RGB space. ‘|·|’ denotes the round and truncation operations, where the output value is truncated into the range [0,255]. Since the camera performs linear correction (γ = 1.0) and exponential gamma correction (e.g. γ = 2.2) before display, 1 we choose the ratio γ 2 /γ 1 to be 1.0/2.2 and 2.2/1.0 for augmentation in our work. We show an example 。

0of gamma correction based augmentation in Fig. 3.

其中I和I aug分別是RGB空間中的原始像素和增廣像素。“|·|”表示圓形和截斷操作（機下取整），其中輸出值被截斷到範圍[0，255]。因爲攝像機執行線性校正(γ=1.0)並在顯示前進行了指數伽瑪校正(如γ=2.2)，所以我們選擇γ2/γ1爲1.0/2.2和2.2/1 0作爲增強，圖3給出了基於γ校正的增強圖。

圖3、基於伽瑪校正基礎上的數據增強的圖解。(A)原始臉；(B)伽馬校正比1.0/2.2的臉；(C)伽馬校正比2.2/1.0的臉。

C. Model Generalization（模型泛化）

Although deep learning is powerful in learning representative information when training data are diverse, it may still suffer from performance degradation when test data are “unseen”, such as the test samples obtained from a different environment from the training data. Generally speaking, it is impossible to involve face samples captured by all types of cameras from every potential scenario. In view of this,we leverage the advantage of domain generalization [54] to solve this problem. More specifically, given face samples from a few different capturing conditions, by partitioning the face samples into different domains based on the capturing conditions, we aim at learning a robust representation across different domains for face spoofing detection by introducing the generalization loss as the regularization term. As such, the generalization capability of the network can be better enhanced.

儘管在訓練數據多樣化的情況下，深度學習在學習表示信息方面有很強的作用，但是當測試數據“看不見”時，它仍然會受到性能下降的影響，例如，從與訓練數據中不同的環境中獲取的測試樣本。一般來說，不可能獲得從每一種可能的場景中並涉及到所有類型的攝像機捕捉到的面部樣本。鑑於此，我們利用領域泛化[54]的優勢來解決這個問題。更具體地說，給出幾個不同捕獲條件下的人臉樣本，根據捕獲條件將人臉樣本劃分成不同的區域，通過引入泛化損失作爲正則化項，學習不同域間的魯棒表示來進行人臉欺騙檢測。

因此，可以更好地提高網絡的泛化能力。

假設有來自L個訓練域的樣本，用 X = [X 1 ,X 2 ,...,X L ]表示，Xi代表區域I中的樣本，x中的樣本總數爲 N 1 + N 2 + ... + N L，其中N 1 , N 2 ,..., NL 是來自每個區域的樣本數。另外，假定網絡的第f層的特徵輸入爲，其中是指從域i（的數據）得到的第f個全連通層的特徵。我們進一步表示，作爲中第k個樣本的輸入特徵。爲了使來自不同領域的特徵分佈對齊，我們採用了最大平均偏差(Maximum Mean Discrepancy---MMD)[55]，這是衡量兩種分佈之間相似性的一種流行的度量方法，以最小化區域間的特徵分佈差異。因此，給定兩個分佈，如果它們之間的MMD距離等於零，它們是相同的。爲了學習廣義特徵表示，我們的目標是優化網絡，將輸入樣本X嵌入Yf，使不同區域之間的MMD距離最小化[55]。