MesoNet: a Compact Facial Video Forgery Detection Network 論文閱讀

MesoNet: a Compact Facial Video Forgery Detection Network

這一篇論文其實還蠻好懂得,沒有什麼非常複雜的東西。主要思想是基於中層的語義進行檢測,MesoNet是mesoscopic network , mesoscopic是細觀、介觀的意思,在原文中有對這一點進行說明

We propose to detect forged videos of faces by placing our method at a mesoscopic level of analysis. Indeed, microscopic analyses based on image noise cannot be applied in a compressed video context where the image noise is strongly degraded. Similarly, at a higher semantic level, human eye struggles to distinguish forged images [21], especially when the image depicts a human face [1, 7]. That is why we propose to adopt an intermediate approach using a deep neural network with a small number of layers

翻譯:我們打算把我們檢測視頻中假臉的方法置於中層水平的分析。的確,基於圖像噪聲的低層(語義層級非常低)水平分析不能夠應用於被壓縮的視頻內容中,因爲這些內容的圖像噪聲會被衰弱的。相似的,在高層語義水平,人類的眼睛不能夠非好的區分僞造的圖片,尤其是當這張圖片描述的是人的臉的時候。這是爲什麼我們採用一個只有很少神經層的深度神經網絡中層方法。

Abstract

This paper presents a method to automatically and efficiently detect face tampering in videos, and particularly focuses on two recent techniques used to generate hyperrealistic forged videos: Deepfake and Face2Face.

這篇文章主要是提出了自動檢測Deepfake和Face2Face的方法。

Thus, this paper follows a deep learning approach and presents two networks, both with a low number of layers to focus on the mesoscopic properties of images. We evaluate those fast networks on both an existing dataset and a dataset we have constituted from online videos. The tests demonstrate a very successful detection rate with more than 98% for Deepfake and 95% for Face2Face.

告訴我們它提出了兩個層數不多的網絡,同時還貢獻了一個數據集,我感覺這是它主要的貢獻了。

1.Introduction

主要告訴了我們假視頻的危害還有做deepfake檢測的必要性什麼的,還有介紹了一些傳統方法的檢測,然後到了使用DL進行檢測。

還介紹了一下兩種比較流行的僞造假臉的方式,一個是deepfake另一個是Face2Face,這兩個也是這篇論文主要檢測的東西。

當然,對於這類的檢測最好能夠做到實時性。

1.1 DeepFake介紹

說實話,有點久了具體細節不太記得了,但是還是大致瞭解思想的,大概和gan的思想差不多。

在這裏插入圖片描述

上圖爲論文中對於deepfake的介紹,主要是訓練decoder和encoder。

The process to generate Deepfake images is to gather aligned faces of two different people A and B, then to train an auto-encoder EA to reconstruct the faces of A from the dataset of facial images of A, and an auto-encoder EB to reconstruct the faces of B from the dataset of facial images of B. The trick consists in sharing the weights of the encoding part of the two auto-encoders EA and EB, but keeping their respective decoder separated. Once the optimization is done, any image containing a face of A can be encoded through this shared encoder but decoded with decoder of EB

產生Deepfake圖像的過程首先是把兩個不同的人A和B的臉對齊,然後訓練一個自動編碼器EA去重構假臉數據集A中的臉,然後有一個自動編碼器EB去重構假臉數據集B。技巧就是分享兩個編碼器EA和EB編碼時候的權值 ,但是保持它們的解碼器獨立的。直到優化做完,假臉數據集A中的圖像能夠使用共享權值的編碼器進行編碼然後使用解碼器B進行解碼。

以這裏爲例子,Deepfake就是說我訓練出一個通用的編碼器和單獨的解碼器,然後我現在已經能夠做到把A中的臉解碼成B的臉,就是說我把現在這張圖片的臉編碼然後使用解碼器解碼,然後就能夠得到假臉了。

這種方式做的假臉在對於細節方面的處理不是非常的到位,可以嘗試使用這個確定進行破解。

Basically, the extraction of faces and their reintegration can fail, especially in the case of face occlusions: some frames can end up with no facial reenactment or with a large blurred area or a doubled facial contour. However, those technical errors can easily be avoided with more advanced networks.More deeply, and this is true for other applications, autoencoders tend to poorly reconstruct fine details because of the compression of the input data on a limited encoding space, the result thus often appears a bit blurry.

1.2 Face2Face

Reenactment methods, like [9], are designed to transfer image facial expression from a source to a target person.

論文中沒有具體的講,不過這一種方法主要是做到了實時換臉的效果,所以讓人驚歎。

2.Proposed method

2.1 Meso-4

方法其實非常明確。

在這裏插入圖片描述
基本上就如上了。

2.2 MesoInception-4

這一個方法也簡單,基本如下圖:

在這裏插入圖片描述
abcd參數也是自己設定的,它在這裏用了幾個進行實驗。

在這裏插入圖片描述

3. Experiments

結果如下:

在這裏插入圖片描述
在這裏插入圖片描述
在這裏插入圖片描述

3.4 Image aggregation

這裏還提出了圖像聚集的方法來提升效果。

Theoretically speaking,there is no justification for a gain in scores or a confidence interval indicator as frames of a same video are strongly correlated to one another

雖然它自己也說了理論上應該不會增強的。

In practice, for the viewer comfort,most filmed face contain a majority of stable clear frames.The effect of punctual movement blur, face occlusion and random misprediction can thus be outweighted by a majority of good predictions on a sample of frames taken from the video

在這裏插入圖片描述

3.5. Aggregation on intra-frames

還做了一個幀內的聚集。

在這裏插入圖片描述

3.6. Intuition behind the network

最後還進行了一些原理解釋,大概就是說抽取了中層語義進行判斷,同時發現眼睛、鼻子、嘴巴之類的區域對判斷有明顯的作用,可以如下圖所示:

在這裏插入圖片描述
從上面的圖片可以看得出真臉的眼睛之類的和假臉的區別還有比較明顯的。

4. Conclusion

Our experiments show that our method has an average detection rate of 98% for Deepfake videos and 95% for Face2Face videos under real conditions of diffusion on the internet.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章