MyDLNote-Network: 2020 CVPR 基於 U-Net 判別器的生成對抗網絡 A U-Net Based Discriminator for Generative Adversari

A U-Net Based Discriminator for Generative Adversarial Networks

[paper] [github]

Abstract

Among the major remaining challenges for generative adversarial networks (GANs) is the capacity to synthesize globally and locally coherent images with object shapes and textures indistinguishable from real images.

To target this issue we propose an alternative U-Net based discriminator architecture, borrowing the insights from the segmentation literature.

The proposed U-Net based architecture allows to provide detailed per-pixel feedback to the generator while maintaining the global coherence of synthesized images, by providing the global image feedback as well. Empowered by the per-pixel response of the discriminator, we further propose a per-pixel consistency regularization technique based on the CutMix data augmentation, encouraging the U-Net discriminator to focus more on semantic and structural changes between real and fake images. This improves the U-Net discriminator training, further enhancing the quality of generated samples. The novel discriminator improves over the state of the art in terms of the standard distribution and image quality metrics, enabling the generator to synthesize images with varying structure, appearance and levels of detail, maintaining global and local realism.

Compared to the BigGAN baseline, we achieve an average improvement of 2.7 FID points across FFHQ, CelebA, and the proposed COCO-Animals dataset.

第一部分,背景和問題:生成式對抗網絡(GANs)仍面臨的主要挑戰之一是合成全局和局部相關圖像的能力,這些圖像的物體形狀和紋理與真實圖像難以區分。

第二部分,本文核心網絡(目標和核心內容):針對這個問題,提出了另一種基於 U-Net 的判別器架構,借鑑了來自分割文獻的見解。

第三部分,網絡介紹及優點:提出的基於 U-Net 的架構允許提供詳細的每像素反饋給生成器,同時保持合成圖像的全局一致性。利用判別器的逐像素響應,進一步提出了一種基於 CutMix 數據增強的逐像素一致性正則化技術,鼓勵 U-Net 判別器更多地關注真實圖像和虛假圖像之間的語義和結構變化。這改進了 U-Net 判別器的訓練,進一步提高了生成樣本的質量。提出的判別器在標準分佈和圖像質量度量方面改進了藝術的狀態,使生成器能夠合成具有不同結構、外觀和細節級別的圖像,保持圖像全局和局部的真實性(像真的一樣)。

最後部分,實驗結論。

摘要部分,沒有太多的介紹網絡是什麼樣子的,只是強調了網絡的優點。原因是,網絡可能並不複雜,而確實優點不少。

 

Introduction

The quality of synthetic images produced by generative adversarial networks (GANs) has seen tremendous improvement recently [5, 20]. The progress is attributed to large-scale training [32, 5], architectural modifications [50, 19, 20, 27], and improved training stability via the use of different regularization techniques [34, 51]. However, despite the recent advances, learning to synthesize images with global semantic coherence, long-range structure and the exactness of detail remains challenging.

第一段介紹了幾方面內容(第一段最能展示作者功底,基本就是幾句話概況了該行業的背景,並給出本文所關注的、解決的核心問題):

研究大方向:GAN

該方向狀況:GAN 目前從三個研究角度進行,即大規模訓練、網絡結構設計和正則化技術。

提出該行業存在的問題:學習合成具有全局語義一致性、長程結構和細節精確的圖像仍然具有挑戰性。

個人認爲,一個好的工作,首先是提出了一個好的問題(新穎,行業還沒注意到的、卻真實存在的問題;行業最關心的大問題,目前還沒有較好的解決;某個該行業很經典的算法存在的問題,這個問題可是比較重要的或者被前人誤解的,可不是小問題);然後,針對這個問題提出了非常合理,甚至巧妙的方法。

 

One source of the problem lies potentially in the discriminator network. The discriminator aims to model the data distribution, acting as a loss function to provide the generator a learning signal to synthesize realistic image samples. The stronger the discriminator is, the better the generator has to become. In the current state-of-the-art GAN models, the discriminator being a classification network learns only a representation that allows to efficiently penalize the generator based on the most discriminative difference between real and synthetic images. Thus, it often focuses either on the global structure or local details. The problem amplifies as the discriminator has to learn in a non-stationary environment: the distribution of synthetic samples shifts as the generator constantly changes through training, and is prone to forgetting previous tasks [7] (in the context of the discriminator training, learning semantics, structures, and textures can be considered different tasks). This discriminator is not incentivized to maintain a more powerful data representation, learning both global and local image differences. This often results in the generated images with discontinued and mottled local structures [27] or images with incoherent geometric and structural patterns (e.g. asymmetric faces or animals with missing legs) [50].

本段具體描述了目前 GAN 存在的問題,總體來看,研究角度比較新穎,當然是爲提出的 U-Nnet GAN 做鋪墊的。這段是非常加分的,因爲只有能看到新問題,才說明作者對該領域瞭解地非常深透。這正是好文章的靈魂。回來,我們來看看都有哪些問題:

本段前半部分介紹了判別器的功能:目標是學習數據的分佈,在生成網絡中作爲對抗損失函數的一部分,促使生成器生成更真實的輸出。通常被作爲分類網絡,因此只學習一種表示,因此它通常要麼關注全局結構,要麼關注局部細節。

後半部分分析了傳統判別器的缺點:在判別器訓練的背景下,學習語義、結構和紋理可以被認爲是不同的任務;當判別器必須在非平穩環境中學習時,問題就會放大,即合成樣本的分佈會隨着發生器通過訓練不斷變化而變化,而且容易忘記之前的任務。如果爲了使得判別器同時學習兩個任務,如全局語義和細節,將導致該判別器失去表達能力(什麼都學,什麼都學不好)。具體地,導致生成的圖像具有不連續的和斑駁的局部結構或具有不連貫的幾何和結構模式的圖像(例如不對稱的臉或缺少腿的動物)。

 

To mitigate this problem, we propose an alternative discriminator architecture, which outputs simultaneously both global (over the whole image) and local (per-pixel) decision of the image belonging to either the real or fake class, see Figure 1. Motivated by the ideas from the segmentation literature, we re-design the discriminator to take a role of both a classifier and segmenter. We change the architecture of the discriminator network to a U-Net [39], where the encoder module performs per-image classification, as in the standard GAN setting, and the decoder module outputs perpixel class decision, providing spatially coherent feedback to the generator, see Figure 2. This architectural change leads to a stronger discriminator, which is encouraged to maintain a more powerful data representation, making the generator task of fooling the discriminator more challenging and thus improving the quality of generated samples (as also reflected in the generator and discriminator loss behavior in Figure S1). Note that we do not modify the generator in any way, and our work is orthogonal to the ongoing research on architectural changes of the generator [20, 27], divergence measures [25, 1, 37], and regularizations [40, 15, 34].

 

本段介紹了 U-Net GAN 的網絡結構和說明。其包括幾點特點:

1. 同時輸出屬於真實類或虛假類的圖像的全局 (對整個圖像) 和局部 (逐像素) 決策;既能充當分類器又能充當分割器的判別器;

2. 編碼器模塊執行每幅圖像分類,就像標準 GAN 設置一樣;

3. 解碼器模塊輸出像素級決策,向生成器提供空間相干反饋;

4. 並沒有以任何方式修改生成器, U-Net GAN 與目前的生成器的架構變更研究 [20,27]、發散度量 [251,37] 和正則化研究 [40,15,34] 是正交的。

 

The proposed U-Net based discriminator allows to employ the recently introduced CutMix [47] augmentation, which is shown to be effective for classification networks, for consistency regularization in the two-dimensional output space of the decoder. Inspired by [47], we cut and mix the patches from real and synthetic images together, where the ground truth label maps are spatially combined with respect to the real and fake patch class for the segmenter (U-Net decoder) and the class labels are set to fake for the classifier (U-Net encoder), as globally the CutMix image should be recognized as fake, see Figure 3. Empowered by per-pixel feedback of the U-Net discriminator, we further employ these CutMix images for consistency regularization, penalizing per-pixel inconsistent predictions of the discriminator under the CutMix transformations. This fosters the discriminator to focus more on semantic and structural changes between real and fake images and to attend less to domain-preserving perturbations. Moreover, it also helps to improve the localization ability of the decoder. Employing the proposed consistency regularization leads to a stronger generator, which pays more attention to local and global image realism. We call our model U-Net GAN.

本段再次剖析了 U-Net GAN 的一些細節:

提出的基於 U-Net 的鑑別器允許使用最近引入的 CutMix [47] 增強,該增強被證明對分類網絡有效,用於在解碼器的二維輸出空間中進行一致性規則化。受 [47] 的啓發,將真實圖像和合成圖像的 patch 塊剪切並混合在一起,其中,分割器(解碼器)中真實和虛假 path 類是 GT 標籤圖;分類器(編碼器)的虛假類是分類標籤。在全局中,CutMix 圖像應該被認爲是假的。利用 U-Net 判別器的逐像素反饋,進一步利用這些 CutMix 圖像進行一致性正則化,懲罰在 CutMix 變換下的逐像素不一致的判別器預測。這就使得判別器更關注真假圖像之間的語義和結構變化,而較少關注保留區域的干擾。此外,它還有助於提高譯碼器的定位能力。使用所提出的一致性正則化導致一個更強大的生成器,它更關注局部和全局圖像真實感。

 

We evaluate the proposed U-Net GAN model across several datasets using the state-of-the-art BigGAN model [5] as a baseline and observe an improved quality of the generated samples in terms of the FID and IS metrics. For unconditional image synthesis on FFHQ [20] at resolution 256 × 256, our U-Net GAN model improves 4 FID points over the BigGAN model, synthesizing high quality human faces (see Figure 4). On CelebA [29] at resolution 128×128 we achieve 1.6 point FID gain, yielding to the best of our knowledge the lowest known FID score of 2.95. For class-conditional image synthesis on the introduced COCOAnimals dataset [28, 24] at resolution 128×128 we observe an improvement in FID from 16.37 to 13.73, synthesizing diverse images of different animal classes (see Figure 5).

本段給出了一下實驗結論:一句話,與 BigGAN 比較,本文的較好。

【現在是凌晨3點了,困了,後面持續更新吧。。。】

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章