Free-Form Image Inpainting with Gated Convolution

本文作者在 2018年 CVPR 上發表了一篇 Generative Image Inpainting with Contextual Attention；讀者可以結合這兩篇一起讀一下，可以幫助大家理解作者在一年裏，面對這個問題時的思路歷程。其中，本文的網絡結構和是完全相同的，只是在基本卷積和GAN 訓練中引入了新的變化。

[paper] : Free-Form Image Inpainting with Gated Convolution （2019 ICCV）

Generative Image Inpainting with Contextual Attention （2018 CVPR）

[github] : An open source framework for generative image inpainting task, with the support of Contextual Attention (CVPR 2018) and Gated Convolution (ICCV 2019 Oral).

Abstract

We present a generative image inpainting system to complete images with free-form mask and guidance.

本文是幹嘛的。

The system is based on gated convolutions learned from millions of images without additional labelling efforts. The proposed gated convolution solves the issue of vanilla convolution that treats all input pixels as valid ones, generalizes partial convolution by providing a learnable dynamic feature selection mechanism for each channel at each spatial location across all layers. Moreover, as free-form masks may appear anywhere in images with any shape, global and local GANs designed for a single rectangular mask are not applicable. Thus, we also present a patch-based GAN loss, named SN-PatchGAN, by applying spectral-normalized discriminator on dense image patches. SN-PatchGAN is simple in formulation, fast and stable in training.

本文的摘要思路是這樣的：

直接說本文的第一個1）貢獻、2）實現方法及3）解決的問題：1）門控捲積 gated convolution，2）通過提供一個可學習的動態特徵選擇機制，3）解決了將所有輸入像素視爲有效像素的普通卷積（vanilla convolution）問題。

接着說本文的第二個1）貢獻、2）實現方法及3）解決的問題：1）譜規範化 PatchGAN，2）將頻譜歸一化鑑別器應用於密集圖像 patch，3）由於自由形狀的遮罩可以出現在任何形狀的圖像中，因此爲單個矩形遮罩設計的全局和局部 GANs 是不適用的。

Results on automatic image inpainting and user-guided extension demonstrate that our system generates higher-quality and more flexible results than previous methods. Our system helps user quickly remove distracting objects, modify image layouts, clear watermarks and edit faces.

最後是實驗結論及應用。

Introduction

Image inpainting (a.k.a. image completion or image hole-filling) is a task of synthesizing alternative contents in missing regions such that the modification is visually realistic and semantically correct. It allows to remove distracting objects or retouch undesired regions in photos. It can also be extended to tasks including image/video un-cropping, rotation, stitching, re-targeting, re-composition, compression, super-resolution, harmonization and many others.

第一段：介紹 image inpainting 是幹啥的。

In computer vision, two broad approaches to image inpainting exist: patch matching using low-level image features and feed-forward generative models with deep convolutional networks. The former approach [3, 8, 9] can synthesize plausible stationary textures, but usually makes critical failures in non-stationary cases like complicated scenes, faces and objects. The latter approach [15, 49, 45, 46, 38, 37, 48, 26, 52, 33, 35, 19] can exploit semantics learned from large scale datasets to synthesize contents in nonstationary images in an end-to-end fashion.

However, deep generative models based on vanilla convolutions are naturally ill-fitted for image hole-filling because the spatially shared convolutional filters treat all input pixels or features as same valid ones. For hole-filling, the input to each layer are composed of valid pixels/features outside holes and invalid ones in masked regions. Vanilla convolutions apply same filters on all valid, invalid and mixed (for example, the ones on hole boundary) pixels/features, leading to visual artifacts such as color discrepancy, blurriness and obvious edge responses surrounding holes when tested on free-form masks [15, 49].

第二三段：介紹兩種傳統方法類型，及各自的優缺點：

patch matching：優點：可以合成看似平穩的紋理；缺點：在非平穩情況下，如複雜的場景、人臉和物體，通常會出現嚴重錯誤。

feed-forward generative models：優點：可以利用從大規模數據集中學到的語義，以端到端方式綜合非平穩圖像中的內容；缺點：基於普通卷積的卷積自然不適合於圖像空洞填充，因爲空間共享的卷積過濾器將所有輸入像素或特徵視爲相同的有效特徵；會導致諸如顏色差異、模糊和孔周圍明顯的邊緣反應等視覺假象。

To address this limitation, recently partial convolution [23] is proposed where the convolution is masked and normalized to be conditioned only on valid pixels. It is then followed by a rule-based mask-update step to update valid locations for next layer. Partial convolution categorizes all input locations to be either invalid or valid, and multiplies a zero-or-one mask to inputs throughout all layers. The mask can also be viewed as a single un-learnable feature gating channel . However this assumption has several limitations. First, considering the input spatial locations across different layers of a network, they may include (1) valid pixels in input image, (2) masked pixels in input image, (3) neurons with receptive field covering no valid pixel of input image, (4) neurons with receptive field covering different number of valid pixels of input image (these valid image pixels may also have different relative locations), and (5) synthesized pixels in deep layers. Heuristically categorizing all locations to be either invalid or valid ignores these important information. Second, if we extend to user-guided image inpainting where users provide sparse sketch inside the mask, should these pixel locations be considered as valid or invalid? How to properly update the mask for next layer? Third, for partial convolution the “invalid” pixels will progressively disappear layer by layer and the rule-based mask will be all ones in deep layers. However, to synthesize pixels in hole these deep layers may also need the information of whether current locations are inside or outside the hole. The partial convolution with all-ones mask cannot provide such information. We will show that if we allow the network to learn the mask automatically, the mask may have different values based on whether current locations are masked or not in input image, even in deep layers.

第四段：針對傳統算法，最新的算法（與本文關係最爲密切的算法）的優缺點（缺點即爲本文着重處理的對象）。

partial convolution：

1）什麼是部分卷積：可以參考我的博文 MyDLNote:Partial Conv. 或原文 partial convolution [2018 ECCV]

2）其缺點：

First：考慮輸入空間位置的不同層之間傳遞網絡，可能包括 (1) 輸入圖像的有效像素；(2) 在輸入圖像的 mask 像素；(3) 輸入圖像中，神經元感受野覆蓋的非有效像素；(4) 輸入圖像中，神經元感受野涵蓋不同數量的有效像素 (這些有效的像素也可能有不同的相對位置)；和 (5) 深度層的生成像素。啓發式地將所有位置歸類爲無效或有效，將忽略這些重要信息。

Second：如果我們擴展到用戶引導的圖像繪製，用戶在 mask 內提供稀疏的草圖，這些像素位置應該被認爲是有效的還是無效的? 如何正確更新下一層的 mask 圖?

Third：對於部分卷積，“無效”的像素會一層一層地逐漸消失，而基於規則的 mask 則會在深層全部消失。然而，爲了合成孔內的像素，這些深層可能還需要知道當前位置是在孔內還是孔外的信息。全部爲 1 的 mask 的部分卷積不能提供這樣的信息。本文將展示，如果允許網絡自動學習 mask，mask 可能有不同的值，這取決於當前位置是否在輸入圖像中被 masked，甚至在深層。

We propose gated convolution for free-form image inpainting. It learns a dynamic feature gating mechanism for each channel and each spatial location (for example, inside or outside masks, RGB channels or user-guidance channels). Specifically we consider the formulation where the input feature is firstly used to compute gating values $g = \sigma (w_gx)$ ( $\sigma$ is sigmoid function, is learnable parameter). The final output is a multiplication of learned feature and gating values $y = \phi (wx)\odot g$ where $\phi$ can be any activation function. Gated convolution is easy to implement and performs significantly better when (1) the masks have arbitrary shapes and (2) the inputs are no longer simply RGB channels with a mask but also have conditional inputs like sparse sketch. For network architectures, we stack gated convolution to form an encoder-decoder network following [49]. Our inpainting network also integrates contextual attention module within same refinement network [49] to better capture long-range dependencies.

第五段：本文的最重要的手段：門控捲積 gated convolution；及網絡結構。

1）門控捲積

門控捲積包括兩個支路：門控支路和主支路。

門控支路：一個卷積層 + sigmoid 函數： $g = \sigma (w_gx)$

主支路：一個卷積層 + 激活函數： $\phi(wx)$

二路結合得到門控捲積：全部元素（空間、通道）相乘（有點像混和注意力（不包括殘差連接的部分））： $y = \phi (wx)\odot g$ 。

2）網絡結構

將門控捲積按照編解碼器的形式堆疊。

整體網絡結構和作者 2018 年 CVPR 上的結構相似，就是將原來網絡中的（膨脹）卷積都用門控（膨脹）卷積替換。

Without compromise of performance, we also significantly simplify training objectives as two terms: a pixelwise reconstruction loss and an adversarial loss. The modification is mainly designed for free-form image inpainting. As the holes may appear anywhere in images with any shape, global and local GANs [15] designed for a single rectangular mask are not applicable. Instead, we propose a variant of generative adversarial networks, named SN-PatchGAN, motivated by global and local GANs [15], MarkovianGANs [21], perceptual loss [17] and recent work on spectral-normalized GANs [24]. The discriminator of SN-PatchGAN directly computes hinge loss on each point of the output map with format $\mathbb{R}^{h\times w\times c}$ , formulating $h \times w \times c$ number of GANs focusing on different locations and different semantics (represented in different channels). SN-PatchGAN is simple in formulation, fast and stable in training and produces high-quality inpainting results.

第六段：本文的第二大貢獻：SN-PatchGAN。

SN-PatchGAN 有兩點：

1）判別器直接計算輸出映射的每個點的鉸鏈損耗，輸出大小爲 $\mathbb{R}^{h\times w\times c}$ ，即輸出的是個 3D 張量，針對不同的位置和不同的語義(在不同的通道中表示)形成 $h \times w \times c$ 個的 GANs 判別。

可以參考圖 3，一目瞭然，最早的判別器是 1個點輸出，後來 PatchGAN 採用 2D 張量輸出，本文提出的是個 3D 輸出。這個創新還是很有意思的，因爲它包含了 GAN 的原理，也包含了特徵匹配的原理。

2）採用譜規範化（spectral-normalization）。

For practical image inpainting tools, enabling user interactivity is crucial because there could exist many plausible solutions for filling a hole in an image. To this end, we present an extension to allow user sketch as guided input. Comparison to other methods is summarized in Table 1. Our main contributions are as follows: (1) We introduce gated convolution to learn a dynamic feature selection mechanism for each channel at each spatial location across all layers, significantly improving the color consistency and inpainting quality of free-form masks and inputs. (2) We present a more practical patch-based GAN discriminator, SN-PatchGAN, for free-form image inpainting. It is simple, fast and produces high-quality inpainting results. (3) We extend our inpainting model to an interactive one, enabling user sketch as guidance to obtain more user-desired inpainting results. (4) Our proposed inpainting system achieves higher-quality free-form inpainting than previous state of the arts on benchmark datasets including Places2 natural scenes and CelebA-HQ faces. We show that the proposed system helps user quickly remove distracting objects, modify image layouts, clear watermarks and edit faces in images.

Table 1: Comparison of different approaches including PatchMatch [3], Global&Local [15], ContextAttention [49], PartialConv [23] and our approach. The comparison of image inpainting is based on four dimensions: Semantic Understanding, Non-Local Algorithm, Free-Form Masks and User-Guided Option.

第七段：本文貢獻總結。

首先，說明本文的方法能夠實現人機交互功能，在許多可行的解決方案來填充圖像中的空洞。在功能方面，與其他方法的比較如表 1 所示。

主要貢獻如下：

(1) 引入門控捲積來學習跨所有層的每個空間位置上的每個通道的動態特徵選擇機制，顯著提高了自由形狀 mask 和輸入的顏色一致性和畫質。

(2) 提出了一種更實用的基於PatchGAN的識別器，SN-PatchGAN，用於繪畫中的自由形態圖像。該方法簡單、快速，能產生高質量的塗裝效果。

(3) 將 inpainting 模型擴展爲交互式 inpainting 模型，以用戶手繪爲導向，獲得更多用戶想要的 inpainting 效果。

(4) 提出的 inpainting 系統在基準數據集 (包括 Places2 natural scenes 和 CelebA-HQ faces) 上實現了比以前更高質量的自由形態inpainting。展示了該系統可以幫助用戶快速刪除分散注意力的對象、修改圖像佈局、清除水印和編輯圖像中的人臉。

Approach

本文的方法，用幾個圖，就足以說明了。這裏不再按照原文一句句翻譯了。看圖就可以把這篇文章看完。

Gated Convolution

門控捲積的結構框圖

Figure 2: Illustration of partial convolution (left) and gated convolution (right).

這張圖將部分卷積與門控捲積進行直觀對比。

公式的話，是這樣的：

Spectral-Normalized Markovian Discriminator (SN-PatchGAN)

兩個內容：

1） 3D 輸出：這是因爲，首先，本文考慮自由形式的圖像繪畫，其中可能有多個孔，任何形狀在任何位置（即在空間上，需要 PatchGAN）。其次，在判別器中，不同的通道表達不同的特徵（即在通道上，需要 PatchGAN）。

2）採用了譜規範化：使得 GAN 訓練更穩定，收斂速度更快。

最後，作者也說明了本文采用的最終損失函數是：L1 reconstruction loss 和 SN-PatchGAN loss, 並且二者比例是 1 : 1。

Inpainting Network Architecture

網絡結構就不多說了，分爲兩步走實現：粗糙的結果和細化的結果。

Figure 3: Overview of our framework with gated convolution and SN-PatchGAN for free-form image inpainting.

這裏要說的是 Contextual Attention 模塊，需要參考 Generative Image Inpainting with Contextual Attention。

llustration of the contextual attention layer.Firstly we use convolution to compute matching score offoreground patches with background patches (as convolu-tional filters). Then we apply softmax to compare and getattention score for each pixel. Finally we reconstruct fore-ground patches with background patches by performing de-convolution on attention score. The contextual attentionlayer is differentiable and fully-convolutional.

Supplementary

在補充材料中，有一個分析挺有意思的，就是在網絡不同層中特徵的可視化，如下圖：

Figure 4: Comparisons of gated convolution and partial convolution with visualization and interpretation of learned gating
values. We first show our inpainting network architecture based on [4] by replacing all convolutions with gated convolutions in the 1st row. Note that for simplicity, the following refinement network in [4] is ignored in the figure. With same settings, we train two models based on gated convolution and partial convolution separately. We then directly visualize intermediate un-normalized gating values in the 2nd row. The values differ mainly based on three parts: background, mask and sketch. In the 3rd row, we provide an interpretation based on which part(s) have higher gating values. Interestingly we also find that for some channels (e.g. channel-31 of the layer after dilated convolution), the learned gating values are based on foreground/background semantic segmentation. For comparison, we also visualize the un-learnable fixed binary mask M of partial convolution in the 4th row.

MyDLNote-Inpainting:[2019 ICCV] Free-Form Image Inpainting with Gated Convolution

Free-Form Image Inpainting with Gated Convolution

Abstract

Introduction

Approach

Gated Convolution

Spectral-Normalized Markovian Discriminator (SN-PatchGAN)

Inpainting Network Architecture

Supplementary

HTTP URL 詳解

IEEE-explore， Springer 文獻免費下載辦法 & IEEE 論文latex / doc 模板下載地址

經典 network -- 圖像分類篇（01 AlexNet / NIN / VGG）（持續更新）

MyDLNote - Attention: [NLA系列] Asymmetric Non-local Neural Networks for Semantic Segmentation

經典 network -- 圖像分類篇（03 ResNet v1-v2）

MyDLNote - Network: Deep High-Resolution Representation Learning for Human Pose Estimation

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結