DL notes07:GAN/DCGAN

GAN

Throughout most of this book, we have talked about how to make predictions. In some form or another, we used deep neural networks learned mappings from data points to labels. This kind of learning is called discriminative learning, as in, we’d like to be able to discriminate between photos cats and photos of dogs. Classifiers and regressors are both examples of discriminative learning. And neural networks trained by backpropagation have upended everything we thought we knew about discriminative learning on large complicated datasets. Classification accuracies on high-res images has gone from useless to human-level (with some caveats) in just 5-6 years. We will spare you another spiel about all the other discriminative tasks where deep neural networks do astoundingly well.

But there is more to machine learning than just solving discriminative tasks. For example, given a large dataset, without any labels, we might want to learn a model that concisely captures the characteristics of this data. Given such a model, we could sample synthetic data points that resemble the distribution of the training data. For example, given a large corpus of photographs of faces, we might want to be able to generate a new photorealistic image that looks like it might plausibly have come from the same dataset. This kind of learning is called generative modeling.

Until recently, we had no method that could synthesize novel photorealistic images. But the success of deep neural networks for discriminative learning opened up new possibilities. One big trend over the last three years has been the application of discriminative deep nets to overcome challenges in problems that we do not generally think of as supervised learning problems. The recurrent neural network language models are one example of using a discriminative network (trained to predict the next character) that once trained can act as a generative model.

In 2014, a breakthrough paper introduced Generative adversarial networks (GANs) Goodfellow.Pouget-Abadie.Mirza.ea.2014, a clever new way to leverage the power of discriminative models to get good generative models. At their heart, GANs rely on the idea that a data generator is good if we cannot tell fake data apart from real data. In statistics, this is called a two-sample test - a test to answer the question whether datasets X={x1,,xn}X=\{x_1,\ldots, x_n\} and X={x1,,xn}X'=\{x'_1,\ldots, x'_n\} were drawn from the same distribution. The main difference between most statistics papers and GANs is that the latter use this idea in a constructive way. In other words, rather than just training a model to say “hey, these two datasets do not look like they came from the same distribution”, they use the two-sample test to provide training signals to a generative model. This allows us to improve the data generator until it generates something that resembles the real data. At the very least, it needs to fool the classifier. Even if our classifier is a state of the art deep neural network.
gan
The GAN architecture is illustrated.As you can see, there are two pieces in GAN architecture - first off, we need a device (say, a deep network but it really could be anything, such as a game rendering engine) that might potentially be able to generate data that looks just like the real thing. If we are dealing with images, this needs to generate images. If we are dealing with speech, it needs to generate audio sequences, and so on. We call this the generator network. The second component is the discriminator network. It attempts to distinguish fake and real data from each other. Both networks are in competition with each other. The generator network attempts to fool the discriminator network. At that point, the discriminator network adapts to the new fake data. This information, in turn is used to improve the generator network, and so on.

The discriminator is a binary classifier to distinguish if the input xx is real (from real data) or fake (from the generator). Typically, the discriminator outputs a scalar prediction oRo\in\mathbb R or input x\mathbf x such as using a dense layer with hidden size 1, and then applies sigmoid function to obtain the predicted probability D(x)=1/(1+eo)D(\mathbf x) = 1/(1+e^{-o}) Assume the label yy for the true data is 11 and 00 for the fake data. We train the discriminator to minimize the cross-entropy loss, i.e.,
minD{ylogD(x)(1y)log(1D(x))}\min_D \{ - y \log D(\mathbf x) - (1-y)\log(1-D(\mathbf x)) \}
For the generator, it first draws some parameterzRd\mathbf z\in\mathbb R^d from a source of randomness, e.g., a normal distribution zN(0,1)\mathbf z \sim \mathcal{N} (0, 1) We often call z\mathbf z as the latent variable. It then applies a function to generate x=G(z)\mathbf x'=G(\mathbf z) The goal of the generator is to fool the discriminator to classify x=G(z)\mathbf x'=G(\mathbf z)
as true data, i.e., we want D(G(z))1D( G(\mathbf z)) \approx 1 In other words, for a given discriminator DD , we update the parameters of the generator GG o maximize the cross-entropy loss when y=0y=0 . i.e.,
maxG{(1y)log(1D(G(z)))}=maxG{log(1D(G(z)))}.\max_G \{ - (1-y) \log(1-D(G(\mathbf z))) \} = \max_G \{ - \log(1-D(G(\mathbf z))) \}.
If the discriminator does a perfect job, then D(x)0D(\mathbf x')\approx 0 so the above loss near 0, which results the gradients are too small to make a good progress for the generator. So commonly we minimize the following loss:
minG{ylog(D(G(z)))}=minG{log(D(G(z)))},\min_G \{ - y \log(D(G(\mathbf z))) \} = \min_G \{ - \log(D(G(\mathbf z))) \},
which is just feed x=G(z)\mathbf x'=G(\mathbf z) into the discriminator but giving label y=1y=1.
To sum up, DD and GG are playing a “minimax” game with the comprehensive objective function:
minDmaxG{ExDatalogD(x)EzNoiselog(1D(G(z)))}.min_D max_G \{ -E_{x \sim \text{Data}} log D(\mathbf x) - E_{z \sim \text{Noise}} log(1 - D(G(\mathbf z))) \}.
Many of the GANs applications are in the context of images. As a demonstration purpose, we are going to content ourselves with fitting a much simpler distribution first. We will illustrate what happens if we use GANs to build the world’s most inefficient estimator of parameters for a Gaussian. Let’s get started.

訓練

def update_D(X,Z,net_D,net_G,loss,trainer_D):
    batch_size=X.shape[0]
    Tensor=torch.FloatTensor
    ones=Variable(Tensor(np.ones(batch_size))).view(batch_size,1)
    zeros = Variable(Tensor(np.zeros(batch_size))).view(batch_size,1)
    real_Y=net_D(X.float())
    fake_X=net_G(Z)
    fake_Y=net_D(fake_X)
    loss_D=(loss(real_Y,ones)+loss(fake_Y,zeros))/2
    loss_D.backward()
    trainer_D.step()
    return float(loss_D.sum())

The generator is updated similarly. Here we reuse the cross-entropy loss but change the label of the fake data from 00 to 11 .

def update_G(Z,net_D,net_G,loss,trainer_G):
    batch_size=Z.shape[0]
    Tensor=torch.FloatTensor
    ones=Variable(Tensor(np.ones((batch_size,)))).view(batch_size,1)
    fake_X=net_G(Z)
    fake_Y=net_D(fake_X)
    loss_G=loss(fake_Y,ones)
    loss_G.backward()
    trainer_G.step()
    return float(loss_G.sum())

Both the discriminator and the generator performs a binary logistic regression with the cross-entropy loss. We use Adam to smooth the training process. In each iteration, we first update the discriminator and then the generator. We visualize both losses and generated examples.
gan
gan-res
我們可以發現出現了一些mode collapse的現象,生成的數據只能代表一部分,不能完全代表訓練數據中的多樣性。

  • Generative adversarial networks (GANs) composes of two deep networks, the generator and the discriminator.
  • The generator generates the image as much closer to the true image as possible to fool the discriminator, via maximizing the cross-entropy loss, i.e.maxlog(D(x))\max \log(D(\mathbf{x'}))設定生成器的損失函數,我們使用了一點小技巧來避免梯度消失。
  • The discriminator tries to distinguish the generated images from the true images, via minimizing the cross-entropy loss, i.e., minylogD(x)(1y)log(1D(x))\min - y \log D(\mathbf{x}) - (1-y)\log(1-D(\mathbf{x}))

DCGAN

we introduced the basic ideas behind how GANs work. We showed that they can draw samples from some simple, easy-to-sample distribution, like a uniform or normal distribution, and transform them into samples that appear to match the distribution of some dataset. And while our example of matching a 2D Gaussian distribution got the point across, it is not especially exciting.

In this section, we will demonstrate how you can use GANs to generate photorealistic images. We will be basing our models on the deep convolutional GANs (DCGAN) introduced in :cite:Radford.Metz.Chintala.2015. We will borrow the convolutional architecture that have proven so successful for discriminative computer vision problems and show how via GANs, they can be leveraged to generate photorealistic images.

由於GAN存在訓練不穩定,生成過程不可控的缺點,而對GAN進行了改進,從而有了DCGAN(Deep Convolutional GAN)。DCGAN的生成器和判別器都使用了卷積神經網絡(CNN)來替代GAN 中的多層感知機,同時爲了使整個網絡可微,拿掉了CNN 中的池化層,另外將全連接層以全局池化層替代以減輕計算量。

DCGAN將轉置卷積神經網絡包含在GAN中,以獲得更好的圖像生成性能。

生成器

The generator needs to map the noise variable zRd\mathbf z\in\mathbb R^d , a length-dd vector, to a RGB image with width and height to be 64×6464 \times 64 . we introduced the fully convolutional network that uses transposed convolution layer to enlarge input size. The basic block of the generator contains a transposed convolution layer followed by the batch normalization and ReLU activation.

In default, the transposed convolution layer uses a kh=kw=4k_h = k_w = 4 kernel, a sh=sw=2s_h = s_w = 2 strides, and a ph=pw=1p_h = p_w = 1 padding. With a input shape of nh×nw=16×16n_h^{'} \times n_w^{'} = 16 \times 16, the generator block will double input’s width and height.
nh×nw=[(nhkh(nh1)(khsh)2ph]×[(nwkw(nw1)(kwsw)2pw]=[(kh+sh(nh1)2ph]×[(kw+sw(nw1)2pw]=[(4+2×(161)2×1]×[(4+2×(161)2×1]=32×32. \begin{aligned} n_h^{'} \times n_w^{'} &= [(n_h k_h - (n_h-1)(k_h-s_h)- 2p_h] \times [(n_w k_w - (n_w-1)(k_w-s_w)- 2p_w]\\ &= [(k_h + s_h (n_h-1)- 2p_h] \times [(k_w + s_w (n_w-1)- 2p_w]\\ &= [(4 + 2 \times (16-1)- 2 \times 1] \times [(4 + 2 \times (16-1)- 2 \times 1]\\ &= 32 \times 32 .\\ \end{aligned}
If changing the transposed convolution layer to a 4×44\times 4 kernel, 1×11\times 1 strides and zero padding. With a input size of 1×11\times 1, the output will have its width and height increased by 3 respectively.

The generator consists of four basic blocks that increase input’s both width and height from 1 to 32. At the same time, it first projects the latent variable into 64×864\times 8 channels, and then halve the channels each time. At last, a transposed convolution layer is used to generate the output. It further doubles the width and height to match the desired 64×6464\times 64 shape, and reduces the channel size to 33 . The tanh activation function is applied to project output values into the range.

生成器結構圖如下圖所示:
在這裏插入圖片描述

辨別器

The discriminator is a normal convolutional network network except that it uses a leaky ReLU as its activation function. Given α[0,1]\alpha \in[0, 1] , its definition is
leaky ReLU(x)={xif x>0αxotherwise.\textrm{leaky ReLU}(x) = \begin{cases}x & \text{if}\ x > 0\\ \alpha x &\text{otherwise}\end{cases}.
As it can be seen, it is normal ReLU if α=0\alpha=0, and an identity function if α=1\alpha=1. For α(0,1)\alpha \in (0, 1), leaky ReLU is a nonlinear function that give a non-zero output for a negative input. It aims to fix the “dying ReLU” problem that a neuron might always output a negative value and therefore cannot make any progress since the gradient of ReLU is 0.
leakyrelu
The basic block of the discriminator is a convolution layer followed by a batch normalization layer and a leaky ReLU activation. The hyper-parameters of the convolution layer are similar to the transpose convolution layer in the generator block.

A basic block with default settings will halve the width and height of the inputs. For example, given a input shape nh=nw=16n_h = n_w = 16, with a kernel shape kh=kw=4k_h = k_w = 4 , a stride shape sh=sw=2s_h = s_w = 2, and a padding shape ph=pw=1p_h = p_w = 1, the output shape will be:
nh×nw=(nhkh+2ph+sh)/sh×(nwkw+2pw+sw)/sw=(164+2×1+2)/2×(164+2×1+2)/2=8×8.\begin{aligned} n_h^{'} \times n_w^{'} &= \lfloor(n_h-k_h+2p_h+s_h)/s_h\rfloor \times \lfloor(n_w-k_w+2p_w+s_w)/s_w\rfloor\\ &= \lfloor(16-4+2\times 1+2)/2\rfloor \times \lfloor(16-4+2\times 1+2)/2\rfloor\\ &= 8 \times 8 .\\ \end{aligned}
It uses a convolution layer with output channel 11 as the last layer to obtain a single prediction value.

總結

  • DCGAN architecture has four convolutional layers for the Discriminator and four “fractionally-strided” convolutional layers for the Generator.
  • The Discriminator is a 4-layer strided convolutions with batch normalization (except its input layer) and leaky ReLU activations.
  • Leaky ReLU is a nonlinear function that give a non-zero output for a negative input. It aims to fix the “dying ReLU” problem and helps the gradients flow easier through the architecture.

DCGAN網絡設計中採用了當時對CNN比較流行的改進方案:

(1)將空間池化層用卷積層替代,這種替代只需要將卷積的步長stride設置爲大於1的數值。改進的意義是下采樣過程不再是固定的拋棄某些位置的像素值,而是可以讓網絡自己去學習下采樣方式。

(2)將全連接層去除,(我前面關於VGG的文章分析了全連接層是如何大幅度增加網絡參數數量的);作者通過實驗發現了全局均值池化有助於模型的穩定性但是降低了模型的收斂速度;作者在這裏說明了他是通過將生成器輸入的噪聲reshape成4D的張量,來實現不用全連接而是用卷積的。

(3)採用BN層,BN的全稱是Batch Normalization,是一種用於常用於卷積層後面的歸一化方法,起到幫助網絡的收斂等作用。作者實驗中發現對所有的層都使用BN會造成採樣的震盪(我也不理解什麼是採樣的震盪,我猜是生成圖像趨於同樣的模式或者生成圖像質量忽高忽低)和網絡不穩定。

DCGAN的改進不包含嚴格數學證明,主要是理論分析和實驗驗證,其對生成器的判別的修改核心如下:

(1)使用指定步長的卷積層代替池化層

(2)生成器和判別器中都使用BN

(3)移除全連接層

(4)生成器除去輸出層採用Tanh外,全部使用ReLU作爲激活函數

(5)判別器所有層都使用LeakyReLU作爲激活函數

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章