Generative Deep Learning - Variational Autoencoders and Generative Adverserial Networks
Chapters 3 and 4 of Generative Deep Learning review Variational Autoencoders and Generative Adverserial Networks and how these deep learning architectures can be used for image generation.
Chapter 3: Variantial Autoencoders
The variational autoencoder (VAE) is now one of the most fundamental and well-known deep learning architectures for generative modeling.
Autoencoders
An autoencoder is a neural network made up of two parts:
- An encoder network that compresses high-dimensional input data into a lower-dimensional representation vector
- A decoder network that decompresses a given representation vector back to the original domain.
This process is shown in the image below.
The network is trained to find weights for the encoder and decoder that minimize the loss between the original input and the reconstruction after it has passed through the encoder and decoder.
The representation vector is a compression of the original image into a lower-dimensional, latent space. The idea is that by choosing any point in the latent space, we should be able to generate novel images by passing this point through the decoder, since the decoder has learned how to convert points in the latent space into viable images.
Autoencoders can also be used to clean noisy images, since the encoder learns that it is not useful to capture the position of the random noise inside the latent space. Genrally speaking, it is a good idea to create a class for your model in another file. This way, you can instantiate an Autoencoder object with the parameters that define a particular model architecture in the notebook.
Convolutional Transpose Layers
Standard ocnvolutional layers allow us to halve the size of an input tensor in both height and width, by setting strides=2. The convolutional transpose layer uses the same principle as a standard convolutional layer (passing a filter across the image), but is different in that setting strides = 2 doubles the size of the input tensor in both height and width. In a convolutional transpose layer, the strides parameter determines the internal zero padding between pizels in the image.
In Keras, the Conv2DTranspose layer allows us to perform convolutional operations on tensors. By stacking these layers, we can gradually expand the size of each layer.
# Setting up the environment for this textbook
!git clone https://github.com/davidADSP/GDL_code.git
# Make sure that you have the most up to date version of the codebase
!git pull
!pip install virtualenv virtualenvwrapper
# The location where the virual environments will be stored
!export WORKON_HOME=$HOME/.virtualenvs
# The default version of python to use when virual environemnt
# is created
!export VIRTUALENVWRAPPER_PYTHON=/usr/local/bin/python3
# Reloads the virtualenvwrapper script
!source /usr/local/bin/virtualenvwrapper.sh
# Install the packages that we'll be using in this book
!pip install -r requirements.txt
from tensorflow.keras.layers import Input, Conv2D, Flatten, Dense, Conv2DTranspose, Reshape, Lambda, Activation, BatchNormalization, LeakyReLU, Dropout
from tensorflow.keras.models import Model
from tensorflow.keras import backend as K
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.utils import plot_model
import numpy as np
import json
import os
import pickle
from tensorflow.keras.callbacks import Callback, LearningRateScheduler
import matplotlib.pyplot as plt
#### CALLBACKS
class CustomCallback(Callback):
def __init__(self, run_folder, print_every_n_batches, initial_epoch, vae):
self.epoch = initial_epoch
self.run_folder = run_folder
self.print_every_n_batches = print_every_n_batches
self.vae = vae
def on_batch_end(self, batch, logs={}):
if batch % self.print_every_n_batches == 0:
z_new = np.random.normal(size = (1,self.vae.z_dim))
reconst = self.vae.decoder.predict(np.array(z_new))[0].squeeze()
filepath = os.path.join(self.run_folder, 'images', 'img_' + str(self.epoch).zfill(3) + '_' + str(batch) + '.jpg')
if len(reconst.shape) == 2:
plt.imsave(filepath, reconst, cmap='gray_r')
else:
plt.imsave(filepath, reconst)
def on_epoch_begin(self, epoch, logs={}):
self.epoch += 1
def step_decay_schedule(initial_lr, decay_factor=0.5, step_size=1):
'''
Wrapper function to create a LearningRateScheduler with step decay schedule.
'''
def schedule(epoch):
new_lr = initial_lr * (decay_factor ** np.floor(epoch/step_size))
return new_lr
return LearningRateScheduler(schedule)
class Autoencoder():
"""
I only include things / methods in this class that were commented on
in the textbook. To see the entire class, see the Jupyter Notebook for
this textbook.
"""
def __init__(self, input_dim, encoder_conv_filters, encoder_conv_kernel_size, encoder_conv_strides, decoder_conv_t_filters, decoder_conv_t_kernel_size, decoder_conv_t_strides, z_dim, use_batch_norm = False, use_dropout = False
):
self.name = 'autoencoder'
self.input_dim = input_dim
self.encoder_conv_filters = encoder_conv_filters
self.encoder_conv_kernel_size = encoder_conv_kernel_size
self.encoder_conv_strides = encoder_conv_strides
self.decoder_conv_t_filters = decoder_conv_t_filters
self.decoder_conv_t_kernel_size = decoder_conv_t_kernel_size
self.decoder_conv_t_strides = decoder_conv_t_strides
self.z_dim = z_dim
self.use_batch_norm = use_batch_norm
self.use_dropout = use_dropout
self.n_layers_encoder = len(encoder_conv_filters)
self.n_layers_decoder = len(decoder_conv_t_filters)
self._build()
def _build(self):
### THE ENCODER
"""
In an autoencoder, the encoder's job is to take the input image
and map it to a point in the latent space. To achieve this,
we first crate an input layer for the image and pass this
through four Conv2D layers, each apturing increasing high-level
features. We use a stride of 2 in some of the layers to reduce the
size of the output. The last convolutional layer is flattened and
connected to a Dense layer of size 2, which represents out
two-dimensional latent space.
"""
# Define the input to the encoder
encoder_input = Input(shape=self.input_dim, name='encoder_input')
x = encoder_input
for i in range(self.n_layers_encoder):
conv_layer = Conv2D(
filters = self.encoder_conv_filters[i]
, kernel_size = self.encoder_conv_kernel_size[i]
, strides = self.encoder_conv_strides[i]
, padding = 'same'
, name = 'encoder_conv_' + str(i)
)
# Stack convolutional layers sequentially on top of each other
x = conv_layer(x)
x = LeakyReLU()(x)
if self.use_batch_norm:
x = BatchNormalization()(x)
if self.use_dropout:
x = Dropout(rate = 0.25)(x)
# Flatten the last convolutional layer to a vector
shape_before_flattening = K.int_shape(x)[1:]
x = Flatten()(x)
# Dense layer that connects this vector to the 2D latent space
encoder_output= Dense(self.z_dim, name='encoder_output')(x)
# The Keras model taht defines the encoder - a model that takes an
# input image and encodes it into the 2D latent space.
self.encoder = Model(encoder_input, encoder_output)
### THE DECODER
"""
The decoder is a mirror image of the encoder, except instead of
convolutional layers, we use convolutional transpose layers.
Note that the decoder doesn't have to be a mirror image of the encoder
It can be anything you want, as long as the output from the last
layer of the decoder is the same size as the input to the encoder (
since our loss function will be comparing these pixel-wise
)
"""
# Define the input to the decoder
decoder_input = Input(shape=(self.z_dim,), name='decoder_input')
# Connect the input to a Dense layer
x = Dense(np.prod(shape_before_flattening))(decoder_input)
# Reshape vector into tensor so that it can be fed as input
# into the first convolutional transpose layer
x = Reshape(shape_before_flattening)(x)
for i in range(self.n_layers_decoder):
conv_t_layer = Conv2DTranspose(
filters = self.decoder_conv_t_filters[i]
, kernel_size = self.decoder_conv_t_kernel_size[i]
, strides = self.decoder_conv_t_strides[i]
, padding = 'same'
, name = 'decoder_conv_t_' + str(i)
)
# Stack convolutional Transpose layers on top of each other
x = conv_t_layer(x)
if i < self.n_layers_decoder - 1:
x = LeakyReLU()(x)
if self.use_batch_norm:
x = BatchNormalization()(x)
if self.use_dropout:
x = Dropout(rate = 0.25)(x)
else:
x = Activation('sigmoid')(x)
decoder_output = x
# The model that defines the Decoder - a model that takes a point
# in the latent space and decodes it into the original space domain
self.decoder = Model(decoder_input, decoder_output)
### THE FULL AUTOENCODER
# Input to autoencoder = input to encoder
model_input = encoder_input
# output from the autoencoder is the output from the encoder passed
# to the decoder
model_output = self.decoder(encoder_output)
# The Keras odel that defines the full autoencoder - a model that takes
# an image and passes it through the encoder and back out theough the
# decoder to generate a reconstruction of the original image.
self.model = Model(model_input, model_output)
def compile(self, learning_rate):
self.learning_rate = learning_rate
optimizer = Adam(lr=learning_rate)
def r_loss(y_true, y_pred):
"""
The loss function is usualy chosen to be either the root mean squared
error or binary cross-entropy between the individual pixels of the
original image and the reconstruction.
"""
return K.mean(K.square(y_true - y_pred), axis = [1,2,3])
self.model.compile(optimizer=optimizer, loss = r_loss)
AE = Autoencoder(
input_dim = (28,28,1)
, encoder_conv_filters = [32,64,64, 64]
, encoder_conv_kernel_size = [3,3,3,3]
, encoder_conv_strides = [1,2,2,1]
, decoder_conv_t_filters = [64,64,32,1]
, decoder_conv_t_kernel_size = [3,3,3,3]
, decoder_conv_t_strides = [1,2,2,1]
, z_dim = 2)
Building a Variational Autoencoder
In an autoencoder, each image is mapped directly to one point in the latent space. In a variational autoencoder, each image is instead mapped to a multivariate normal distrivution around a point in the latent space.
The Normal Distribution
A normal distribution is a probability distribution characterized by a distinctive bell curve shape. In one dimension, it is deined by two variables: the mean ( μ ) and the variance ( σ2 ). The standard deviation ( σ ) is the square root of the variance. Probability function of the normal distribution in one dimension:
The concept of a normal distribution extends to more than one dimension - the probability function for a general multivaraite normal distribution in k dimensions:
In 2D, the mean vector μ and the symmetric covariance matrix Σ are defined as:
where ρ is the correlation between the two dimensions x1 and x2 .
Variational autoencoders assume that there is no correlation between any of the dimensions in the latent space and therefore that the covariance matrux is diagonal. This means the encoder only needs to map each input to a mean vector and a variance vector and does not need to worry about covariance between dimensions.
The encoder will take eahc input image and encode it to two vectors mu and log_var which together define a multivariate normal distribution in the latent space:
- mu: The mean of the distribution
- log_var: The logarithm of the variance of each dimension
To encode an image into a specific point z in the latent space, we can sample from this distribution, using:
`z = mu + sigma * epsilon*
where
sigma = exp(log_var / 2)
epsilon is a point sampled from the standard normal distribution.
How does this help the autoencoder? Since we are sampling from an area around mu, the decoder must ensure that all points in the same neighborhood produce very similar images when decoded, so that the reconstruction loss remains small.
from tensorflow.keras.layers import Input, Conv2D, Flatten, Dense, Conv2DTranspose, Reshape, Lambda, Activation, BatchNormalization, LeakyReLU, Dropout
from tensorflow.keras.models import Model
from tensorflow.keras import backend as K
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.utils import plot_model
import numpy as np
import json
import os
import pickle
class VariationalAutoencoder():
"""
I only include things / methods in this class that were commented on
in the textbook. To see the entire class, see the Jupyter Notebook for
this textbook.
"""
def __init__(self
, input_dim
, encoder_conv_filters
, encoder_conv_kernel_size
, encoder_conv_strides
, decoder_conv_t_filters
, decoder_conv_t_kernel_size
, decoder_conv_t_strides
, z_dim
, use_batch_norm = False
, use_dropout= False
):
self.name = 'variational_autoencoder'
self.input_dim = input_dim
self.encoder_conv_filters = encoder_conv_filters
self.encoder_conv_kernel_size = encoder_conv_kernel_size
self.encoder_conv_strides = encoder_conv_strides
self.decoder_conv_t_filters = decoder_conv_t_filters
self.decoder_conv_t_kernel_size = decoder_conv_t_kernel_size
self.decoder_conv_t_strides = decoder_conv_t_strides
self.z_dim = z_dim
self.use_batch_norm = use_batch_norm
self.use_dropout = use_dropout
self.n_layers_encoder = len(encoder_conv_filters)
self.n_layers_decoder = len(decoder_conv_t_filters)
self._build()
def _build(self):
### THE ENCODER
encoder_input = Input(shape=self.input_dim, name='encoder_input')
x = encoder_input
for i in range(self.n_layers_encoder):
conv_layer = Conv2D(
filters = self.encoder_conv_filters[i]
, kernel_size = self.encoder_conv_kernel_size[i]
, strides = self.encoder_conv_strides[i]
, padding = 'same'
, name = 'encoder_conv_' + str(i)
)
x = conv_layer(x)
if self.use_batch_norm:
x = BatchNormalization()(x)
x = LeakyReLU()(x)
if self.use_dropout:
x = Dropout(rate = 0.25)(x)
shape_before_flattening = K.int_shape(x)[1:]
x = Flatten()(x)
# Insteadf of connecting the flattend layer directly to the 2D latent space
# we connect it to layers mu and log_var
self.mu = Dense(self.z_dim, name='mu')(x)
self.log_var = Dense(self.z_dim, name='log_var')(x)
# The Keras model that outputs the values of mu and log_var
# for a given input image
self.encoder_mu_log_var = Model(encoder_input, (self.mu, self.log_var))
def sampling(args):
mu, log_var = args
epsilon = K.random_normal(shape=K.shape(mu), mean=0., stddev=1.)
return mu + K.exp(log_var / 2) * epsilon
# Samples a point z in the latent space from the normal distribution
# defined by the parameters mu and log_var
encoder_output = Lambda(sampling, name='encoder_output')([self.mu, self.log_var])
# Defines the encoder - a modle taht takes an input image and encodes
# it into the 2D latent space, by sampling a point from the normal
# distribution defined by mu and log_var
self.encoder = Model(encoder_input, encoder_output)
### THE DECODER
decoder_input = Input(shape=(self.z_dim,), name='decoder_input')
x = Dense(np.prod(shape_before_flattening))(decoder_input)
x = Reshape(shape_before_flattening)(x)
for i in range(self.n_layers_decoder):
conv_t_layer = Conv2DTranspose(
filters = self.decoder_conv_t_filters[i]
, kernel_size = self.decoder_conv_t_kernel_size[i]
, strides = self.decoder_conv_t_strides[i]
, padding = 'same'
, name = 'decoder_conv_t_' + str(i)
)
x = conv_t_layer(x)
if i < self.n_layers_decoder - 1:
if self.use_batch_norm:
x = BatchNormalization()(x)
x = LeakyReLU()(x)
if self.use_dropout:
x = Dropout(rate = 0.25)(x)
else:
x = Activation('sigmoid')(x)
decoder_output = x
self.decoder = Model(decoder_input, decoder_output)
### THE FULL VAE
model_input = encoder_input
model_output = self.decoder(encoder_output)
self.model = Model(model_input, model_output)
def compile(self, learning_rate, r_loss_factor):
self.learning_rate = learning_rate
### COMPILATION
def vae_r_loss(y_true, y_pred):
r_loss = K.mean(K.square(y_true - y_pred), axis = [1,2,3])
return r_loss_factor * r_loss
def vae_kl_loss(y_true, y_pred):
kl_loss = -0.5 * K.sum(1 + self.log_var - K.square(self.mu) - K.exp(self.log_var), axis = 1)
return kl_loss
def vae_loss(y_true, y_pred):
r_loss = vae_r_loss(y_true, y_pred)
kl_loss = vae_kl_loss(y_true, y_pred)
return r_loss + kl_loss
optimizer = Adam(lr=learning_rate)
self.model.compile(optimizer=optimizer, loss = vae_loss, metrics = [vae_r_loss, vae_kl_loss])
The Loss Function
We still use RMSE for the loss but we must add something else: Kullback-Leibler (KL) divergence is a way of measuring how much one probability distribution differs from another, In this case, the KL divergence has the closed form:
kl_loss = -0.5 * sum(1 + log_var - mu ^ 2 - expo(log_var))
in mathematical notation:
The KL divergence term penalizes the network for encoding observations to mu and log_var variables that differ significantly from the parameters of a standard normal distribution, namely mu = 0 and log_var = 0. The addition of the KL diveregence means that when we sample from the normal distribution, we will get a point that likely lies within the bounds of what the VAE is used to seeing and that points will be sampled from the normal distribution symmetrically and efficiently.
Using VAE to Generate Faces
The size of the latent space must increase with the complexity of the images. Unlike Naive Bayes, the Variantional Autoencoder does not suffer from the problem of being able to capture dependencies between adjacent pixels - the convolutional layers of the encoder asre designed to translate low-level pixels into high level features and the decoder is trained to perform the opposite task of translating high-level features in the latent space back to raw pixels.
Latent Space Arithmetic
One benefit of mapping images into a lower-dimensional space is that we can perform arithemetic in this latent space that has a visual analogue when decoded back into the original image domain. Cincepturally, we perform the following vector arithemetic, where alpha is a factor that determines how much the feature vector is added or subtracted (and feature_vector is equal to the average position of encoded images in the latent space with the attribute you want to encode minus the average position of encoded images that don't have the attribute you want to encode):
z_new - z + alpha * feature_vector
You can use latent space arithemetic to add a smile where there was none, to morph between two faces, or more.
In summary, variational autoencoders solve some problems of autoencoders by introducing randomness into the model and constraining how points in the latent space are distributed.
Generative Adverserial Networks
Introduction to GANs
A GAN is a battle between two adversaries, the generator and the discriminator. The generator tries to convert random noise into observations that look as if they have been sampled from the original dataset and the discriminator tries to predict whether an observation comes from the original dataset or is one of the generator's forgeries.
At the start of this process, the generator outputs noisy images and the discriminator predicts randomly. The key to GANs lies in how we alternate the training of the two networks so that as the generator becomes more adept at fooling the discriminator, the discriminator must adapt in order to maintain its ability to correctly identify which observations are fake. This drives the generator to find new wyas to fool the discriminator, and so the cycle continues.
Your First GAN
The Discrimnator
The goal of the discriminator is to predict if an image is real or fake. This is a supervised image classification problem. It is commonplace to use Convolutional layers in GANs, eventhough the original paper used dense layers. It is common to see batch normalization layers in the discriminator for vanilla GANs.
The Generator
The input to the generator is avector, usually drawn from a multivariate standard normal distribution. The output of an image is the same size as an image in the original training data.
This description may remind you of the decoder in a variational autoencoder; the generator of a GAN fulfills exactly the same purpose as the decoder of a VAE: converting a vector in the latent space to an image. The concept of mapping from a latent space back to the original domain is very common in generative modeling as it gives us the ability to manipulate vectors in the latent space to change high-level features of images in the original domain.
Unsampling
In this GAN, we use the Keras Unsampling2D layer to double the width and height of the input tensor. This simply repeats each row and column of its input in order to double the size. We then follow this up with a normal convolutional kayer with stride 1 to perform the convolution operation. It is similar idea to convolutional transpose, but insread of filling the gaps between pixels with zeros, unsampling just repeats the existing pixel values.
Both methods - Unsampling + Conv2D and Conv2DTranspose - are acceptable ways to transform back to the original image domain. It really is a case of testing both methods in your own problem setting and seeing which produces better results. It has been shown that the Conv2DTranspose method can lead to artifacts, or small checkerboard patterns in the output image that spoil the quality of the output.
Training the GAN
We can train the discriminator by creating a training set where some of the images are randomly selected real observations from the training set and some are outputs from the generator. The response would be 1 for the true images and 0 for the generated images. If we treat this as a supervised learning problem, we can train the discriminator to learn how to tell the difference between the original and generated images, outputting values near 1 for the true images and values near 0 for the fake images.
To train the generator, we must first connect it to the discriminator to create a Keras model that we can train. Specifically, we feed the output from the generator (a 28 x 28 x 1 image) into the discriminator so that the output form this combined model is the probability that the generated image is real, according to the discriminator. We can train this combined model by creating training batches consisting of randomly generated 100-dimensional latent vectors as input and a response which is set to 1, since we want to train the generator to produce images that the discriminator thinks are real.
The loss is then just a binary cross-entropy loss between the output of the discriminator and the response vector of 1.
We must freeze the weights of the discriminator while we are training the combined model, so that only the generator's weights are updated. If we do not freeze the discriminator's wights, the discriminator will adjust so that it is more likely to predict generated images are real, which is not the desired outcome. We want generated images to be predicted close to 1 (real) because the generator is strong, not because the discriminator is weak.
from tensorflow.keras.layers import Input, Conv2D, Flatten, Dense, Conv2DTranspose, Reshape, Lambda, Activation, BatchNormalization, LeakyReLU, Dropout, ZeroPadding2D, UpSampling2D
from tensorflow.keras.layers.merge import _Merge
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras import backend as K
from tensorflow.keras.optimizers import Adam, RMSprop
from tensorflow.keras.utils import plot_model
from tensorflow.keras.initializers import RandomNormal
import numpy as np
import json
import os
import pickle as pkl
import matplotlib.pyplot as plt
class GAN():
"""
I only include things / methods in this class that were commented on
in the textbook. To see the entire class, see the Jupyter Notebook for
this textbook.
"""
def __init__(self
, input_dim
, discriminator_conv_filters
, discriminator_conv_kernel_size
, discriminator_conv_strides
, discriminator_batch_norm_momentum
, discriminator_activation
, discriminator_dropout_rate
, discriminator_learning_rate
, generator_initial_dense_layer_size
, generator_upsample
, generator_conv_filters
, generator_conv_kernel_size
, generator_conv_strides
, generator_batch_norm_momentum
, generator_activation
, generator_dropout_rate
, generator_learning_rate
, optimiser
, z_dim
):
self.name = 'gan'
self.input_dim = input_dim
self.discriminator_conv_filters = discriminator_conv_filters
self.discriminator_conv_kernel_size = discriminator_conv_kernel_size
self.discriminator_conv_strides = discriminator_conv_strides
self.discriminator_batch_norm_momentum = discriminator_batch_norm_momentum
self.discriminator_activation = discriminator_activation
self.discriminator_dropout_rate = discriminator_dropout_rate
self.discriminator_learning_rate = discriminator_learning_rate
self.generator_initial_dense_layer_size = generator_initial_dense_layer_size
self.generator_upsample = generator_upsample
self.generator_conv_filters = generator_conv_filters
self.generator_conv_kernel_size = generator_conv_kernel_size
self.generator_conv_strides = generator_conv_strides
self.generator_batch_norm_momentum = generator_batch_norm_momentum
self.generator_activation = generator_activation
self.generator_dropout_rate = generator_dropout_rate
self.generator_learning_rate = generator_learning_rate
self.optimiser = optimiser
self.z_dim = z_dim
self.n_layers_discriminator = len(discriminator_conv_filters)
self.n_layers_generator = len(generator_conv_filters)
self.weight_init = RandomNormal(mean=0., stddev=0.02)
self.d_losses = []
self.g_losses = []
self.epoch = 0
self._build_discriminator()
self._build_generator()
self._build_adversarial()
def get_activation(self, activation):
if activation == 'leaky_relu':
layer = LeakyReLU(alpha = 0.2)
else:
layer = Activation(activation)
return layer
def _build_discriminator(self):
### THE discriminator
# Define the input to the discriminator (the image)
discriminator_input = Input(shape=self.input_dim, name='discriminator_input')
x = discriminator_input
# Stack Convolutional layers on top of each other
for i in range(self.n_layers_discriminator):
x = Conv2D(
filters = self.discriminator_conv_filters[i]
, kernel_size = self.discriminator_conv_kernel_size[i]
, strides = self.discriminator_conv_strides[i]
, padding = 'same'
, name = 'discriminator_conv_' + str(i)
, kernel_initializer = self.weight_init
)(x)
if self.discriminator_batch_norm_momentum and i > 0:
x = BatchNormalization(momentum = self.discriminator_batch_norm_momentum)(x)
x = self.get_activation(self.discriminator_activation)(x)
if self.discriminator_dropout_rate:
x = Dropout(rate = self.discriminator_dropout_rate)(x)
# Flatten the last convolutional layer to a vector
x = Flatten()(x)
# Dense layer of one unit, with a sigmoid activation function
# that transforms the output from the dense layer to the range [0,1]
discriminator_output = Dense(1, activation='sigmoid', kernel_initializer = self.weight_init)(x)
# The Keras model that defines the discriminator - a model that takes
# an input image and outputs a single image between 0 and 1
self.discriminator = Model(discriminator_input, discriminator_output)
def _build_generator(self):
### THE generator
# Define the input to the generator - a vector of length 100
generator_input = Input(shape=(self.z_dim,), name='generator_input')
x = generator_input
# We follow this with a Dense layer consisting of 3,136 units
x = Dense(np.prod(self.generator_initial_dense_layer_size), kernel_initializer = self.weight_init)(x)
if self.generator_batch_norm_momentum:
x = BatchNormalization(momentum = self.generator_batch_norm_momentum)(x)
x = self.get_activation(self.generator_activation)(x)
# which, after appluing batch normalization and a ReLU activation
# function, is reshaped to a 7 x 7 x 64 tensor
x = Reshape(self.generator_initial_dense_layer_size)(x)
if self.generator_dropout_rate:
x = Dropout(rate = self.generator_dropout_rate)(x)
# Pass through 4 Conv2D layers, the first two preceded by
# Unsampling2D lauers, to reshape the tensor to 14 x 14, then 28 x 28.
# In all but the last layer, we use batch normalization and ReLU
# activation
for i in range(self.n_layers_generator):
if self.generator_upsample[i] == 2:
x = UpSampling2D()(x)
x = Conv2D(
filters = self.generator_conv_filters[i]
, kernel_size = self.generator_conv_kernel_size[i]
, padding = 'same'
, name = 'generator_conv_' + str(i)
, kernel_initializer = self.weight_init
)(x)
else:
x = Conv2DTranspose(
filters = self.generator_conv_filters[i]
, kernel_size = self.generator_conv_kernel_size[i]
, padding = 'same'
, strides = self.generator_conv_strides[i]
, name = 'generator_conv_' + str(i)
, kernel_initializer = self.weight_init
)(x)
# Adter the final Conv2D layer, we use a tanh activation to
# transform the outpuyt to the range [-1,1] to match the original
# image domain
if i < self.n_layers_generator - 1:
if self.generator_batch_norm_momentum:
x = BatchNormalization(momentum = self.generator_batch_norm_momentum)(x)
x = self.get_activation(self.generator_activation)(x)
else:
x = Activation('tanh')(x)
generator_output = x
# The Keras model that defines a generator - a model that accpets
# a vector of length 100 and outputs a tensor of shape [28,28,1]
self.generator = Model(generator_input, generator_output)
def _build_adversarial(self):
### COMPILE DISCRIMINATOR
# The discriminator is compiled with binary cross entropy loss, as
# the response is binary and we have one output unit with sigmoid
# activation
self.discriminator.compile(
optimizer=self.get_opti(self.discriminator_learning_rate)
, loss = 'binary_crossentropy'
, metrics = ['accuracy']
)
### COMPILE THE FULL GAN
# Freeze the discriminator weights - this doesn;t affect the existing
# dscriminator model that we have already compiled
self.set_trainable(self.discriminator, False)
model_input = Input(shape=(self.z_dim,), name='model_input')
model_output = self.discriminator(self.generator(model_input))
# Define a new model whose input is a 100-dimensional latent vector
# this is passed through the generator and frozen discriminator to
# produce the output probability
self.model = Model(model_input, model_output)
# Again, we use a binary cross-entropy loss for the combined model
# the learning rate is slower than the discriminator as generally we
# would like the discrimintaor to be stronger than the generator. The
# learning rate is a parameetr that should be tuned carefuully for
# each GAN problem setting.
self.model.compile(optimizer=self.get_opti(self.generator_learning_rate) , loss='binary_crossentropy', metrics=['accuracy'])
self.set_trainable(self.discriminator, True)
def train_discriminator(self, x_train, batch_size, using_generator):
valid = np.ones((batch_size,1))
fake = np.zeros((batch_size,1))
if using_generator:
true_imgs = next(x_train)[0]
if true_imgs.shape[0] != batch_size:
true_imgs = next(x_train)[0]
else:
idx = np.random.randint(0, x_train.shape[0], batch_size)
true_imgs = x_train[idx]
noise = np.random.normal(0, 1, (batch_size, self.z_dim))
gen_imgs = self.generator.predict(noise)
# Once the batch update of teh discriminator involves first training
# on a batch of true images with the reponse of 1
d_loss_real, d_acc_real = self.discriminator.train_on_batch(true_imgs, valid)
# Then on a batch of generated images with the response 0
d_loss_fake, d_acc_fake = self.discriminator.train_on_batch(gen_imgs, fake)
d_loss = 0.5 * (d_loss_real + d_loss_fake)
d_acc = 0.5 * (d_acc_real + d_acc_fake)
return [d_loss, d_loss_real, d_loss_fake, d_acc, d_acc_real, d_acc_fake]
def train_generator(self, batch_size):
valid = np.ones((batch_size,1))
noise = np.random.normal(0, 1, (batch_size, self.z_dim))
# once batch update of the generator involves training on a batch of
# generated images with the response 1. As the discriminator is frozed
# its weights will not be affected; instead, the generator weights will
# move in the direction that allows it to better generate images that are
# more likely to fool the discriminator
return self.model.train_on_batch(noise, valid)
After a suitable number of epochs, the discriminator and generator will have found an equilibrium that allows the generator to learn meaningful information from the discriminator and the quality of the images will start to improve.
A requirement for a successive generative model is that it doesn't only reproduce images from the training set. To test this, we can find the image from the training set that is closest to a particular generated example. A good measure for distance is the ℓ1 distance.
GAN Challenges
GANs are notoriously difficult to train. This section looks at common problems when training GANs:
Oscillating Loss
The loss of the discriminator and generator can start to osciallate wildly, rather than exhibiting long-term stability. Typically, there is some small osciallation of the loss between batches, but in the long term you should be looking for loss that stabilizes or gradually increases or decreases, rather than eratically fluctuating, to ensure your GAN converges and improves over time.
Mode Collapse
Mode collapse occurs when the generator finds a small number of samples that fool the discriminator and therefore isn't able to produce any examples other than this limited set.
Uninformative Loss
The lack of correlation between the generator loss and image quality sometimes makes GAN training difficult to monitor.
Hyperparameters
Even with simple GANs, there are a large number of hyperparameters to tune. GANs are highly sensitive to slight changes in all these parameters, and finding a set of parameters that works is often a case of educated trial and error, rather than following an established set of guidelines. This is why it is important to understand the inner workings of GAN.
Tackling the GAN Challenges
In recent years, several key adjustments have drastically improved the overall stability of GAN models and diminished the likelihood of some of the problems - such as the Wasserstein GAN (WGAN) and Wasserstein GAN-Gradient Penalty (WGAN-GP). Both of which are minor adjustments to GAN and the latter is now considered a best practice.
Wasserstein GAN
The Wasserstein GAN was one of the first big steps toward stablizing GAN in training. The authors were able to show how to train GAN's that have the following two properties:
- A meaningful loss metric that correlates with the generator's convergence and sample quality
- Improved stability of the optimization process
The paper introduces a new loss function for both the discriminator and the generator. Using this loss function instead of binary cross entropy results in more stable convergence.
Wasserstein Loss
Binary Cross Entropy Loss (Loss FUnction for Original GAN)
The Wasserstein los requires that we use yi=1 and yi=−1 as labels, rather than 1 and 0. We also remove the sigmoid activation from the final layer of the discriminator, so that predictions pi are no longer constrained to fall in range [-1,1] but can be (−∞,∞) . For this reason, the discriminator in WGAN is usually referred to as a critic. The wasserstein loss function:
The WGAN critic tries to maximize the difference between its predictions for real images and generated images, with real images scoring higher.
The Lipschitz Constraint
It may surpisey you that we are now allowing the critic to output any number in the range (−∞,∞) , rather than applying a sigmoid function. We need to place an additional constraint on the critic for the loss functon to work: it is required that the critic is a 1-Lipschitz continuous function.
Weight Clipping
In the WGAN paper, the authrs show how it is possible to enforce the Lipschitz constraint by clipping the weights of the critic to lie within a small range, [-0.01, 0.01], after each training batch.
One of the main criticisms of the WGAN is that since we are cliipping the weights in the critic, its capacity to learn is greatly diminished.
WGAN-GP
One of the most recent extensions of the WGAN literatire is the Wasserstein GAN-Gradient Penalty (WGAN-GP) framework. The WGAN-GP generator is defined and compiled in exactly the same way as the WGAN generator. It is only the definition and compilation of the critic that we need to change. There are three changes to the WGAN critic to make it a WGAN-GP critic:
- Include a gradient penalty term in the critic loss function
- Don't clip the weights of the critic
- Don't use batch normalization layers in the critic
The image below shows the training process for the critic. If we compare this to the original discriminator for the training process, we see that the key addition is the gradient penalty loss included as part of the overall loss function, alongside the Wasserstein loss from the real and fake images.
The gradient penalty loss measures the squared difference between the norm of the gradient of the predictions with respect to the input images and 1. The model will naturally be inclined to find wights that ensure the gradient penalty term is miniuzed, thereby encouraging the model to conform to the Lopschitz constraint.
It is intractable to caculate this gradient everywhere during the training process, so instead the WGAN-GP evaluatesd the gradient at only a handful of points.
Batch Normallization should not be used in the critic of WGAN-GP.
Sumamry
If we compare WGAN-GP outputs to VAE outputs, we can see that GAN images are generally sharper. This is true in general - VAEs tend to produce softer images that blur color boundariesm whereas GANs are known to produce more well defined images.
It is also true that GANs are generally more difficult to train than VAEs and take longer to reach a satisfactory quality. However, most of the state-of-the-art generative models today are GAN-based, as the rewards for training large-scale GANs on GPUs over a longer period of time are significant.