Generative Deep Learning - Variational Autoencoders and Generative Adverserial Networks

Chapters 3 and 4 of Generative Deep Learning review Variational Autoencoders and Generative Adverserial Networks and how these deep learning architectures can be used for image generation.

Chapter 3: Variantial Autoencoders

The variational autoencoder (VAE) is now one of the most fundamental and well-known deep learning architectures for generative modeling.

Autoencoders

An autoencoder is a neural network made up of two parts:

  1. An encoder network that compresses high-dimensional input data into a lower-dimensional representation vector
  2. A decoder network that decompresses a given representation vector back to the original domain.

This process is shown in the image below.

Diagram of an Autoencoder

The network is trained to find weights for the encoder and decoder that minimize the loss between the original input and the reconstruction after it has passed through the encoder and decoder.

The representation vector is a compression of the original image into a lower-dimensional, latent space. The idea is that by choosing any point in the latent space, we should be able to generate novel images by passing this point through the decoder, since the decoder has learned how to convert points in the latent space into viable images.

Autoencoders can also be used to clean noisy images, since the encoder learns that it is not useful to capture the position of the random noise inside the latent space. Genrally speaking, it is a good idea to create a class for your model in another file. This way, you can instantiate an Autoencoder object with the parameters that define a particular model architecture in the notebook.

Convolutional Transpose Layers

Standard ocnvolutional layers allow us to halve the size of an input tensor in both height and width, by setting strides=2. The convolutional transpose layer uses the same principle as a standard convolutional layer (passing a filter across the image), but is different in that setting strides = 2 doubles the size of the input tensor in both height and width. In a convolutional transpose layer, the strides parameter determines the internal zero padding between pizels in the image.

Convolutional Transpose Layer

In Keras, the Conv2DTranspose layer allows us to perform convolutional operations on tensors. By stacking these layers, we can gradually expand the size of each layer.

# Setting up the environment for this textbook
!git clone https://github.com/davidADSP/GDL_code.git
# Make sure that you have the most up to date version of the codebase
!git pull
!pip install virtualenv virtualenvwrapper
# The location where the virual environments will be stored
!export WORKON_HOME=$HOME/.virtualenvs
# The default version of python to use when virual environemnt
# is created
!export VIRTUALENVWRAPPER_PYTHON=/usr/local/bin/python3
# Reloads the virtualenvwrapper script
!source /usr/local/bin/virtualenvwrapper.sh
# Install the packages that we'll be using in this book
!pip install -r requirements.txt
out[2]

Cloning into 'GDL_code'...
remote: Enumerating objects: 394, done.
remote: Counting objects: 100% (3/3), done.
remote: Compressing objects: 100% (3/3), done.
remote: Total 394 (delta 0), reused 1 (delta 0), pack-reused 391 (from 1)
Receiving objects: 100% (394/394), 22.13 MiB | 27.18 MiB/s, done.
Resolving deltas: 100% (237/237), done.
fatal: not a git repository (or any of the parent directories): .git
Collecting virtualenv
Downloading virtualenv-20.26.4-py3-none-any.whl.metadata (4.5 kB)
Collecting virtualenvwrapper
Downloading virtualenvwrapper-6.1.0-py3-none-any.whl.metadata (5.1 kB)
Collecting distlib<1,>=0.3.7 (from virtualenv)
Downloading distlib-0.3.8-py2.py3-none-any.whl.metadata (5.1 kB)
Requirement already satisfied: filelock<4,>=3.12.2 in /usr/local/lib/python3.10/dist-packages (from virtualenv) (3.16.0)
Requirement already satisfied: platformdirs<5,>=3.9.1 in /usr/local/lib/python3.10/dist-packages (from virtualenv) (4.3.2)
Collecting virtualenv-clone (from virtualenvwrapper)
Downloading virtualenv_clone-0.5.7-py3-none-any.whl.metadata (2.7 kB)
Collecting stevedore (from virtualenvwrapper)
Downloading stevedore-5.3.0-py3-none-any.whl.metadata (2.3 kB)
Collecting pbr>=2.0.0 (from stevedore->virtualenvwrapper)
Downloading pbr-6.1.0-py2.py3-none-any.whl.metadata (3.4 kB)
Downloading virtualenv-20.26.4-py3-none-any.whl (6.0 MB)
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.0/6.0 MB 37.9 MB/s eta 0:00:00
[?25hDownloading virtualenvwrapper-6.1.0-py3-none-any.whl (22 kB)
Downloading distlib-0.3.8-py2.py3-none-any.whl (468 kB)
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 468.9/468.9 kB 22.7 MB/s eta 0:00:00
[?25hDownloading stevedore-5.3.0-py3-none-any.whl (49 kB)
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 49.7/49.7 kB 3.4 MB/s eta 0:00:00
[?25hDownloading virtualenv_clone-0.5.7-py3-none-any.whl (6.6 kB)
Downloading pbr-6.1.0-py2.py3-none-any.whl (108 kB)
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 108.5/108.5 kB 7.4 MB/s eta 0:00:00
[?25hInstalling collected packages: distlib, virtualenv-clone, virtualenv, pbr, stevedore, virtualenvwrapper
Successfully installed distlib-0.3.8 pbr-6.1.0 stevedore-5.3.0 virtualenv-20.26.4 virtualenv-clone-0.5.7 virtualenvwrapper-6.1.0
virtualenvwrapper.user_scripts creating /root/.virtualenvs/premkproject
virtualenvwrapper.user_scripts creating /root/.virtualenvs/postmkproject
virtualenvwrapper.user_scripts creating /root/.virtualenvs/initialize
virtualenvwrapper.user_scripts creating /root/.virtualenvs/premkvirtualenv
virtualenvwrapper.user_scripts creating /root/.virtualenvs/postmkvirtualenv
virtualenvwrapper.user_scripts creating /root/.virtualenvs/prermvirtualenv
virtualenvwrapper.user_scripts creating /root/.virtualenvs/postrmvirtualenv
virtualenvwrapper.user_scripts creating /root/.virtualenvs/predeactivate
virtualenvwrapper.user_scripts creating /root/.virtualenvs/postdeactivate
virtualenvwrapper.user_scripts creating /root/.virtualenvs/preactivate
virtualenvwrapper.user_scripts creating /root/.virtualenvs/postactivate
virtualenvwrapper.user_scripts creating /root/.virtualenvs/get_env_details
ERROR: Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'


from tensorflow.keras.layers import Input, Conv2D, Flatten, Dense, Conv2DTranspose, Reshape, Lambda, Activation, BatchNormalization, LeakyReLU, Dropout
from tensorflow.keras.models import Model
from tensorflow.keras import backend as K
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.utils import plot_model
import numpy as np
import json
import os
import pickle
from tensorflow.keras.callbacks import Callback, LearningRateScheduler
import matplotlib.pyplot as plt

#### CALLBACKS
class CustomCallback(Callback):

    def __init__(self, run_folder, print_every_n_batches, initial_epoch, vae):
        self.epoch = initial_epoch
        self.run_folder = run_folder
        self.print_every_n_batches = print_every_n_batches
        self.vae = vae

    def on_batch_end(self, batch, logs={}):
        if batch % self.print_every_n_batches == 0:
            z_new = np.random.normal(size = (1,self.vae.z_dim))
            reconst = self.vae.decoder.predict(np.array(z_new))[0].squeeze()

            filepath = os.path.join(self.run_folder, 'images', 'img_' + str(self.epoch).zfill(3) + '_' + str(batch) + '.jpg')
            if len(reconst.shape) == 2:
                plt.imsave(filepath, reconst, cmap='gray_r')
            else:
                plt.imsave(filepath, reconst)

    def on_epoch_begin(self, epoch, logs={}):
        self.epoch += 1



def step_decay_schedule(initial_lr, decay_factor=0.5, step_size=1):
    '''
    Wrapper function to create a LearningRateScheduler with step decay schedule.
    '''
    def schedule(epoch):
        new_lr = initial_lr * (decay_factor ** np.floor(epoch/step_size))

        return new_lr

    return LearningRateScheduler(schedule)


class Autoencoder():
    """
    I only include things / methods in this class that were commented on
    in the textbook. To see the entire class, see the Jupyter Notebook for
    this textbook.
    """
    def __init__(self, input_dim, encoder_conv_filters, encoder_conv_kernel_size, encoder_conv_strides, decoder_conv_t_filters, decoder_conv_t_kernel_size, decoder_conv_t_strides, z_dim, use_batch_norm = False, use_dropout = False
        ):

        self.name = 'autoencoder'

        self.input_dim = input_dim
        self.encoder_conv_filters = encoder_conv_filters
        self.encoder_conv_kernel_size = encoder_conv_kernel_size
        self.encoder_conv_strides = encoder_conv_strides
        self.decoder_conv_t_filters = decoder_conv_t_filters
        self.decoder_conv_t_kernel_size = decoder_conv_t_kernel_size
        self.decoder_conv_t_strides = decoder_conv_t_strides
        self.z_dim = z_dim

        self.use_batch_norm = use_batch_norm
        self.use_dropout = use_dropout

        self.n_layers_encoder = len(encoder_conv_filters)
        self.n_layers_decoder = len(decoder_conv_t_filters)

        self._build()

    def _build(self):

        ### THE ENCODER
        """
        In an autoencoder, the encoder's job is to take the input image
        and map it to a point in the latent space. To achieve this,
        we first crate an input layer for the image and pass this
        through four Conv2D layers, each apturing increasing high-level
        features. We use a stride of 2 in some of the layers to reduce the
        size of the output. The last convolutional layer is flattened and
        connected to a Dense layer of size 2, which represents out
        two-dimensional latent space.
        """
        # Define the input to the encoder
        encoder_input = Input(shape=self.input_dim, name='encoder_input')

        x = encoder_input


        for i in range(self.n_layers_encoder):
            conv_layer = Conv2D(
                filters = self.encoder_conv_filters[i]
                , kernel_size = self.encoder_conv_kernel_size[i]
                , strides = self.encoder_conv_strides[i]
                , padding = 'same'
                , name = 'encoder_conv_' + str(i)
                )
            # Stack convolutional layers sequentially on top of each other
            x = conv_layer(x)

            x = LeakyReLU()(x)

            if self.use_batch_norm:
                x = BatchNormalization()(x)

            if self.use_dropout:
                x = Dropout(rate = 0.25)(x)

        # Flatten the last convolutional layer to a vector
        shape_before_flattening = K.int_shape(x)[1:]

        x = Flatten()(x)

        # Dense layer that connects this vector to the 2D latent space
        encoder_output= Dense(self.z_dim, name='encoder_output')(x)

        # The Keras model taht defines the encoder - a model that takes an
        # input image and encodes it into the 2D latent space.
        self.encoder = Model(encoder_input, encoder_output)


        ### THE DECODER
        """
        The decoder is a mirror image of the encoder, except instead of
        convolutional layers, we use convolutional transpose layers.
        Note that the decoder doesn't have to be a mirror image of the encoder
        It can be anything you want, as long as the output from the last
        layer of the decoder is the same size as the input to the encoder (
          since our loss function will be comparing these pixel-wise
        )
        """
        # Define the input to the decoder
        decoder_input = Input(shape=(self.z_dim,), name='decoder_input')

        # Connect the input to a Dense layer
        x = Dense(np.prod(shape_before_flattening))(decoder_input)
        # Reshape vector into tensor so that it can be fed as input
        # into the first convolutional transpose layer
        x = Reshape(shape_before_flattening)(x)

        for i in range(self.n_layers_decoder):
            conv_t_layer = Conv2DTranspose(
                filters = self.decoder_conv_t_filters[i]
                , kernel_size = self.decoder_conv_t_kernel_size[i]
                , strides = self.decoder_conv_t_strides[i]
                , padding = 'same'
                , name = 'decoder_conv_t_' + str(i)
                )
            # Stack convolutional Transpose layers on top of each other
            x = conv_t_layer(x)

            if i < self.n_layers_decoder - 1:
                x = LeakyReLU()(x)

                if self.use_batch_norm:
                    x = BatchNormalization()(x)

                if self.use_dropout:
                    x = Dropout(rate = 0.25)(x)
            else:
                x = Activation('sigmoid')(x)

        decoder_output = x

        # The model that defines the Decoder - a model that takes a point
        # in the latent space and decodes it into the original space domain
        self.decoder = Model(decoder_input, decoder_output)

        ### THE FULL AUTOENCODER
        # Input to autoencoder = input to encoder
        model_input = encoder_input
        # output from the autoencoder is the output from the encoder passed
        # to the decoder
        model_output = self.decoder(encoder_output)

        # The Keras odel that defines the full autoencoder - a model that takes
        # an image and passes it through the encoder and back out theough the
        # decoder to generate a reconstruction of the original image.
        self.model = Model(model_input, model_output)


    def compile(self, learning_rate):
        self.learning_rate = learning_rate

        optimizer = Adam(lr=learning_rate)

        def r_loss(y_true, y_pred):
          """
          The loss function is usualy chosen to be either the root mean squared
          error or binary cross-entropy between the individual pixels of the
          original image and the reconstruction.
          """
          return K.mean(K.square(y_true - y_pred), axis = [1,2,3])

        self.model.compile(optimizer=optimizer, loss = r_loss)
out[3]
AE = Autoencoder(
 input_dim = (28,28,1)
 , encoder_conv_filters = [32,64,64, 64]
 , encoder_conv_kernel_size = [3,3,3,3]
 , encoder_conv_strides = [1,2,2,1]
 , decoder_conv_t_filters = [64,64,32,1]
 , decoder_conv_t_kernel_size = [3,3,3,3]
 , decoder_conv_t_strides = [1,2,2,1]
 , z_dim = 2)
out[4]

Building a Variational Autoencoder

In an autoencoder, each image is mapped directly to one point in the latent space. In a variational autoencoder, each image is instead mapped to a multivariate normal distrivution around a point in the latent space.

Encoder vs Autoencoder

The Normal Distribution

A normal distribution is a probability distribution characterized by a distinctive bell curve shape. In one dimension, it is deined by two variables: the mean ( μ\muμ ) and the variance ( σ2\sigma ^2σ2 ). The standard deviation ( σ\sigmaσ ) is the square root of the variance. Probability function of the normal distribution in one dimension:

f(x,μ,σ2)=12πσ2 e(xμ)22σ2f(x,\mu,\sigma^2) = \cfrac{1}{\sqrt{2\pi \sigma ^2 }}\space e^{\small{-\cfrac{(x - \mu)^2}{2\sigma ^2} } }f(x,μ,σ2)=2πσ21 e2σ2(xμ)2

The concept of a normal distribution extends to more than one dimension - the probability function for a general multivaraite normal distribution in kkk dimensions:

f(x1,,xk)=exp(12(xμ)TΣ1(xμ))(2π)kΣf(x_1, \ldots , x_k) = \frac{\exp (-\frac{1}{2} (\textbf{x}-\mu)^T \Sigma ^{-1} (\textbf{x}-\mu) )}{\sqrt{(2\pi)^k \lvert \Sigma \rvert}}f(x1,,xk)=(2π)kΣexp(21(xμ)TΣ1(xμ))

In 2D, the mean vector μ\muμ and the symmetric covariance matrix Σ\SigmaΣ are defined as:

μ=[μ1μ2], Σ=[σ12ρσ1σ2ρσ1σ2σ22]\mu = \begin{bmatrix}\mu_1 \\ \mu_2 \end{bmatrix}, \space \Sigma = \begin{bmatrix}\sigma_1^2 & \rho \sigma_1 \sigma_2 \\ \rho \sigma_1 \sigma_2 && \sigma_2^2 \end{bmatrix}μ=[μ1μ2], Σ=[σ12ρσ1σ2ρσ1σ2σ22]

where ρ\rhoρ is the correlation between the two dimensions x1x_1x1 and x2x_2x2 .

Variational autoencoders assume that there is no correlation between any of the dimensions in the latent space and therefore that the covariance matrux is diagonal. This means the encoder only needs to map each input to a mean vector and a variance vector and does not need to worry about covariance between dimensions.

The encoder will take eahc input image and encode it to two vectors mu and log_var which together define a multivariate normal distribution in the latent space:

  • mu: The mean of the distribution
  • log_var: The logarithm of the variance of each dimension

To encode an image into a specific point z in the latent space, we can sample from this distribution, using:

`z = mu + sigma * epsilon*

where

sigma = exp(log_var / 2)

epsilon is a point sampled from the standard normal distribution.

How does this help the autoencoder? Since we are sampling from an area around mu, the decoder must ensure that all points in the same neighborhood produce very similar images when decoded, so that the reconstruction loss remains small.


from tensorflow.keras.layers import Input, Conv2D, Flatten, Dense, Conv2DTranspose, Reshape, Lambda, Activation, BatchNormalization, LeakyReLU, Dropout
from tensorflow.keras.models import Model
from tensorflow.keras import backend as K
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.utils import plot_model
import numpy as np
import json
import os
import pickle


class VariationalAutoencoder():
    """
    I only include things / methods in this class that were commented on
    in the textbook. To see the entire class, see the Jupyter Notebook for
    this textbook.
    """
    def __init__(self
        , input_dim
        , encoder_conv_filters
        , encoder_conv_kernel_size
        , encoder_conv_strides
        , decoder_conv_t_filters
        , decoder_conv_t_kernel_size
        , decoder_conv_t_strides
        , z_dim
        , use_batch_norm = False
        , use_dropout= False
        ):

        self.name = 'variational_autoencoder'

        self.input_dim = input_dim
        self.encoder_conv_filters = encoder_conv_filters
        self.encoder_conv_kernel_size = encoder_conv_kernel_size
        self.encoder_conv_strides = encoder_conv_strides
        self.decoder_conv_t_filters = decoder_conv_t_filters
        self.decoder_conv_t_kernel_size = decoder_conv_t_kernel_size
        self.decoder_conv_t_strides = decoder_conv_t_strides
        self.z_dim = z_dim

        self.use_batch_norm = use_batch_norm
        self.use_dropout = use_dropout

        self.n_layers_encoder = len(encoder_conv_filters)
        self.n_layers_decoder = len(decoder_conv_t_filters)

        self._build()

    def _build(self):

        ### THE ENCODER
        encoder_input = Input(shape=self.input_dim, name='encoder_input')

        x = encoder_input

        for i in range(self.n_layers_encoder):
            conv_layer = Conv2D(
                filters = self.encoder_conv_filters[i]
                , kernel_size = self.encoder_conv_kernel_size[i]
                , strides = self.encoder_conv_strides[i]
                , padding = 'same'
                , name = 'encoder_conv_' + str(i)
                )

            x = conv_layer(x)

            if self.use_batch_norm:
                x = BatchNormalization()(x)

            x = LeakyReLU()(x)

            if self.use_dropout:
                x = Dropout(rate = 0.25)(x)

        shape_before_flattening = K.int_shape(x)[1:]

        x = Flatten()(x)
        # Insteadf of connecting the flattend layer directly to the 2D latent space
        # we connect it to layers mu and log_var
        self.mu = Dense(self.z_dim, name='mu')(x)
        self.log_var = Dense(self.z_dim, name='log_var')(x)
        # The Keras model that outputs the values of mu and log_var
        # for a given input image
        self.encoder_mu_log_var = Model(encoder_input, (self.mu, self.log_var))

        def sampling(args):
            mu, log_var = args
            epsilon = K.random_normal(shape=K.shape(mu), mean=0., stddev=1.)
            return mu + K.exp(log_var / 2) * epsilon
        # Samples a point z in the latent space from the normal distribution
        # defined by the parameters mu and log_var
        encoder_output = Lambda(sampling, name='encoder_output')([self.mu, self.log_var])

        # Defines the encoder - a modle taht takes an input image and encodes
        # it into the 2D latent space, by sampling a point from the normal
        # distribution defined by mu and log_var
        self.encoder = Model(encoder_input, encoder_output)



        ### THE DECODER

        decoder_input = Input(shape=(self.z_dim,), name='decoder_input')

        x = Dense(np.prod(shape_before_flattening))(decoder_input)
        x = Reshape(shape_before_flattening)(x)

        for i in range(self.n_layers_decoder):
            conv_t_layer = Conv2DTranspose(
                filters = self.decoder_conv_t_filters[i]
                , kernel_size = self.decoder_conv_t_kernel_size[i]
                , strides = self.decoder_conv_t_strides[i]
                , padding = 'same'
                , name = 'decoder_conv_t_' + str(i)
                )

            x = conv_t_layer(x)

            if i < self.n_layers_decoder - 1:
                if self.use_batch_norm:
                    x = BatchNormalization()(x)
                x = LeakyReLU()(x)
                if self.use_dropout:
                    x = Dropout(rate = 0.25)(x)
            else:
                x = Activation('sigmoid')(x)



        decoder_output = x

        self.decoder = Model(decoder_input, decoder_output)

        ### THE FULL VAE
        model_input = encoder_input
        model_output = self.decoder(encoder_output)

        self.model = Model(model_input, model_output)


    def compile(self, learning_rate, r_loss_factor):
        self.learning_rate = learning_rate

        ### COMPILATION

        def vae_r_loss(y_true, y_pred):

            r_loss = K.mean(K.square(y_true - y_pred), axis = [1,2,3])
            return r_loss_factor * r_loss

        def vae_kl_loss(y_true, y_pred):
            kl_loss =  -0.5 * K.sum(1 + self.log_var - K.square(self.mu) - K.exp(self.log_var), axis = 1)
            return kl_loss

        def vae_loss(y_true, y_pred):
            r_loss = vae_r_loss(y_true, y_pred)
            kl_loss = vae_kl_loss(y_true, y_pred)
            return  r_loss + kl_loss

        optimizer = Adam(lr=learning_rate)
        self.model.compile(optimizer=optimizer, loss = vae_loss,  metrics = [vae_r_loss, vae_kl_loss])

out[6]

The Loss Function

We still use RMSE for the loss but we must add something else: Kullback-Leibler (KL) divergence is a way of measuring how much one probability distribution differs from another, In this case, the KL divergence has the closed form:

kl_loss = -0.5 * sum(1 + log_var - mu ^ 2 - expo(log_var))

in mathematical notation:

DKL=[N(μ,σ  N(0,1))]=12(1+log(σ2)μ2σ2)D_{KL}=[N(\mu ,\sigma \ || \ N(0,1) )]=\frac{1}{2}\sum (1+\log (\sigma ^2) - \mu^2 - \sigma ^2 )DKL=[N(μ,σ ∣∣ N(0,1))]=21(1+log(σ2)μ2σ2)

The KL divergence term penalizes the network for encoding observations to mu and log_var variables that differ significantly from the parameters of a standard normal distribution, namely mu = 0 and log_var = 0. The addition of the KL diveregence means that when we sample from the normal distribution, we will get a point that likely lies within the bounds of what the VAE is used to seeing and that points will be sampled from the normal distribution symmetrically and efficiently.

Using VAE to Generate Faces

The size of the latent space must increase with the complexity of the images. Unlike Naive Bayes, the Variantional Autoencoder does not suffer from the problem of being able to capture dependencies between adjacent pixels - the convolutional layers of the encoder asre designed to translate low-level pixels into high level features and the decoder is trained to perform the opposite task of translating high-level features in the latent space back to raw pixels.

Latent Space Arithmetic

One benefit of mapping images into a lower-dimensional space is that we can perform arithemetic in this latent space that has a visual analogue when decoded back into the original image domain. Cincepturally, we perform the following vector arithemetic, where alpha is a factor that determines how much the feature vector is added or subtracted (and feature_vector is equal to the average position of encoded images in the latent space with the attribute you want to encode minus the average position of encoded images that don't have the attribute you want to encode):

z_new - z + alpha * feature_vector

You can use latent space arithemetic to add a smile where there was none, to morph between two faces, or more.

In summary, variational autoencoders solve some problems of autoencoders by introducing randomness into the model and constraining how points in the latent space are distributed.

Generative Adverserial Networks

Introduction to GANs

A GAN is a battle between two adversaries, the generator and the discriminator. The generator tries to convert random noise into observations that look as if they have been sampled from the original dataset and the discriminator tries to predict whether an observation comes from the original dataset or is one of the generator's forgeries.

GAN Picture

At the start of this process, the generator outputs noisy images and the discriminator predicts randomly. The key to GANs lies in how we alternate the training of the two networks so that as the generator becomes more adept at fooling the discriminator, the discriminator must adapt in order to maintain its ability to correctly identify which observations are fake. This drives the generator to find new wyas to fool the discriminator, and so the cycle continues.

Your First GAN

The Discrimnator

The goal of the discriminator is to predict if an image is real or fake. This is a supervised image classification problem. It is commonplace to use Convolutional layers in GANs, eventhough the original paper used dense layers. It is common to see batch normalization layers in the discriminator for vanilla GANs.

The Generator

The input to the generator is avector, usually drawn from a multivariate standard normal distribution. The output of an image is the same size as an image in the original training data.

This description may remind you of the decoder in a variational autoencoder; the generator of a GAN fulfills exactly the same purpose as the decoder of a VAE: converting a vector in the latent space to an image. The concept of mapping from a latent space back to the original domain is very common in generative modeling as it gives us the ability to manipulate vectors in the latent space to change high-level features of images in the original domain.

Unsampling

In this GAN, we use the Keras Unsampling2D layer to double the width and height of the input tensor. This simply repeats each row and column of its input in order to double the size. We then follow this up with a normal convolutional kayer with stride 1 to perform the convolution operation. It is similar idea to convolutional transpose, but insread of filling the gaps between pixels with zeros, unsampling just repeats the existing pixel values.

Both methods - Unsampling + Conv2D and Conv2DTranspose - are acceptable ways to transform back to the original image domain. It really is a case of testing both methods in your own problem setting and seeing which produces better results. It has been shown that the Conv2DTranspose method can lead to artifacts, or small checkerboard patterns in the output image that spoil the quality of the output.

Artifacts When Using Convolutional transpose Layer

Training the GAN

We can train the discriminator by creating a training set where some of the images are randomly selected real observations from the training set and some are outputs from the generator. The response would be 1 for the true images and 0 for the generated images. If we treat this as a supervised learning problem, we can train the discriminator to learn how to tell the difference between the original and generated images, outputting values near 1 for the true images and values near 0 for the fake images.

To train the generator, we must first connect it to the discriminator to create a Keras model that we can train. Specifically, we feed the output from the generator (a 28 x 28 x 1 image) into the discriminator so that the output form this combined model is the probability that the generated image is real, according to the discriminator. We can train this combined model by creating training batches consisting of randomly generated 100-dimensional latent vectors as input and a response which is set to 1, since we want to train the generator to produce images that the discriminator thinks are real.

The loss is then just a binary cross-entropy loss between the output of the discriminator and the response vector of 1.

We must freeze the weights of the discriminator while we are training the combined model, so that only the generator's weights are updated. If we do not freeze the discriminator's wights, the discriminator will adjust so that it is more likely to predict generated images are real, which is not the desired outcome. We want generated images to be predicted close to 1 (real) because the generator is strong, not because the discriminator is weak.

Training the GAN


from tensorflow.keras.layers import Input, Conv2D, Flatten, Dense, Conv2DTranspose, Reshape, Lambda, Activation, BatchNormalization, LeakyReLU, Dropout, ZeroPadding2D, UpSampling2D
from tensorflow.keras.layers.merge import _Merge
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras import backend as K
from tensorflow.keras.optimizers import Adam, RMSprop
from tensorflow.keras.utils import plot_model
from tensorflow.keras.initializers import RandomNormal

import numpy as np
import json
import os
import pickle as pkl
import matplotlib.pyplot as plt


class GAN():
    """
    I only include things / methods in this class that were commented on
    in the textbook. To see the entire class, see the Jupyter Notebook for
    this textbook.
    """
    def __init__(self
        , input_dim
        , discriminator_conv_filters
        , discriminator_conv_kernel_size
        , discriminator_conv_strides
        , discriminator_batch_norm_momentum
        , discriminator_activation
        , discriminator_dropout_rate
        , discriminator_learning_rate
        , generator_initial_dense_layer_size
        , generator_upsample
        , generator_conv_filters
        , generator_conv_kernel_size
        , generator_conv_strides
        , generator_batch_norm_momentum
        , generator_activation
        , generator_dropout_rate
        , generator_learning_rate
        , optimiser
        , z_dim
        ):

        self.name = 'gan'

        self.input_dim = input_dim
        self.discriminator_conv_filters = discriminator_conv_filters
        self.discriminator_conv_kernel_size = discriminator_conv_kernel_size
        self.discriminator_conv_strides = discriminator_conv_strides
        self.discriminator_batch_norm_momentum = discriminator_batch_norm_momentum
        self.discriminator_activation = discriminator_activation
        self.discriminator_dropout_rate = discriminator_dropout_rate
        self.discriminator_learning_rate = discriminator_learning_rate

        self.generator_initial_dense_layer_size = generator_initial_dense_layer_size
        self.generator_upsample = generator_upsample
        self.generator_conv_filters = generator_conv_filters
        self.generator_conv_kernel_size = generator_conv_kernel_size
        self.generator_conv_strides = generator_conv_strides
        self.generator_batch_norm_momentum = generator_batch_norm_momentum
        self.generator_activation = generator_activation
        self.generator_dropout_rate = generator_dropout_rate
        self.generator_learning_rate = generator_learning_rate

        self.optimiser = optimiser
        self.z_dim = z_dim

        self.n_layers_discriminator = len(discriminator_conv_filters)
        self.n_layers_generator = len(generator_conv_filters)

        self.weight_init = RandomNormal(mean=0., stddev=0.02)

        self.d_losses = []
        self.g_losses = []

        self.epoch = 0

        self._build_discriminator()
        self._build_generator()

        self._build_adversarial()

    def get_activation(self, activation):
        if activation == 'leaky_relu':
            layer = LeakyReLU(alpha = 0.2)
        else:
            layer = Activation(activation)
        return layer

    def _build_discriminator(self):

        ### THE discriminator

        # Define the input to the discriminator (the image)
        discriminator_input = Input(shape=self.input_dim, name='discriminator_input')

        x = discriminator_input

        # Stack Convolutional layers on top of each other
        for i in range(self.n_layers_discriminator):

            x = Conv2D(
                filters = self.discriminator_conv_filters[i]
                , kernel_size = self.discriminator_conv_kernel_size[i]
                , strides = self.discriminator_conv_strides[i]
                , padding = 'same'
                , name = 'discriminator_conv_' + str(i)
                , kernel_initializer = self.weight_init
                )(x)

            if self.discriminator_batch_norm_momentum and i > 0:
                x = BatchNormalization(momentum = self.discriminator_batch_norm_momentum)(x)

            x = self.get_activation(self.discriminator_activation)(x)

            if self.discriminator_dropout_rate:
                x = Dropout(rate = self.discriminator_dropout_rate)(x)
        # Flatten the last convolutional layer to a vector
        x = Flatten()(x)
        # Dense layer of one unit, with a sigmoid activation function
        # that transforms the output from the dense layer to the range [0,1]
        discriminator_output = Dense(1, activation='sigmoid', kernel_initializer = self.weight_init)(x)
        # The Keras model that defines the discriminator - a model that takes
        # an input image and outputs a single image between 0 and 1
        self.discriminator = Model(discriminator_input, discriminator_output)


    def _build_generator(self):

        ### THE generator
        # Define the input to the generator - a vector of length 100
        generator_input = Input(shape=(self.z_dim,), name='generator_input')

        x = generator_input

        # We follow this with a Dense layer consisting of 3,136 units
        x = Dense(np.prod(self.generator_initial_dense_layer_size), kernel_initializer = self.weight_init)(x)

        if self.generator_batch_norm_momentum:
            x = BatchNormalization(momentum = self.generator_batch_norm_momentum)(x)

        x = self.get_activation(self.generator_activation)(x)
        # which, after appluing batch normalization and a ReLU activation
        # function, is reshaped to a 7 x 7 x 64 tensor
        x = Reshape(self.generator_initial_dense_layer_size)(x)

        if self.generator_dropout_rate:
            x = Dropout(rate = self.generator_dropout_rate)(x)
        # Pass through 4 Conv2D layers, the first two preceded by
        # Unsampling2D lauers, to reshape the tensor to 14 x 14, then 28 x 28.
        # In all but the last layer, we use batch normalization and ReLU
        # activation
        for i in range(self.n_layers_generator):

            if self.generator_upsample[i] == 2:
                x = UpSampling2D()(x)
                x = Conv2D(
                    filters = self.generator_conv_filters[i]
                    , kernel_size = self.generator_conv_kernel_size[i]
                    , padding = 'same'
                    , name = 'generator_conv_' + str(i)
                    , kernel_initializer = self.weight_init
                )(x)
            else:

                x = Conv2DTranspose(
                    filters = self.generator_conv_filters[i]
                    , kernel_size = self.generator_conv_kernel_size[i]
                    , padding = 'same'
                    , strides = self.generator_conv_strides[i]
                    , name = 'generator_conv_' + str(i)
                    , kernel_initializer = self.weight_init
                    )(x)
            # Adter the final Conv2D layer, we use a tanh activation to
            # transform the outpuyt to the range [-1,1] to match the original
            # image domain
            if i < self.n_layers_generator - 1:

                if self.generator_batch_norm_momentum:
                    x = BatchNormalization(momentum = self.generator_batch_norm_momentum)(x)

                x = self.get_activation(self.generator_activation)(x)


            else:

                x = Activation('tanh')(x)


        generator_output = x
        # The Keras model that defines a generator - a model that accpets
        # a vector of length 100 and outputs a tensor of shape [28,28,1]
        self.generator = Model(generator_input, generator_output)

    def _build_adversarial(self):

        ### COMPILE DISCRIMINATOR

        # The discriminator is compiled with binary cross entropy loss, as
        # the response is binary and we have one output unit with sigmoid
        # activation
        self.discriminator.compile(
        optimizer=self.get_opti(self.discriminator_learning_rate)
        , loss = 'binary_crossentropy'
        ,  metrics = ['accuracy']
        )

        ### COMPILE THE FULL GAN

        # Freeze the discriminator weights - this doesn;t affect the existing
        # dscriminator model that we have already compiled
        self.set_trainable(self.discriminator, False)

        model_input = Input(shape=(self.z_dim,), name='model_input')
        model_output = self.discriminator(self.generator(model_input))
        # Define a new model whose input is a 100-dimensional latent vector
        # this is passed through the generator and frozen discriminator to
        # produce the output probability
        self.model = Model(model_input, model_output)

        # Again, we use a binary cross-entropy loss for the combined model
        # the learning rate is slower than the discriminator as generally we
        # would like the discrimintaor to be stronger than the generator. The
        # learning rate is a parameetr that should be tuned carefuully for
        # each GAN problem setting.
        self.model.compile(optimizer=self.get_opti(self.generator_learning_rate) , loss='binary_crossentropy', metrics=['accuracy'])

        self.set_trainable(self.discriminator, True)




    def train_discriminator(self, x_train, batch_size, using_generator):

        valid = np.ones((batch_size,1))
        fake = np.zeros((batch_size,1))

        if using_generator:
            true_imgs = next(x_train)[0]
            if true_imgs.shape[0] != batch_size:
                true_imgs = next(x_train)[0]
        else:
            idx = np.random.randint(0, x_train.shape[0], batch_size)
            true_imgs = x_train[idx]

        noise = np.random.normal(0, 1, (batch_size, self.z_dim))
        gen_imgs = self.generator.predict(noise)

        # Once the batch update of teh discriminator involves first training
        # on a batch of true images with the reponse of 1
        d_loss_real, d_acc_real =   self.discriminator.train_on_batch(true_imgs, valid)
        # Then on a batch of generated images with the response 0
        d_loss_fake, d_acc_fake =   self.discriminator.train_on_batch(gen_imgs, fake)
        d_loss =  0.5 * (d_loss_real + d_loss_fake)
        d_acc = 0.5 * (d_acc_real + d_acc_fake)

        return [d_loss, d_loss_real, d_loss_fake, d_acc, d_acc_real, d_acc_fake]

    def train_generator(self, batch_size):
        valid = np.ones((batch_size,1))
        noise = np.random.normal(0, 1, (batch_size, self.z_dim))
        # once batch update of the generator involves training on a batch of
        # generated images with the response 1. As the discriminator is frozed
        # its weights will not be affected; instead, the generator weights will
        # move in the direction that allows it to better generate images that are
        # more likely to fool the discriminator
        return self.model.train_on_batch(noise, valid)

out[9]

After a suitable number of epochs, the discriminator and generator will have found an equilibrium that allows the generator to learn meaningful information from the discriminator and the quality of the images will start to improve.

Loss and Accuracy of the Generator During Training

A requirement for a successive generative model is that it doesn't only reproduce images from the training set. To test this, we can find the image from the training set that is closest to a particular generated example. A good measure for distance is the 1\ell _11 distance.

GAN Challenges

GANs are notoriously difficult to train. This section looks at common problems when training GANs:

Oscillating Loss

The loss of the discriminator and generator can start to osciallate wildly, rather than exhibiting long-term stability. Typically, there is some small osciallation of the loss between batches, but in the long term you should be looking for loss that stabilizes or gradually increases or decreases, rather than eratically fluctuating, to ensure your GAN converges and improves over time.

Oscillating Loss

Mode Collapse

Mode collapse occurs when the generator finds a small number of samples that fool the discriminator and therefore isn't able to produce any examples other than this limited set.

Uninformative Loss

The lack of correlation between the generator loss and image quality sometimes makes GAN training difficult to monitor.

Hyperparameters

Even with simple GANs, there are a large number of hyperparameters to tune. GANs are highly sensitive to slight changes in all these parameters, and finding a set of parameters that works is often a case of educated trial and error, rather than following an established set of guidelines. This is why it is important to understand the inner workings of GAN.

Tackling the GAN Challenges

In recent years, several key adjustments have drastically improved the overall stability of GAN models and diminished the likelihood of some of the problems - such as the Wasserstein GAN (WGAN) and Wasserstein GAN-Gradient Penalty (WGAN-GP). Both of which are minor adjustments to GAN and the latter is now considered a best practice.

Wasserstein GAN

The Wasserstein GAN was one of the first big steps toward stablizing GAN in training. The authors were able to show how to train GAN's that have the following two properties:

  • A meaningful loss metric that correlates with the generator's convergence and sample quality
  • Improved stability of the optimization process

The paper introduces a new loss function for both the discriminator and the generator. Using this loss function instead of binary cross entropy results in more stable convergence.

Wasserstein Loss

Binary Cross Entropy Loss (Loss FUnction for Original GAN)

1ni=1n(yilog(pi)+(1yi)log(1pi))-\frac{1}{n}\sum_{i=1}^n (y_i\log(p_i) + (1-y_i) \log(1-p_i))n1i=1n(yilog(pi)+(1yi)log(1pi))

The Wasserstein los requires that we use yi=1y_i = 1yi=1 and yi=1y_i = -1yi=1 as labels, rather than 1 and 0. We also remove the sigmoid activation from the final layer of the discriminator, so that predictions pip_ipi are no longer constrained to fall in range [-1,1] but can be (,)( -\infty , \infty )(,) . For this reason, the discriminator in WGAN is usually referred to as a critic. The wasserstein loss function:

1ni=1n(yipi)-\frac{1}{n} \sum_{i=1}^n (y_i p_i)n1i=1n(yipi)

The WGAN critic tries to maximize the difference between its predictions for real images and generated images, with real images scoring higher.

The Lipschitz Constraint

It may surpisey you that we are now allowing the critic to output any number in the range (,)( -\infty , \infty )(,) , rather than applying a sigmoid function. We need to place an additional constraint on the critic for the loss functon to work: it is required that the critic is a 1-Lipschitz continuous function.

Weight Clipping

In the WGAN paper, the authrs show how it is possible to enforce the Lipschitz constraint by clipping the weights of the critic to lie within a small range, [-0.01, 0.01], after each training batch.

One of the main criticisms of the WGAN is that since we are cliipping the weights in the critic, its capacity to learn is greatly diminished.

WGAN-GP

One of the most recent extensions of the WGAN literatire is the Wasserstein GAN-Gradient Penalty (WGAN-GP) framework. The WGAN-GP generator is defined and compiled in exactly the same way as the WGAN generator. It is only the definition and compilation of the critic that we need to change. There are three changes to the WGAN critic to make it a WGAN-GP critic:

  • Include a gradient penalty term in the critic loss function
  • Don't clip the weights of the critic
  • Don't use batch normalization layers in the critic

The image below shows the training process for the critic. If we compare this to the original discriminator for the training process, we see that the key addition is the gradient penalty loss included as part of the overall loss function, alongside the Wasserstein loss from the real and fake images.

WGAN-GP Critic Training Process

The gradient penalty loss measures the squared difference between the norm of the gradient of the predictions with respect to the input images and 1. The model will naturally be inclined to find wights that ensure the gradient penalty term is miniuzed, thereby encouraging the model to conform to the Lopschitz constraint.

It is intractable to caculate this gradient everywhere during the training process, so instead the WGAN-GP evaluatesd the gradient at only a handful of points.

Batch Normallization should not be used in the critic of WGAN-GP.

Sumamry

If we compare WGAN-GP outputs to VAE outputs, we can see that GAN images are generally sharper. This is true in general - VAEs tend to produce softer images that blur color boundariesm whereas GANs are known to produce more well defined images.

It is also true that GANs are generally more difficult to train than VAEs and take longer to reach a satisfactory quality. However, most of the state-of-the-art generative models today are GAN-based, as the rewards for training large-scale GANs on GPUs over a longer period of time are significant.