Generative Deep Learning - Paint and Write
Chapters 5 and 6 of Generative Deep Learning go into style transfer for images - transforming a base image to have the style of a style image - and text generation using RNNs and LSTM networks.
Chapter 5: Paint
I the field of style transfer, our aim is to build a model that can transform an input base image in order to give the impression that it comes from the same collection as a given set of style images. With style transfer, our aim isn't to model the underlying distribution of the style images, but instead to extract only the stylistic components from these images and embed them into the base image.
This chapter teaches you to build two different kinds of style transfer models: CycleGAN and Neural Style Transfer.
CycleGAN
The CycleGAN paper represented a significant step forward in the field of style transfer as it showed how it was possible to train a model that could copy the style from a reference image without a training set of examples. The CycleGAN paper was released only a few months after the pix2pix paper and shows how it is possible to train a model to tackle problems where we do not have pairs of images in the source and target domains.
While pix2pix only works in one direction, CycleGAN trains the model in both directions simultaneously, so that the model learns to translate images from target to source as well as source to target.
Your First CycleGAn
A CycleGAN is composed of four models, two generators and two discriminators. The first generator, G_AB, converts images from domain A into domain B. The second generator, G_BA, converts images from domain B to domain A. Since we do not have paired images on which to train our generators, we also need to train two discriminators that will determine if the images produced by the generators are convincing. The first discriminator d_A, is trained to be able to identify the difference between images from domain A and fake images that have been produced by generatir G_BA. Conversely, discriminator d_B is trained to be able to identify the difference between real images from domain B that have been produced by generator G_AB. The relationship between the two models:
CycleGAN generators typically take one of two forms: U-Net or ResNet (residual network).
The Generators (U-Net)
U-Net consists of two halves: the downsampling hald, where input images are compressed spatially but expanded channel-wise, and an upsampling half, where representations are expanded spatially while the number of channels is reduced.
Unlike in a VAE, there are also skip connections equivalently shaped layers in the upsampling and downsampling parts of the network. A VAE is linear; data flows through the network from input to the output, one layer after another. A U-Net is different, because it contains skip connections that allow information to shortcut parts of the network and flow through later layers.
The intuition here is that with each subsequent layer in the downsampling part of the network, the model increasingly captures the what of the images and loses information on the where. At the apex of the U, the feature maps will have learned a contextural understanding of what is in the image, with little understanding of where it is located. The skip connections allow the network to blend high-level abstract information captured during the downsampling pprocess (i.e., the image style) with the specific spatial information that is being fed back from the previous layers in the network (the image content).
Concatenate Layer
The Concatenate layer simply joins a set of layers together along a particular axis (by default, the last axis). In Keras, join two layers x and y with Concatenate()([x,y]). In the case of U-Net, we use the Concatenate layers to connect upsampling layers to the equivalently sized layer in the downsampling part of the network.
Instance Normalization Layer
The generator of this CycleGAN ises InstanceNormalization layers rather than BatchNormalization layers, which in style transfer problems can lead to more satisfying results.
An InstanceNormalization layer normalizes every single observation individually rather than as a batch. Unlike a BatchNormalization layer, it doesn;t require mu and sigma parameters to be calculated as a running average during training, since at test time the layer can normalize per instance in the same way as it does at train time. The means and standard deviation used to normalize eahc layer are calculated per channel and per observation.
The Discriminators
The discriminators in the CycleGAN that we will be building output an 8 x 8 single-channel tensor rather than a single number. The reason for this is that the CycleGAN discriminator (inherited from PathGAN) divides the image into square overlapping "patches" and guesses if each patch is real or fake, rather than predicting for the image as a whole. Therefore the output of the discriminator is a tensor that contains the predicted probability for each patch, rather than just a single number.
The benefit of using a PathGAN discriminator is that the loss function can then measure how good the discriminator is at distinguishing images based on their style rather than their content.
Compiling the CycleGAN
We judge the generators simultaneously on three criteria:
- Validity: Do the images produced by each generator fool the relevant discriminator? (does output from g_BA fool d_A and does output from g_AB fool d_B)
- Reconstruction: If we apply the two generators one after the other (in other directions), do we return to the original image? The CycleGAN gets its name from this cyclic reconstruction criterion.
- Identity: If we apply each generator to images from its own target domain, does the image remain unchanged?
The Genrators (ResNet)
The ResNet architecture is similar to a U-net in that it allows information from previous lauers in the network to skip ahead one or more layers. However, rather than creating a U shape by connecting layers from the downsampling part of the network to corresponding upsampling layers, a ResNet is built of residual blocks stacked on top of each other, where each block contains a skip connection that sums the input and output of the block, before passing this on to the next layer.
On either side of the residual blocks, the ResNet generator also contains downsampling and upsampling layers. The overall architecture of the ResNet is shown below.
It has been shown that ResNet architectures can be trained to hundreds and even thousands of layers deep and not suffer from the vanishing gradient problem, where the gradients at early layers are tiny and therefore train very slowly. This is due to the fact that the error gradients can backpropagate freely through the network through the skip connections that are part of the residual blocks.
In the original CycleGAN paper, the model was trained for 200 epochs to achieve state-of-the-art results for artist-to-photograph style transfer.
Neural Style Transfer
Neural style transefr: we don't have a training set at all, but wish to transfer the style of one single image onto another. The idea works on the premise that we want to minimize a loss function that is a wighted sum of three distinct parts:
- content loss: we would like the combined image to contain the same content as the base image
- style loss: we would like the combined image to have the same general style as the style image
- total variance loss: we would like the combined image to appear smooth rather than pixelated
We minimize the losss via gradient descent - that is, we update each pixel value by an amount proportional to the negative gradient of the loss function, over many iterations.
Content Loss
The content loss measures how different two images are in terms of the subject matter and overall placement of their content. Two images that contain similar-looking scenes (a photo of row of buildings and another photo of the same buildings taken in a slightly different light from a different angle) should have a smaller loss than two images that contain completely different scenes.
What we need is a deep neural network that has already been trained to identufy the content of an image, so that we can tap into a deep layer of the network to extract the high-level features of a given input image. If we measure the mean squared error between this output for the base image and the current combined image, we have our content loss function.
Style Loss
The solution to measuring style loss given in the paper is based on the idea that images that are similar in style typically have the same pattern of correlation between feature maps in a given layer.
To numerically measure how much two feature maps are jointly activated together, we can flatten them and calculate the dot product. If the resulting vector is high, the feature maps are highly correlated; if the value is low, the feature maps are not correlated. We can define a matrix that contains the dot product between all pairs of features in a luer. This is called a Gram matrix.
To calculate the style loss, all we need to do is calculate the Gram Matrix (GM) for a set of layers throughout the network for both the base image and the combined image and compare their similarity using sum of squared errors.
Total Variance Loss
The total variance loss is a measure of noise in the combined image. To judge how noisy an image is, we can shift it one pixel to the right and calculate the sum of squared difference between the translated and original images. For balance, we also do the same procedure but shift the image one pixel down. The sum of these two terms is the total varaince loss.
Chapter 6: Write
Exploring methods for building generative models on text data. Differences between text and image data that make using the same processes as before difficult:
- Text data is composed of discrete chunks (words or characters), wehereas pixels in an image are points in a continuous color spectrum. We can more easily apply backpropagation to pixel data.
- Text data has a time dimension but no spatial dimension, whereas image data has two spatial dimensions but no time dimension. The order of words is highly important in text data and words wouldn't make sense in reverse, whereas images can usually be flipped without affecting the content. There are often long-term sequential dependencies between words that need to be captured by the odel.
- Text data is highly sensitive to small changes in the individual units (words or characters). This makes it very difficult to train a model to generate coherent text, as wevery word is vital to the overall meaning of the passage
- Text data has a rules-based grammatical structure, whereas image data doesn;t follow set rules about how the pixel values should be assigned.
Long Short Term Memory Networks
The long short-term memroy (LSTM) network is one of the most utilized and successful deep learning techniques for sequential data such as text.
AN LSTM is a type fo recurrent neural network (RNN). RNNs contain a recurrent layer (or cell) that is able to handle sequential data by making its own output at a particular timestep form part of the input to the next timestep, so that information from the past can affect the prediction at the current timestep. We say LSTM network to mean a neural network with an LSTM recurrent layer.
LSTMs do not experience the vanishing gradients problem experienced by vanilla RNNs and can be trained on sequences that are hundreds of timesteps long.
LSTM Network
Tokenization
Tokenization is the process of splitting the text up into individual units, such as words or characters. How you tokenize text will depend on what you are trying to achieve with text generation. There are pros and conse to both word and character tokens:
- If you use word tokens:
- All text can be converted to lowercase, to ensure capitalized words at the start of sentences are tokenized the same way as the same words appearing in the middle of a sentence. This may not be wanted in the case of proper nouns.
- The text vocabulary (the set of distinct words in the training set) may be very large, with some words appearing sparsely or onlu once. It may be wise to replace sparse words with a token for unknown word, rather than including them as separate tokens, to reduce the number of weights the neuralk network needs to learn.
- Words can be stemmed, meaning that they are reduced to their simplest form, so that different tenses of a verb remained tokenized together. For example, browse, browsing, browses, and browsed, would all be stemmed to brows
- Using word tokenization means that the model will never be able to predict words outside the training vocabulary.
- If you use charater tokens:
- The model may generate sequences of characters that form new words outside of the training vocabulary - this may be desirable in some contexts, but not in others
- Capital letters can either be converted to their lowercase counterparts, or remain as separate tokens
- The vocabulary is usally much smaller when using character tokenization. This is beneficial for model training speed as there are fewer weighst to learn in the final output layer.
The Embedding Layer
An embedding layer is essentially a lookup table that converts each token into a vector of length embedding_size. The number of weights learned by this layer is therefore ewual to the size of the vocabulary, multiplied by embedding_size.
The Input layer passes a tensor of integer sequences of shape [batch_size, seq_length] to the Embedding layer, which outputs a tensor of shape [batch_size, seq_length, embedding_size]. This is then passed on to the LSTM layer.
We embed each integer token into a continuous vector because it enables the model to learn a representation for each word that is able to be updated through backpropagation. Using embedding is preferred to one hot encoding because it makes the embedding trainable itself, thus giving the model more flexibility in deciding how to embed each token to improve model performance.
The LSTM Layer
A recurrent layer has the special property of being able to process sequential input data [x1,…,xn] . It consists of a cell that updates its hidden state, ht , as each element of the sequence xt is passed through it, one timestep at a time. The hidden state is a vector with length equal to the number of units in the cell - it can be thought of as the cell's current understanding of the sequence. At timestep t , the cell uses the previous value of the hidden state ht−1 , together with the data from teh current timestep xt to produce an updated hidden state vector ht . This recurrent process continues untl the end of the sequence. Once the sequence is finished, the layer outputs the final hidden state of the cell hn which is then passed to the next layer of the network.
The fact that the output from the cell is called a hidden state is an unfortunate naming convention - it's not really hidden, and you shouldn;t think of it as such. The last hidden state is the overall output of the layer, and you can access the hidden state at each individual timestep.
The LSTM Cell
The job of the LSTM cell is to output a new hidden state, ht , given its previous hidden state, ht−1 , and the current word embedding, xt . The length of ht is equal to the number of units in the LSTM; it has nothing to do wityh the length of the sequence. There is one cell in an LSTM layer that is defined by the number of units it contains.
An LSTM cell maintains a cell state, Ct , which can be thought of as the cell's internal beliefs about the current status of the sequence. This is distinct from the hidden state, ht m which is ultimately output by the cell after the final timestep. The cell state is teh same length as the hidden state - the number of units in the cell.
The hidden state is updated in 6 steps:
- The hidden state of the previous timestep, ht−1 , and the current word embedding, xt , are concatenated and passed through the forget gate. This gate is simply a dense layer with weights matrix Wf , bias bf , and a sigmoid activation function. The resulting vector, fn , has a length equal to the number of units in the cell and contains values between 0 and 1 that determine how much of the previous cell state, Ct−1 should be retained.
- The concatenated vector is also passed through an input gate, which, like the forget gate, is a dense layer with weights matrix Wi , bi , and a sigmoid activation function. The output from this gate it , has length equal to the number of units in the cell and contains values between 0 and 1 that determines how much new information will be added to the previous cell state, Ct−1 .
- The concatenated vector is passed through a dense layer with weights matrix WC, bias bC , and a tanh activation function to generate a vector C^t that contains the new information that the cell wants to consider keeping. It also has length equal to the number of units in the cell and contains values between -1 and 1.
- ft and Ct−1 are multiplied element-wise and added to the element-wise multiplication of it and C^t . This represents forgetting parts of the previous cell state and then adding new relevant information to produce the updated cell state, Ct .
- The original concatenated vector is also passed through an output gate: a dense layer with weights matrix Wo , bias bo , and a sigmoid activation. The resulting vector, ot , has a length equal to the number of units in the cell and stores values between 0 and 1 that determine how much of the updated cell state, Ct , to output from the cell.
- ot is multiplied element-wise with the updated cell state Ct after a tanh activation has been applied to produce the new hidden state, ht .
Generating New Text
Generate long strings of text using the following process:
- Feed the network with an existing sequence of words and ask it to predict the following word.
- Append this word to the existing sequence and repeat.
The network will output a set of probabilities for each word that we can sample from. Therefore, we can make the text generation stochastic, rather than deterministic. Moreover, we can introduce a temperature parameter to the sampling process to indicate how deterministic we would like the process to be. Lower temperature values result in more deterministic sampling. The LSTM network cannot grasp the semantic meaning of the words that it is generating.
RNN Extensions
Stacked Recurrent Networks
We can stack LSTM layers so that deeper features can be learned from text. To achieve this, set the return_sequences parameter within the first LSTM layer to True. This makes the layer output the hidden state from every timestep, rather than just the final timestep. The second LSTM layer can then use the hidden states from the first layer as its input data.
Gated Recurrent Units
Another commonly used RNN layer is the gated recurrent unit (GRU). The key differences from the LSTM unit are as follows:
- The forget and input gates are replaced by reset and update gates.
- There is no cell state or output gate, only a hidden state that is output from the cell.
The hidden state is updated in 4 states:
- A hidden state of the previous timestep, ht−1 , and the current word embedding , xt , are concatenated and used to create the reset gate. This gate is a dense layer, with weights matrix Wr and a sigmoid activation function. The resulting vector, rt , has a length equal to the number of units in the cell and stores values between 0 and 1 that determine how much of the previous hidden state, ht−1 , should be carried forward into the calculation for the new beliefs of the cell.
- The reset gate is applied to the hidden state, ht−1 , and concatenated with teh current word embedding xt . This vector is then fed into a dense layer with weights matrix W and a tanh activation function to generate a vector, h^t , that stores the new beliefs of the cell. It has length equal to the number of units in the cell and stores values between -1 and 1.
- The concatenation of the hidden state of the previous timestep, ht−1 , and the the current word embedding, xt , are also used to create the update gate. This gate is a dense layer with weights matrix Wz and a sigmoid activation. The resulting vector, zt , has length equal to the number of units in the cell and stores values between 0 and 1, which are used to determine how much of the new beliefs, h^t , to blend into the current hidden state, ht−1 .
- The new beliefs of the cell h^t and the current hidden state ht−1 are blended in a proportion determined by the update gate, zt m to produce the updated hidden state, ht , that is output form the cell.
Bidirectional Cells
A bidirectional layer takes advantage of the fact that a sequence can be processedin the reverse direction by storing two sets of hidden states: one that is produced as a result of the sequence being processed in the usual forward direction and another that is produced when the sequence is processed backwards. This way, the layer can learn from information both preceding and succeeding the given timestamp.
Encoder-Decoder Models
For some tasks, the goal isn't to preduct the single next word in the existing sequence; instead we wish to predict a completely different sequence of words that is in some way related to the input sequence, examples:
- Language Translation
- Question Generation
- Text summarization
For these problems, we can use an encoder-decoder model. For sequence data, the encoder-decoder process works as follows:
- The original input sequence is summarized into a single vetor by the encoder RNN.
- This vector is used to initialize the decoder RNN.
- The hidden state of the decoder RNN at each timestep is connected to a dense layer that outputs a probability distribution over the vocabulary of words. This way, the decoder can generate a novel sequence of text, having been initialized with a representation of the input data produced by the encoder.
The final hidden state of the encoder canbe thought of as a representation of the entire input document. The decoder then transforms this representation into sequential output, such as the translation of the text into another language, or a question relating to the document.
During training, the output distribution produced by the decoder at each timestep is compared against the true next word, to calculate the loss. This way of training the encoder-decoder networks is known as teacher forcing.
There are several extensions to encoder-decoder networks that improve the accuracy and generative power of the model. two of the most widely used are pointer networks and attention mechanism. Pointer networks give the model the ability to "point" at specific words in the input text to include in the generated question, rather than only relying on the words in the known vocabulary.