Hands On Machine Learning Chapter 16/17 - NLP with RNN and Attention / Autoencoder, GAN, Diffusion

These chapters goes over NLP (text generation, classification), introduces attention mechanism and the transformer model, and retviews Autoencoders, Generative Adverserial Networks, and Diffusion models for image generation.

Chapter 16: Natural Language Processing with RNNs and Attention

I have already been over a majority of the concepts presented here when reading Natural Language Processing with Transformers, so I am just going to take notes on things that I want to review.

Attention Mechanisms

The image below shows an encoder-decoder model with an added attention mechanism. Instead of just sending the encoder's final hidden state to the decoder, as well as the previous target word at each step (which is still done), we now send all of the encoder's outputs to the decoder as well. Since the decoder cannot deal with all these encoder outputs at once, they need to be aggregated: at each timestep, the decoder's memory cell computes a weighted sum of all the encoder outputs. This determines which words it will focus on at this step. The weight α(t,i)\alpha _{(t,i)}α(t,i) is the weight of the ithi^{th}ith encoder output at the ttht^{th}tth decoder time step. The rest of the decoder works just like earlier: at each time step the memory cell receives the inputs we just discussed, plus the hidden state from the previous time step, and finally it receives the traget word from the previous time step.

Encoder-Decoder Network with an Attention Model

The α(t,i)\alpha_{(t,i)}α(t,i) weights are generated by a small neural network called an alignment model (or an attention layer), which is trained jointly with the rest of the encoder-decoder model.

Attention Mechanisms

In short, the attention layer provides a away to focus the attention of the model on part of the inputs. There's another way to think of the model: it acts as a differentiable memory retrieval mechanism.

Attention is All You Need: The Original Transformer Architecture

In 2017, Google researchers created an architecture called the transformer, which significantly improved the state-of-the-art in NMT without using any recurrent or convolutional layers, just attention mechanisms (plus embedding layers, dense layers, normalization layers, and a few others).

Because the model is not recurrent, it doesn't suffer as much from the vanishing or exploding gradients problems as RNNs, it can be trained in fewer steps, it's easier to parallelize across multiple GPUs, and it can better capture long-range patterns than RNNs.

Transformer Architecture

Going through the image above:

  • Notice that the encoder and decoder contain modules that are stacked NNN times.
  • The encoder's mult-head attention layer updates each word representation by attending to (paying attention to) all other words in the same sentence. That's where the vague representation of the word "like" becomes richer and more accurate representation, capturing its precise meaning in the given sentence.
  • The decoder's masked multi-head attention layer does the same thing, but when it processes a word, it doesn't attend to words located after it; it's a causal layer.
  • The decoder's upper multi-head attention layer is where the decoder pays attention to the words in the English sentence. This is called cross-attention.
  • The positional encodings are dense vectors that represent the position of each word in the sentence. The nthn^{th}nth positional encoding is added to the word embedding of the nthn^{th}nth word in each sentence. This is needed because all layers in the transformer architecture ignore word positions: without positional encodings, you could shuffle the input sequences, and it would just shuffle the output sequences in the same way. Obviously, the order of words matters, which is why we need to give positional information to the transformer somehow: adding positional encodings to the word representations is a good way to achieve this.

Positional Encodings

A Position ecnoding is a dense vector that encodes the position of a word withing a sentence: the ithi^{th}ith positional encoding is added to the word embedding of the ithi^{th}ith word in a sentence. PositionalEncoding layer in TensorFlow:

class PositionalEncoding(tf.keras.layers.Layer):
    def __init__(self, max_length, embed_size, dtype=tf.float32, **kwargs):
        super().__init__(dtype=dtype, **kwargs)
        assert embed_size % 2 == 0, "embed_size must be even"
        p, i = np.meshgrid(np.arange(max_length),
                           2 * np.arange(embed_size // 2))
        pos_emb = np.empty((1, max_length, embed_size))
        pos_emb[0, :, ::2] = np.sin(p / 10_000 ** (i / embed_size)).T
        pos_emb[0, :, 1::2] = np.cos(p / 10_000 ** (i / embed_size)).T
        self.pos_encodings = tf.constant(pos_emb.astype(self.dtype))
        self.supports_masking = True

    def call(self, inputs):
        batch_max_length = tf.shape(inputs)[1]
        return inputs + self.pos_encodings[:, :batch_max_length]

Multi-head Attention

Scaled dot-product attention:

Attention(Q,K,V)=softmax(QKTdkeys)V\text{Attention}(\textbf{Q},\textbf{K},\textbf{V})=\text{softmax}\left( \cfrac{\textbf{Q}\textbf{K}^T}{\sqrt{d_{keys}}} \right)\textbf{V}Attention(Q,K,V)=softmax(dkeysQKT)V
  • Q\textbf{Q}Q is a matrix containing one row per query. Its shape is [nqueries,dkeys][n_{queries}, d_{keys} ][nqueries,dkeys] , where nqueriesn_{queries}nqueries is the number of queries and dkeysd_{keys}dkeys is the number of dimensions of each qeury and each key.
  • K\textbf{K}K is a matrix containing one row per key. Its shape is [nkeys,dkeys][n_{keys}, d_{keys} ][nkeys,dkeys] , where nkeysn_{keys}nkeys is the number of keys and values.
  • V\textbf{V}V is a matrix containing one row per value. Its shape is [nkeys,dvalues][n_{keys}, d_{values} ][nkeys,dvalues] , where dvaluesd_{values}dvalues is the number of dimensions of each value.

Multi-Head Attention Layer

As seen in the image above, the multi-head attention is layer is a bunch of dot-product attention layers, each preceded by a linear transformation of the values, keys, and queries (a time-distributed dense layer with no activation function). All the outputs are simply concatenated, and they go through a final linear transformation. The multi-head attention layer applues multiple different linear transformations of the values, keys, and queries: this allows the model to apply many different projections of the word representations into different subspaces, each focusing on a subset of the word's characteristics.

Vision Transformers

Visual Attention: a convolutional neural network first processes the image and outputs some feature maps, then a decoder RNN equipeed with an attention mechanism generated the caption, one word at a time.

At each decoder time step (i.e. each word), the decoder uses the attention model to focus on just the right part of the image.

Visual Transformer

The idea of visual transformer (ViT) is simple: just chop the image up into squares (i.e., 16x16), and treat the sequence of sqaures as if it were a sequence of word representations. The sqrares are first flattened - then these vectors go through a linear layer that transforms them but retains their dimensionality. The resulting sequence of vectors can then be treated just like a sequence of word embeddings: this means adding positional embeddings, and passing the result to the transformer. Transformers don't have as many inductive biases as convolution neural nets, so they need extra data just to learn things that CNNs implicitly assume.

Chapter 17: Autoencoders, GANs, and Diffusion Models

Autoencoders are artificial neural networks capable of learning dense representations of the input data, called latent representations or codings. Autoencoders can be useful for visualization, dimensionality reduction, unsupervised pretraining of deep neural networks, and as generative models. They work by learning to copy their inputs to their outputs.

Generative adverserail networks (GANs) are neural networks capable of generating data. GANs are now widely used for super resolution (increasing resolution), colorization, powerful image editing, turning simple sketches into photorealistic images, predicting the next frames in a video, augmenting a dataset, and generating other types of data. They are composed of a generator that tries to generate data close to training data and a discriminator that tries to tell real data from fake data. The generator and discriminator compete against each other in training.

Diffusion models are a recent addition to the generative learning party. A denoising diffusion probabilistic model (DDPRM) is trained to remove a tiny bit of noise from an image.

Efficient Data Representations

An autoencoder looks at inpyts, converts them to an efficient latent representation, and then spits out something that looks very close to the inputs. An autoencoder is composed of two parts: an encoder (or recognition network) that converts the inputs to latent representations, followed by a decoder (or generative network) that converts the internal representation to the outputs.

Autoencoder

An autoencoder typically looks like MLP except that the number of neurons in the output layer must be equal to the number of inputs. The outputs are often called reconstructions because the autoencoder tries to reconstruct the inputs. The cost function contains a reconstruction loss that penalizes the model when the reconstructions are different from the inputs. Becuase internal representation has a lower dimensionality than the input data, the autoencoder is said to be undercomplete.

Stacked Autoencoders

Autoencoders can have multiple hidden layers, In this case, they are called stacked autoencoders (or deep autoencoders). The architecture of a stacked autoencoder is typically symettrical with regard to the central hidden layer (the coding layer).

Stacked Autoencoder

One big advantage of autoencoders is that they can handle large datasets with many instances and many features. If you have a large dataset but most of it is unlabeled, you can first train a stacked autoencoder using all the data, then reuse the lower layers to create a neural network for your actual task and train it using the labeled data.

Convolutional Autoencoders

If you want to build an autoencoder for images, you will build a convolutional autoencoder. The encoder is a regaulr CNN composed of convolutional layers and pooling layers. It typically reduces the spatial dimensionality (height and width) of the inputs while increasing the depth (number of feature maps). The decoder must do the reverse (upscale the image and reduce its depth back to the original dimensions).

Denoising Autoencoders

Another way to force the autoencoder to learn useful features is to add noise to its inputs, training it to recober the original, noise-free inputs. The noise can be randomly switched-off inputs, just like dropout.

Sparse Autoencoders

Another kind of constraint that often leads to good feature extraction is sparsity: by adding an appropriate term to the cost function, the autoencoder is pushed to reduce the number of active neurons in the coding layer. Another approach, which often yields better results, is to measure tha ctual sparsity of the coding layer at each training iteration, and penalize the model when the measdured sparsity differs from training sparsity. Once we have the mean activation per neuron, we want to penalize the neurons that are too active, by adding a sparsity loss to the cost function. One approach could be simply adding the squared error to the cost function, but in practice a better approach is to use the Kullback-Leibler (KL) divergence.

Variational Autoencoders

VAEs are different from all other autoencoders discussed so far in these ways:

  1. They are probabilistic autoencoders, meaning that their outputs are partly deteremined by chance, even after training (as opposed to denoising autoencoders, which us randomness only during training)
  2. Most importantly, they are generative autoencoders, meaning that they can generate new instances that look like they were sampled from the training set.

VAEs are easier to train and the sampling process is much faster than RBMs. Variational autoencoders perform variational Bayesian inference, which is an efficient way of carrying out approximate Bayesain inference. Recall: Bayesian infereence means updating a probability distribution based on new data, using equations derived from Bayes' theorem. The original distribution is called the prior, while the updated distribution is called the posterior.

Instead of producing a coding for a given input, the encoder produces a mean encoding μ\muμ and a standard deviation σ\sigmaσ . The actual coding is then sampled randonly from a Gaussian distribution with mean μ\muμ and standard deviation σ\sigmaσ . After that the decoder decodes a sampled coding normally.

VAE

A variational autoencoder tends to produce codings that look as though they were sampled from a simple Gaussian distribution: during training, the cost function pushes the codings to gradually migrate within the coding space (latent space) to end up looking like a cloud of gaussian points.

Generative Adverserial Networks

GAN is composed of two neural networks:

  • Generator
    • Takes a random distribution as input (typically Gaussian) and outputs some data - typically, an image. You can think of the random inputs as the latent representations of the image to be generated. The generator offers the same functionality as a decoder in a variational autoencoder, but it must be trained differently.
  • Discriminator
    • Takes either a fake image from the generator or a real image from the training set as input, and must guess whether the input image is fake or real.

GAN

During training, the generator and the discriminator have opposite goals: the discriminator tries to tell fake images from real images, while the generator tries to produce images that look real enough to trick the discriminator. Each training iteration is divided into two phases:

  • First, train the discriminator. A batch of real images is sampled from the training set and is completed with an equal number of fake images produced by the generator.
  • Second, train the generator. Use it to produce another batch of fake images, and once again the discriminator is used to tell whether the images are fake or real. We want the generator to produce image sthat the discriminator will (wrongly) believe to bve real. The weights of the discriminator are frozen during this step, so backpopagation only affects the weights of the generator.

The Difficulties of Training GANs

During training, the generator and the discriminator constantly try to outsmart each other, in a zero-sum game. As training advances, the game may end up in a state that theorists call a Nash equilibrium, named after the mathematician John Nash: this is when no player would be better off changing their stragtegy, assuming the other players do not change theirs. The biggest difficulty of training GANs is called mode collapse: this is when the generator's outputs gradually become less diverse. Moreover, because the generator and the discriminator are constantly pushing against each other, their parameters may end up oscillating and becoming unstable. Training may begin properly, then suddenly diverge for no apparent reason, due to these instabilities. The training of GANs is an active area of research, and the dynamics of GANs are still not perfectly understood.

Deep Convolutional GANs

Main guidelines for producing stable convolutional GANs:

  • Replace any pooling layer with strided convolutions (in the discriminator) and transposed convolutions (in the generator)
  • Use batch normalization in both the generator and the discriminator, except in the generator's output layer and the discriminator's input layer
  • Remove fully connected hidden layers for deeper archiectures
  • Use ReLU activation in the generator for all layers except the output layer, which should use tanh
  • Use leaky ReLU activation in the discriminator for all layers

Diffusion Models

Denoising Diffusion Probabilistic Models (DDPM) have been able to beat GANs recently, although they take a longer time to generate images compared to VAEs and GANs. How does it work? Start with a picture of a cat, denoted x0\textbf{x}_0x0 and at each time step ttt you add a little bit of Gaussian noise to the image, with mean 0 and variance \beta_t .Thisnoiseisindependentforeachpixel:wecallitisotropic.Youfirstobtaintheimage. This noise is independent for each pixel: we call it *isotropic*. You first obtain the image.Thisnoiseisindependentforeachpixel:wecallitisotropic.Youfirstobtaintheimage \textbf{x}_1 ,then, then,then \textbf{x}_2 ,andsoonuntilthecatiscompletelyhiddenbythenoise.Inshort,weregradualllydrowningthecatinnoise:thisiscalledtheforwardprocess.Wecanthentrainamodelthatcanperformthereverseprocess,goingfrom, and so on until the cat is completely hidden by the noise. In short, we're graduallly drowning the cat in noise: this is called the *forward process*. We can then train a model that can perform the *reverse process*, going from,andsoonuntilthecatiscompletelyhiddenbythenoise.Inshort,weregradualllydrowningthecatinnoise:thisiscalledtheforwardprocess.Wecanthentrainamodelthatcanperformthereverseprocess,goingfrom \textbf{x}t tototo \textbf{x}{t-1} $ . We can then use it to remove a tiny bit of noise from an image, and repeat the operation many times until all the noise is gone. If we train the model on a dataset containing many cat images, then we can give it a picture entirely full of Gaussian noise, and the model will gradually make a brand new cat appear.

Diffusion Model

Diffusion models have made tremendous prpgress recently. A recent paper introduced latent diffusion models, where the diffusion process takes place in latent space, rather than in pixel space. To achieve this, a powerful autoencoder is used to compress each training image into a much smaller latent space, where the diffusion process takes place, then the autoencoder is used to decompress the final latent representation, generating the output image. This considerably speeds up image generation, and reduces training time and cost dramatically.