Generative Deep Learning - Compose, Play, and The Future of Generative Modeling

The last three chapters of Generative Deep Learning go over encoder-decoder networks, reinforcement learning, and the future of generative modeling (mainly the Transformer) respectively.

DOWNLOAD NOTEBOOK

2 522

Chapter 7: Compose

Working on getting machines to compose music. Our model must be able to learn from and re-create the sequential structure of music and must also be able to choose from a discrete set of probabilities for subsequent notes.

Music generation is more difficult than text dur to its polyphonic nature and pitch and rythm. Generating music note by note is complex, because we often do not want all the instruments to change note simultaneously.

Attention

The kind of encoder-decoder network studied in the Write chapter sometimes struggle to retain all the require information for the decoder to accurately translate the source. We want a model that does not only care about the hidden state of the network now, but also about the hidden state of the network a while back. The attention mechanism was propoised to solve this problem. Rather than only using the final hidden state of the encoder RNN as the context vector, the attention mechanism allows the model to create the context vector as a weighted sum of the hidden states of the encoder RNN at each previous timestep. The attention mechanism is just a set of layers that converts the previous encoder hidden states and current decoder hidden state into the summation weights that generate the context vector.

Reccurent Layer for Predicting Next Note

Each hidden state $h_{j}$ (a vector of length equal to the number of units in the recurrent layer) is passed theough an alignment function a to generate a scalar, $e_{j}$ , In this example, the function is a densely connetced layer with one output unit and a tanh activation function.
The softmax function is applied to the vector $e_1 , \ldots , e_n$ to produce the vector of weights $\alpha_1 , \ldots , \alpha _n$ .
The hidden state vector $h_{j}$ is multiplied by its respective weight $\alpha_j$ and the results are then summed to give the context vector $c$ (thus $c$ has the same length as the hidden state vector).

Attention in Encoder-Decoder Networks

The attention mechanism is a powerful tool that helps the network decide which previous states of the recurrent layer are important for preducting the continuation of a sequence. We may wish to predict a sequence of future nots by using an RNN decoder, rather tha building up sequences one note at a time.The attention mechanism works in exactly the same way as we have seen previously, with one alteration: the hidden state of the decoder is also rolled into the mechanism so that the model is able to decide where to focus its attention not only though previous encoder hidden states, but also from the current encoder hidden state.

Attention Mechanism in Context of Encoder-Decoder Network

While there are many copies of the attention mechanism within the encoder-decoder network, they all share the same weights, so there is no extra overhead in the number of parameters to be learned. The only change is that now, the decoder hidden state is rolled into the attention mechanism (see redlines above). This slightly changes the equations to encorporate an extra index (i) to specify the step of the decoder.

Chapter 8: Play

In March 2018, David Ha and Jürgen Schmidhuber published their “World Models”
paper. The paper showed how it is possible to train a model that can learn how to perform a particular task through experimentation within its own generative hallucinated dreams, rather than inside the environment itself. It is an excellent example of how generative modeling can be used to solve practical problems, when applied alongside other machine learning techniques such as reinforcement learning.

A key component of the architecture is a generative model that can construct a probability distribution for the next possible state, given the current state and action. Having built up an understanding of the underlying phsyics of the environemnt through random movements, the model is then able to train itself from scratch on a new task, wntirely within its own internal representation of the environment. This approach led to world-best scores for both the tasks on which it was tested.

Reinforcement Learning

Reinforcement Learning (RL) is a field of machine learning that aims to train an agent to perform optimally within a given environment, with respect to a particular goal.

Reinforcement learning aims to maximize the long-term reward of an agent in a given environment. It is often described as one of the three major branches of machine learning, alongside supervised leaning (Predicting using labeled data) and unsuoervised learning (learning structure from unlabeled data).

Terminology in reinforcement learning:

Environment: The world in which the agent operates. It defines the set of rules that govern the game state update process and reward allocation, given the agent's previous action and current game state.
Agent: The entity that takes actions in the environment
Game State: The data that represents a particular situation that the agentmay encounter (also just called a state).
Action: A feasible move that an agent can make.
Reward: The value given back to the agent by the environment after the action has been taken. The agent aims to maximize the long-term sum of its rewards.
Episode: One run of an agent in the environment; this is also called a rollout
Timestep: For a discrete event environment, all states, actions and rewards are sibscriped to show their value at timestep $t$ .

The environment is first initialized with a current game state, $s_{o}$ . At timestep $t$ , the agent receives the current games state $s_{t}$ and uses this to decide the next best action $a_{t}$ , which it then performs. Given this action, the environment then calculates the next state $s_{t+1}$ and reward $r_{t+1}$ and passes these back to the agent, for the cycle to begin again. The cycle then continues until the end criterion of the episode is met (e.g., a given number of timesteps elapse or the agent wins/loses).

Reinforcement learning involves creating an agent that can learn optimal stategies by itself in complex environemnts through repeated play - this is what we will be using in this chapter to build the agent.

OpenAI Gym

OpenAI Gym is a toolkit for developing reinforcement learning algorithms that is available as a Python library. COntained within the library are several classic learning environments - such as CartPole and Pong, as well as environments that present more complex challenges. All of the environments provide a step method through which you can submit a given action; the environment will return the next state and reward. By repeatedly calling the step method with the actions chosen by the agent, you cna play out an episode in the environment. OpenAI Gym also provides graphics that allow you to watch your agent perform in a given environment.

World Model Architecture

The solution consists of three distinct parts:

$V$ : A variational autoencder
$M$ : A recurrent neural network with a mixture density network (MDN-RNN)
$C$ : A controller

Variational Autoencoder

Take a high dimensional input image and condense it to a latent random variable that approximately follows a standard multivariate distribution, through minimization of the reconstruction error and KL divergence. This part of the architecture produces a latent vector z that represents the current state. This is passed on to the next part of the network, the MDN-RNN.

The MDN-RNN

The forward thinking - the prediction of the next state = is the job of the MDN-RNN, a network that trues to predict the distribution of the next latent state based on the previous latent state and the previous action.

The MDN-RNN is an LSTM layer with 256 hidden units followed by a mixture density network (MDN) output layer that allows for the fact that the next latent state could actually be drawn from any one of several normal distributions.

The Controller

The responsibility of choosing an action lies with the controller. The controller is a densely connected neural network, where the input is a concatenation of z (the current latent state samples from the distribution encoded by the VAE) and the hidden state of the RNN.

The Controller Architecture

There is no training set of correct actions, as we do not know what the optimal action is for a given state of the environment. This is what distinguished this as a reinforcement learning problem. We need the agent to discover the optimal values for the weights itslef by experimenting within the environment and updating its weights based on received feedback. Evolutionary strategies are becoming a popular choice for solving reinforcement learning problems, due to their simplicity, efficiency, and scalability. The below strategy is known as CMA-ES:

CMA-ES

Evolutionary strategies generally adhere to the following process:

Create a population of agents and randomly initialize the parameters to be optimized for each agent.
Loop over the following:
Evaluate each agent in the environment, returning the average reward over multiple episodes.
Breed the agents with the best scores to create new members of the population.
Add randomness to the parameters of the new members.
Update the population pool by adding the newly created agents and removing poorly performing agents.

This is similar to the process through which animals evolve in nature - hence the name evolutionary strategies. "Breeding" in this context simply means combining the existing best-scoring agebt sucg that the next generation are more likely to produce high-quality results, similar to their parents. The added randomness to the population esures that we are not too narrow in our search field.

CMA-ES is just one form of evolutionary stragey. In short, it works by maintaining a normal distribution from which it can sample the parameters of new agents. At each generation, it updaes the mean of the distribution to maximize the likelihood of sampling the high-scoring agents from the previous timestep. At the same time, it updates the covariance matrix of the distribution to maximize the likelihood of sampling the high-scoring agents, given the population mean. it can be thought of as a form of naturally arising gradient descent, but with the added benedit that it is derivative-free, meaning that we do not need to calculate or estimate costly gradients.

One of the great benefits of CMA-ES is that it can be easily parallelized. The nost time-consuming part of the algorithm is calculating the score for a given set of parameters, since it need to simulate an agent with these parameters in the environment.

Chapter 9: The Future of Generative Modeling

Sequence modeling has primarily been driven by the invention of the Transformer, an attention-based module that remove the need for recurrent or convolutional neural networks entirely and now powers most state of the art sequential models. Image generation has reached new heights through the development of new GAN-based techniques.

The Transformer

The Transformer was first intriduced in the 2017 paper "Attention is All You Need", where the authors show how it is possible to create powerful neural networks for sequential modeling that do not require complex recurrent or convolutional architectures but instead only rely on attention mechanisms.

The Transformer has an encoder-decoder architecture but it uses stacked attention layers instead of LSTM. The stacked attention layers encode an input sentence to a sequence of representations. The decoder in the right hand side of the diagram then use this encoding to generate output words one at a time, using previous words as additional input into the model.

Positional Encoding

The words are first passed through an embedding layer to convert each into a vector of length $d_{model}=512$ . We also need to encode the position of each word in the sentnce. To achive this, we use the following positional encoding function that converts the position pos of the word in the sentence into a vector of length $d_{model}$ .

PE_{pos,2i} = \sin \left( \cfrac{pos}{10000^{2i/d_{model}}} \right) \\[0.25em] PE_{pos,2i+1} = \cos \left( \cfrac{pos}{10000^{(2i+1)/d_{model}}} \right)

For small $i$ , the wavelength of this function is short and therefore the function value changes rapidly along the position axis. Larger volumes of $i$ create a longer wavelength, and therefore nearby words are given approximately the same value. Each position thus has its own unique encoding, and since teh function can be applied to any value of $p o s$ it can be used to encode any position, no matter what the sequence of the input is.

To construct the input into the first encoder layer, the matrix of positional encodings is added to the word embedding matrix. This way, both the meaning and position for each word in the sequence are captured in a single vector, of length $d_{model}$ .

Multihead Attention

The tensor the flows through to the first of 6 encoder layers. Each encoder layer consists of several sublayers, starting with the multihead attention layer. The same multihead attention layer is used for both the encoder and decoder:

The multihead attention layer requires two inputs: the query input, $x^{Q}$ , and the key-value input, $x^{KV}$ . The job of the layer is to learn which posituions in the key-value input it should attend to, for every position of the query input. None of the layer's weight matrices are dependent on the sequence length of the query input ( $\textbf{n}^Q$ ) or the key-value input ( $n^{KV}$ ), so the layer can handle sequences of arbitrary length.

The encoder uses self-attention - that is, the query input and key-value input are the same (the output from the previous layer in the encoder). For example, in the first encoder layer, both inputs are the positionally encoded embedding of the input sequence. In the decoder, the query input comes from the previous layer in the decoder andd the key-value input comes from the final output from the encoder.

The first step of the layer is to create three matrices, the query $Q$ , key $K$ , and value $V$ , through multiplication of the input with three weight matrices, $W^{Q}$ , $W^{K}$ , and $W^{V}$ , as follows:

Q = x^Q W^Q \\[0.25em] K = x^{KV} W^K \\[0.25em] V = x^{KV} W^V \\[0.25em]

$Q$ and $K$ are representation of the query input and key-value input, respectivelu. We want to measure the similarity of these representations across each position in the query input and key-value input.

We can achieve this by performing a matrix multiplication of $Q$ with $K^{T}$ and scaling by factor $\sqrt{d_k}$ . This is known as scaled dot-product attention. Scaling is important, to ensure that the dot product between vectors in $Q$ and $K$ does not grow too large.

We then apply a softmax function to ensure all rows sum to 1. This matrix is of shape $n_Q \times n_{KV}$ and is the equivalent of the attention matrix.

The final step to complete the single attention head is to matrix multiply the attention matrix with the value matrix $V$ . In other words, the head outputs a weighted sum of the value representations $V$ for each position in the query, where the weights are determined by the attention matrix.

Incorporating multiple attention heads allows each to learn a distinct attention and value mechanism, therefore enriching the output from the multihead attention layer. The output matrices from the multiple heads are concatenated and passed through one final matrix multiplication with a weights matrix $W^{O}$ . This is then added pointwise to the original query input through a skip connection, and layer normalizatioon is applied to the result.

The final part of the encoder consists of a feed-forward (densely connected) layer applied to each position separately. The weights are shared across positions, but not between layers of the encoder-decoder. The encoder concludes with one final skip connection and normalization layer. The output from the layer is the same shape as the query input - $n_Q \times d_{model}$ - this allows us to stack several encoder layers on top of each other, allowing the model to learn deeper features.

The Decoder

The decoder layers are very similar to the encoder layers, with two key differences:

The initial self-attention layer is masked, so that information from subsequent timesteps acnnot be attended to during training. This is achieved by setting the appropriate elements of the input to the softmax to $-\infty$ .
The output from the encoder layer is also incorporated into each layer of the decoder, after the initial self-attention mechanism. Here, the query input comes from the previous layer of the decoder and the key-value input comes from the encoder.

Each position in the output from the final decoder layer is fed through one final dense layer with a softmax activation function to give next word probabilities.

User Comments

There are currently no comments for this article.