Natural Language Processing with PyTorch Chapters 5-8
Read and took notes on Chapters 5 through 8 of natural language processing with PyTorch. These chapters mainly go over recurrent neural networks, LSTMs, and general considerations for sequence to sequence models.
Embedding Words and Types
When any feature comes from a finite (or countably infinite) set, it is a discrete type. Representing discrete types (words) as dense vectors is at the core of deep learning's successes in NLP. The terms "representative learning" and "embedding" refer to learning this mapping from one discrete type to a point in the vector space. When the discrete types are words, the dense vector representation is called a word embedding. In Learning-based or prediction-based embedding methods, the representations are learned by maximizing an objective for a specific learning task - i.e., predicting a word based on context.
Why Learn Embeddings?
Low-dimensional learned dense representations have several benefits over the one-hot and count-based vectors you have seen:
- Reducing the dimensionality is computationally efficient.
- Count-based representations result in high dimensional vectors that redundantly encode similar information among many dimensions, and do not share statistical strength.
- Very high dimensions in the input can result in real problems in machine learning and optimization - curse of dimensionality
- Representations learned from task-specific data are optimal for the task at hand.
Embeddings are often used to represent words in a lower-dimensional space than would be needed of a one-hot vector or a count-based representation was used.
Approaches to Learning Word Embeddings
All word embedding methods train with just words (i.e., unlabeled data), but in a supervised fashion. This is possible by constructing auxilary supervised tasks in which the data is implicitly labeled, with the intuition that a representation that is optimized to solve the auxialry task will capture many statistical and linguistic properties of the text corpus in order to be generally useful. Examples of auxilary tasks:
- Given a sequence of words, predict the next word. This is also called the language modeling task.
- Given a sequence of words before and after, predict the missing word.
- Given a word, predict words that occur within a window, independent of the position.
Examples include GloVe, Continuous Bag-of-Words, Skipgrams, and so on. For most purposes, using pretrained word embeddings and fne-tuning them fro the task at hand appears sufficient.
The Practical Use of Pretrained Word Embeddings
Pretrained word embeddings, trained on a large corpus - like all Google News, Wikipedia, or Common Crawl - using one of the methods described earlier are available freely to download and use.
Loading Embeddings
Typically, the embeddings will come in the following format: each line starts with the word'type that is being embedded and is followed by a sequence of numbers (the vector representation). The length of this sequence is the dimension of the representation (the embedding dimension). Example:
dog 1.242 0.360 0.573 0.367 0.600 0.189 1.273 ...
cat 0.964 0.610 0.674 0.351 0.413 0.212 1.380 ...
Relationships between Word Embeddings
The core feature of word embeddings is that they encode syntactic and semantic relationships that manifest as regularities in word use. Because word vectors are just based on cooccurrences, relationships can be wrong.
Sequernce Modeling for Natural Language Processing
A sequence is an ordered collection of items. Traditional machine learning assumes data points to be independently and identically distributed (IID), but in many situations, like with language, speech, and time-series data, one data item depends on the items that precede or follow it. Such data is also called sequence data. Understanding sequences is essential to understanding human language. MLPs and CNNs do not adequately model sequences.
In deep learning, modeling sequences involves maintaining hidden "state information", or a hidden state. As each item in the sequence is encountered - for example, as each word in a sentence is seen by the model - the hidden state is updated. The hidden state (usually a vector) encapsulates everything seen by the sequence so far. This hidden state vector, also called a sequence representation, can then be used in many sequence modeling tasks in a myriad of ways depending on the task we are solving, ranging from classifying sequences to predicting sequences.
The most basic neural network sequence model is the recurrent neural network (RNN).
Introduction to Recurrent Neural Networks
The purpose of recurrent neural networks is to model sequences of tensors. There are several different members in the RNN family, but in this chapter, we work with the most basic form called the Elman RNN. The goal of recurrent networks - both the basic Elman form and the more complicated forms - is to learn representation of a sequence. The hidden state vector is computed from both a current input vector and the previous hidden state vector.
Backpropagation through time (BPTT): the input vector from the current time step and the hidden state vector from the previous time step are mapped to a hidden state vector of the current time step. A new hidden vector is computed using a hidden-to-hidden weight matrix to map the previous hidden state vector and an input-to-hidden weight matrix to map the input vector.
The hidden-to-hidden and input-to-hidden weights are shared across the different time steps. The intuition you should take away from this fact is that, during training, these weights will be adjusted so that the RNN is learning how to incorporate incoming information and maintain a state representation summarizing the input seen so far. The RNN does not have any way of knowing which time step it is on. Instead, it is simply learning how to transition from one time step to another and maintain a state representation that will minimize its loss function.
Using the same weights to transform inputs into outputs at every time step is another example of parameter sharing. RNNs use the same parameters to compute outputs at every time step by relying on a hidden state vector to capture the state of the sequence. In this way, the goal of RNNs is to learn sequence invariance by being able to compute any output given the hidden state vector and the input vector. You can think of an RNN sharing parameters across time and a CNN sharing parameters across space.
Because words and sentences can be of different lengths, the RNN or any sequence model should be equipped to handle variable-length sequences. One possible technique is to restrict sequences to a fixed length artifically. Another technique, called masking, can handle variable length sequence by taking advantage of knowledge of the lengths of the sequences. Masking allows for the data to signal when certain inputs should not count toward the gradient or the eventual output.
Intermediate Sequence Modeling for Natural Language Processing
The goal of this chapter is sequence prediction. Sequence prediction tasks require us to label each item of a sequence. Such tasks are common in natural language processing. Some examples include language modeling, in which we predict the next word given a sequence of words at each step; part of speech tagging, in which we predict the grammatical part of speech for each word; named entity recognition, in which we predict whether each word is part of a named entity, such as Person, Location, Product, or Organization. Sometimes, sequence prediction tasks are also referred to as sequence labeling.
Elman recurrent neural networks fail to capture long-range dependencies well and perform poorly in practice.
The Problem with Vanilla RNNs (Elman RNNs)
The vanilla/Elman RNN is well suited for modeling sequences, but it has two issues that make it unsuitable for many tasks: the inability to retain inform for long-range predictions and gradient stability. The first issue with Elman ENNs is that the hidden state vector is updated regardless of whether it makes sense at every time step. The second issue is that gradients tend to spiral out of control to zero or to infinity.
There are solutions to deal with these gradient problems in vanilla RNNs, such as the use of rectified linear units (ReLUs), gradient clipping, and careful initialization. But none of the proposed solutions work as reliably as the technique called gating.
Gating as a Solution to Vanilla RNN's Challenges
Gating: h is the hidden state, x is the input, F is the recurrent computation of the RNNs, and λ is the "switch" or "gate" controlling how much the recurrent computation affects the hidden state:
The function λ is usually a sigmoid function. In the case of long short term memory, this basic intuition is extended carefully to incorporate not only conditional updates, but also intentional forgetting of the values in the previous hidden state ht−1 . This "forgetting" happens by multiplying the previous hidden state ht−1 with another function μ , that also produces values between 0 and 1 depending on the current input:
μ is other gating function. In actual LSTM, the gating functions are parameterized, leading to a somewhat complex sequence of operations. The LSTM is only one of the many gated variants of the RNN. Another variant that's becoming increasing popular is the gated recurrent unit (GRU). The gating mechanism is an effective solution for problems of vanilla/Elman RNNs. It not only makes the updates controlled, but also keeps the gradient issues under check and makes training relatively easier.
Tips and Tricks for Training Sequence Models
- When possible, use the gated variants: gated architectures simplify training by addressing many of the numerical stability issues of nongated variants
- When possible, prefer GRUs over LSTMs: GRUs provide almost comparable performance to LSTMs and use far fewer parameters and compute resources.
- Use Adam as your optimizer: It is reliable and typically converges faster than the alternatives. This is especially true for sequence models. If for some reason your models are not converging with Adam, switching to SGD might help.
- Gradient clipping: If you notice numerical errors in applying concepts, instruct your code to plot the values of the gradients during training ans clip any outliers.
- Early Stopping: with sequence models, it is easy to overfit. We recommend that you stop the training procedure early, when the evaluation error, as measured on a development set, starts going up.
Advanced Sequence Modeling for Natural Language Processing
Sequence-to-Sequence Models, Encoder-Decoder Models, and Conditioned Generation
Sequence-to-Sequence models are a special case of a general family of models called encoder-decoder models. An encoder-decoder model is a composition of two models, an "encoder" and a "decoder", that are typically jointly trained. The encoder model takes an input and produces an encoding or a representation ( ϕ ) of the input, which is usually a vector. The goal of this encoder is to capture important properties of the input with respect to the task at hand. The goal of the decoder is to take the encoded input and produce a desired output. Sequence to Sequence models are encoder-decoder models in which the encoder and decoder are sequence models and the inputs and outputs are both sequences, possibly of different lengths.
One way to view encoder-decoder models is as a special case of models called conditioned generation models. In conditioned generation, instead of the input representation ϕ , a general conditioning context c influences a decoder to produce an output. When the conditioning context c comes from an encoder model, conditioned generation is same as encoder-decoder model.
Capturing More from a Sequence: Bidirectional Recurrent Models
When modeling a sequence, it is useful to observe not just the words in the past but also the words that appear in the future. The goal of Bidirectional Recurrent Models is to use information from the past and the future to make better predictions. Any of the models of the recurrent family, such as Elmann RNNs, LSTMs, or GRUs, could be used in such a bidirectional formulation.
Capturing More from a Sequence: Attention
The phenomenon of attention: focusing on the relevant parts of the input while producing the output. We would like our sequence generation models to incorporate attention to different parts of the input and not just the final summary of the entire input. This is called the attention mechanism. To incorporate attention, we consider not only the final hidden state of the encoder, but also the hidden states for each of the intermediate steps. These encoder hidden states are, somewhat uninformatively, called values (or, in some situations keys), Attention also depends on the previous hidden state of the decoder, called the query. Attention is represented by a vector with the same dimension as the number of values it is attending to. This is called the attention vector, or attention weights, or sometimes alignment. The attention weights are combined with the encoder states ("values") to generate a context vector that's sometimes also known as a glimpse. This context vector becomes the input for the decoder instead of the full sentence encoding. The attention vector for the next time step is updated using a compatability function.
There are several ways to implement attention: most commonly used is the context aware mechanism. Another popular one is location-aware attention, which depends on the query vector and the key. The attention weights are typically floating-point values between 0 and 1. This is called soft attention. In contrast, it is possible to learn a binary 0/1 for attention. This is called hard attention. The attention mechanism in the image below depends on encoder states for all the time steps in the input; this is also known as global attention. In contrast, for local attention, you could devise an attention mechanism that depended only on a window of the input around the current time step.
Sometimes, the alignment information could be explicitly provided as part of the training data. In such situations, a supervised attention mechanism could be devised to learn the attention function using a separate neural network that's jointly trained. For large inputs, you can use a hierarchal attention mechanism.
The work on transformer models introduces multiheaded attention, in which multiple attention vectors are used to track different regions of the input. They also popularized the concept of self-attention, a mechanism whereby the model learns which regions of the input influence one another. When the input is multimodal, it is possible to design a multimodal attention mechanism.
There are two kinds of metrics for annotated evaluation of generated sequences: n-gram overlap-based metrics and perplexity. N-gram overlap based metrics tend to measure how close an output is with respect to a reference (an expected output) by computing a score using n-gram overlap statistics.
Perplexity is the other automatic evaluation metric based on information theory, and you can apply it to any situation in which you can measure the probability of the output sequence. For a sequence x , if P(x) is the probability of the sequence, perplexity is defined as:
This gives us a simple way to compare different sequence generation models - measure the perplexity of the model for a held out dataset. Although this is easy to compute, perplexity has many problems when used for sequence generation evaluation.