Understanding LSTM Networks

I am reading this paper because it was recommended as part of Ilya Sutskever's approx. 30 papers that he recommended to John Carmack to learn what really matters for machine learning / AI today. This blog post, by Christopher Olah, is about better understanding LSTM Networks.

Reference Colah's blog Post

DOWNLOAD TEX

Date Created: 21 18, 2024

Last Edited: 57 04, 2025

1 186

0.1 Notes

Humans don’t start thinking every second. Your thoughts have persistence. Traditional neural networks can’t do this, and it seems to be a major shortcoming. Recurrent neural networks address these issues. They are networks with loops in them, allowing information to persist.

In the above diagram, a chunk of neural network, $A$ , looks at some input $x_{t}$ and outputs a value $h_{t}$ . A loop allows information to be passed from one step of the network to the next. A recurrent neural network can be thought of as multiple copies of the same network, each passing a message to a successor:

The chain-like nature reveals that recurrent neural networks are intimately related to sequences and lists. They’re the natural architecture of neural network to use for sequence data. Essential to the success of RNNs is the the use of Long Short Term Memory Cells, LSTMs.

One of the appeals of RNNs is that they might be able to connect previous information to the present task. Sometimes, we only need to look at recent information to perform the present task. Simple RNNs can learn to use recent information from the pat. But there are also cases where more context is needed - the gap between the relevant information and the point where it is needed is large. As the gap grows large, simple RNNs are unable to connect the information:

Long Short Term Memory networks - usually just called ”LSTMs” - are a special kind of RNN, capable of learning long-term dependencies. They were introduced in 1997, work well on a variety of problems, and are now widely used. LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn. All recurrent networks have the form of a chain of repeating modules of neural networks. In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer. LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer, there are four, interacting in a very special way:

In the above diagram, each line carries an entire vector, from the output of one node to the inputs of others. The pink circle represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers. Lines merging denote concatenation, while a line forking denotes its content being copied and copies going to different locations.

The key to LSTMs is the cell state, the horizontal line running through the top of the diagram./ The cell state runs through the entire chain, with only some minor interactions. It’s very easy for information to just flow along unchanged.

The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates. Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation.

The sigmoid layer outputs numbers between 0 and 1, describing how much of each component should be let though. A value of 0 means ”let nothing through”, while 1 means ”let everything through”. A LSTM has three of these gates, to protect and control the cell state.

0.2 Step-by-Step LSTM Walk Through

The first step in the LSTM is to decide what information we are going to throw away from the cell state. This decision is made by a sigmoid layer called the ”forget gate layer”. It looks at $h_{t - 1}$ and $x_{t}$ , and outputs a number between 0 and 1 for each number in the cell state $C_{t - 1}$ . A 1 represents completely keep this while 0 represents completely get rid of this.

The next step is to decide what new information we are going to store in the cell state. This has two parts. First, a sigmoid layer called the ”input gate layer” decides which values we’ll update. Next, the tanh layer creates a vector of new candidate values ${\tilde{C}}_{t}$ that could be added to the state. In the next step, we’ll combine these two o create an update to the state.

It is now time to update the old cell state, $C_{t - 1}$ into the new cell state $C_{t}$ . The previous steps decided what to do, not it just needs to be done. The old state is multiplied by $f_{t}$ , forgetting the things we decided to forget earlier. Then we add $i_{t} * {\tilde{C}}_{t}$ . This is the new candidate values, scaled by how much we decided to update each state value.

Finally, we need to decide what we’re going to output. This output will be based on our cell state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell start we’re going to output. Then, we put the cell state through tanh (to push the values between -1 and 1) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.

0.3 Variants on Long Short Term Memory

Not all LSTMs are the same as above. It seems like every paper involving LSTMs involves some variant. One paper adds peephole connections, which means that we get the gate layers to look at the cell state. This diagram adds peepholes to all the gates, but many papers will give some peepholes and not others.

Another variant is to use coupled forget and input gates. Instead of separately deciding what to forget and what we should add new information to, we make those decisions together. We only forget when we’re going to input something in its place. We only input new values to the state when we forget something older.

A more dramatic version on the LSTM is the Gated Recurrent Unit, or GRU. It combines the forget and input gate into a single ”update gate”. It also merges the cell state and the hidden state, and makes some other changes. The resulting model is simpler than standard LSTM models, and has been growing increasingly popular.

A paper comparing the LSTM variants found them to be about the same. Is there another big step (in what we can accomplish with RNNs) after LSTMs? Most researchers would say ”yes”, and that step is attention. The idea is to let every step of an RNN pick information to look at from some larger collection of information.

Understanding LSTM Networks

0.1 Notes

0.2 Step-by-Step LSTM Walk Through

0.3 Variants on Long Short Term Memory

Comments

User Comments