Understanding LSTM Networks

I am reading this paper because it was recommended as part of Ilya Sutskever's approx. 30 papers that he recommended to John Carmack to learn what really matters for machine learning / AI today. This blog post, by Christopher Olah, is about better understanding LSTM Networks.

Reference Colah's blog Post

Date Created:
Last Edited:
1 18

0.1 Notes

Humans don’t start thinking every second. Your thoughts have persistence. Traditional neural networks can’t do this, and it seems to be a major shortcoming. Recurrent neural networks address these issues. They are networks with loops in them, allowing information to persist.

PIC

In the above diagram, a chunk of neural network, A, looks at some input xt and outputs a value ht. A loop allows information to be passed from one step of the network to the next. A recurrent neural network can be thought of as multiple copies of the same network, each passing a message to a successor:

PIC

The chain-like nature reveals that recurrent neural networks are intimately related to sequences and lists. They’re the natural architecture of neural network to use for sequence data. Essential to the success of RNNs is the the use of Long Short Term Memory Cells, LSTMs.

One of the appeals of RNNs is that they might be able to connect previous information to the present task. Sometimes, we only need to look at recent information to perform the present task. Simple RNNs can learn to use recent information from the pat. But there are also cases where more context is needed - the gap between the relevant information and the point where it is needed is large. As the gap grows large, simple RNNs are unable to connect the information:

PIC

Long Short Term Memory networks - usually just called ”LSTMs” - are a special kind of RNN, capable of learning long-term dependencies. They were introduced in 1997, work well on a variety of problems, and are now widely used. LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn. All recurrent networks have the form of a chain of repeating modules of neural networks. In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer. LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer, there are four, interacting in a very special way:

PIC

In the above diagram, each line carries an entire vector, from the output of one node to the inputs of others. The pink circle represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers. Lines merging denote concatenation, while a line forking denotes its content being copied and copies going to different locations.

The key to LSTMs is the cell state, the horizontal line running through the top of the diagram./ The cell state runs through the entire chain, with only some minor interactions. It’s very easy for information to just flow along unchanged.

PIC

The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates. Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation.

PIC

The sigmoid layer outputs numbers between 0 and 1, describing how much of each component should be let though. A value of 0 means ”let nothing through”, while 1 means ”let everything through”. A LSTM has three of these gates, to protect and control the cell state.

0.2 Step-by-Step LSTM Walk Through

The first step in the LSTM is to decide what information we are going to throw away from the cell state. This decision is made by a sigmoid layer called the ”forget gate layer”. It looks at ht-1 and xt, and outputs a number between 0 and 1 for each number in the cell state Ct-1. A 1 represents completely keep this while 0 represents completely get rid of this.

PIC

The next step is to decide what new information we are going to store in the cell state. This has two parts. First, a sigmoid layer called the ”input gate layer” decides which values we’ll update. Next, the tanh layer creates a vector of new candidate values C~t that could be added to the state. In the next step, we’ll combine these two o create an update to the state.

PIC

It is now time to update the old cell state, Ct-1 into the new cell state Ct. The previous steps decided what to do, not it just needs to be done. The old state is multiplied by ft, forgetting the things we decided to forget earlier. Then we add it *C~t. This is the new candidate values, scaled by how much we decided to update each state value.

PIC

Finally, we need to decide what we’re going to output. This output will be based on our cell state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell start we’re going to output. Then, we put the cell state through tanh (to push the values between -1 and 1) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.

PIC

0.3 Variants on Long Short Term Memory

Not all LSTMs are the same as above. It seems like every paper involving LSTMs involves some variant. One paper adds peephole connections, which means that we get the gate layers to look at the cell state. This diagram adds peepholes to all the gates, but many papers will give some peepholes and not others.

PIC

Another variant is to use coupled forget and input gates. Instead of separately deciding what to forget and what we should add new information to, we make those decisions together. We only forget when we’re going to input something in its place. We only input new values to the state when we forget something older.

PIC

A more dramatic version on the LSTM is the Gated Recurrent Unit, or GRU. It combines the forget and input gate into a single ”update gate”. It also merges the cell state and the hidden state, and makes some other changes. The resulting model is simpler than standard LSTM models, and has been growing increasingly popular.

PIC

A paper comparing the LSTM variants found them to be about the same. Is there another big step (in what we can accomplish with RNNs) after LSTMs? Most researchers would say ”yes”, and that step is attention. The idea is to let every step of an RNN pick information to look at from some larger collection of information.

Comments

You have to be logged in to add a comment

User Comments

Insert Math Markup

ESC
About Inserting Math Content
Display Style:

Embed News Content

ESC
About Embedding News Content

Embed Youtube Video

ESC
Embedding Youtube Videos

Embed TikTok Video

ESC
Embedding TikTok Videos

Embed X Post

ESC
Embedding X Posts

Embed Instagram Post

ESC
Embedding Instagram Posts

Insert Details Element

ESC

Example Output:

Summary Title
You will be able to insert content here after confirming the title of the <details> element.

Insert Table

ESC
Customization
Align:
Preview:

Insert Horizontal Rule

#000000

Preview:


View Content At Different Sizes

ESC

Edit Style of Block Nodes

ESC

Edit the background color, default text color, margin, padding, and border of block nodes. Editable block nodes include paragraphs, headers, and lists.

#ffffff
#000000

Edit Selected Cells

Change the background color, vertical align, and borders of the cells in the current selection.

#ffffff
Vertical Align:
Border
#000000
Border Style:

Edit Table

ESC
Customization:
Align:

Upload Lexical State

ESC

Upload a .lexical file. If the file type matches the type of the current editor, then a preview will be shown below the file input.

Upload 3D Object

ESC

Upload Jupyter Notebook

ESC

Upload a Jupyter notebook and embed the resulting HTML in the text editor.

Insert Custom HTML

ESC

Edit Image Background Color

ESC
#ffffff

Insert Columns Layout

ESC
Column Type:

Select Code Language

ESC
Select Coding Language

Insert Chart

ESC

Use the search box below

Upload Previous Version of Article State

ESC