Recurrent Neural Network Regularization

I am reading this paper because it was recommended as part of Ilya Sutskever's approx. 30 papers that he recommended to John Carmack to learn what really matters for machine learning / AI today. This paper shows how to apply the dropout regularization technique to LSTMs, and it shows that this application of dropout substantially reduces overfitting on a variety of tasks.

Reference Link to PDF of Conference Paper

Date Created:
Last Edited:
1 15

This paper presents a simple regularization technique for Recurrent Neural Networks (RNNs) with Log Short-Term Memory (LSTM) units. Regular dropout does not work well with RNNs and LSTMs. This paper shows how to apply dropout to effectively regularize LSTMs.

The Recurrent Neural Network (RNN) is a neural sequence model that achieves state of the art performance on important tasks that include language modeling, speech recognition, and machine translation. Successful applications of neural networks requires good regularization, but dropout, the most effective regularization technique for feed forward neural networks, does not work well with RNNs. As a result, it is difficult to prevent large RNNs from overfitting. This paper shows that when dropout is applied correctly, it greatly reduces overfitting in RNNs.

This paper shows that the problem of dropout not working with RNNs because the recurrence amplifies noise can be fixed by applying dropout to a certain subset of the RNNs connections. As a result, RNNs can also benefit from dropout. This paper shows how to correctly apply dropout to LSTMs, the most commonly-used RNN variant.

In this paper, subscripts denote timestamps and superscripts denote layers. All states are n-dimensional. Let Tn,m : Rn Rm be an affine transform (Wx + b for some W and b). is element wise multiplication and ht0 is an input word vector at timestep k. The activations htL is used to predict yt, since L is the number of layers in the deep LSTM. The RNN dynamics can be described using deterministic transitions form previous to current hidden states. The deterministic state transition is a function

RNN : htl-1,h t-1l h tl

For classical RNNs, this function is given by:

htl = f(T n.nhtl-1 + T n,nht-1l), where f {sigm,tanh}

The LSTM has complicated dynamics that allow it to easily ”memorize” information for an extended number of timesteps. The ”log term” memory is stored in a vector of memory cells, ctl Rn. All LSTM architectures have explicit memory cells for storing information for long periods of time. The LSTM can decde to overwrite the memory cell, retrieve it, or keep it for the next time step. The LSTM architecture in this paper’s experiments is given by the equation below:

LSTM : htl-1,h t-1l,c t-1l h tl,c tl ( i f o g ) = ( sigm sigm sigm tanh )T2n,4n ( htl-1 ht-1l ) ctl = fc t-1l+igh tl = otanh(c tl)

In these equations, sigm and tanh are applied element-wise. The image below demonstrates the LSTM equations:

PIC

The main contribution of this paper is a recipe for applying dropout to LSTMs in a way that successfully reduces overfitting. The main idea is to apply the dropout operator only to the non-recurrent connections (See image above). The following equation describes it more precisely, where D is the dropout operator that sets a random subset of its argument to zero:

( i f o g ) = ( sigm sigm sigm tanh )T2n,4n ( D(htl-1) ht-1l ) ctl = fc t-1l+igh tl = otanh(c tl)

The dropout operator corrupts the information carried by the units, forcing them to perform their intermediate computations more robustly. We do not want to erase all the information form the units. It is especially important that the units remember events that occurred many timesteps in the past. The image below shows how information could flow from an event that occurred at t - 2 to the prediction in timestep t + 2 in our implementation of dropout. The information is corrupted by the dropout operator exactly L + 1 times, and this number is independent of timesteps traversed by the information. Standard dropout perturbs the recurrent connections, which makes it difficult for the LSTM to learn to store information for long periods of time. By not using dropout on recurrent connections, the LSTM can benefit from dropout regularization without sacrificing its valuable memorization ability.

PIC

This paper presented a simple way of applying dropout to LSTMs that results in large performance increases on several problems of different domains. This work makes dropout useful for RNNs, and the results suggest that this implementation of dropout could improve performance on a wide variety of applications.

Comments

You have to be logged in to add a comment

User Comments

Insert Math Markup

ESC
About Inserting Math Content
Display Style:

Embed News Content

ESC
About Embedding News Content

Embed Youtube Video

ESC
Embedding Youtube Videos

Embed TikTok Video

ESC
Embedding TikTok Videos

Embed X Post

ESC
Embedding X Posts

Embed Instagram Post

ESC
Embedding Instagram Posts

Insert Details Element

ESC

Example Output:

Summary Title
You will be able to insert content here after confirming the title of the <details> element.

Insert Table

ESC
Customization
Align:
Preview:

Insert Horizontal Rule

#000000

Preview:


View Content At Different Sizes

ESC

Edit Style of Block Nodes

ESC

Edit the background color, default text color, margin, padding, and border of block nodes. Editable block nodes include paragraphs, headers, and lists.

#ffffff
#000000

Edit Selected Cells

Change the background color, vertical align, and borders of the cells in the current selection.

#ffffff
Vertical Align:
Border
#000000
Border Style:

Edit Table

ESC
Customization:
Align:

Upload Lexical State

ESC

Upload a .lexical file. If the file type matches the type of the current editor, then a preview will be shown below the file input.

Upload 3D Object

ESC

Upload Jupyter Notebook

ESC

Upload a Jupyter notebook and embed the resulting HTML in the text editor.

Insert Custom HTML

ESC

Edit Image Background Color

ESC
#ffffff

Insert Columns Layout

ESC
Column Type:

Select Code Language

ESC
Select Coding Language

Insert Chart

ESC

Use the search box below

Upload Previous Version of Article State

ESC