Recurrent Neural Network Regularization

I am reading this paper because it was recommended as part of Ilya Sutskever's approx. 30 papers that he recommended to John Carmack to learn what really matters for machine learning / AI today. This paper shows how to apply the dropout regularization technique to LSTMs, and it shows that this application of dropout substantially reduces overfitting on a variety of tasks.

Reference Link to PDF of Conference Paper

DOWNLOAD TEX

Date Created: 24 18, 2024

Last Edited: 47 05, 2025

1 149

This paper presents a simple regularization technique for Recurrent Neural Networks (RNNs) with Log Short-Term Memory (LSTM) units. Regular dropout does not work well with RNNs and LSTMs. This paper shows how to apply dropout to effectively regularize LSTMs.

The Recurrent Neural Network (RNN) is a neural sequence model that achieves state of the art performance on important tasks that include language modeling, speech recognition, and machine translation. Successful applications of neural networks requires good regularization, but dropout, the most effective regularization technique for feed forward neural networks, does not work well with RNNs. As a result, it is difficult to prevent large RNNs from overfitting. This paper shows that when dropout is applied correctly, it greatly reduces overfitting in RNNs.

This paper shows that the problem of dropout not working with RNNs because the recurrence amplifies noise can be fixed by applying dropout to a certain subset of the RNNs connections. As a result, RNNs can also benefit from dropout. This paper shows how to correctly apply dropout to LSTMs, the most commonly-used RNN variant.

In this paper, subscripts denote timestamps and superscripts denote layers. All states are $n$ -dimensional. Let $T_{n, m} : R^{n} \to R^{m}$ be an affine transform ( $Wx + b$ for some $W$ and $b$ ). $⊙$ is element wise multiplication and $h_{t}^{0}$ is an input word vector at timestep $k$ . The activations $h_{t}^{L}$ is used to predict $y_{t}$ , since $L$ is the number of layers in the deep LSTM. The RNN dynamics can be described using deterministic transitions form previous to current hidden states. The deterministic state transition is a function

RNN : h_{t}^{l - 1}, h_{t - 1}^{l} \to h_{t}^{l}

For classical RNNs, this function is given by:

h_{t}^{l} = f (T_{n . n} h_{t}^{l - 1} + T_{n}, n h_{t - 1}^{l}), where f \in {sigm, tanh}

The LSTM has complicated dynamics that allow it to easily ”memorize” information for an extended number of timesteps. The ”log term” memory is stored in a vector of memory cells, $c_{t}^{l} \in R^{n}$ . All LSTM architectures have explicit memory cells for storing information for long periods of time. The LSTM can decde to overwrite the memory cell, retrieve it, or keep it for the next time step. The LSTM architecture in this paper’s experiments is given by the equation below:

LSTM : h_{t}^{l - 1}, h_{t - 1}^{l}, c_{t - 1}^{l} \to h_{t}^{l}, c_{t}^{l} (\begin{matrix} i \\ f \\ o \\ g \end{matrix}) = (\begin{matrix} sigm \\ sigm \\ sigm \\ tanh \end{matrix}) T_{2 n, 4 n} (\begin{matrix} h_{t}^{l - 1} \\ h_{t - 1}^{l} \end{matrix}) c_{t}^{l} = f ⊙ c_{t - 1}^{l} + i ⊙ g h_{t}^{l} = o ⊙ tanh (c_{t}^{l})

In these equations, sigm and tanh are applied element-wise. The image below demonstrates the LSTM equations:

The main contribution of this paper is a recipe for applying dropout to LSTMs in a way that successfully reduces overfitting. The main idea is to apply the dropout operator only to the non-recurrent connections (See image above). The following equation describes it more precisely, where D is the dropout operator that sets a random subset of its argument to zero:

(\begin{matrix} i \\ f \\ o \\ g \end{matrix}) = (\begin{matrix} sigm \\ sigm \\ sigm \\ tanh \end{matrix}) T_{2 n, 4 n} (\begin{matrix} D (h_{t}^{l - 1}) \\ h_{t - 1}^{l} \end{matrix}) c_{t}^{l} = f ⊙ c_{t - 1}^{l} + i ⊙ g h_{t}^{l} = o ⊙ tanh (c_{t}^{l})

The dropout operator corrupts the information carried by the units, forcing them to perform their intermediate computations more robustly. We do not want to erase all the information form the units. It is especially important that the units remember events that occurred many timesteps in the past. The image below shows how information could flow from an event that occurred at $t - 2$ to the prediction in timestep $t + 2$ in our implementation of dropout. The information is corrupted by the dropout operator exactly $L + 1$ times, and this number is independent of timesteps traversed by the information. Standard dropout perturbs the recurrent connections, which makes it difficult for the LSTM to learn to store information for long periods of time. By not using dropout on recurrent connections, the LSTM can benefit from dropout regularization without sacrificing its valuable memorization ability.

This paper presented a simple way of applying dropout to LSTMs that results in large performance increases on several problems of different domains. This work makes dropout useful for RNNs, and the results suggest that this implementation of dropout could improve performance on a wide variety of applications.

Recurrent Neural Network Regularization

Comments

User Comments