Hands-On Machine Learning Chapter 15 - Processing Sequences Using RNNs and CNNs

This chapter goes over processing sequence data with recurrent neural networks and convolutional neural networks - it introduces LSTMs, GRUs, simple RNNs and explains them all.

Chapter 15: Processing Sequences Using RNNs and CNNs

Recurrent Neurons and Layers

A recurrent neural network looks very much like a feed forward neural network, except it it also has connections pointing backward. At each time step ttt , also called a frame, this recurrent neuron receives the inputs x(t)\textbf{x}_{(t)}x(t) as well as its own output from the previous time step y^(t1)\hat{y}_{(t-1)}y^(t1) . Since there is np previous output from the first time step, it is generally set to 0. A layer of recurrent neurons: at each time step ttt , every neuron receives both the input vector x(t)\textbf{x}_{(t)}x(t) and the output vector from the previous time step y^(t1)\hat{y}_{(t-1)}y^(t1) .

A Recurrent Neuron Unrolled Through Time

Each recurrent neuron has two sets of weights: one for the inputs x(t)\textbf{x}_{(t)}x(t) and the other for the outputs of the previous time step y^(t1)\hat{y}_{(t-1)}y^(t1) . These weight vectors can be denoted wx\textbf{w}_{x}wx and wy^\textbf{w}_{\hat{y}}wy^ . Their corresponding matrices can be denoted Wx\textbf{W}_{x}Wx and Wy^\textbf{W}_{\hat{y}}Wy^ . The output vector of a recurrent layer ( ϕ\phiϕ is the activation function and b\textbf{b}b is the bias vector):

y^(t)=ϕ(WxTx(t)+Wy^Ty^(t1)+b)\hat{y}_{(t)}=\phi (\textbf{W}_{\textbf{x}}^T \textbf{x}_{(t)} + \textbf{W}_{\hat{y}}^T \hat{y}_{(t-1)} + \textbf{b})y^(t)=ϕ(WxTx(t)+Wy^Ty^(t1)+b)

Computing a recurrent layer's output in one shot for an entire mini-batch by placing all the inputs at time step ttt into an input matrix X(t)\textbf{X}_{(t)}X(t) :

Y^(t)=ϕ(WxTX(t)+Wy^TY^(t1)+b)=ϕ([X(t) Y^(t1)]W+b) with W=[WxWy^]\hat{Y}_{(t)} = \phi (\textbf{W}_{\textbf{x}}^T \textbf{X}_{(t)} + \textbf{W}_{\hat{y}}^T \hat{Y}_{(t-1)} + \textbf{b}) \\[0.25em] = \phi ([\textbf{X}_{(t)} \space \hat{Y}_{(t-1)} ]\textbf{W} + \textbf{b}) \text{ with }\textbf{W}=\begin{bmatrix}\textbf{W}_{x} \\ \textbf{W}_{\hat{y}} \end{bmatrix}Y^(t)=ϕ(WxTX(t)+Wy^TY^(t1)+b)=ϕ([X(t) Y^(t1)]W+b) with W=[WxWy^]
  • Y^(t)\hat{Y}_{(t)}Y^(t) is an m×nneuronsm \times n_{neurons}m×nneurons matrix containing the layer's outputs at time step ttt for each instance in the mini batch
  • X^(t)\hat{X}_{(t)}X^(t) is an m×nneuronsm \times n_{neurons}m×nneurons matrix containing all the inputs for all instances
  • Wx\textbf{W}_xWx is an ninputs×nneuronsn_{inputs} \times n_{neurons}ninputs×nneurons matrix containing the connection weighst fro the inputs of the current time step
  • Wy^\textbf{W}_{\hat{y}}Wy^ is an nneurons×nneuronsn_{neurons} \times n_{neurons}nneurons×nneurons matrix containing the connection weights for the outputs of the previous time step
  • b\textbf{b}b is a vector of size nneuronsn_{neurons}nneurons containing each neuron's bias term
  • The weight matrices Wx\textbf{W}_xWx and Wy^\textbf{W}_{\hat{y}}Wy^ are often concatenated vertically into a single matrix W\textbf{W}W of shape (ninputs×nneurons)×nneurons(n_{inputs}\times n_{neurons}) \times n_{neurons}(ninputs×nneurons)×nneurons
  • The notion [X(t) Y^(t1)][\textbf{X}_{(t)} \space \hat{Y}_{(t-1)} ][X(t) Y^(t1)] represents the horizontal concatenation of the matrices X(t)\textbf{X}_{(t)}X(t) and Y^(t1)\hat{Y}_{(t-1)}Y^(t1)

Notice that Y^(t)\hat{Y}_{(t)}Y^(t) is a function of all inputs since t=0t=0t=0 .

Memory Cells

Since the output of a recurrent neuron at time step ttt is a function of all the inputs from the previous time steps, you could say it has a form of memory. A part of a neural network that preserves some state across time steps is called a memory cell (or simply a cell). A cells state at time ttt , h(t)\textbf{h}_{(t)}h(t) (where "h" stands for "hidden") , is a functuon of some inputs at that time step and its state at the previous time step h(t)=f(x(t),h(t1))\textbf{h}_{(t)}=f(\textbf{x}_{(t)},\textbf{h}_{(t-1)})h(t)=f(x(t),h(t1)) . It is not always the case that the hidden state for a time step is equal to the output of that time step.

Hidden State and Output Not Always Equal

Input and Output Sequences

An RNN can simultaneously take a sequence of inputs and produce a sequence of ouputs. This type of sequence-to-sequence network is useful to forecast time series. Alternatively, you could feed the network a sequence of inputs and ignore all outputs except for the last one (sequence-to-vector network). Conversely, you could feed the network the same input vector over and over again at each time step and let it output a sequence (vector-to-sequence network). Lastly, you could have a sequence-to-vector network, called an encoder, followed by a vector-to-sequence network, called a decoder (encoder-decoder model).

Input And Output Sequences

Training RNNs

To train an RNN, the trick is to unroll it through time and use regular backpropagation. This strategy is called backpropagation through time (BPTT). The output sequence is evaluated ysing a loss function L(Y(0),,Y(T);Y^(0),,Y^(T))\mathscr{L}(\textbf{Y}_{(0)},\ldots , \textbf{Y}_{(T)}; \hat{Y}_{(0)}, \ldots , \hat{Y}_{(T)})L(Y(0),,Y(T);Y^(0),,Y^(T)) (where Y(i)\textbf{Y}_{(i)}Y(i) is the ithi^{th}ith target, Y^(i)\hat{Y}_{(i)}Y^(i) is the ithi^{th}ith prediction, and TTT is the max time step). Note that this function may ignore some outputs - see image below, which only uses last three outputs. Since the same parameters W\textbf{W}W and b\textbf{b}b are used at each time step, their gradients will be tweaked multiple times during backprop. Keras takes care of this backprop and gradient descent.

BPTT

Forecasting a Time Series

This is time series data: data with values at different time steps, usually at regular intervals. More specifically, since there are multiple values per time step, this is called a multivariate time series. Time series with one value per time step is called univariate time series. Predicting future values (forecasting) is the most typical task when dealing with time series. Other tasks include imputation, classification, anomaly detection, and more. Our baseline for this dataset (which shows seasonality (see image below)) will be a naive forecast, copying a past value to make our forecast. When a time series is correlated with a lagged version of itself, we say that the time series is autocorrelated. The MAE, MAPE, and MSE are among the most common metrics you can use to evaluate your forecasts. Differencing, subtracting a past value from a more recent value in the series, is a common technique used to remove trend and seasonality from a time series: it's easier to maintain a stationary time series, meaning one whose properties remain constant over time, without any seasonality or trends. Once you're able to make accurate forecasts on the differenced time series, it's easy to turn them into forecasts for the actual time series by just adding back the past values that were previously subtracted.

"""
Loading and Cleaning up the Data
"""
import pandas as pd
import os
# Load File
df = pd.read_csv(os.path.join(os.getcwd(),'CTA_-_Ridership_-_Daily_Boarding_Totals.csv'), parse_dates=["service_date"])
df.columns = ["date", "day_type", "bus", "rail", "total"]  # shorter names
# Sort the rows by date
df = df.sort_values("date").set_index("date")
df = df.drop("total", axis=1)  # no need for total, it's just bus + rail
df = df.drop_duplicates()  # remove duplicated months (2011-10 and 2014-07)
df.head()
out[3]

day_type bus rail

date

2001-01-01 U 297192 126455

2001-01-02 W 780827 501952

2001-01-03 W 824923 536432

2001-01-04 W 870021 550011

2001-01-05 W 890426 557917

"""
Plot the Bus and Rail Ridership For a
Few Months
"""
import matplotlib.pyplot as plt

df["2019-03":"2019-05"].plot(grid=True, marker=".", figsize=(8, 3.5))
plt.show()
out[4]
Jupyter Notebook Image

<Figure size 800x350 with 1 Axes>

"""
Visualizing Naive Forecasts
"""
diff_7 = df[["bus", "rail"]].diff(7)["2019-03":"2019-05"]

fig, axs = plt.subplots(2, 1, sharex=True, figsize=(8, 5))
df.plot(ax=axs[0], legend=False, marker=".")  # original time series
df.shift(7).plot(ax=axs[0], grid=True, legend=False, linestyle=":")  # lagged
diff_7.plot(ax=axs[1], grid=True, marker=".")  # 7-day difference time series
plt.show()
out[5]
Jupyter Notebook Image

<Figure size 800x500 with 2 Axes>

"""
MAE of Naive Forecasts
"""
print("MAE")
print(diff_7.abs().mean())
print("MAPE - Mean Absolute Percentage Error")
targets = df[["bus", "rail"]]["2019-03":"2019-05"]
print((diff_7 / targets).abs().mean())
out[6]

MAE
bus 43915.608696
rail 42143.271739
dtype: float64
MAPE - Mean Absolute Percentage Error
bus 0.082938
rail 0.089948
dtype: float64

The ARMA Model Family

The autoregressive moving average (ARMA) model computes forecasts using a simple weighted average sum of lagged values and corrects theseforecasts by adding a moving overage.

y^(t)=i=1pαiy(ti)+i=1qθiϵ(ti) with ϵ(t)=y(t)y^(t)\hat{y}_{(t)}=\sum_{i=1}^p \alpha_i y_{(t-i)} + \sum_{i=1}^q \theta_i \epsilon _{(t-i)}\text{ with }\epsilon_{(t)}=y_{(t)}-\hat{y}_{(t)}y^(t)=i=1pαiy(ti)+i=1qθiϵ(ti) with ϵ(t)=y(t)y^(t)
  • y(t)y_{(t)}y(t) is the time series' value at time step t
  • y^(t)\hat{y}_{(t)}y^(t) is the model's forecast at time step t
  • The first sum is the weighted sum of the last ppp values of the time series, using the learned weights αi\alpha_iαi . The number ppp is a hyperparameter, and it determines how far back into the past the model should look. The sum if the auto-regressive component: it performs regression based on past values.
  • The second sum if the weighted sum over the past qqq forecast errors ϵ(t)\epsilon_{(t)}ϵ(t) , using the learned weights θi\theta_{i}θi . The number qqq is a hyperparameter. The sum is the moving average of the model.

This model assumes the time series is stationary. If it is not, then differencing may help. Generally, running ddd consecutive rounds of differencing computes an approximation of the dthd^{th}dth order derivative of the time series, so it will elimnate polynomial trends up to degree ddd . The hyperparameter ddd is called the order of integration. Differencing is the centrol contribution of the autoregressive integrated moving average (AIRMA) model: this model runs ddd rounds of differencing to make the time series more stationary, then it applies a regular ARMA model. One last member of the ARMA family is the seasonal ARMA (SARIMA): it models the time series in the same way as ARIMA, but it additionally models a seasonal component for a given frequency, using the exact same ARMIMA approach.

!pip install statsmodels
from statsmodels.tsa.arima.model import ARIMA

origin, today = "2019-01-01", "2019-05-31"
rail_series = df.loc[origin:today]["rail"].asfreq("D")
model = ARIMA(rail_series,
              order=(1, 0, 0), # p=1,d=0,q=0
              seasonal_order=(0, 1, 1, 7)) # P=0,D=1,Q=1,s=7
model = model.fit()
y_pred = model.forecast()  # returns 427,758.6 - worse that naive baseline

"""
Running the above code in a loop to get the MAE over a period
"""
origin, start_date, end_date = "2019-01-01", "2019-03-01", "2019-05-31"
time_period = pd.date_range(start_date, end_date)
rail_series = df.loc[origin:end_date]["rail"].asfreq("D")
y_preds = []
for today in time_period.shift(-1):
    model = ARIMA(rail_series[origin:today],  # train on data up to "today"
                  order=(1, 0, 0),
                  seasonal_order=(0, 1, 1, 7))
    model = model.fit()  # note that we retrain the model every day!
    y_pred = model.forecast()[0]
    y_preds.append(y_pred)

y_preds = pd.Series(y_preds, index=time_period)
mae = (y_preds - rail_series[time_period]).abs().mean()  # returns 32,040.7
print("MAE:",mae)
out[8]

Preparing the Data for Machine Learning Models

Keras has a utility function called tf.keras.utils.timeseries_dataset_from_array() to help us prepare the training set. It takes a time series as input, and it builds a tf.data.Dataset containing all the windows of the desired length, as well as their corresponding targets. You can also use the window() function of tf.data.Dataset class, which is more complex but gives you more control.

"""
Building the dataset
"""
import tensorflow as tf

my_series = [0, 1, 2, 3, 4, 5]
my_dataset = tf.keras.utils.timeseries_dataset_from_array(
    my_series,
    targets=my_series[3:],  # the targets are 3 steps into the future
    sequence_length=3,
    batch_size=2
)
list(my_dataset) # Inspecting the dataset
out[10]

[(<tf.Tensor: shape=(2, 3), dtype=int32, numpy=

array([[0, 1, 2],

[1, 2, 3]], dtype=int32)>,

<tf.Tensor: shape=(2,), dtype=int32, numpy=array([3, 4], dtype=int32)>),

(<tf.Tensor: shape=(1, 3), dtype=int32, numpy=array([[2, 3, 4]], dtype=int32)>,

<tf.Tensor: shape=(1,), dtype=int32, numpy=array([5], dtype=int32)>)]

"""
Using the window function
"""
# drop_remainder drops windows that are smaller than the specified window size
# that occur at the end of the series - you typically want to do this.
# window() function returns a nested dataset - like a list oflists.
dataset = tf.data.Dataset.range(6).window(4, shift=1, drop_remainder=True)
# We must flatten the nested dataset into a flat dataset for training
# It retuns tensors, not datasets
dataset = dataset.flat_map(lambda window_dataset: window_dataset.batch(4))
for window_tensor in dataset:
  print(f"{window_tensor}")

def to_windows(dataset, length):
  """
  Helper function to make it easier to extract windows from a dataset
  """
  dataset = dataset.window(length, shift=1, drop_remainder=True)
  return dataset.flat_map(lambda window_ds: window_ds.batch(length))

dataset = to_windows(tf.data.Dataset.range(6), 4)  # 3 inputs + 1 target = 4
# Split each window into inputs and targets
dataset = dataset.map(lambda window: (window[:-1], window[-1]))
# Gives the same output as the timeseries_dataset_from_array() function
list(dataset.batch(2))
out[11]

[0 1 2 3]
[1 2 3 4]
[2 3 4 5]

[(<tf.Tensor: shape=(2, 3), dtype=int64, numpy=

array([[0, 1, 2],

[1, 2, 3]])>,

<tf.Tensor: shape=(2,), dtype=int64, numpy=array([3, 4])>),

(<tf.Tensor: shape=(1, 3), dtype=int64, numpy=array([[2, 3, 4]])>,

<tf.Tensor: shape=(1,), dtype=int64, numpy=array([5])>)]

"""
Split data into train, test, and validation.
Scale the data down by a factor of a million to ensure the values
are in 0-1 range, which helps with weight initialization and learning rate
"""
rail_train = df["rail"]["2016-01":"2018-12"] / 1e6
rail_valid = df["rail"]["2019-01":"2019-05"] / 1e6
rail_test = df["rail"]["2019-06":] / 1e6
out[12]
"""
Create datasets for training and validation,
"""
seq_length = 56
train_ds = tf.keras.utils.timeseries_dataset_from_array(
    rail_train.to_numpy(),
    targets=rail_train[seq_length:],
    sequence_length=seq_length,
    batch_size=32,
    shuffle=True, # Shuffle training windows, but not contents, for gradient descent
    seed=42
)
valid_ds = tf.keras.utils.timeseries_dataset_from_array(
    rail_valid.to_numpy(),
    targets=rail_valid[seq_length:],
    sequence_length=seq_length,
    batch_size=32
)
out[13]

Forecasting Using a Simple RNN

The most basic RNN, containing a single recurrent layer with just one recurrent neuron. All recureent layers in Keras expect 3D inputs of shape [batch size, time steps, dimensionality], where dimensionality is 1 for univariate time series and more for multivariate time series. input_shape ignores the first argument, and since recurrent layers can accept input sequences of any length, we can set the second dimension to None, which means "any size".

model = tf.keras.Sequential([
    tf.keras.layers.SimpleRNN(1, input_shape=[None, 1])
])

The model would end up having a large MAE due to only having a single recurrent neuron (not enough parameters) and the default activation function is tanh, which cannot output values in the range in which we need to output values.

Creating a model with a large recurrent layer. The recurrent layer will be able to carry much more information from one time step to the next, and the dense output lauer will project the final output from 32 dimensions to 1m without any value range contraints. This model ends up having a MAE of 27,703, the best yet.

univar_model = tf.keras.Sequential([
    tf.keras.layers.SimpleRNN(32, input_shape=[None, 1]),
    tf.keras.layers.Dense(1)  # no activation function by default
])

Forecasting Using a Deep RNN

Stacking multiple layers of cells gives you a deep RNN. Implementing a deep RNN in Keras is simple: just stack recurrent layers.

Deep RNN

# Deep RNN
deep_model = tf.keras.Sequential([
    # Make sure to set return_sequences=True for all layers except last
    # recurrent layer in Deep RNN, since you need the outputs for each timesteps.
    tf.keras.layers.SimpleRNN(32, return_sequences=True, input_shape=[None, 1]),
    tf.keras.layers.SimpleRNN(32, return_sequences=True), # Sequence to Sequence layers
    tf.keras.layers.SimpleRNN(32),
    tf.keras.layers.Dense(1) # Sequence to Vector Layer with no activation
])

Forecasting Multivariate Time Series

A great quality of neural networks is their flexibility: they can deal with multivariate time series with almost no change to their archiecture.

df_mulvar = df[["bus", "rail"]] / 1e6  # use both bus & rail series as input
df_mulvar["next_day_type"] = df["day_type"].shift(-1)  # we know tomorrow's type
df_mulvar = pd.get_dummies(df_mulvar)  # one-hot encode the day type
print(df_mulvar.head())
"""
Split data into train, val, test
"""
mulvar_train = df_mulvar["2016-01":"2018-12"]
mulvar_valid = df_mulvar["2019-01":"2019-05"]
mulvar_test = df_mulvar["2019-06":]
"""
Create the training sets
"""
train_mulvar_ds = tf.keras.utils.timeseries_dataset_from_array(
    mulvar_train.to_numpy(),  # use all 5 columns as input
    targets=mulvar_train["rail"][seq_length:],  # forecast only the rail series
    sequence_length=seq_length,
    batch_size=32,
    shuffle=True, # Shuffle training windows, but not contents, for gradient descent
    seed=42
)
valid_mulvar_ds = tf.keras.utils.timeseries_dataset_from_array(
    mulvar_valid.to_numpy(),
    targets=mulvar_valid["rail"][seq_length:],
    sequence_length=seq_length,
    batch_size=32
)
"""
Create the RNN

The only difference between this model and the earlier model is the multivariate
nature of it.
"""
mulvar_model = tf.keras.Sequential([
    tf.keras.layers.SimpleRNN(32, input_shape=[None, 5]),
    tf.keras.layers.Dense(1)
])

Forecasting Several Time Steps Ahead

To predict several time steps ahead, you could either:

  1. Predict the next timestep, add that to the data as if the data point ocurred, then re-run the model to get the next time step, and so on...
  2. Train an RNN to predict the next 14 values in one shot:
  3. This approach works well - it doesn't accumulate errors like (1)
def split_inputs_and_targets(mulvar_series, ahead=14, target_col=1):
    return mulvar_series[:, :-ahead], mulvar_series[:, -ahead:, target_col]

ahead_train_ds = tf.keras.utils.timeseries_dataset_from_array(
    mulvar_train.to_numpy(),
    targets=None,
    sequence_length=seq_length + 14,
    [...]  # the other 3 arguments are the same as earlier
).map(split_inputs_and_targets)
ahead_valid_ds = tf.keras.utils.timeseries_dataset_from_array(
    mulvar_valid.to_numpy(),
    targets=None,
    sequence_length=seq_length + 14,
    batch_size=32
).map(split_inputs_and_targets)

Forecasting Using a Sequence-to-Sequence Model

We can train the model to forecast the next 14 values at each and every time step - turn the sequence-to-vector RNN into a sequence-to-sequence RNN. The advantage of this technqiue is that the loss will contain a term for the output of the RNN at each and every time step, not just for the output at the last time step - this means that there will be many more error gradients flowing through the model.

"""
Preparing the Dataset for Sequence-to-Sequence Model
"""
my_series = tf.data.Dataset.range(7)
dataset = to_windows(to_windows(my_series, 3), 4)
print(list(dataset))

dataset = dataset.map(lambda S: (S[:, 0], S[:, 1:]))
print(list(dataset))

def to_seq2seq_dataset(series, seq_length=56, ahead=14, target_col=1, batch_size=32, shuffle=False, seed=None):
  """
  Utility function to prepare the datasets fro sequence-to-sequence model
  """
  ds = to_windows(tf.data.Dataset.from_tensor_slices(series), ahead + 1)
  ds = to_windows(ds, seq_length).map(lambda S: (S[:, 0], S[:, 1:, 1]))
  if shuffle:
      ds = ds.shuffle(8 * batch_size, seed=seed)
  return ds.batch(batch_size)

"""
Create the datasets
"""
seq2seq_train = to_seq2seq_dataset(mulvar_train, shuffle=True, seed=42)
seq2seq_valid = to_seq2seq_dataset(mulvar_valid)
"""
Build the sequence-to-sequnece model
"""
seq2seq_model = tf.keras.Sequential([
    # return_sequnces=True for sequence to sequence model
    tf.keras.layers.SimpleRNN(32, return_sequences=True, input_shape=[None, 5]),
    tf.keras.layers.Dense(14)
])

Handling Long Sequences

Fighting the Unstable Gradients Problem

Many of the tricks we used in deep nets to alleviate the unstable gradients problem can also be used for RNNs: good parameter initialization, faster optimizers, dropout, and so on. Nonsaturating activation functions may lead the RNN to be even more unstable during training. You want to use a smaller learning rate. You want to monitor the size of the gradients and may consider using gradient clipping. Batch Normalization won't work well with RNNs. A form of normalization that works well with RNNs: layer normalization. It is similar to batch normalization, but instead of normalizing across the batch dimension, layer normalization normalizes across the feature dimension. In an RNN, it is typically used right after the linear combination of the inputs and the hidden states.

class LNSimpleRNNCell(tf.keras.layers.Layer):
  """
  Custom Cell inherits from Layer class

  """
  def __init__(self, units, activation="tanh", **kwargs):
    super().__init__(**kwargs)
    self.state_size = units
    self.output_size = units
    self.simple_rnn_cell = tf.keras.layers.SimpleRNNCell(units,
                                                          activation=None)
    self.layer_norm = tf.keras.layers.LayerNormalization()
    self.activation = tf.keras.activations.get(activation)

  def call(self, inputs, states):
    """
    inputs = current time step
    states = hidden states from the previous time step(s)
    """
    outputs, new_states = self.simple_rnn_cell(inputs, states)
    norm_outputs = self.activation(self.layer_norm(outputs))
    return norm_outputs, [norm_outputs]
custom_ln_model = tf.keras.Sequential([
    tf.keras.layers.RNN(LNSimpleRNNCell(32), return_sequences=True,
                        input_shape=[None, 5]),
    tf.keras.layers.Dense(14)
])

Tackling the Short-Term Memory Problem

Sue to transformations that the data goes through when traversing an RNN, some information is lost at each time step. After a while, the RNN's state contains virtually no trace of the first inputs. Various cells with long term memory have been introduced to address this: Simple RNN cells not used much anymore. Most popular is the LSTM cell.

LSTM Cells

The long short-term memory (LSTM) cell performs much better than simple RNN cell and converges faster.

model = tf.keras.Sequential([
    tf.keras.layers.LSTM(32, return_sequences=True, input_shape=[None, 5]),
    tf.keras.layers.Dense(14)
])

LSTM Cell

The LSTM cell looks exactly like a regular cell, except that its state is split into two vectors: h(t)\textbf{h}_{(t)}h(t) and c(t)c_{(t)}c(t) ("c" stands for "cell"). You can think of h(t)\textbf{h}_{(t)}h(t) as short term state and c(t)c_{(t)}c(t) as long term state. The key idea of LSTM is that the network can learn what to stroe in the long-term state, what to throw away, and what to read from it. As the long term state c(t1)\textbf{c}_{(t-1)}c(t1) traverses the network from left to right, you can see it first goes through a forget gate, dropping some memroies, and then it adds some new memories via the addition operation (which adds the memories that were selected by an input gate). The result c(t)\textbf{c}_{(t)}c(t) is sent straight out, without any further transformation. At each time step, some memories are dropped and others are added. After the addition operation, the long-term state is copied and passed through the tanh function, and then the result is filtered by the output gate. This produces the short term state h(t)\textbf{h}_{(t)}h(t) which is equal to the cell's output fro this time step.

First, the current input vector x(t)\textbf{x}_{(t)}x(t) and the previous short-term state h(t1)\textbf{h}_{(t-1)}h(t1) are fed to dour different fully-connected layers:

  • The main layer is the one that outputs g(t)\textbf{g}_{(t)}g(t) . It has the usual role fo analying the current state inputs and the previous short term state. In a LSTM cell, this ;ayer's output does not go straight out; instead the most important parts are stored in the long-term state (and the rest is dropped).
  • The three other layers are gate controllers. The gate controllers are fed to element-wise multiplication operations: if they output 0s they close the gate and if they output 1s, they open it:
    • The forget gate (controlled by f(t)\textbf{f}_{(t)}f(t) ) controls which parts of the long-term state should be erased.
    • The input gate (controlled by i(t)\textbf{i}_{(t)}i(t) ) controls which parts of g(t)\textbf{g}_{(t)}g(t) should be added to the long-term state.
    • Finally, the putput gate (controlled by o(t)\textbf{o}_{(t)}o(t) ) controls which parts of the long-term state should be read and output at the time step, both to h(t)\textbf{h}_{(t)}h(t) and y(t)\textbf{y}_{(t)}y(t) .

LSTM Cell Computations

In short, an LSTM cell can learn to recignize an important input (role of the input gate), store it in the long-term state, and preserve it fro as long as it is needed (role of the forget gate), and extract it whenever it is needed. These cells are successful at capturing long-term patterns in time series, long texts, audio recordings, and more.

GRU Cells

The gated recurrent unit (GRU) cell is a simplified version of the LSTM cell, and it seems to perform just as well.

GRU cell

Main simplifications:

  • Both state vectors are merged into a single vector h(t)\textbf{h}_{(t)}h(t)
  • A single gate controller z(t)\textbf{z}_{(t)}z(t) controls both the forget gate and the input gate. If the gate controller outputs a 1, the forget gate is pen (=1) and the input gate is closed (1-1=0). If it outputs a 0, the opposite happens. In other words, whenever a memory must be stored, the location where it will be stored is erased first. This is actually a frequent variant to the LSTM cell in and of itself.
  • There is no output gate; the full state vector is output at every time step. However, there is a new gate controller r(t)\textbf{r}_{(t)}r(t) that controls which part of the previous state will be shown to the main layer ( g(t)\textbf{g}_{(t)}g(t) ).

GRU Computations

LSTM and GRU cells are one of the main reasons behind the success of RNNs. Yet while they can tackle much longer sequences than simple RNNs, they still have a fairly limited short-term memory, and they have a hard time learning long-term patterns in sequences of 100 time steps or more, such as audio samples, long time series, or long sentences.

Using 1D Convolutional Layers to Process Sequences

A 1D convolutional layer slides several kernels across a sequence, producing a 1D feature map per kernel. Each kernel will learn to detect a single very short sequential pattern. You can build a neural network composed of a mix of recurrent layers and 1D convolutional layers. If you use a 1D convolutional layer with a stride of 1 and "same" padding, then the output sequence will have the same length of the input sequence.