Hands On Machine Learning Chapter 11 - Training Deep Neural Networks

I am going to re-read Hands-On Machine Learning with Scikit-learn Keras & TensorFlow because I don't feel that I got a good grasp of machine learning the first time I read it, and I skipped neural networks the first time I read the book. Since the first time reading this textbook.

DOWNLOAD NOTEBOOK

2 521

Training Deep Neural Networks

For tackling complex problems - 10 layers or more, each containing hundreds of neurons, connected by hundreds of thousands of connection. This would not be a walk in the park:

You would be faced with the tricky vanishing gradients problem (or the related exploding gradients problem) that affects deep neural networks and makes lower layers very hard to train
You might not have enough training data for such a large network, or it might be too costly to label.
Training may be extremely slow.
A model with millions of parameters would severely risk overfitting the training set, especially if there are not enough training instances, or they are too noisy.

Vanishing / Exploding Gradients Problem

The backpropagation algorithm works by going from the output layer to the input layer, propagating the error gradient on the way. Once the algorithm has computed the gradient of teh cost function with regards to each parameter in the network, it uses these gradients to update each parameter with a gradient descent step. Unfortunately, gradients often get smaller and smaller as the algorithm progresses down to the lower layers. As a result, the Gradient Descent update leaves the lower layer connection weights virtually unchanged, and training never converges to a good solution. This is called the vanishing gradients problem. In some cases, the opposite can happen: the gradients can grow bigger and bigger, so many layers get insanely large weight updates and the algorithm diverges. This is the exploding gradients problem, which is mostly encountered in recurrent neural networks. More generally, deep neural networks suffer from unstable gradients; different layers may learn at widely different speeds. A paper titles "Understanding the Difficulty of Training Deep Feedforward Neural Networks" by Xavier Glorot and Yoshua Bengio fund a dew suspects, including the combination of the popular sigmoid activation function and the weight initialization technique that was most popular at the time, namdely random initialization using a normal distibution with a mean of 0 and a standard deviation of 1.

Looking at the logistic function below, you can see that when the inputs become large, the function saturates at 0 or 1, with a derivative extremely close to 0. Thus when backpropagation kicks in, it has virtually no gradient to propagate back through the network, and what little gradient exists keeps getting diluted as backpropagation progresses down through the top layers, so there is really nothing left for the lower layers

Golorot and He Initialization

We ant the signal to flow properly in both directions -> The authors argue that we need the variance of the outputs of each layer to be equal to the variance of its inputs and we also need the gradients to have equal variance before and after flowing through a layer in the reverse direction. It is actually not possible to guarante both unless te layer has an equal number of inputs and neurons (these numbers are called the fan-in and fan-out of the layer), but they proposed a good compromise that has proven to work weel in practice: the connection weights of each layer must be initialized randomly as described below, where $\text{fan}_{\text{avg}} = (\text{fan}_{in} + \text{fan}_{out})/2$ . This initialization is called Xaviar initialization or Golorot Initialization.

Glorot Initialization (when using the logistic activation function)

\text{Normal Distribution with Mean 0 and variance}\sigma ^2 = \cfrac{1}{\text{fan}_{\text{avg}}} \\[0.25em] \text{Or a uniform distribution between -r and +r, with}r=\sqrt{\cfrac{3}{\text{fan}_\text{avg}}}

Using Glorot initialization can speed up training considerably, and it is one of the tricks that led to the current success of Deep Learning. Papers such as Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification have provided similar strategies for different activation functions. These strategies differ only by the scale of the variance and whether they use $\text{fan}_{in}$ or $\text{fan}_{avg}$ . The initialization strategy for the ReLU activation function is sometimes called He Initialization.

Initialization Parameters for Each Type of Activation Function

You can change the initialization and distribution of a layer with the kernel_initializer argument for a Keras layer.

Nonsaturating Activation functions

One of the key insights in the 2010 paper by Glorot and Bengio was that the vanishing/exploding gradients problems were in part due to a poor choice off activation function. Unfortunately, the ReLU activation function is not perfect - it suffers from a problem known as the dying ReLUs: during training, some neurons effectively die, meaning they stop outputting anything other than 0. A neuron dies when its weights get tweaked in such a way that the weighted sum of its inputs are negative for all instances in the training set. To solve this problem, you may want to use a variant of the ReLU function, such as the leaky ReLU. TYhis function is defined as $\text{LeakyReLU}_{\alpha}(z) = \text{max}(\alpha z, z)$ (see the image below). The hyperparameter $\alpha$ defines how much the function "leaks": it is the slope of the function for $z < 0$ and is typically set to 0.01. This small slope ensures that leaky ReLUs never die; they can go into along coma, but they have a chance to eventually wake up. A 2015 paper compared several variants of the ReLU activation functions and one of its conclusions was that the leaky variants always outperformed the strict ReLU activation function. They concluded that an alpha of 0.2 performs better than 0.01, and that for large datasets, alpha can be learned during training (parametric leaky ReLU)

In a 2015 paper by Djork-Arne Clevert et al.,a new activation function called the exponential linear unit that outperformed all the ReLU variants in their experiments. See image bwlow for the activation function. It is like ReLU, with a few differences: it takes on negative values for z < 0, which alleviates the vanishing gradients problem. It has a nonzero gradient for z <0, which avoids the dead neuron problem. The main drawback of this is that it is slower to compute than ReLU, but during training this is compensated for by the faster convergence rate.

ELU Activation Function

\text{ELU}_{\alpha} = \begin{cases} \alpha (\text{exp} (z) - 1) & \space \text{ if }z < 0 \\ z & \space \text{ if }z \geq 0 \end{cases}

In a 2017 paper called "Self-Normalizing Neural Networks", the authors show that if you build a neural network composed exclusively of a stack of dense layers, and if all hidden layers use the SELU activation function, then the network will self-normalize: the output of each layer will tend to preserve mean 0 and a standard deviation of 1 during training, which will solve the vanishing/exploding gradients problem. As a result, this activation function outperforms other activation functions bery significantly for such neural nets (especially deep ones). However, there are a few conditions for self-normalization to happen:

Input feature must be standardized (mean 0 and stand deviation 1)
Every hidden layer's weights must also be initialized using leCun normal initialization.
The network's architecture must be sequential
The paper only guarantees self-normalization if all layers are dense

What activation function should you choose for deep neural networks? Generally SELU > ELU > leaky ReLU > ReLU > tanh > logistic.

Batch Normalization

Although initialization techniques above can significantly reduce the vanishing/exploding gradients problem at the beginning of training, it doesn't guarantee that they won't come back. In a 2015 paper, Ioffe ans Szegedy proposed a technique called Batch Normalization (BN) to address the vanishing/exploding gradients problems. The technique consists of adding an operation in the model just before or after the activation function of each hidden layer, simply zero-centering and normalizing each input, then scaling and shifting the result using two new parameter vectors per layer: one for scaling, the other fro shifting. Four parameter vectors are learned in each batch-normalized layer: the output scale vector and the output offset vector are learned through regular backpropagation, and the final input mean vector and the final input standard deviation vector are estimated using an exponential moving average. The authors demonstrated that the technique considerably improved all the deep neural networks they experimented with and BN acts like a regularizer, reducing the need for other regularization techniques. It does add some complexity to the model and there is a runtime penalty, however.Training is slow but convergence is faster with batch normalization. See the code below for example of implementing BatchNormalization in keras. The authors of the BN paper argues in favor of adding BN layers before activation functions, as in the code below. The BatchNormalization class has quite a few hyperparameters you can weak, but the defaults will usually be fine.

# Implementing Batch Normalization in Keras 
import tensorflow as tf 
from tensorflow import keras

model = keras.models.Sequential([
 keras.layers.Flatten(input_shape=[28, 28]),
 keras.layers.BatchNormalization(),
 keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
 keras.layers.BatchNormalization(),
 keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
 keras.layers.BatchNormalization(),
 keras.layers.Dense(10, activation="softmax")
])

out[2]

c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\keras\src\layers\reshaping\flatten.py:37: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.
super().__init__(**kwargs)

Gradient Clipping

Another popular technique to lessen the exploding gradients problem is to simply clip the gradients during backpropagation so that they never exceed some threshold. This is called Gradient Clipping. This technique is often used in recurrent neural networks, as Batch Normalization is tricky to use in RNNs. In Keras, implementing Gradinet clipping is just a matter of setting the clipvalue or clipnorm argument when creating an optimizer.

Reusing Pretrained Layers

It is generally not a good idea to train a very large DNN from scratch: instead, you should always try to find an existing model that accomplishes a similar task to the one you are trying to tackle and then just reuse the lower layers of this network in a process called transfer learning. it will not only speed up training considerably, but will also require mich less training data.

If the input pictures of your new task don’t have the same size as the ones used in the original task, you will usually have to add a preprocessing step to resize them to the size expected by the original model. More generally, transfer learning will work best when the inputs have similar low-level features.

The output layer of the original model should usually be replaced since it is most likely not useful for the new task, and it may not even have the right number of outputs for the new task. The upper hidden layers of the original model are less likely to be as useful as the lower layers, since the high-level features that are most useful for the new task may differ significantly from the ones that were most useful for the original task. You want to find the right number of layers to reuse - the more similar the tasks, teh more layers you want to reuse. Try freezing all the reused layers first (make their weights non-trainable, so gradient descent won't modify them), then train your model and see how it performs. Then try unfreezing one or two of the top hidden layers to let backpropagation tweak them and see if performance improves. The more training data you have, the more layers you can unfreeze. It is also useful to reduce the learning rate when you unfreeze reused layers: this will avoid wrecking their fine-tuned weights. If you cannot get good performance, and you have little training data, try dropping the top hidden layer and freeze all remaining hidden layers again. You can iterate until you find teh right number of layers to reuse.

See the code below for example of re-using model. Don't immediately trust papers that give too good results. Transfer learning does not work very well with small dense networks, it works best with deep convolutional neural networks.

model_A = keras.models.load_model("my_model_A.h5")
# Reusing all models except for the output layer
model_B_on_A = keras.models.Sequential(model_A.layers[:-1])
model_B_on_A.add(keras.layers.Dense(1, activation="sigmoid"))

# When you train model_B_on-A, it will also affect model_A. If you want to avoid that 
# you need to clone model_A before you reuse its layers. To fo this, you must clone model A's architecture 
# then copy its weights 
model_A_clone = keras.models.clone_model(model_A)
model_A_clone.set_weights(model_A.get_weights())


# Since the output layer was initialized randomly, it will make large errors during the first few epochs
# and there will be large enough error graidents to wreck the reused weights 
# To avoid this -> freeze the layers
# Freezing the layers 
for layer in model_B_on_A.layers[:-1]:
 layer.trainable = False

# You must always compile the model after you freeze or unfreeze layers
model_B_on_A.compile(loss="binary_crossentropy", optimizer="sgd",metrics=["accuracy"])

# You can unfreeze the reused layers after training for a few epochs 
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=4,validation_data=(X_valid_B, y_valid_B))

# Unfreeze layers after training for a couple epocs 
for layer in model_B_on_A.layers[:-1]:
 layer.trainable = True

optimizer = keras.optimizers.SGD(lr=1e-4) # the default lr is 1e-3
model_B_on_A.compile(loss="binary_crossentropy", optimizer=optimizer,metrics=["accuracy"])
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=16,validation_data=(X_valid_B, y_valid_B))

out[4]

Unsupervised Pretraining

Suppose you want to tackle a complex task for which you don't have much labeled training data, but unfortunately you cannot find a model trained on a similar task. You should be able to perform unsupervised pretraining.If you gather plenty of unlabeled training data, you cna try to train the layers one by one, starting with the lowest layer and then going up, using an unsupervised feature detector algorithm such as Restricted Boltzmann Machines or autoencoders. Each layer is trained on the output of the previously trained layers (all layers except the one being trained are frozen). Once all layers have been trained this way, you can add the output layer for our task, and dine-tune the final network using supervised learning (with the labeled training examples). At this point, you can unfreeze all the pretrained layers, or just some of the upper ones.

Unsupervised pre-training is still a good option when you have a complex task to solve, no similar model you can resuse, and little labeled training data but plenty of unlabeled training data.

Pretraining on an Auxilary Task

If you do not have much labeled training data, one last option is to train a first neural network on an auxilary task for which you cna easily obtain or generate labeled training data, then reuse the lower layers of that network for your actual task. The first neural network's lower layers will learn feature detectors that will likely be reusable by the second neural network.For natural language processing (NLP) applications, you cna easily download millions of text documents and automatically generate labeled training data from it. Self-supervised learning is when you automatically generate labels from the data itself, then you train a model on the resulting "labeled" dataset using supervised learning techniques, Since this approach required no human labeling whatsoever, it is best classified as a form of unsupervised learrning.

Faster Optimizers

Training a very large deep neural network can be painfully slow. SO far we have seen four ways of speeding up training:

applying a good initialization strategy for connection weights
Using a good activation function
Using Batch normalization
Reusing parts of a pretrained network

Another speed boost comes from using a faster optimizer than a gradient descent optimizer.

Momentum Optimization

Momentum optimization, proposed by Boris Polyak in 1964, cares a great deal about what the previous gradients were (not just the immediate gradient in the function $\bm{\theta} \leftarrow \bm{\theta} - \eta \nabla _{\bm{\theta}} J(\bm{\theta})\text{, where } \bm{\theta}\text{ are the weights, }\nabla _{\bm{\theta}}J(\bm{\theta})\text{ is the gradient of the cost function with regard to the weights, and }\eta\text{ is the learning rate}$ ): at each iteration, it subtracts the local gradient from the momentum vector m (multiplied by the learning right), and it updates the wights by simply adding this momentum vector. In other words, the gradient is used for acceleration, not speed. The algorithm introduces the beta hyperparameter below for friction.Gradient Descent goes down the steep slope quite fast, but then it takes a long time to go down the valley. Momentum optimization will roll down the valley faster and faster until it reaches the bottom (optimum).

Momentum Algorithm

\textbf{m} \leftarrow \beta \textbf{m} - \eta \nabla _{\bm{\theta}} J(\bm{\theta}) \\ \bm{\theta} \leftarrow \bm{\theta} + \textbf{m}

# Momentum Optimization in Keras
optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9)

Netserov Accelerated Gradient

One small variant to Momentum optimization, proposed in 1982, is almost always faster than vanilla Momentum optimization. The idea of Neterov Momentum optimization, or Nesterov Accelerated Gradient (NAG), is to measure the gradient of the cost function not at the local position but slightly ahead in the direction of the momentum. The only difference from vanilla Momentum optimization is that the gradient is measured at $\bm{\theta} + \beta \textbf{m}$ rather than just $\bm{\theta}$ . The small tweak works because in general the momentum vector will be pointing in the right direction (toward the optimum), so it will be slightly more accurate to use the gradient measured a bit farther in that direction rather than using the gradient at the original position, as seen in the image below. As you can see, the Nesterov update ends up slightly closer top the optimum,

Nesterov Accelerated Algorithm

\textbf{m} \leftarrow \beta \textbf{m} - \eta \nabla _{\bm{\theta}} J(\bm{\theta} + \beta \textbf{m}) \\ \bm{\theta} \leftarrow \bm{\theta} + \textbf{m}

Regular versus Nesterov Momentum Optimization

# NAG in Keras 
optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9, nesterov=True)

AdaGrad

The AdaGrad algorithm achieves pointing a bit more directly to global optimum than gradient descent by scaling down the gradient vector along the steepest dimensions. In short, this algorithm decays the learning rate, but it does so faster for steep dimensions than for dimensions with greater slope. This is called an adaptive learning rate. It helps point the resulting updates more directly toward the global optimum. This neccesitates less tuning of the learning rate hyperparameter. AdaGrad often stops too early when training neural networks.

AdaGrad Algorithm

$$
\textbf{s} \leftarrow \textbf{s} + \nabla _{\bm{\theta}} J(\bm{\theta}) \otimes \nabla _{\bm{\theta}} J(\bm{\theta}) \[0.5em]
\bm{\theta} \leftarrow \bm{\theta} - \eta \nabla _{\bm{\theta}} J(\bm{\theta}) ⊘ \sqrt{\textbf{s} + \epsilon} \[0.5em]
\otimes \text{ means element-wise multiplication} \
⊘\text{ means element wise division}
$$

RMSProp

The RMSProp algoritn fixes AdaGrad's problem of never converging to global optimum by accumulating only the gradients from the most recent iterations. Except for very simple problems, this optimizer almost always performs much better than AdaGrad. It was preffered until Adam optimization came around.

RMSProp Algorithm

$$
\textbf{s} \leftarrow \beta \textbf{s} + (1-\beta) \nabla _{\bm{\theta}} J(\bm{\theta}) \otimes \nabla _{\bm{\theta}} J(\bm{\theta}) \[0.5em]
\bm{\theta} \leftarrow \bm{\theta} - \eta \nabla _{\bm{\theta}} J(\bm{\theta}) ⊘ \sqrt{\textbf{s} + \epsilon} \[0.5em]
$$

# RMSProp in Keras 
optimizer = keras.optimizers.RMSprop(lr=0.001, rho=0.9)

Adam and Nadam Optimization

Adam, which stands for adaptive moment optimization, combines the ideas of momentum optimization and RMSProp: just like Momentum optimization it keeps track of an exponentially decaying average of past gradients, and just like RMSProp, it keeps track of an exponentially decaying average of past squared gradients.

Adam Algorithm

# Adam Algorithm in Keras
optimizer = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999)

Since Adam is an adaptive learning rate algorithm, it requires less tuning of the learning rate hyperparameter $\eta$ . You can often use the default value, making Adam even easier to use than Gradient Descent.

Learning Rate Scheduling

Finding a good learning rate can be tricky. If you set it way too high, training may actually diverge. If you set it too low, training will eventually converge to the optimum, but it will take a very long time.

Learning Curves For Various Learning Rates eta

If you start with a high learning rate and then reduce it once it stops making fast progress, you can reach a good solution faster than with the optimal constant learning rate. There are many different strategies to reduce the learning rate during training. These strategies are called learning schedules, the most common of which are:

Power Scheduling
- Set the learning rate to a function of the iteration number t: $\eta (t) = \eta _0 / (1 + t/k) ^k$ .
Exponential Scheduling
- Set the learning rate $\eta (t) = \eta _0 0.1^{th}$ . The learning rate will drop by a factor of 10 every $s$ steps.
Piecewise constant Scheduling
- Use a constant learning rate for a number of epochs, then a smaller learning rate for another number of epochs
Performance Scheduling
- Measure the validation error every N steps and reduce the learning rate by a factor of $\lambda$ when the error stops dropping.

Implementing power scheduling in Keras is the easiest option: just set the decay hyperparameter when creating an optimizer. The decay is the inverse of $s$ (the number of steps it takes to divide the learning rate by one more unit). Exponential scheduling and piecewise scheduling are simple too - you just need to create a LearningrateScheduler callback.

Avoiding Overfitting Through Regularization

Deep Neural networks typically have thousands of parameters, sometimes even millions. With so many parameters, the network has an incredible amount of freedom and can fit a huge variety of complex datasets. This flexibility also means that it is prone to overfitting the training set. We already explored two good regularization techniques: Batch Normalization and early stopping. Here are some other techniques:

l1 and l2 Regularization

You can use $\ell _1$ and $\ell _2$ regularization to constrain a neural network's connection weights. Here's how to do it in Keras (see code below). You typically want to use the same regularizer for all layers in network, as well as the same activation function and same initialization strategey in all hidden layers. See the code below for a way to write cleaner code to avoid re-writing all parameters.

## L2 
layer = keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal",kernel_regularizer=keras.regularizers.l2(0.01))
## L1
layer = keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal",kernel_regularizer=keras.regularizers.l1(0.01))

from functools import partial
RegularizedDense = partial(keras.layers.Dense,activation="elu",kernel_initializer="he_normal",kernel_regularizer=keras.regularizers.l2(0.01))
model = keras.models.Sequential([
 keras.layers.Flatten(input_shape=[28, 28]),
 RegularizedDense(300),
 RegularizedDense(100),
 RegularizedDense(10, activation="softmax",
 kernel_initializer="glorot_uniform")
])

Dropout

Dropout is one of the most popular regularization techniques for deep neural networks. It was proposed by Geoffrey Hinton in 2012 and further detailed in a paper, and it has proved to be highly successful: even the state-of-the-art neural networks got 1-2% accuracy boost by simply adding dropout. It is a fairly simple algorithm: at every training step, every neuron (including the input neurons, but always excluding the output neurons) has a probability $p$ of being temporarily "dropped out", meaning that it will be entirely ignored during this training step, but it may be active during the next step. The hyperparameter $p$ is called the dropout rate and it is typically set to 50%. After training, neurons don't get dropped anymore. Neurons trained with dropout cannot co-adapt with their neighboring neurons; they have to be as useful as possible on their own. they also cannot rely excessively on just a few input neurons; they must pay attention to their input neurons. They end up being less sensitive to slight changes in the inputs, In the end, you get a more robust network that generalizes better. We need to multiply each input connection weight by the keep probability $(1 - p)$ after training to compensate for the fact that a neuron will be connected to $1 / (1 - p)$ as many neurons as it was (on average) during training. If you observe that the model is overfitting, you can increase the dropout rate. Conversely, you should try decreasing the dropout rate if the model underfits the training set. Dropout does tend to significantly slow down convergence, but it usually results in a much better model when tuned properly. So, it is generally well worth the extra time and effort.

# Dropout in Keras
model = keras.models.Sequential([
 keras.layers.Flatten(input_shape=[28, 28]),
 keras.layers.Dropout(rate=0.2),
 keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
 keras.layers.Dropout(rate=0.2),
 keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
 keras.layers.Dropout(rate=0.2),
 keras.layers.Dense(10, activation="softmax")
])

Monte-Carlo (MC) Dropout

In a 2016 paper, more good reasons to use dropout were given:

gave dropout a solid mathematical justification by establishing a connection between dropout networks and approximate bayesian inference
Introduces a powerful texhnique called MC Dropout, which can boost the performance of any trained dropout model, without having to retrain it
MC Dropout is easy to implement without retraining:

with keras.backend.learning_phase_scope(1): # force training mode = dropout on
 y_probas = np.stack([model.predict(X_test_scaled) for sample in range(100)])
y_proba = y_probas.mean(axis=0)

Max-Norm Regularization

Another regularization technique that is popular for neural networks is called max-norm regularization: for ach neuron, it constrains the wights $\textbf{w}$ of the incoming connection such that [...] , where $r$ is the max-norm hyperparameter. Max norm regularization by clipping the weight vector if needed.

Summary and Practical Guidelines

The configurations in the table below will work fine in most cases, without hyperparameter tuning. Don't forget to standardize the input features.

The table above could be tweaked:

If your model self-normalizes:
- If it overfits the training set, then you should add alpha dropout (and always use early stopping as well). Do not use other regularization methods, or else they would break self-normalization.
If the model cannot self-normalize:
- You can try using ELU instead of SELU, it may perform better.
- If it is a deep network, you should use Batch Normalization after every hidden layer. If it overfits the training set, you can also try using max-norm or $\ell _ 2$ regularization.
If you need a sparse model, you can use $\ell _1$ regularization. If you need an even sparser model, you can try using FTRL instead of Nadam optimization with $\ell _1$ regularization.
If you need a low-latency model (in terms of making predictions), you need to use less layers, avoid Batch Normalization, and possible replace SELU activation function with the leaky ReLU. Having a sparse model will also help. You could also reduce float precision.
If building a risk-sensitive application, or inference latency is not very important in your application, you can use MC Dropout to boost performance and get more reliable probability estimates, along with uncertainty estimates.

User Comments

There are currently no comments for this article.