Training Deep Neural Networks Exercises Answers
This chapter goes over the considerations training deep neural networks with Keras.
Question 1
Is it okay to initialize all the weights to the same value as long as that value is selected randomly using He initialization?
No, all weights should be sampled independently; they should not all have the same initial value. One important goal of sampling weights randomly is to break symmetry: if all the weights have the same initial value, even if that value is not zero, then symmetry is not broken (i.e., all neurons in a given layer are equivalent), and backpropagation will be unable to break it. Concretely, this means that all the neurons in any given layer will always have the same weights. It's like having just one neuron per layer, and much slower. It is virtually impossible for such a configuration to converge to a good solution.
Question 2
Is it okay to initialize the bias terms to 0?
It is perfectly fine to initialize the bias terms to zero. Some people like to initialize them just like weights, and that's OK too; it does not make much difference.
Question 3
Name three advantages of the SELU activation function over ReLU.
- SELU takes on negative values when z<0 , whch allows the unit to have an average output closer to 0. This helps alleviate the vanishing gradients problem. The hyperparameter α defines the value that the ELU function approaches when z is a large negative number.
- It has a nonzero gradient for z<0 , which avoids the dead neurons problem.
- If α=1 then the function is smooth everywhere, which helps speed up Gradient Descent, since it does not bounce as much left and right of z=0 .
Question 4
In which cases would you want to use each of the following activation functions: SELU, leaky ReLU (and its variants), ReLU, tanh, logistic, and softmax?
You should use leaky ReLU on large training sets when you are using the ReLU activation function and you notice that during training, some neurons effectively die, meaning that they stop outputting anything other than 0. In some cases, you may fund that half of all your network's neurons are dead, especially if you need to use a large learning rate. A neuron dies when its weights get tweaked in a way that the weighted sum of its inputs are negative for all instances in the training set. Ehn this happens, it just keeps outputting 0s, and gradient descent does not affect it anymore since the gradient of the ReLU function is 0 when its input is negative.
The vanishing gradients problem, where gradients get smaller and smaller as the algorithm progresses down to the lower layers, might be most common in deep neural networks. The exploding gradients problem, in which the gradients can grow bigger and bigger, so many layers get insanely large weights and the algorithms diverges, is mostly encountered in recurrent neural networks. More generally, deep neural networks suffer from unstable gradients; different layers may learn at widely different speeds.
In general SELY > ELU > leaky ReLU (and its variants) > tanh > logistic. If the network's architecture prevents it from self-normalizing, then ELU may perform better than SELU (since SELU is not smooth at z=0 ). If you care about runtime latency, then you may prefer leaky ReLU. If you do not want to tweak yet another hyperparameter, you may just use the default α values used by Keras. If you have spare time and computing power, you can use cross-validation to evaluate other activation functions, in particular RReLU if your network is overfitting, or PReLU if you have a huge training set.
Question 5
What may happen if you set the momentum hyperparameter too close to 1 (e.g., 0.99999) when using an SGD optimizer?
If you set the momentum hyperparameter too close to 1 (e.g., 0.99999) when using an SGD optimizer, then the algorithm will likely pick up a lot of speed, hopefully moving roughly toward the global minimum, but its momentum will carry it right past the minimum. Then it will slow down and come back, accelerate again, overshoot again, and so on. It may oscillate this way many times before converging, so overall it will take much longer to converge than with a smaller momentum value.
Question 6
Name three ways you can produce a sparse model.
- Train the model as usual, then get rid of the tiny weights by setting them to 0. This will not typically lead to a very sparse model, and it may degrade the model's performance.
- Apply a strong ℓ1 regularization during training, as it pushes the optimizer to zero out as many weights as it can.
- Apply Dual Averaging, often called Follow the Regularized Leader (FTRL), a technique proposed by Yuri Neseterov, When used with ℓ1 regularization, this technique often leads to very sparse models. Keras implements a variant of FTRL called FTRL Proximal in the FTRL optimizer.
Question 7
Does dropout slow down training? Does it slow down inference (i.e., making predictions on new instances)? What are about MC dropout?
Yes, dropout does slow down training, in general roughly by a factor of two. However, it has no impact on inference speed since it is only turned on during training. MC Dropout is exactly like dropout during training, but it is still active during inference, so each inference is slowed down slightly. More importantly, when using MC Dropout you generally want to run inference 10 times or more to get better predictions. This means that making predictions is slowed down by a factor of 10 or more.
Question 8
Deep Learning
Build a DNN with five hidden layers of 100 neurons each, He initialization, and the ELU activation function.
Using Adam optimization and early stopping, try training it on MNIST but only on digits 0 to 4, as we will use transfer learning for digits 5 to 9 in the next exercise. You will need a softmax output layer with five neurons, and as always make sure to save checkpoints at regular intervals and save the final model so you can reuse it later.
Tune the hyperparameters using cross-validation and see what precision you can achieve.
Now try adding Batch Normalization and compare the learning curves: is it converging faster than before? Does it produce a better model?
Is the model overfitting the training set? Try adding dropout to every layer and try again. Does it help?
import tensorflow as tf
from tensorflow import keras
import numpy as np
import pandas as pd
# Load Data
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()
print("X_train Shape:",X_train.shape)
print("y_train Shape:",y_train.shape)
print("X_test Shape:",X_test.shape)
print("y_test Shape:",y_test.shape)
y_train_numeric, y_test_numeric = y_train.astype(np.int64), y_test.astype(np.int64)
zero_to_four_train_indices, zero_to_four_test_indices = np.nonzero(y_train_numeric <= 4)[0], np.nonzero(y_test_numeric <= 4)[0]
five_to_nine_train_indices, five_to_nine_test_indices = np.nonzero(y_train_numeric >= 5)[0], np.nonzero(y_test_numeric >= 5)[0]
print("\nEnsuring Correct Splitting of Data:\n-------------------------------------------------")
print("Train Indices Correct:",zero_to_four_train_indices.shape[0] + five_to_nine_train_indices.shape[0] == y_train.shape[0])
print("Test Indices Correct:",zero_to_four_test_indices.shape[0] + five_to_nine_test_indices.shape[0] == y_test.shape[0])
# Get Data for Questin 8
X_train_0_4, X_test_0_4 = X_train[zero_to_four_train_indices,:,:], X_test[zero_to_four_test_indices,:,:]
y_train_0_4, y_test_0_4 = y_train[zero_to_four_train_indices], y_test[zero_to_four_test_indices]
print("y_train_0_4 Unique Values:",pd.Series(np.concatenate((y_train_0_4,y_test_0_4),axis=0)).unique())
# Get Data for Question 9
X_train_5_9, X_test_5_9 = X_train[five_to_nine_train_indices,:,:], X_test[five_to_nine_test_indices,:,:]
y_train_5_9, y_test_5_9 = y_train[five_to_nine_train_indices], y_test[five_to_nine_test_indices]
print("y_train_5_9 Unique Values:",pd.Series(np.concatenate((y_train_5_9,y_test_5_9),axis=0)).unique())
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,2,layout="constrained")
ax[0].imshow(X_train_0_4[0,:,:],cmap="gray")
ax[0].set_title("0 to 4 Training Set")
ax[1].imshow(X_train_5_9[0,:,:],cmap="gray")
ax[1].set_title("5 to 9 Training Set")
plt.show()
import keras_tuner as kt
"""
Since the outvalue values are not one hot encoded
"""
loss = tf.keras.losses.SparseCategoricalCrossentropy()
metrics = ['accuracy']
def model_builder(hp):
"""
Build a model with hyperparameter tuning
"""
keras.layers.Input(X_train_0_4.shape[1:])
model = tf.keras.Sequential()
model.add(keras.layers.Lambda(lambda x: x / 255)) # Scale the input
model.add(keras.layers.Flatten(name="Flatten")) # Flatten the input
model.add(keras.layers.Dense(100,kernel_initializer="he_normal",activation="elu",name="Hidden1_0_4"))
model.add(keras.layers.Dense(100,kernel_initializer="he_normal",activation="elu",name="Hidden2_0_4"))
model.add(keras.layers.Dense(100,kernel_initializer="he_normal",activation="elu",name="Hidden3_0_4"))
model.add(keras.layers.Dense(100,kernel_initializer="he_normal",activation="elu",name="Hidden4_0_4"))
model.add(keras.layers.Dense(100,kernel_initializer="he_normal",activation="elu",name="Hidden5_0_4"))
# Classifying Images 0-4
model.add(keras.layers.Dense(5, activation="softmax",name="Output"))
"""
Suggested Learning rates
"""
hp_learning_rate = hp.Choice('learning_rate',values=[1e-2,1e-3,1e-4])
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=hp_learning_rate),loss=loss,metrics=metrics)
return model
import os
"""
The `fit()` method accepts a `callback` argument taht lets you specify a list of objects that Keras will call during training at the start and end f training, at the start and end of each epoch and even before and after processing each batch. The checkpoint_cb below saaves checkpoints of the model at regular intervals during training, by default at the end of each epoch.
"""
sve_model_cb = keras.callbacks.ModelCheckpoint(filepath=os.path.join(os.getcwd(),'mnist_model_0_4.keras'),save_best_only=True,verbose=1)
"""
Visualization Using Tensorboard
--------------------------------------------
> TensorBoard is a great interactive visualization tool that you can use to view the learning curves during training, compare learning curves between multiple runs, vis ualize the computation graph, analyze training statistics, view images generated by your model, visualize complex multidimensional data projected down to 3D and automatically clustered for you, and more! This tool is installed automatically when you install TensorFlow, so you already have it.
To use it, you must modify your program so that it outputs the data that you wantto visualize to special binary log files called event files. Each binary record is called a summary. TensorBoard server will monitor the log directory, and it will automatically pick up the changes and update the visualizations: this allows you to visualize live data (with a short delay), such as the learning curves during training.
"""
root_logdir = os.path.join(os.getcwd(), "ch11_mnist_0_4_logs")
def get_run_logdir():
"""
Returns a string that represents a path to the log directory + the run id
"""
import time
run_id = time.strftime("run_%Y_%m_%d-%H_%M_%S")
return os.path.join(root_logdir, run_id)
run_logdir = get_run_logdir()
tensorboard_cb = keras.callbacks.TensorBoard(run_logdir)
"""
> if you use a validation set during training, you can set save_best_only=True when creating the ModelCheckpoint. In this case, it will only save your model when its performance on the validation set is the best so far. This way, you do not need to worry about training for too long and overfitting the training set: simply restore the last model saved after training, and this will be the best model on the validation set. This is a simple way to implement early stopping
"""
early_stopping_cb = keras.callbacks.EarlyStopping(patience=10,restore_best_weights=True)
tuner = kt.Hyperband(model_builder,objective="val_accuracy",max_epochs=10)
tuner.search(X_train_0_4,y_train_0_4,epochs=100,validation_split=0.2,callbacks=[early_stopping_cb,tensorboard_cb,sve_model_cb])
best_hyperparameters = tuner.get_best_hyperparameters(num_trials=1)[0]
model = tuner.hypermodel.build(best_hyperparameters)
history = model.fit(X_train_0_4,y_train_0_4,epochs=100,validation_split=0.2,callbacks=[early_stopping_cb,tensorboard_cb,sve_model_cb])
import pandas as pd
pd.DataFrame(history.history).plot(figsize=(8, 5))
plt.grid(True)
plt.gca().set_ylim(-0.1, 1.1) # set the vertical range to [0-1]
plt.gca().set_title("Model History wthout Batch Normalization After Hyperparamter Tuning")
plt.show()
print(best_hyperparameters)
def model_builder_2(hp):
"""
Build a model with hyperparameter tuning - including dropout and batch normalization
"""
keras.layers.Input(X_train_0_4.shape[1:])
model = tf.keras.Sequential()
model.add(keras.layers.Lambda(lambda x: x / 255)) # Scale the input
model.add(keras.layers.Flatten(name="Flatten")) # Flatten the input
if hp.Boolean("batch_norm_1"):
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.Dense(100,kernel_initializer="he_normal",activation="elu",name="Hidden1_0_4"))
if hp.Boolean("dropout_1"):
model.add(keras.layers.Dropout(rate=0.2))
if hp.Boolean("batch_norm_2"):
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.Dense(100,kernel_initializer="he_normal",activation="elu",name="Hidden2_0_4"))
if hp.Boolean("dropout_2"):
model.add(keras.layers.Dropout(rate=0.2))
if hp.Boolean("batch_norm_3"):
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.Dense(100,kernel_initializer="he_normal",activation="elu",name="Hidden3_0_4"))
if hp.Boolean("dropout_3"):
model.add(keras.layers.Dropout(rate=0.2))
if hp.Boolean("batch_norm_4"):
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.Dense(100,kernel_initializer="he_normal",activation="elu",name="Hidden4_0_4"))
if hp.Boolean("dropout_4"):
model.add(keras.layers.Dropout(rate=0.2))
if hp.Boolean("batch_norm_5"):
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.Dense(100,kernel_initializer="he_normal",activation="elu",name="Hidden5_0_4"))
if hp.Boolean("dropout_5"):
model.add(keras.layers.Dropout(rate=0.2))
# Classifying Images 0-4
model.add(keras.layers.Dense(5, activation="softmax",name="Output"))
"""
Suggested Learning rates
"""
hp_learning_rate = hp.Choice('learning_rate',values=[1e-2,1e-3,1e-4])
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=hp_learning_rate),loss=loss,metrics=metrics)
return model
sve_model_cb_2 = keras.callbacks.ModelCheckpoint(filepath=os.path.join(os.getcwd(),'models','mnist_model_0_4v2.keras'),save_best_only=True,verbose=1)
root_logdir = os.path.join(os.getcwd(), "ch11_mnist_0_4_logs_v2")
def get_run_logdir():
"""
Returns a string that represents a path to the log directory + the run id
"""
import time
run_id = time.strftime("run_%Y_%m_%d-%H_%M_%S")
return os.path.join(root_logdir, run_id)
run_logdir = get_run_logdir()
tensorboard_cb_2 = keras.callbacks.TensorBoard(run_logdir)
tuner = kt.Hyperband(model_builder_2,objective="val_accuracy",max_epochs=10)
tuner.search(X_train_0_4,y_train_0_4,epochs=100,validation_split=0.2,callbacks=[early_stopping_cb,tensorboard_cb_2,sve_model_cb_2])
best_hyperparameters = tuner.get_best_hyperparameters(num_trials=1)[0]
model = tuner.hypermodel.build(best_hyperparameters)
history = model.fit(X_train_0_4,y_train_0_4,epochs=100,validation_split=0.2,callbacks=[early_stopping_cb,tensorboard_cb_2,sve_model_cb_2])
Question 9
Transfer learning.
- Create a new DNN that reuses all the pretrained hidden layers of the previous model, freezes them, and replaces the softmax output layer with a new one.
- Train this new DNN on digits 5 to 9, using only 100 images per digit, and time how long it takes. Despite this small number of examples, can you achieve high precision?
- Try caching the frozen layers, and train the model again: how much faster is it now?
- Try again reusing just four hidden layers instead of five. Can you achieve a higher precision?
- Now unfreeze the top two hidden layers and continue training: can you get the model to perform even better?
model_0_4_load = keras.models.load_model(os.path.join(os.getcwd(),'models','mnist_model_0_4v2.keras'),safe_mode=False)
model_0_4_load_clone = keras.models.clone_model(model_0_4_load)
"""
Clone Model does not clone the weights
"""
model_0_4_load_clone.set_weights(model_0_4_load.get_weights())
layers = [
model_0_4_load_clone.get_layer('Hidden1_0_4'),
model_0_4_load_clone.get_layer('Hidden2_0_4'),
model_0_4_load_clone.get_layer('Hidden3_0_4'),
model_0_4_load_clone.get_layer('Hidden4_0_4')
]
model_5_9_on_0_4 = keras.models.Sequential()
model_5_9_on_0_4.add(keras.layers.Lambda(lambda x: x / 255,name="Scale_Inputs")) # Scale the input
model_5_9_on_0_4.add(keras.layers.Flatten(name="Flatten"))
for layer in layers:
model_5_9_on_0_4.add(layer)
model_5_9_on_0_4.add(keras.layers.Dense(5, activation="softmax",name="Output_5_9"))
for layer in model_5_9_on_0_4.layers[:-1]:
layer.trainable = False
model_5_9_on_0_4.compile(loss="binary_crossentropy", optimizer="sgd",metrics=["accuracy"])
sve_model_cb_2 = keras.callbacks.ModelCheckpoint(filepath=os.path.join(os.getcwd(),'models','mnist_model_5_9_v1.keras'),save_best_only=True,verbose=1)
root_logdir = os.path.join(os.getcwd(), "ch11_mnist_5_9_logs_v1")
def get_run_logdir():
"""
Returns a string that represents a path to the log directory + the run id
"""
import time
run_id = time.strftime("run_%Y_%m_%d-%H_%M_%S")
return os.path.join(root_logdir, run_id)
run_logdir = get_run_logdir()
tensorboard_cb_2 = keras.callbacks.TensorBoard(run_logdir)
loss = tf.keras.losses.SparseCategoricalCrossentropy()
metrics = ['accuracy']
model_5_9_on_0_4.compile(loss=loss,metrics=metrics,optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3))
"""
Generate a training set so that it contains 100 samples for each digit
"""
X_training_data_100_each = None
y_train_data_100_each = None
for i in range(5,10):
y_train_indices = np.nonzero(y_train_5_9[y_train_5_9==i])[0]
y_train_indices = y_train_indices[:100]
y_train_add = y_train_5_9[y_train_indices]
X_train_add = X_train_5_9[y_train_indices,:,:]
if X_training_data_100_each is None:
X_training_data_100_each = X_train_add
y_train_data_100_each = y_train_add
else:
X_training_data_100_each = np.vstack((X_training_data_100_each,X_train_add))
y_train_data_100_each = np.concatenate((y_train_data_100_each,y_train_add),axis=0)
# history = model_5_9_on_0_4.fit(X_training_data_100_each,y_train_data_100_each,epochs=100,validation_split=0.2,callbacks=[early_stopping_cb,tensorboard_cb_2,sve_model_cb_2])
print("I can't figure out what is going wrong here. I am going to move on.")