Introduction to Artificial Neural Networks with Keras Exercises Answers
This chapter introduces artificial neural networks and introduces Keras.
Question 1
Visit the TensorFlow Playground at https://playground.tensorflow.org
- Layers and patterns: try training the default neural network by clicking the run button (top left). Notice how it quickly finds a good solution for the classification task. Notice that the neurons in the first hidden layer have learned simple patterns, while the neurons in the second hidden layer have learned to combine the simple patterns of the first hidden layer into more complex patterns. In general, the more layers, the more complex the patterns can be.
- Activation function: try replacing the Tanh activation function with the ReLU activation function, and train the network again. Notice that it finds a solution even faster, but this time the boundaries are linear. This is due to the shape of the ReLU function.
- Local minima: modify the network architecture to have just one hidden layer with three neurons. Train it multiple times (to reset the network weights, click the reset button next to the play button). Notice that the training time varies a lot, and sometimes it even gets stuck in a local minimum.
- Too small: now remove one neuron to keep just 2. Notice that the neural network is now incapable of finding a good solution, even if you try multiple times. The model has too few parameters and it systematically underfits the training set.
- Large enough: next, set the number of neurons to 8 and train the network several times. Notice that it is now consistently fast and never gets stuck. This highlights an important finding in neural network theory: large neural networks almost never get stuck in local minima, and even when they do these local optima are almost as good as the global optimum. However, they can still get stuck on long plateaus for a long time.
- Deep net and vanishing gradients: now change the dataset to be the spiral (bottom right dataset under “DATA”). Change the network architecture to have 4 hidden layers with 8 neurons each. Notice that training takes much longer, and often gets stuck on plateaus for long periods of time. Also notice that the neurons in the highest layers (i.e. on the right) tend to evolve faster than the neurons in the lowest layers (i.e. on the left). This problem, called the “vanishing gradients” problem, can be alleviated using better weight initialization and other techniques, better optimizers (such as AdaGrad or Adam), or using Batch Normalization.
- More: go ahead and play with the other parameters to get a feel of what they do. In fact, you should definitely play with this UI for at least one hour, it will grow your intuitions about neural networks significantly.
Some Notes
- For the spiral, function, 6 hidden layers with 8 neurons each (with only two input features ( x1 and x2 ) produces a good model)
- Only one neuron in the last hidden layer means that the neural network can only predict one class
- This was combined with a large learning rate
- A too large learning rate can cause the NN to not find a good solution for a while
- Adding more neurons per layer can make it slower for the NN to converge to a solution
- Adding nonlinear features helps with nonlinear data
- Increasing the learning rate may be more important with nonlinear data
- Seems like a learning rate greater than 0.03 is almost never called for
- A randomized learning rate seems like it may be a good solution to avoid local minima and backtracking on previous solutions may be good
Question 2
2. Draw an ANN using the original artificial neurons (like the ones in Figure 10-3) that computes A⊕B (where ⊕ represents the XOR operation)
Question 3
Why is it generally preferable to use a Logistic Regression classifier rather than a classical Perceptron (i.e., a single layer of threshold logic units trained using the Perceptron training algorithm)? How can you tweak a Perceptron to make it equivalent to a Logistic Regression classifier?
It is generally preferable to use a Logistic Regression classifier rather than a classical Perceptron because Logistic Regression classifiers can output a class probability, while perceptrons just make predictions based on a hard threshold.Classic Perceptrons also offer no regularization.
If you change the Perceptron's activation function to the logistic activation function (or the softmax activation function if there are multiple neurons), and if you train it using Gradient Descent, then it becomes equivalent to a Logistic Regression classifier.
Question 4
Why was the logistic activation function a key ingredient in training the first MLPs?
The logistic function σ(z)=(1+exp(−z))1 was essential in trining the first MLPs because the step function contains only flat segments, so there is no gradient to work with - Gradient Descent cannot move on a flat surface, while the logistic function has a well defined non-zero derivative everywhere, allowing Gradient Descent to make progress at every step. Gradient descent making progress is the key to the neural network converging to a good solution. The nonlinearity of the logistic function also makes sure that there is some nonlinearity between layers which gives the MLP the ability to solve complex problems. (You need an activation function because chaining linear functions just results in another linear function - and you cannot solve complex problems like that.)
Question 5
Name three popular activation functions. Can you draw them?
- The Sigmoid Activation Function (Logistic Function): Described above
- The Hyperbolic Tangent Function tanh(x)=2σ(2z)−1 : Just like the logistic function is S-shaped, continuous, and differentiable, but its output value ranges from -1 to 1 (instead of 0 to 1 in the case of the logistic function), which tens to make each layer's output more or less centered around 0 at the beginning of training. This often helps speed up convergence.
- The *Rectified Linear Unit Function RELU(z)=max(0,z) : It is continuous but unfortunately differentiable at z=0 (the slope changes abruptly, which causes Gradient Descent to bounce around), and its derivative is 0 for z<0 . However, in practice, it works well and has the advantage of being dat to compute. The fact that it does not have a maximum output value helps reduce some issues during gradient descent.
import matplotlib.pyplot as plt
import numpy as np
import math
fig, ax = plt.subplots(1,2,layout="constrained",figsize=(12,6))
ax[0].set_title("Activation Functions")
ax[1].set_title("Derivatives")
z = np.linspace(-5,5,1000)
y_step = np.where(z>0,1,-1)
def get_sigma(z,derivative=False):
if derivative:
sigma_z = 1/(1 + math.e ** (-z) )
return sigma_z * (1 - sigma_z)
return 1/(1 + math.e ** (-z) )
def get_tanh(z):
return (math.e**z - math.e**(-z))/(math.e**z + math.e**(-z))
y_sigmoid = [get_sigma(i) for i in z]
y_tanh = [get_tanh(i) for i in z]
dy_sigmoid = [get_sigma(i,derivative=True) for i in z]
dy_tanh = [1 - i**2 for i in y_tanh]
y_relu = np.where(z>0,z,0)
dy_step = np.zeros_like(z)
dy_relu = np.where(z>0,1,0)
ax[0].plot(z,y_step,'b',label="Step")
ax[0].plot(z,y_sigmoid,'r-',label="Sigmoid / Logistic")
ax[0].plot(z,y_tanh,'g--',label="Tanh (Hyperbolic Tangent)")
ax[0].plot(z,y_relu,'m--',label="ReLU (Rectified Linear Unit Function)")
ax[0].axis((-4,4,-1.5,1.5))
ax[0].legend()
ax[0].grid(visible=True)
ax[0].set_xlabel("z")
ax[1].plot(z,dy_step,'b',label="Step")
ax[1].plot(z,dy_sigmoid,'r-',label="Sigmoid / Logistic")
ax[1].plot(z,dy_tanh,'g--',label="Tanh (Hyperbolic Tangent)")
ax[1].plot(z,dy_relu,'m--',label="ReLU (Rectified Linear Unit Function)")
ax[1].axis((-5,5,-0.2,1.2))
ax[1].legend()
ax[1].grid(visible=True)
fig.suptitle(r"Activation Functions and Derivatives for Neural Networks")
plt.show()
Question 6
Suppose you have an MLP composed of one input layer with 10 passthrough neurons, followed by one hidden layer with 50 artificial neurons, and finally one output layer with 3 artificial neurons. All artificial neurons use the ReLU activation function.
- What is the shape of the input matrix X ?
- m×10 where m is the number of training instances.
- What about the shape of the hidden layer’s weight vector Wh and the shape of its bias vector bh ?
- Since the hidden layer has 50 neurons and there are 10 features, its weight vector will have the shape of 10×50 . Because each of the 10 features will need to be multiplied by a weight which is connected to each of the 50 hidden layer neurons.
- The bias vector will have a length of 50. This is because one bias unit is added to the output of the passthrough layer. That unit is then multiplied once for each of the 50 artificial neurons.
- What about the shape of the hidden layer’s weight vector Wo and the shape of its bias vector bo ?
- The output layer has 3 neurons. The input to the output layer is the output of the hidden layer. The output of the hidden layer has 50 neurons, therefore the shape of the output layer's weight vector is 50×3 .
- The bias vector will have a length of 3 because one bias unit is added to the output of the hidden layer. That unit is then multiplied once for each of the 3 neurons in the output layer.
- What is the shape of the network’s output matrix Y ?
- The shape of the output matrix is going to be m x 3, where m is the batch size, and 3 because each example will compute a probability that it belongs to one of three classes.
- Write the equation that computes the network’s output matrix Y as a function of X , Wh , bh , Wo , and bo ?
Question 7
How many neurons do you need in the output layer if you want to classify email into spam or ham? What activation function should you use in the output layer? If instead you want to tackle MNIST, how many neurons do you need in the output layer, using what activation function? Answer the same questions for getting your network to predict housing prices as in Chapter 2.
You need one output layer neuron if you want to classify an email as spam or ham - assuming the two are mutually exclusive. You should use the logistic activation function.
MLPs can also be used for classification tasks. For a binary classification problem, you just need a single output neuron using the logistic activation function: the output will be a number between 0 and 1, which you can interpret as the estimated probability of the positive class. Obviously, the estimated probability of the negative class is equal to one minus that number.
For multiclass classification problems like MNIST, the number of output neurons you need is one per class (in the case of MNIST, 10), and you should use the softmax activation function for the output layer. The soft,ax activation function will ensure that all the estimated probabilities are between 0 and 1 and that they add up to one (which is required if the classes are exclusive). For the california housing dataset problem, you need one output neuron for predicting the price of the home and you do not want to use any activation function for the output neuron so that it is free to output any range of values.
Question 8
What is backpropagation and how does it work? What is the difference between backpropagation and reverse-mode autodiff?
After the NN has gone through the forward pass, the NN algorithm measures the network's output error - it uses a loss function that compares the desired output and the actual output of the network, and returns some measures of the error. Then the NN algorithm computes how much each output connection contributed to the error. This is done analytically by simply applying the chain rule - think calculus, which makes this step fast and precise. The algorithm then measures how much of these error contributions came from each connection in the layer below, again using the chain rule—and so on until the algorithm reaches the input layer. As we explained earlier, this reverse pass efficiently measures the error gradient across all the connection weights in the network by propagating the error gradient backward through the network (hence the name of the algorithm). Finally, the algorithm performs a Gradient Descent step to tweak all the connection weights in the network, using the error gradients it just computed.
for each training instance the backpropagation algorithm first makes a prediction (forward pass), measures the error, then goes through each layer in reverse to measure the error contribution from each connection (reverse pass), and finally slightly tweaks the connection weights to reduce the error (Gradient Descent step).
Automatically computing gradients is called automatic dierentiation, or autodiff. There are various autodiff techniques, with different pros and cons. The one used by backpropagation is called
reverse-mode autodiff. It is fast and precise, and is well suited when the function to differentiate has many variables (e.g., connection weights) and few outputs (e.g., one loss).
Question 9
Can you list all the hyperparameters you can tweak in an MLP? If the MLP overfits the training data, how could you tweak these hyperparameters to try to solve the problem?
In general, the hyperparameters of a neural network you can adjust are the number of hidden layers, the number of neurons in each hidden layer, and the activation function used by each neuron. For binary classification, use the logistic activation function. For a multi-class problem, use softmax. For a linear regression problem, don't use an activation function. Some simple ways to try and solve overfitting are reducing the number of hidden layers or the number of neurons.
Question 10
Train a deep MLP on the MNIST dataset and see if you can get over 98% precision. Try adding all the bells and whistles (i.e., save checkpoints, use early stopping, plot learning curves using TensorBoard, and so on)
import tensorflow as tf
from tensorflow import keras
from sklearn.datasets import fetch_openml
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()
X_train = X_train / 255
X_train, X_val = X_train[:50000,:,:], X_train[50000:,:,:]
X_test = X_test / 255
"""
Commented steps only required if the model uses categorical_crossentropy loss instead of sparse_categorical_crossentropy
"""
y_train, y_val = y_train[:50000], y_train[50000:]
# y_train = tf.one_hot(y_train,10)
#
# y_test = tf.one_hot(y_test,10)
print("X_train Shape: {}".format(X_train.shape))
print("X_test Shape: {}".format(X_test.shape))
plt.imshow(X_train[0],cmap="gray")
plt.show()
"""
This line creates a Sequential model. This is the simplestkind of Keras model, for neural networks that are just composed of a single stack of layers, connected sequentially. This is called the sequential API.
"""
"""
Layer 1: This layer is a Flatten layer whose role is simply to convert each image to a 1D array (the array is already 1D in this case, but when loading images from Keras, images are typically not flattened)
Dnse Layers: The code below adds three Dense layers with 100 neurons and the ReLU activation function. Each Dense layer manages
its own weight matrix, containing all the connection weights between the neurons and their inputs. It also manages a vector of bias terms (one per neuron). When it receives some input
data, it computes the equation seen below
Output Layer: We add a Dnse output layer with 10 neurons (one per class), using the softmax activation function (because the classes are exclusive)
"""
model = keras.models.Sequential([
keras.layers.Input(X_train.shape[1:]),
keras.layers.Flatten(name="Flatten"), # Flatten the input
keras.layers.Dense(300, activation="relu",name="First"),
keras.layers.Dense(300, activation="relu",name="Second"),
keras.layers.Dense(300, activation="relu",name="Third"),
keras.layers.Dense(300, activation="relu",name="Fourth"),
keras.layers.Dense(10, activation="softmax",name="Output")
])
"""
The method below displays all the models layers.
"""
print(model.summary())
The image above shows the equation that the Keras Dense layers compute.
print(model.layers)
print(model.layers[1].name)
print(model.get_layer('First').name)
weights, biases = model.get_layer('First').get_weights()
print("Weights:",weights)
print("Weight Shape: {}".format(weights.shape))
print("Biases:",biases)
print("Biases Shape: {}".format(biases.shape))
Notice that the Dense layer initialized the connection weights randomly (which is needed to break symmetry), and the biases were just initialized to zeros, which is fine. If you ever want to use a different initialization method, you can set kernel_initializer (kernel is another name for the matrix of connection weights) or bias_initializer when creating the layer.
"""
After a model is created, you must call its `compile()` method to specify the loass function and the optimizer to use. You can also specify a list of exyra metrics to compute during training and evaluation
loss
--------------------------------------
> We use the "sparse_categorical_crossen tropy" loss because we have sparse labels (i.e., for each instance there is just a target class index, from 0 to 9 in this case), and the classes are exclusive. If instead we had one target probability per class for each instance (such as one-hot vectors, e.g. [0., 0., 0., 1., 0., 0., 0., 0., 0., 0.] to represent class 3), then we would need to use the "categorical_crossentropy" loss instead. If we were doing binary classification (with one or more binary labels), then we would use the "sigmoid" (i.e., logistic) activation function in the output layer instead of the "softmax" activation function, and we would use the "binary_crossentropy" loss.
optimizer
--------------------------------------
> Secondly, regarding the optimizer, "sgd" simply means that we will train the model using simple Stochastic Gradient Descent. In other words, Keras will perform the backpropagation algorithm described earlier (i.e., reverse-mode autodiff + Gradient Descent).
metrics
------------------------------------
> Since this is a classifier, its useful to measure its "accuracy" during training and evaluation
"""
model.compile(loss="sparse_categorical_crossentropy",optimizer="sgd",metrics=["accuracy"])
import os
"""
The `fit()` method accepts a `callback` argument taht lets you specify a list of objects that Keras will call during training at the start and end f training, at the start and end of each epoch and even before and after processing each batch. The checkpoint_cb below saaves checkpoints of the model at regular intervals during training, by default at the end of each epoch.
"""
checkpoint_cb = keras.callbacks.ModelCheckpoint(os.path.join(os.getcwd(),'mnist_model.keras'),save_best_only=True)
"""
> if you use a validation set during training, you can set save_best_only=True when creating the ModelCheckpoint. In this case, it will only save your model when its performance on the validation set is the best so far. This way, you do not need to worry about training for too long and overfitting the training set: simply restore the last model saved after training, and this will be the best model on the validation set. This is a simple way to implement early stopping
"""
early_stopping_cb = keras.callbacks.EarlyStopping(patience=10,restore_best_weights=True)
"""
Visualization Using Tensorboard
--------------------------------------------
> TensorBoard is a great interactive visualization tool that you can use to view the learning curves during training, compare learning curves between multiple runs, vis ualize the computation graph, analyze training statistics, view images generated by your model, visualize complex multidimensional data projected down to 3D and automatically clustered for you, and more! This tool is installed automatically when you install TensorFlow, so you already have it.
To use it, you must modify your program so that it outputs the data that you wantto visualize to special binary log files called event files. Each binary record is called a summary. TensorBoard server will monitor the log directory, and it will automatically pick up the changes and update the visualizations: this allows you to visualize live data (with a short delay), such as the learning curves during training.
"""
root_logdir = os.path.join(os.getcwd(), "ch10_mnist_logs")
def get_run_logdir():
"""
Returns a string that represents a path to the log directory + the run id
"""
import time
run_id = time.strftime("run_%Y_%m_%d-%H_%M_%S")
return os.path.join(root_logdir, run_id)
run_logdir = get_run_logdir()
tensorboard_cb = keras.callbacks.TensorBoard(run_logdir)
"""
The number of epochs can be set to a large value since training will stop automatically when there is no more progress. There is no need to restore the best model saved in this case since the EarlyStopping callback will keep track of the best weights and restore them for us at the end of training.
"""
history = model.fit(X_train, y_train, epochs=100, validation_data=(X_val, y_val),callbacks=[checkpoint_cb, early_stopping_cb, tensorboard_cb])
import pandas as pd
pd.DataFrame(history.history).plot(figsize=(8, 5))
plt.grid(True)
plt.gca().set_ylim(0, 1) # set the vertical range to [0-1]
plt.show()
print(model.evaluate(X_test, y_test))
X_new = X_test[:3]
y_proba = model.predict(X_new)
print(y_proba.round(2))
Command To Launch Tensorboard: tensorboard --logdir=./ch10_mnist_logs --port=6096