Hands On Machine Learning Chapter 10 - Introduction to Neural Networks with Keras
I am going to re-read Hands-On Machine Learning with Scikit-learn Keras & TensorFlow because I don't feel that I got a good grasp of machine learning the first time I read it, and I skipped neural networks the first time I read the book. Since the first time reading this textbook.
Introduction to Neural Networks with Keras
Looking at the brain's architecture to build an intelligent machine is the key idea that sparked Artificial Neural Networks (ANNs). ANNs have gradually become quite different from their biological cousins. ANNs are at the very core of Deep Learning.
From Biological to Artificial Neurons
ANNs were first introduced back in 1943 by the neurophysiologist Warren McCulloch and the mathematician Walter Pitts in their landmark paper, "A Logical Calculus of Ideas Immanent in Nervous Activity", in which they presented a simplified computational model of how biological neurons might work together in animal brains to perform complex computations using propositional logic. In the early 1980s, there was a revival of interest in connectivism (the study of neural networks), as new architectures were invented and better training techniques were developed. We are now witnessing another wave of interest in ANNs. There is some reason to believe this wave is different:
- The huge frequency of data available.
- The tremendous increase in computing power.
- Training algorithms have improved.
- Some theoretical limitations of ANNs have turned out to be benign in practice.
- More funding.
Biological Neurons
The image below shows a biological neuron. It is an unusual-looking cell mostly found in animal cerebral cortexes (e.g., your brain), composed of a cell body containing the nucleus and most of the cell's complex components, and many branching extensions called dendrites, plus one very long extension called the axon. The axon's length may just be a few times longer or tens of thousands of times longer than the cell body. Near its extremity the axon splits off into many branches called telodendria, and at the tip of these branches are miniscule structures called synaptic terminals (or simply synapses), which are connected to dendrites (or directly to the cell body) of other neurons.Biological neurons receive short electrical impulses called signals from other neurons via these synapses. When a neuron receives a sufficient number of signals from other neurons within a few milliseconds, it fires its own signals.
Individual biological neurons see, to behave in a rather simple way, but they are organized in a vast network of billions of neurons, each neuron typically connected to thousands of other neurons. Highly complex computations can be performed by a vast variety of fairly simple neurons. The architecture of biological neural networks (BNN) is still the subject of active research, but some parts of the brain have been mapped and it seems that neurons are often organized in consecutive layers, as seen in the image below.
Logical Computations with Neurons
McCulloch and Pitts proposed a very simple model of the computational neuron, which later became known as an artificial neuron: it has one or more binary inputs and one binary output. The artificial neuron simply activates its output when more than a certain number of inputs are active. McCulloch and Pitts showed that even with such a simplified model it is possible to build a network of artificial neurons that computes any logical proposition you want. See image below for some examples.
The Perceptron
The Perceptron is one of the simplest ANN architectures, invented in 1857 by Frank Rosenblatt. It is based on a slightly different artificial neuron (see image below) called a threshold logical unit (TLU), or sometimes a linear threshold unit (LTU): the inputs and outputs are now numbers (instead of binary values) and each input connection is associated with a weight. The TLU computes a weighted sum of its inputs ( z=w1x1+w2+x2+⋯+wnxn=xTw ), then applies a step function to that sum and outputs the result: hw(x)=step(z) , where z=xTw .
The most common step function used in Perceptrons is the Heavyside step function. Sometimes the sign function is used instead.
Common Step Functions used in Perceptrons
A single TLU with a threshold can be used for a simple linear binary classification. Training a TLU in this case means finding the right weights. A Perceptron is simply composed of a single layer of TLUs, with each TLU connected to all the inputs. When all the neurons in a layer are connected to every neuron in the previous layer, it is called a fully connected layer or a dense layer. To represent the fact that each input is sent to every TLU, it is common to draw a special passthrough neurons called input neurons: they just output whatever input they are fed. All the input neurons form the input layer. Moreover, an extra vias Feature is generally added ( x0=1 ): it is typically represented using a special type of neuron called a bias neuron, which outputs 1 all the time. A multi-output classifier that can classify instances simultaneously into three different binary classes can be seen below.
Thanks to linear algebra, it is possible to efficiently compute the outputs of a layer of artificial neurons for several instances at once, using the equation below.
Computing the Outputs of a Fully Connected Layer
The Perceptron training algorithm proposed by Rosenblatt was inspired by Hebb's rule. In his book The Organization of Behavior, published in 1949, Donald Hebb suggested that when a biological neuron often triggers another neuron, the connection between those two neurons grows stronger. This later became known as Hebb's Rule (or Hebian Learning): the connection weight between two neurons is increased whenever they have the same output. Perceptrons are trained using a variant of this rule that takes into account the error made by the network; it reinforces connections that help reduce the error. The Perceptron is fed one training instance at a time, and for each instance it makes predictions. For every output neuron that produced a wrong prediction, it reinforces the connection weights from the input that would have contributed to the correct prediction. This is shown in the below equation:
Perceptron learning Rule
The decision boundary of each output is linear, so Perceptrons are incapable of learning complex patterns. If training instances are linearly separable, Rosenblatt demonstrated that this algorithm would converge to a solution. This is called the Perceptron convergence theorem. Scikit-Learn provides a Perceptron class that implements a single TLU network. The Perceptron learning algorithm strongly resembles Stochastic Gradient Descent. COntrary to Logistic Regression classifiers, Perceptrons do not output a class probability; they just make predictions based on a hard threshold. This is one of the good reasons to prefer Logistic Regression over Perceptrons. In their 1969 monograph titled Perceptrons, Marvin Minsky and Seymour Papert highlighted a number of serious weaknesses of Perceptrons, in particular the fact that they are incapable of solving some trivial problems - e.g. the Exclusive OR (XOR) classification problem (See Image Below left half). It turns out that the limitations of Perceptrons can be eliminated by stacking multiple Perceptrons.The resulting ANN is called a Multi-Layer Perceptron (MLP). In particular, an MPL can solve the XOR problem, see right half of image below.
import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import Perceptron
iris = load_iris()
X = iris.data[:, (2,3)] # Petal Length, petal width
y = (iris.target == 0).astype(np.int8) # Iris Setosa?
per_clf = Perceptron()
per_clf.fit(X,y)
y_pred = per_clf.predict([[2,0.5]])
print(y_pred)
Mult-Layer Perceptron and Backpropagation
An MLP is composed od one (passthrough) input layer, one or more layers of TLUs, called hidden layers, and one final layer of TLUs canned the output layer (see image below). The layers close to the input layer are usually called the lower layers, and the ones close to the output are usually called upper layers. Every layer except the output layer includes a bias neuron and is fully connected to the next layer. MLP is an example of a feedforward neural network because the signal flows only in one direction.
When an ANN contains a depp stack of hidden layers, it is called a deep neural network (DNN). The field of Deep Learning studies DNNs and more generally models containing deep stacks of computations. (However, many people talk about deep learning whenever neural networks are involved). In 1986, David Rumelhart, Geoffrey Hinton and Ronald Williams published a groundbreaking paper introducing the backpropagation training algorithm, which is still used today. Backpropagation is a Gradient Descent using an efficient technique for computing gradients automatically: in just two passes through the network (one forward, one backward), the backpropagation algorithm is able to compute the gradient of the network's error with regards to every single model parameter. In other words, it can find out how each connection weight and each bias term should be tweaked in order to reduce the error. Once it has these gradients, it just performs a regular Gradient Descent step, and the whole process is repeated until the network converges to teh solution. automatically computing gradients is called automatic differentiation or autodiff.
Backpropagation
- Handles one batch at a time, and it goes through the full training set multiple times. Each pass is called an epoch
- Each mini-batch is passed to the network's input layer, which just sends it to the first hidden layer. The algorithm then computes the output of all the neurons in this layer. The result is passed on to the next layer, its output is computed and passed to the next layer, and so on until we get the output of the last layer, the output layer. This is the forward pass: it is exactly like making predictions, except all intermediate results are preserved since they are needed for the backward pass.
- The algorithm measures the network's output layer.
- It computes how much each output connection contributed to the error. This is done analytically by applying the chain rule, which makes this step fast and precise.
- The algorithm then measures how much of these error contributions came from each connection in the layer below, again using the chain rule, and so on until the algorithm reaches the input layer. This reverse pass efficiently measures the error gradient across all the connection weights in the network by propagating the error gradient backward through the network.
- The algorithm then performs a Gradient Descent step to tweak all the connection weights in the network, using the error gradients it just computed.
This algorithm is so important, it’s worth summarizing it again: for each training instance the backpropagation algorithm first makes a prediction (forward pass), measures the error, then goes through each layer in reverse to measure the error contribution from each connection (reverse pass), and finally slightly tweaks the connection weights to reduce the error (Gradient Descent step).
It is important to initialize all the hidden layers' connection weights randomly, or else training will fail. When you randomly initialize the weights, you break the symmetry and allow backpropagation to train a diverse set of neurons. In order to work properly, the authors made a key change to the MLP's architecture: they replaced the step function with the logistic function σ(z)=1 / (1+exp(−z)) . This was essential because the logistic function has a nonzero derivative everywhere. The backpropagation algorithm works well with many other activation fucntions, not just the logistic function. Other popular ones:
- The hyperbolic tangent function tanh(z)=2σ(2z)−1
- Just like the logistic function it is S-shaped, continuous, and differentiable, but its output value ranges from –1 to 1 (instead of 0 to 1 in the case of the logistic function), which tends to make each layer’s output more or less centered around 0 at the beginning of training. This often helps speed up convergence.
- The Rectified Linear Unit Function ReLU(z)=max(0,z)
- In practice works well and is fact to compute. Does not have a maximum output value which helps reduce some issues during Gradient Descent
The popular activation functions and their derivatives can be seen in the image below. You need some non-linearity between layers to effectively separate/isolate? them - this is what the activation function is for.
Regression MLPs
MLPs can be used for regression tasks. If you just one to predict a single value, then you just need a single output neuron. For multivariate regression (to predict multiple values at once), you need one output neuron per output dimension. In general, when building a MLP for regression, you do not want to use any activation function for the output neurons, so they are free to output any range of values. If you want to guarantee that the output will always be positive, then you can use the ReLU activation function, or the softplus activation function in the output layer. If you want outputs to fall in a range, you can use the logistic function or hyperbolic tangent and scale it. The loss function used during training is typically the mean squared error, but if you have a lot of outliers, it may be good to use mean absolute error.
\Classification MLPs
MLPs can be used for classification tasks. For a binary classification problem, you only need a single output neuron using the logistic activation function: the output number between 0 and 1, which you can estimate as the probability of the positive class. MLPs can also handle multilabel binary classification tasks. You dedicate one output neuron for each positive class you want to predict. If each instance can belong only to a single class, out of 3 or more possible classes, then you would need one output neuron per class, and you should use softmax activation function for the whole output layer (see image below). The softamx function will ensure that all the estimated probabilities are between 0 and 1 and that they ass up to 1. This is called multiclass classification.
The cross-entropy loss function is generally a good choice for MLP classification.
Implementing MLPs with Keras
Keras is a high-level Deep Learning API that allows you to easily build, train, evaluate and execute all sorts of neural networks. [ ... ] [D]eveloped by Francois Chollet as part of a research project and released as an open source project in March 2015. It quickly gained popularity owing to its ease of use, flexibility, and beautiful design. To perform the heavy computations required by neural networks, keras-team relieas on a computation backend. At the present, you can choose from three popular open source deep learning libraries: TensorFlow, Microsoft Cognitive Toolkit or Theano.
Building an Image Classifier Using the Sequential API
The Fashion MNIST dataset is a drop in replacement for MNIST. It has the exact same format as MNIST (70,000 grayscale images of 28x28 pixels each, with 10 classes), but the images represent fashion items rather than handwritten digits, so each class is more diverse and the problem turns out to be significantly more challenging than MNIST. Keras provides some utility functions to fetch and load common datasets, including MNIST, Fashion MNIST, the original California housing dataset, and more.
See the code below for a description of building a classification MLP with two hidden layers. The model's summary() method displays all the model's layers, including each layers name (which is automatically generated unless you set it when creating the layer), its output shape (None means the batch size can be anything), and its number of parameters. The summary ends with the total number of parameters, including trainable and non-trainable parameters. Note that Dense layers often have a lot of parameters. - the first hidden layer has 784 x 300 connection weights, plus 300 bias terms, which add up to 235,500 parameters. This gives the model a lot of flexibility to fit the training data, but it also means that the model runs the risk of overfitting, especially when you do not have a lot of training data. All the parameters of a layer can be accessed using its get_weights() and set_weights() method. For a Dense layer, this includes both the connection weights and the bias terms. If you want to use a different initialization method, you can set kernel_initializer (kernel is another name for the matrix of connection weights) or bias_initializer when creating the layer. The shape of the weight matrix depends on the number of inputs. This is why it is recommended to specify the input_shape when creating the first layer in a Sequential model.
Compiling the Model
After a model is created, you must call its compile() method to specify the loss function and the optimizer to use. You may also specify a list of extra metrics to compute during training and evaluation. See the comments in the code below to see why certain keyword arguments were used in the compile() method.
Training and Evaluating the Model
To train the model, simply call its fit() method. We pass in the training data, the number of epochs to train, and we also pass in validation set (optional): Keras will measure the loss and the extra metrics on this set at the end of every epoch, which is very useful to see how well the model really performs: if the performance on the training set os much better than on the validation set, your model id probably overfitting the training set. At each epoch during training, Keras displays the number fo instances processed so far, the mean training time per sample, the loss and accuracy(or any other metrics you asked for), both on the training set and the validation set. Note: Instead of passing in a validation set using the validation_data argument, you could instead set validation_split to be the ratio of the training set that you want Keras to use for validation. If your training set is skewed, it would be a useful to set the class_weight argument. The fit() method returns a History object containing the training parameters (history.params), the list of epochs it went through (history.epoch), and most importantly, a dictionary history.history containing the loss and extra metrics it measured at the end of each epocj on the traing and validation set. You can easily plot the learning curves (see below).
# Load the Fashion MNIST dataset
import tensorflow as tf
from tensorflow import keras
print(tf.__version__)
print(keras.__version__)
fashion_mnist = keras.datasets.fashion_mnist
(X_train_full, y_train_full), (X_test, y_test) = fashion_mnist.load_data()
# One important difference of the fashion MNIST dataset is that each image is represented as a 28x28 array instead of a 784 vector, and pixel intensities are represented as integers rather than floats
print(X_train_full.shape)
print(X_train_full.dtype)
# Create validation and training sets
# Since we are training the neural net using Gradient Descent, we must scale the input features
X_valid, X_train = X_train_full[:50000] / 255.0, X_train_full[50000:] / 255.0
y_valid, y_train = y_train_full[:50000], y_train_full[50000:]
# We need the list of class names that we are working with
class_names = ["T-shirt/top", "Trouser", "Pullover", "Dress", "Coat", "Sandal", "Shirt", "Sneaker", "Bag", "Ankle boot"]
print(class_names[y_train[0]])
# Create a model using teh sequential API
# This is a classification MLP with two hidden layers
# The Sequntial model is the simplest kind of Keras model, for neural networks that are just composed of a single
# stack of layers, connected sequntially
model = keras.models.Sequential()
# This Flatten layer recieves the image and turns it into a 1D array. This layer does not have any parameters, it is just there to do some simle preprocessing
# Since it is the first layer in the model, you should specify `input_shape`
# Alternatively, you could ass a `keras.layers.InputLayer` as the first layer, setting shape=[28,28]
model.add(keras.layers.Flatten(input_shape=[28,28]))
# Adds a Dense layer with 300 neurons and the ReLU activation function
# Each Dense layer manages its own weight matrix, containing all connection weights between the neurons and their inputs - it also manages a vector of bias terms
model.add(keras.layers.Dense(300,activation="relu"))
# Another Dense Layer, with 100 neurons
model.add(keras.layers.Dense(100,activation="relu"))
# Dense output layer with 10 neurons (one per class), using the softmax activation function (because the classes are exclusive)
model.add(keras.layers.Dense(10,activation="softmax"))
# Here is another way to create am equivalent Sequential model:
model = keras.models.Sequential([
keras.layers.Flatten(input_shape=[28, 28]),
keras.layers.Dense(300, activation="relu"),
keras.layers.Dense(100, activation="relu"),
keras.layers.Dense(10, activation="softmax")
])
model.summary()
# You can easily get a model's list of layers, to fetch a layer by its index, or you can fetch a model by its name
print(model.layers)
print(model.layers[1].name)
weights, biases = model.layers[1].get_weights()
print("Weights =",weights)
print("Weights Shape =",weights.shape)
print("Biases =",biases)
print("Biases Shape =",biases.shape)
# Compiling the model
## We use `sparse_categorical_crossentropy` loss because we have spare slabels (for each instance, there is just a target class index - form 0 to 9 in this case),
# and the classes are exclusive.
# If instead we had a probability per class per instance, we would need to use `categorical_crossentropy` losss instead
# If we were dong binary classification, we would use the "sigmoid" activation function in the output layer instead of "softmax" and we would use
# the binary_crossentropy loss
model.compile(loss="sparse_categorical_crossentropy",optimizer="sgd",metrics=["accuracy"])
history = model.fit(X_train,y_train,epochs=30,validation_data=(X_valid, y_valid))
import pandas as pd
import matplotlib.pyplot as plt
pd.DataFrame(history.history).plot(figsize=(8,5))
plt.grid(True)
plt.gca().set_ylim(0,1) # set the vertical range from 0 to 1
plt.show()
You can see that both the training and validation accuracy steadily increase during training and validation loss decrease - good! Moreover, the validation curves are quite close to teh training curves, which means that there is not too much overfitting. The training set performance ends up beating the validation performance, which is generally the case when you train long enough. If you are not satisfied with the performance of your model, you should go back and tune the hyperparameters - the number of layers, number of neurons per layer, the types of activation functions for hidden layers, the training epochs number, the batch size. You can easily estimate the generalization error using the evaluate() method.
Using the Model to Make Predictions
You can use the model'ss predict() method to make predictions on new instances.
Building a Regression MLP Using the Sequential API
Building, training, evaluating, and using a regression MLP using the Sequential API to make predictions is quite similar to what we did for classification. the main differences are the fact that the output layer has a single neuron (since we only want to predict a single value) and uses no activation function, and the loss function is the mean squared error. The Sequential APU is easy to use. Although Sequential models are exteremely common, it is sometimes useful to build neural networks with more complex topologies, or with multiple inputs and outputs. For this purpose, Keras offers the Functional API.
model.evaluate(X_test, y_test)
X_new = X_test[:3]
y_proba = model.predict(X_new)
print(y_proba.round(2))
## Building a Regression MLP Using the Sequential API
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
housing = fetch_california_housing()
X_train_full, X_test, y_train_full, y_test = train_test_split(housing.data, housing.target)
X_train, X_valid, y_train, y_valid = train_test_split(X_train_full, y_train_full)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_train_scaled = scaler.transform(X_valid)
X_test_scaled = scaler.transform(X_test)
model = keras.models.Sequential([
keras.layers.Dense(30, activation="relu", input_shape=X_train.shape[1:]),
keras.layers.Dense(1)
])
model.compile(loss="mean_squared_error", optimizer="sgd")
history = model.fit(X_train, y_train, epochs=20,validation_data=(X_valid, y_valid))
mse_test = model.evaluate(X_test, y_test)
X_new = X_test[:3] # pretend these are new instances
y_pred = model.predict(X_new)
Building Complex Models Using the Functional API
One example of non-sequntial neural network is a Wide & Deep neural network. This neural network architecture was introduced in a 2016 paper by Heng-Tze Cheng et al. It connects all or part of the inputs directly to the output layer, as shown in the image below. This architecture makes it possible for the neural network to learn both deep patterns (using the deep path) and simple rules (through the short path). In contrast, a regular MLP forces all the data to flow through the full stack of layers, thus simple pattern sin the data may end up being distorted by this sequence of transformations.
When you train a wide and deep model, sending some features through the wide path and others through the deep path (possibly overlapping subset of features), you have to mass in pairs of matrices - one per input - to the fit() method.
There are many cases when you may want to have multiple outputs:
- The task may demand it: locate and classify the main object in a picture. This is a regression task (finding the coordinates of the object's center, as well as its with and height) and a classification task
- You may have multiple independent tasks to perform based on the same data. In many cases, you will get better results on all tasks by training a single neural network with one output per task. This is because the neural network can learn features in the data that are useful across tasks.
- Another use case is a regularization technique
Adding extra outputs is quite easy: connect them to the appropriate layers and add them to your model's list of outputs. Each output needs its own loss function, so when we compile a model we should pass a list of losses.
# Wide and Deep Neural Network
# Create an Input Object - this is because we may have multiple inputs
input = keras.layers.Input(shape=X_train.shape[1:])
# Create a Dense layer with 30 neurons and using the ReLU activation function. We call it lke a
# function, passing in the input, this is why it is called the Functional API.
# Note we are just telling Keras how it should connect the layers together, no
# data is bing processed yet
hidden1 = keras.layers.Dense(30,activation="relu")(input)
# Creating a second hidden layer, similar to first
hidden2 = keras.layers.Dense(30,activation="relu")(hidden1)
# Concatenate layer, use it like a function to concatenate the input and the
# output of the seond hidden layer
concat = keras.layers.Concatenate()([input,hidden2])
# Create the output layer, with a single neuron and no activation function, and we call it like a function
# passing in the result of the concatenation
output = keras.layers.Dense(1)(concat)
# create a model, specifying the input and outputs to use
model = keras.models.Model(inputs=[input],outputs=[output])
## Sending some layers through deep path and others through wide path (they can be overlapping)
input_A = keras.layers.Input(shape=[5])
input_B = keras.layers.Input(shape=[6])
hidden1 = keras.layers.Dense(30, activation="relu")(input_B)
hidden2 = keras.layers.Dense(30, activation="relu")(hidden1)
concat = keras.layers.concatenate([input_A, hidden2])
output = keras.layers.Dense(1)(concat)
model = keras.models.Model(inputs=[input_A, input_B], outputs=[output])
model.compile(loss="mse", optimizer="sgd")
X_train_A, X_train_B = X_train[:, :5], X_train[:, 2:]
X_valid_A, X_valid_B = X_valid[:, :5], X_valid[:, 2:]
X_test_A, X_test_B = X_test[:, :5], X_test[:, 2:]
X_new_A, X_new_B = X_test_A[:3], X_test_B[:3]
history = model.fit((X_train_A, X_train_B), y_train, epochs=20, validation_data=((X_valid_A, X_valid_B), y_valid))
mse_test = model.evaluate((X_test_A, X_test_B), y_test)
y_pred = model.predict((X_new_A, X_new_B))
# Multiple Outputs
input_A = keras.layers.Input(shape=[5])
input_B = keras.layers.Input(shape=[6])
hidden1 = keras.layers.Dense(30, activation="relu")(input_B)
hidden2 = keras.layers.Dense(30, activation="relu")(hidden1)
concat = keras.layers.concatenate([input_A, hidden2])
output = keras.layers.Dense(1)(concat)
aux_output = keras.layers.Dense(1)(hidden2)
model = keras.models.Model(inputs=[input_A, input_B],outputs=[output, aux_output])
# Each output needs its own lodd function
# We care much more about the main output than the auxilary output
# so we give the main output's loss a much greater weight
model.compile(loss=["mse", "mse"], loss_weights=[0.9, 0.1], optimizer="sgd")
# When we train the model, we need to provide some labels for each output
history = model.fit([X_train_A, X_train_B], [y_train, y_train], epochs=20,validation_data=([X_valid_A, X_valid_B], [y_valid, y_valid]))
# When we evaluate the model, keras will return the total loss, as well as the individual losses
losses = model.evaluate([X_test_A, X_test_B], [y_test, y_test])
# The predict() method will return predictions for each output
y_pred_main, y_pred_aux = model.predict([X_new_A, X_new_B])
Building Dynamic Models Using the Subclassing API
The Subclassing API is used for a more imperative programming style (vs the static API provided by the Sequential dn Functional API). The code below shows: subclass the Model class, create the layers you need in the constructor, and use them to perform the computations you want in the call() method. You should probably stick to the Sequential API and the Functional API.
class WideAndDeepModel(keras.models.Model):
def __init__(self, units=30, activation="relu", **kwargs):
super().__init__(**kwargs) # handles standard args (e.g., name)
self.hidden1 = keras.layers.Dense(units, activation=activation)
self.hidden2 = keras.layers.Dense(units, activation=activation)
self.main_output = keras.layers.Dense(1)
self.aux_output = keras.layers.Dense(1)
def call(self, inputs):
input_A, input_B = inputs
hidden1 = self.hidden1(input_B)
hidden2 = self.hidden2(hidden1)
concat = keras.layers.concatenate([input_A, hidden2])
main_output = self.main_output(concat)
aux_output = self.aux_output(hidden2)
return main_output, aux_output
model = WideAndDeepModel()
Saving and Restoring a Model
Saving a model is simple. Keras will save both the model's architecture (including every layer's Hyperparameters) and the value of all the model parameters for every layer (connection weights and biases), using the HDF5 format. It also saves the optimizer. You will typically have a script that trains a model and saves it, and one or more scripts that load the model and use it to make predictions. You can also use the save_weights() and load_weights() functions to save and restore the model parameters.
model.save("my_keras_model.h5")
model = keras.models.load_model("my_keras_model.h5")
If the model training lasts several hours - which is common when training on large datasets - you can save checkpoints at regular intervals during training. You call tell teh fit() method to save checkpoints using callbacks.
Using Callbacks
The fit() method accepts a callbacks argument that lets you specify a list of objects that Keras will call during training at the start and end of training, at the start and end of each epoch and even before and after processing each batch. The ModelCheckpoint class saves your model by default at the end of every epoch.
checkpoint_cb = keras.callbacks.ModelCheckpoint("my_keras_model.h5")
history = model.fit(X_train, y_train, epochs=10, callbacks=[checkpoint_cb])
Moreover, if you use a validation set during training, you can set save_best_only=True when creating the ModelCheckpoint. In this case, it will only save your model when its performance on the validation set is the best so far. This way, you do not need to worry about training for too long and overfitting the training set: simply restore the last model saved after training, and this will be the best model on the validation set. This is a simple way to implement early stopping. Another way to implement early stopping is to simply use the EarlyStopping call‐
back. It will interrupt training when it measures no progress on the validation set for a number of epochs (defined by the patience argument), and it will optionally roll back to the best model. You can combine both callbacks to both save checkpoints of your model (in case your computer crashes), and actually interrupt training early when there is no more progress (to avoid wasting time and resources)
checkpoint_cb = keras.callbacks.ModelCheckpoint("my_keras_model.h5",save_best_only=True)
history = model.fit(X_train, y_train, epochs=10,validation_data=(X_valid, y_valid),callbacks=[checkpoint_cb])
model = keras.models.load_model("my_keras_model.h5") # rollback to best model
early_stopping_cb = keras.callbacks.EarlyStopping(patience=10,restore_best_weights=True)
history = model.fit(X_train, y_train, epochs=100,validation_data=(X_valid, y_valid),callbacks=[checkpoint_cb, early_stopping_cb])
If you need extra control, you can easily write your own callbacks.
Visualization Using TensorBoard
A tool that you should definitely have in your toolbox. TensorBoard is a great interactive visualization tool that you can use to view the learning curves during training, compare learning curves between multiple runs, visualize the computation graph, analyze training statistics, view images generated by your model, visualize complex multidimensional data projected down to 3D and automatically clustered for you, and more. This tool is automatically installed when you install Tensor Flow. To use it, you mist modify your program files so that it outputs the data you want to visualize in special binary log files called event files. Each binary record is called a summary. The TensorBoard server will monitor the log directory, and it will automatically pick up changes and update the visualizations: this allows you to visualize live data, such as learning curves during training. In general, you want to point the TensorBoard server to a root log directory, and configure your program so that it writes to a different subdirectory each time it runs. ns. This way, the same TensorBoard server instance will allow you to visualize and compare data from multiple runs of your program, without getting everything mixed up
root_logdir = os.path.join(os.curdir, "my_logs")
def get_run_logdir():
import time
run_id = time.strftime("run_%Y_%m_%d-%H_%M_%S")
return os.path.join(root_logdir, run_id)
run_logdir = get_run_logdir() # e.g., './my_logs/run_2019_01_16-11_28_43'
[...] # Build and compile your model
tensorboard_cb = keras.callbacks.TensorBoard(run_logdir)
history = model.fit(X_train, y_train, epochs=30,validation_data=(X_valid, y_valid),callbacks=[tensorboard_cb])
Run the following command to run the TensorBoard server:
$ tensorboard --logdir=./my_logs --port=6006
TensorBoard 2.0.0 at http://mycomputer.local:6006 (Press CTRL+C to quit)
Fine-Tuning Neural Network Hyperparameters
The flexibility of neural networks is one of their main drawbacks: there are many hyperparameters to tweak. Not only can you use any imaginable network architecture, but even in a simple MLP you can change the number of layers, the number of neurons per layer, the type of activation function to use in each layer, the weight initialization logic, and much more. How do you know what set of hyperparameters is best for your task? One option is to try many combinations of hyperparameters and see which work best using GridSearchCV or RandomizedSearchCV. See the code on page 315-317 for examples of this (How to wrap Keras models to use with Scikit-Learn). The exploration may last hours depending on the hardware, the size of the dataset, the complexity of the model, and the number of hyperparameters you are tuning. Randomized searhc works well for simple problems.
There are many techniques to explore a search space more efficiently than randomly. The core idea is simple: when a region of the space turns out to be good, it should be explored more. This takes care of the "zooming" process for you and leads to much better solutions in much less time. Here are a few Python libraries that you can use for optimizing hyperparameters:
- Hyperopt a popular Python library for optimizing all sorts of complex search spaces
- Hyperas, kopt, or Talos optimizing hyperparametrs for Keras model
- Scikit-Optimize a general-purpose optimization library. The BayesSearchCV class performs Bayesian optimization using an interface similar to GridSearchCV
- Spearmint a Bayesian optimization library
- Sklearn-Deap a hyperparameter search library based on evolutionary algorithms
Many companies offer services for hyperparameter optimization. Google Cloud ML Engine has a hyperparameter tuning service. Hyperparameter tuning is still an active area of research. Evolutionary algorithms are making a comeback lately - for example, in this 2017 paper by DeepMind. Google has an evolutionary approach not only to find appropriate hyperparameters but also to find the best neural network architecture for a problem. They call this AutoML and it is available as a cloud service. Check out Google's post on discovering architectures using evolutionary algorithms. See this 2017 post by Uber where they introduce their Deep Neuroevolution technique.
Number of Hidden Layers
Deep networks have a much higher parameter efficiency than shallow ones: they can model complex functions using exponentially fewer neurons than shallow nets, allowing them to reach much better performance with the same amount of training data. Real world data is often structured in such a hierarchal way and DNNs automatically take advantage of this fact: lower hidden layers model low-level structures (e.g., line segments of various shapes and orientations), intermediate hidden layers combine these low-level structures to model intermediate-level structures (squares, circles) and highest hidden layers and the output layer combine these intermediate structures to model high-level structures (faces). The hierarchal nature of DNNs helps them converge faster to good solutions and improves their ability to generalize to new datasets. Transfer learning initializing weights of higher hidden layers with the weights of lower hidden layers (that were already trained to detect simpler features) to improve training time. generally, simpler problems require less layers. For more complex problems, you can gradually ramp up the number of hidden layers until you start overfitting the training set.
Number of Neurons per Hidden Layer
The number of neurons in input and output layers is constrained by the task. As for the hidden layers, it used to be a common practice to size them to form a pyramid, with fewer and fewer neurons at each layer—the rationale being that many low-level features can coalesce into far fewer high-level features. However, this practice has been abandoned by now. Just like for the number of layers, you can try increasing the number of neurons gradually until the network starts overfitting. In general, you will get more bang for the buck by increasing the number of layers than the number of neurons per layer. A simpler approach is to pick a model with more layers and neurons than you actually need, then use early stopping to prevent it from overfitting.
Learning Rate, Batch Size, and Other Hyperparameters
- The learning rate is arguably the most important hyperparameter. In general, the optimal earning rate is about half of the maximum learning rate (the learning rate above which the algorithm diverges). So a simple approach for tuning the learning rate is to start with a large value that makes the training algorithm diverge, then divide this value by 3 and try again, and repeat until the training algorithm stops diverging.
- Choosing a better optimizer than plain old Mini-batch Gradient Descent is also quite important.
- The batch size also have a significant impact on the model's performance. In general, the optimal batch size will be lower than 32, A small batch size ensures that each training iteration is very fast, and although a large batch size will give a more precise estimate of the gradients, in practice this does not matter since the optimization landscape is quite complex and the direction os the true gradients do not point precisely in the direction of the optimum.
- The choice of activation function: ReLU activation function will be a good default for hidden layers. For the output layer, it depends on the task.
- In most cases, the number of training instances does not actually need to be tweaked, just use early stopping instead.
For best practices, make sure to read Yoshua Bengio's great 2012 paper, which presents many practical recommendations for deep networks.