Deep Learning with Python Chapters 3 and 4
These chapters give an "Introduction to Keras and TensorFlow" and an introduction to Classification and Regresison with Keras.
Introduction to Keras and TensorFlow
Links
Tensorflow is a Python-based, free, open source machine learning platform, developed primarily by Google. Much like NumPy, the primary purpose of TensorFlow is to enable engineers and researchers to manipulate mathematical expressions over numerical tensors. But TensorFlow differs in scope:
- It can automatically compute the gradient of any differentiable expression, making it highly suitable for machine learning
- It can run not only on CPUs, but also on GPUs and TPUs, highly parallelized hardware accelerators
- Computation defined in TensorFlow can easily be distribute across many machines
- TensorFlow programs, can be exported to other runtimes, such as C++, JavaScript, or TensorFlow lite (for applications funning on mobile or embedded devices). This makes TensorFlow applications easy to deploy in practical settings.
TensorFlow is a platform:
- TF-agents for reinforcement learning research
- TFX for industry-strength machine learning workflow management
- TensorFlow Serving for production deployment, and there's the TensorFlow Hub repository of pretrained models.
Keras is a deep learning API for Python, built on top of TensorFlow, that provides a convenient way to define and train any ind of deep learning model. It was initially developed for research with the aim of enabling fast deep learning experimentation.
It is highly recommended that you run deep learning code on a modern NVIDIA GPU rather than your own computer's CPU. Some applications - in particular, image processing with convolutional neural networks - will be excruciatingly slow on CPU, even a fast multicore CPU. And for applications - in particular image processing with convolutional neural networks, will be excruciatingly slow on CPU, even a multicore CPU. To do deep learning, you probably want to use the free GPU runtime from Colaboratory, a hosted network service offered by Google - this is only good for small workloads. If you want to scale up, you will need to use GPU instances on Google Cloud or Amazon EC2. It's best to run Keras on a Unix workstation - if using Windows, use Ubuntu dual boot.
my_list = [3,2,5]
sorted(my_list)
# How to Install packages in Colab: !pip install <package_name>
## Make sure to turn on the GPU runtime by going to Runtime > Change Runtime Type
Training a neural network revolves around the following concepts:
- Low-Levl tensor manipualtion - the infrastructure that underlies all modern machine learning. This translates to TensorFlow APIs:
- Tensors, including special tensors that store the network's state (variables)
- Tensor Operations such as assition, relu, and matmul
- Backpropagation, a way to compute the gradient of mathematical expressions (handled in TensorFlow via the GradientTape object)
- Second, high level deep learning concepts. This translates to Keras APIs:
- Layers, which are combined into a model
- A loss function, which defines the feedback signal used for learning
- An optimizer, which determines how learning proceeds
- Metrics to evaluate model performance, such as accuracy
- A training loop that performs mini-batch gradient stochastic gradient descent
import tensorflow as tf
"""
Tensors need to be created with some initial value.
All ones or All Zeros Tensors:
"""
x = tf.ones(shape=(2,1)) # Equivalent to np.ones(shape=(2,1))
print(x)
x = tf.zeros(shape=(2,1)) # Equivalent to np.zeros(shape=(2,1))
print(x)
"""
Random Tensors:
"""
x = tf.random.normal(shape=(3,1),mean=0.,stddev=1.) # Tensor of random values drawn from a normal distribution with mean 0 and a standard deviation of 1. Equivalent to np.random.normal(size=(3,1),loc=0.,scale=1.)
print(x)
x = tf.random.uniform(shape=(3,1),minval=0.,maxval=1.) # Tensor of Random values drawn from a uniform distribution between 0 and 1.
print(x)
"""
A significant difference between NumPy arrays and TensorFlow tensors is that TensorFlow tensors aren't assignable: theu're constant.
"""
import numpy as np
x = np.ones(shape=(2,2))
x[0,0] = 0. # This is acceptable for NumPy arrays
x = tf.ones(shape=(2,2))
try:
x[0,0] = 2. # This is unacceptable in TensorFlow
except Exception as e:
print(e)
"""
Creating a TensorFlow variable
"""
v = tf.Variable(initial_value=tf.random.normal(shape=(3,1)))
print(v)
# The stae of the variable can be modifed via its `assign` method
v.assign(tf.ones((3,1)))
print(v)
# It also works for a subset of the coefficients
v[0,0].assign(3.)
print(v)
# `assign_add()` and `assign_sub()` are efficient equivalents of += and -=
v.assign_add(tf.ones((3,1)))
print(v)
# Just Like Numpy, TensorFlow offers a large collection of Tensor operations to express mathematical formula:
a = tf.ones((2,2))
a = a + tf.ones((2,2))
b = tf.square(a) # Take the square
c = tf.sqrt(a)
d = b + c # Add Two Tensors, element-wise
e = tf.matmul(a,b) # Take the dot product of two tensors
e *= d # Multiply two tensors (element-wise)
"""
Here's something NumPy can't do: retrieve the Gradient of any differentiable expression with respect to any of its inputs.
Just oppern a `GradientTape` scope, apply some computation to one or several input tensors and retrieve the gradient of the result wrt to the inputs
"""
input_var = tf.Variable(initial_value=3.)
with tf.GradientTape() as tape:
result = tf.square(input_var)
gradient = tape.gradient(result,input_var)
"""
This is most commonly used to retrieve the gradients of the loss of a model with respect to its weights: gradient = tape.gradient(loss,weights)
Only *trainable variables* are tracked by default. WIth a constant tensor, you'd have to manually mark it as being watched
"""
input_const = tf.constant(3.)
with tf.GradientTape() as tape:
tape.watch(input_const) # Manually watch constant Tensor
result = tf.square(input_const)
gradient = tape.gradient(result,input_const)
time = tf.Variable(0.)
with tf.GradientTape() as outer_tape:
with tf.GradientTape() as inner_tape:
position = 4.9 * time ** 2
speed = inner_tape.gradient(position,time)
acceleration = outer_tape.gradient(speed,time) # We compute the outer tape to compute the gradient from the inner tape
print("Acceleration = {:.1f}".format(acceleration))
### End-to-End Example A Linear Classifier in Pure TensorFlow
"""
Coming up with some nicely linearly separable synthetic dat to work with: two classes of points in a 2D plane.
"""
num_samples_per_class = 1000
negative_samples = np.random.multivariate_normal(mean=[0,3],cov=[[1,0.5],[0.5,1]],size=num_samples_per_class) # Generate a clss of points, 1000 random 2D points. cov corresponds to an oval-like point oreiented from bottom left to top right
positive_samples = np.random.multivariate_normal(mean=[3,0],cov=[[1,0.5],[0.5,1]],size=num_samples_per_class) # Generate the other class with a different mean and the same covariance matrix
import matplotlib.pyplot as plt
def plot_data():
fig, ax = plt.subplots(1,1,layout="constrained")
ax.scatter(negative_samples[:,0],negative_samples[:,1],c='y',label="Negative Samples")
ax.scatter(positive_samples[:,0],positive_samples[:,1],c='b',label="Positive Samples")
ax.legend()
ax.set_title("Visualizing Data")
return ax
plot_data()
plt.show()
inputs = np.vstack((negative_samples,positive_samples)).astype("float32")
targets = np.vstack((np.zeros((num_samples_per_class,1),dtype=np.float32), np.ones((num_samples_per_class,1),dtype=np.float32)))
"""
A linear classifier is an affine transformation `(prediction = W * input + b) trained to minimize the sequare of the difference between predictions and the targets
"""
input_dim = 2 # The input will be 2D points
output_dim = 1 # The output prediction will be a single score per sample (0 or 1)
W = tf.Variable(initial_value=tf.random.uniform(shape=(input_dim,output_dim)))
b = tf.Variable(initial_value=tf.zeros(shape=(output_dim,)))
def model(inputs):
"""
This is our foward pass function
"""
return tf.matmul(inputs,W)+b
def square_loss(targets,predictions):
# per_sample_losses will be a tensor with the same shape as the targets and predictions, containing per-sample loss score
per_sample_losses = tf.square(targets-predictions)
# We need to average these per-sample loss scores into a single scalar loss value: this is what reduce_mean does
return tf.reduce_mean(per_sample_losses)
learning_rate = 0.1
def training_step(inputs,targets):
"""
The training steps receives some training data and updates the weights W and b so as to minimize the loss on the data
"""
with tf.GradientTape() as tape:
"""
Forwaed pass inside a gradient tape scope
"""
predictions = model(inputs)
loss = square_loss(predictions,targets)
grad_loss_wrt_W, grad_loss_wrt_b = tape.gradient(loss,[W,b])
"""
Update the weights
"""
W.assign_sub(grad_loss_wrt_W * learning_rate)
b.assign_sub(grad_loss_wrt_b * learning_rate)
return loss
"""
Below, for simplicity, we will do batch training instead of mini-batch training. On the one hand, this means each training step will take longer to run,
on the other, it means that each gradient update will be much more effective at reducing the loss of the training data . As a result, we will need fewer steps of training, and we
should use a larger training rate than we would typically use for mini-batch
"""
for step in range(40):
loss = training_step(inputs,targets)
print(f"Loss at step {step}: {loss:.4f}")
ax = plot_data()
x = np.linspace(-1,4,100)
"""
> Recall that the prediction value for a given point [x, y] is simply prediction == [[w1], [w2]] • [x, y] + b == w1 * x + w2 * y + b. Thus, class 0 is defined as w1 * x + w2 * y + b < 0.5, and class 1 is defined as w1 * x + w2 * y + b > 0.5.
"""
y = -W[0] / W[1] * x + (0.5-b) / W[1] # Line Equation
ax.plot(x,y,"-r")
plt.show()
"""
This is really what a linear classifier is all about: finding the parameters of a line (orm in higher-dimensional spaces, a hyperplane) neatly separating two classes of data
"""
The fundamental data structure in neural networks is the layer. A layer is a data processing module that takes as input one or more tensors and that outputs one or more tensors. Some layers are stateless, but more frequently layers have a state: the layer's weights, one or several tensors learned with stochastic gradient descent, which contain the network's knowledge. Different types of layers are appropriate in different situations:
- Densely Connected Layers (Fully Connected Layers) is appropriate for vector data (samples, features) (Dense Keras class)
- Sequence Data ((samples, timestamps, features)) is typically processed by recurrent layers, such as an LSTM layer or 1D convolutional layers Conv1D
- Image Data, stored in rank-4 tensors, is usually processed by 2D convolutional layers (Conv2D)
Building deep learning models in Keras is done by clipping together compatible layers to form useful data-transformation pipelines. Everything in Keras is either a Layer or something that closely interacts with a Layer. A Layer is an object that encapsulates some state (weights) and some computation (a forward pass). The wights are typically defined in build() and the computation is defined in the call() method. Just lie with LEGO bricks, you can only "clip" together layers that are compatible. The notion of layer compatibility refers specifically to the fact that every layer will only accept tensors of a certain shape and will return output tensors of a certain shape. When using Keras, you don't have to worry about size compatibility most of the time, because the layers you add to your models are dynamically built to match the shape of the incoming layer.
A deep learning model is a graph of layers, In Keras, that's the Model class. Until now, you've only seen Sequential models (a subclass of Model), which are simple stacks of layers, mapping a single input to a single output. As you move forward, you'll be exposed to a much broader variety of network topologies:
- Two Branch Networks
- Multihead networks
- Residual Connections
The topology of a model defines a hypothesis space. By choosing a network topology, you constrain your space of possibilities (hypothesis space) to a specific series of tensor operations, mapping input data to output data. What you're searching for is a good set of values for the wight tensors involved in these tensor operations. To learn from data, you have to make assumptions about it. The structure of your hypothesis space is extremely important, and it encodes the assumptions you make about the problem, the prior knowledge that the model starts with.
from tensorflow import keras
class SimpleDense(keras.layers.Layer):
"""
All Keras layers inherit from the base Layer class
"""
def __init__(self, units, activation=None):
super().__init__()
self.units = units
self.activation = activation
def build(self,input_shape):
"""
Weight creation takes place in the build() method
"""
input_dim = input_shape[-1]
"""
add_weight() method is a sortcut for creating weights. It is also possible to create standalone variables and assign them to layer attributes, like self.W = tf.variable(tf.random.uniform(w_shape))
"""
self.W = self.add_weight(shape=(input_dim, self.units), initializer="random_normal")
self.b = self.add_weight(shape=(self.units,), initializer="zeros")
def call(self,inputs):
"""
The forward pass computation is defined in the call() method
"""
y = tf.matmul(inputs,self.W) + self.b
if self.activation is not None:
y = self.activation(y)
return y
my_dense = SimpleDense(units=32, activation=tf.nn.relu) # Instantiate our layer, defined previously
input_tensor = tf.ones(shape=(2,784)) # Create some test inputs
output_tensor = my_dense(input_tensor) # Call the layer on the inputs, kust like a function
print(output_tensor.shape)
**compile() Step:**Once the model architecture is defined, you choose three more things:
- Loss function (objective function) - the quantity that will be minimized during training. It represents a measure of success for the task at hand
- Optimizer - Determines how the network will be updated based on the loss function. It implements a specific variant of stochastic gradient descent (SGD)
- Metrics - measures of success you want to monitor during training and validation, such as classification accuracy. Unlike the loss, training will not optimize directly for these metrics. As such, metrics don't need to be differentiable.
model = keras.Sequential([keras.layers.Dense(1)]) # Define a linear classifier
model.compile(
optimizer="rmsprop", # Specify the optiizer by name: RMSprop
loss="mean_squared_error", # Specify the loss by name
metrics=["accuracy"] # Specify a list of metrics
)
Choosing the right loss function for the right problem is extremely important: your network will take any shortcut it can to minimize the loss, so if the objective doesn't fully correlate with success for the task at hand, your network will end up doing things you may not have wanted.
The fit() method implements the training loop itself. Key arguments:
- The data (inputs and targets) to train on. It will typically be passed either in the form of NumPy arrays or a TensorFlow Dataset object.
- The number of epochs to train on: how many times the training loop should iterate over the data passed
- The batch size to use within each epoch of mini-batch gradient descent: the number of training examples considered to compute the gradients for one weight update step
hostory = model.fit(
inputs, # The input examples, as a NumPy array
targets, # The corresponding NumPy targets, as a NumPy array
epochs=5, # The training loop will iterate over the data 5 times
batch_size=128 # The training loop will iterate over the data in batches of 128 examples
)
The fit() returns a Histroy object. This object contains a history field, which is dict mapping keys such as "loss" or specific metric names to the lust of their peer epoch values
"""
The goal of machine leanring is to obtain mdoels that perform well in general. To keep an eye on how the model does on new data, it's standard practice to reserve a subset of the training data as *validation data* - you will use validation data to compute a loss value and metrics value.
"""
model = keras.Sequential([keras.layers.Dense(1)])
model.compile(optimizer=keras.optimizers.RMSprop(learning_rate=0.1), loss=keras.losses.MeanSquaredError(), metrics=[keras.metrics.BinaryAccuracy()])
"""
To avoid having samples from only one class in teh validation data, shuffle the inpts and targets using a random indices permutation
"""
indices_permutations = np.random_permutation(len(inputs))
shuffled_inputs = inputs[indices_permutations]
shuffled_targets = targets[indices_permutations]
"""
Reserve 30% of the training inputs and targets for validation
"""
num_valdatiopn_samples = int(0.3 * len(inputs))
val_inputs = shuffled_inputs[:num_valdatiopn_samples]
val_targets = shuffled_targets[:num_valdatiopn_samples]
training_inputs = shuffled_inputs[num_valdatiopn_samples:]
training_targets = shuffled_targets[num_valdatiopn_samples:]
model.fit(
training_inputs, # Training data, used to update weights
training_targets,
epochs=5,
batch_size=16,
validation_data=(val_inputs,val_targets) # Validation data, used only to monitor the validation loss and metrics
)
"""
You can use the evaluate() method to compute the validation loss and metrics after the training is complete.
This method will iterate in batches over the data bpasses and return a list of scalars, where the first entry is the validation loss and te following entries are the validation metrics. If the model has no metrics, only the validation loss is returned.
"""
loss_and_metrics = model.evaluate(val_inputs, val_targets, batch_size=128)
"""
Once you've trained your model, you're going to wat to use it to make predictions on new data. This is called *inference*. The is done with the `predict()` method, which will iterate over the data in small batche sand return a NumPuy array of predictions. It can also process TensorFlow Dataset objects.
"""
predictions = model.predict(new_inputs, batch_size=128) # Take a NumPy array or a dataset and return a NumPy array
Getting Started With Neural Networks: Classification and Regression
Two-class classification or binary classification is one of the most common kinds of machine learning problems. The example below is an example of binary classification with the IMDB dataset.
from tensorflow.keras.datasets import imdb
"""
- IMDB reviews are sequences of words that have been turned into sequences of integers, where each integer stands for a specific word in a dictionary
- num_words=10000 means to only keep the top 10,000 most frequently occuring words in the training data
- labels are either 0 (negative) or 1 (positive)
"""
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
print("Train Data[0]:",train_data[0])
print("Train Label[0]:",train_labels[0])
"""
Decoding Reviews back to text
"""
word_index = imdb.get_word_index() # word_index is a dictionary mapping words to an interger index
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()]) # Reverses it, mapping integer indices to words
decoded_review = " ".join([reverse_word_index.get(i - 3, "?") for i in train_data[0]]) # Decodes the review. Note that the indices are offset by 3, because 0, 1m and 2 are reserved indices for "padding", "start of sequence", and "unknown"
"""
Mult-hot encoding to prepare data
"""
import numpy as np
def vectorize_sequences(sequences, dimension=10_000):
results = np.zeros((len(sequences), dimension)) # Create an all-zero matrix of shape (len(sequences), dimension)
for i, sequence in enumerate(sequences):
for j in sequence:
results[i, j] = 1. # Set Specific inices of results[i] to 1s
return results
x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)
print("x_train[0]:",x_train[0])
y_train = np.asarray(train_labels).astype("float32")
y_test = np.asarray(test_labels).astype("float32")
"""
There are two key architecture decisions to be made about stacks of Dense layers:
1. How many layers to use
2. How many units to use for each layer
"""
from tensorflow import keras
from tensorflow.keras import layers
model = keras.Sequential([
# The first agument is the nmber of units in the layer
# Having 16 units means the weight matrix W wll have shape (input_dimension, 16): the dot
# product with W will project the input data onto a 16-dimensional representation space.
# You can intuitively understand the dimensionality of your representation space as "how
# much freedom you're allowing the model to have when learning internal representations."
# Having more units allows the model to learn more-complex representations, but it makes
# the model more computationally expensive and may lead to learning unwanted pattern (overfitting data)
layers.Dense(16, activation="relu"), # Each dense layer computes: output = relu(dot(input, W) + b)
layers.Dense(16, activation="relu"),
# The sigmoid activation function outputs a provabability (score between 0 and 1) indicating
# how likely the sample is to have a target "1". A relu (rectified linear unit) is a function meant to zero out
# negative values, whereas a sigmoid "squashes" arbitrary values into the [0,1] interval
layers.Dense(1, activation="sigmoid")
])
"""
Coosing Loss Function and Optimizer
----------------------------------------------------
Because you're facing a binary classification problem and the output of your modek si a probability
(you end your model with asingle-unit layer ith a sigmoid activation), it's best to use `binary_crossentropy`
loss. Crossentropy is the best choice when you're doing models that output probabilities. Crossentropy is a quantity
from the field of information theory that measures the distance between probability distributions, or, in this case,
between the ground-truth distribution and your predictions.
rmsprop optimizer is a good default choice for virtually any problem.
"""
model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"])
x_val = x_train[:10_000]
partial_x_train = x_train[10_000:]
partial_y_train = y_train[10_000:]
y_val = y_train[:10_000]
partial_y_train = y_train[10_000:]
history = model.fit(
partial_x_train,
partial_y_train,
epochs=20,
batch_size=512,
validation_data=(x_val, y_val)
)
history_dict = history.history
"""
The dictionary contains four entries: one per metric that was being monitored during training and validation. Plotting Training and Validation Loss side by side:
"""
import matplotlib.pyplot as plt
loss_values = history_dict["loss"]
val_loss_values = history_dict["val_loss"]
epochs = range(1, len(loss_values) + 1)
plt.plot(epochs, loss_values, "bo", label="Training loss")
plt.plot(epochs, val_loss_values, "b", label="Validation loss")
plt.title("Training and validation loss")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend()
plt.show()
plt.clf() # Clears the figure
acc = history_dict["accuracy"]
val_acc = history_dict["val_accuracy"]
plt.plot(epochs, acc, "bo", label="Training acc")
plt.plot(epochs, val_acc, "b", label="Validation acc")
plt.title("Training and validation accuracy")
plt.xlabel("Epochs")
plt.ylabel("Accuracy")
plt.legend()
plt.show()
Above, you can see that the training loss decreases with every epoch, and the training accuracy increase with every epoch - that is what you would expect with gradient descent optimization, but this isn't the case with validation accuracy / loss. The above graphs are an example of overfitting: after the fourth epoch, you're overoptimizing on the training data, and you end up learning representations that are specific to the training data and don;t generalize to data outside of the training set.
model = keras.Sequential([
layers.Dense(16,activation="relu"),
layers.Dense(16,activation="relu"),
layers.Dense(1, activation="sigmoid")
])
model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"])
model.fit(x_train, y_train, epochs=4, batch_size=512)
results = model.evaluate(x_test, y_test)
print("Results [test loss, test accuracy]=",results)
predictions = model.predict(x_test)
!pip install keras-tuner -q
import keras_tuner
def build_model(hp):
model = keras.models.Sequential()
model.add(layers.Dense(hp.Choice('units_layer_1',[16,32,64,128]),activation=hp.Choice('activation_1',['relu','tanh'])))
if hp.Boolean('two_layers'):
model.add(layers.Dense(hp.Choice('units_layer_2',[16,32,64,128]),activation=hp.Choice('activation_2',['relu','tanh'])))
if hp.Boolean('three_layers'):
model.add(layers.Dense(hp.Choice('units_layer_3',[16,32,64,128]),activation=hp.Choice('activation_3',['relu','tanh'])))
model.add(layers.Dense(1,activation="sigmoid"))
model.compile(optimizer="rmsprop",loss=hp.Choice("loss",["binary_crossentropy","mse"]),metrics=["accuracy"])
return model
tuner = keras_tuner.Hyperband(hypermodel=build_model,objective="val_accuracy",max_epochs=10)
"""
Implementing early stopping
"""
early_stopping_cb = keras.callbacks.EarlyStopping(patience=10,restore_best_weights=True)
tuner.search(x_train,y_train,epochs=100,validation_split=0.2,callbacks=[early_stopping_cb])
print(tuner.results_summary())
best_hp = tuner.get_best_hyperparameters()[0]
model = tuner.hypermodel.build(best_hp)
model.fit(x_train,y_train)
results = model.evaluate(x_test,y_test)
print(results)
Multiclass Classification Example
In this section, we build a model to classify Reuters newswires into 46 mutually exclusive topics. Because we have many classes, this problem is an instance of multi-class classification, and because each data point should be classified into only one category, the problem is more specifically an instance of single-label multiclass classification. If each data point could belong to multiple categories, we'd be facing a multilabel multiclass classification problem.
The Reuters datset is a set of short newswires and their topics. There are 46 different topics, equally distributed among samples.
from tensorflow.keras.datasets import reuters
(train_data, train_labels), (test_data, test_labels) = reuters.load_data(num_words=10_000)
print("Train Data Length:",len(train_data))
print("Test Data Length:",len(test_data))
"""
As with the IMDB reviews, each sample is a list of indices
"""
print("Train Data Instance:",train_data[10])
"""
How to decode the newswires back to text
"""
word_index = reuters.get_word_index()
reverse_word_index = dict(
[(value, key) for (key, value) in word_index.items()]
)
decoded_newswire = " ".join(
[reverse_word_index.get(i - 3, "?") for i in train_data[0]]
) # Note that the indices are offset by 3 because 0, 1, and 2 are reserved indices for "passing", "start of sequence", and "unkown"
print("Train Label of Instance",train_labels[10])
"""
The label associated with an example is an integer between 0 and 45 - a topic index.
"""
x_train = vectorize_sequences(train_data) # Vectorize training data
x_test = vectorize_sequences(test_data) # Vectorized test data
def to_one_hot(labels,dimension=46):
results = np.zeros((len(labels), dimension))
for i, label in enumerate(labels):
results[i, label] = 1
return results
y_train = to_one_hot(train_labels) # Vectorized Train Labels
y_test = to_one_hot(test_labels) # Vectorized Test Labels
"""
There is a built in way to do this in Keras
"""
from tensorflow.keras.utils import to_categorical
y_train = to_categorical(train_labels)
y_test = to_categorical(test_labels)
model = keras.Sequential([
layers.Dense(64, activation="relu"),
layers.Dense(64, activation="relu"),
layers.Dense(46, activation="softmax"),
])
"""
Each layer in a stack of Dense layers can only access information rpesent in
the output of the previous layer. If one layer drops some information relevant to the classification problem, this information can never be recobered by later
layers: each layer can potentially bacomje an information-bottleneck. IOn the previous example, we used 16-dimensional intermediate layers, but a
16-dimensional space may be too limited to learn form 46 different classes: such small layers may act as information bottlenecks, permenanelt dropping
relevant information.
"""
model = keras.Sequential([
layers.Dense(64, activation="relu"),
layers.Dense(64, activation="relu"),
# 46 units because 46 classes
# softmax activation -> the model will output a probability distribution
# over the 46 different output classes - for every input sample, the model will produce a 46-diemnsional output vector, where `output[i]` is the probability that the sample belongs to class i, the 46 scores will sum to 1
layers.Dense(46, activation="softmax")
])
"""
The ebest loss function to use in this case is categorical_crossentropy. It measures the distance between two probability distributions here, between the probability distribution output by the model and the true distribution of the labels. By minimizing the distance between these two distributions, you train the model to output something as close as possible to the true labels.
"""
model.compile(optimizer="rmsprop",loss="categorical_crossentropy",metrics=["accuracy"])
"""
Validating samples
"""
x_val = x_train[:1_000]
partial_x_train = x_train[1_000:]
y_val = y_train[:1_000]
partial_y_train = y_train[1_000:]
history = model.fit(partial_x_train, partial_y_train, epochs=20, batch_size=512, validation_data=(x_val, y_val))
# Plotting the Training and Validation Loss
loss = history.history["loss"]
val_loss = history.history["val_loss"]
epochs = range(1, len(loss) + 1)
plt.plot(epochs, loss, "bo", label="Training loss")
plt.plot(epochs, val_loss, "b", label="Validation loss")
plt.title("Training and validation loss")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend()
plt.show()
# Plotting the Validation Accuracy
plt.clf()
acc = history.history["accuracy"]
val_acc = history.history["val_accuracy"]
plt.plot(epochs, acc, "bo", label="Training accuracy")
plt.plot(epochs, val_acc, "b", label="Validation accuracy")
plt.title("Training and validation accuracy")
plt.xlabel("Epochs")
plt.ylabel("Accuracy")
plt.legend()
plt.show()
"""
The model begins to overfit after 9 epochs -> train model from scratch with 9 epochs
"""
model = keras.Sequential([
layers.Dense(64, activation="relu"),
layers.Dense(64, activation="relu"),
layers.Dense(46, activation="softmax")
])
model.compile(optimizer="rmsprop",
loss="categorical_crossentropy",
metrics=["accuracy"])
model.fit(x_train,
y_train,
epochs=9,
batch_size=512)
results = model.evaluate(x_test, y_test)
results = model.evaluate(x_test, y_test)
print("Final Results")
print(results)
You should avoid intermediate layers that have fewer units than the number of output classifications - this can introduce information bottlenecks that degrade performance - i.e., with 46 output classes, no intermediate layer should have less than 46 units.
Regression Example
Below is an example of regression - predicting a continuous value instead of a discrete label. We are attempting to predict the median price of homes in a given Boston suburb in the mid-1970s, given data points about the suburb at the time, such as the crime rate, the local property tax rate, and so on. Each feature in the input data has a different scale.
It is inappropriate to feed into a neural network values that all take wildly different ranges. A widespread best practice for dealing with such data is to do feature-wise normalization: for each feature in the input data (a column in the input data matrix), we subtract the mean of the feature and divide by the standard deviation, so that the feature is centered around 0 and has a unit standard deviation.
Because so few samples are available, we'll use a very small model with two intermediate layers, each with 64 units. In general, the less training data you have, thw worse overfitting will be, and using a small model is one way to mitigate overfitting.
The regression model ends with a single unit and no activation (it will be a linear layer). This is a typical setup for scalar regresison (a regression where you're trying to predict a single continuous value). The last layer being purely linear allows the model to predict values in any range.
The mean swaured error, the square of the difference between the predictions and the targets, is a widely used loss function for regresison problems. The mean absolute error (MAE) is the avsolutre value of the difference between the predictions and the targets.
from tensorflow.keras.datasets import boston_housing
(train_data, train_targets), (test_data, test_targets) = boston_housing.load_data()
print(train_data.shape)
print(test_data.shape)
print(train_targets)
mean = train_data.mean(axis=0)
train_data -= mean
std = train_data.std(axis=0)
train_data /= std
test_data -= mean
test_data /= std
def build_model():
"""
Since we need to instantiate the same model multiple times, we use a function to construct it.
"""
model = keras.Sequential([
layers.Dense(64,activation="relu"),
layers.Dense(64, activation="relu"),
layers.Dense(1)
])
model.compile(optimizer="rmsprop",loss="mse",metrics=["mae"])
return model
Validating Approach using K-fold Validation
Since the amount of training data is small, the validation set would be small. As a consequence, the validation socres might change a lot depending on data points chosen: validation scores might have a high variation with regard to the validation split - this would prevent us from reliably evaluating model. Best practice in such situations is to use K-fold cross validation:
It consists of splitting the available data into K partitions (typically K=4 or 5 ), instantiating K identical models, and training each one on K−1 partitions while evaluating on the remaining partition. The validation score for the model used is then the average of the K validation scores obtained. In terms of code, this is straightforward.
k = 4
num_val_samples = len(train_data) // k
num_epochs = 100
all_scores = []
for i in range(k):
print(f"Processing Fold ${i}")
# Prepares the validation data: data from partition #k
val_data = train_data[i*num_val_samples: (i+1)*num_val_samples]
val_targets = train_targets[i * num_val_samples: (i+1) * num_val_samples]
# Prepare the training data: data from all other partitions
partial_train_data = np.concatenate(
[train_data[:i * num_val_samples],
train_data[(i+1)*num_val_samples:]],
axis=0
)
partial_train_targets = np.concatenate(
[train_targets[:i * num_val_samples],
train_targets[(i+1)*num_val_samples:]],
axis=0
)
model = build_model()
# Trains in silent mode (verbose = 0)
model.fit(partial_train_data, partial_train_targets, epochs=num_epochs, batch_size=16, verbose=0)
val_mse, val_mae = model.evaluate(val_data, val_targets, verbose=0)
all_scores.append(val_mae)
all_scores
np.mean(all_scores)
num_epochs = 500
all_mae_histories = []
for i in range(k):
print(f"Processing Fold #{i}")
val_data = train_data[i*num_val_samples : (i+1) * num_val_samples] # Prepares the validation data: data from partition #k
val_targets = train_targets[i * num_val_samples : (i+1) * num_val_samples]
# Prepares the training data: data from all other partitions
partial_train_data = np.concatenate(
[train_data[:i * num_val_samples],
train_data[(i+1) * num_val_samples:]],
axis=0
)
partial_train_targets = np.concatenate(
[train_targets[:i * num_val_samples],
train_targets[(i+1) * num_val_samples:]],
axis=0
)
# Builds the Keras model (already compiled)
model = build_model()
# Trains the model (in silent mode, verbose=0)
history = model.fit(partial_train_data,partial_train_targets,validation_data=(val_data, val_targets), epochs=num_epochs, batch_size=16, verbose=0)
mae_history = history.history["val_mae"]
all_mae_histories.append(mae_history)
average_mae_history = [np.mean([x[i] for x in all_mae_histories]) for i in range(num_epochs)]
plt.plot(range(1, len(average_mae_history) + 1),average_mae_history)
plt.xlabel("Epochs")
plt.ylabel("Validation MAE")
plt.show()
plt.clf()
truncated_mae_history = average_mae_history[10:]
plt.plot(range(1, len(truncated_mae_history) + 1), truncated_mae_history)
plt.xlabel("Epochs")
plt.ylabel("Validation MAE")
plt.show()