Deep Learning with Python Chapters 3 and 4

These chapters give an "Introduction to Keras and TensorFlow" and an introduction to Classification and Regresison with Keras.

DOWNLOAD NOTEBOOK

2 530

Introduction to Keras and TensorFlow

Links

Tensorflow is a Python-based, free, open source machine learning platform, developed primarily by Google. Much like NumPy, the primary purpose of TensorFlow is to enable engineers and researchers to manipulate mathematical expressions over numerical tensors. But TensorFlow differs in scope:

It can automatically compute the gradient of any differentiable expression, making it highly suitable for machine learning
It can run not only on CPUs, but also on GPUs and TPUs, highly parallelized hardware accelerators
Computation defined in TensorFlow can easily be distribute across many machines
TensorFlow programs, can be exported to other runtimes, such as C++, JavaScript, or TensorFlow lite (for applications funning on mobile or embedded devices). This makes TensorFlow applications easy to deploy in practical settings.

TensorFlow is a platform:

TF-agents for reinforcement learning research
TFX for industry-strength machine learning workflow management
TensorFlow Serving for production deployment, and there's the TensorFlow Hub repository of pretrained models.

Keras is a deep learning API for Python, built on top of TensorFlow, that provides a convenient way to define and train any ind of deep learning model. It was initially developed for research with the aim of enabling fast deep learning experimentation.

Relationship Between Keras, TensorFlow, and the Machine

It is highly recommended that you run deep learning code on a modern NVIDIA GPU rather than your own computer's CPU. Some applications - in particular, image processing with convolutional neural networks - will be excruciatingly slow on CPU, even a fast multicore CPU. And for applications - in particular image processing with convolutional neural networks, will be excruciatingly slow on CPU, even a multicore CPU. To do deep learning, you probably want to use the free GPU runtime from Colaboratory, a hosted network service offered by Google - this is only good for small workloads. If you want to scale up, you will need to use GPU instances on Google Cloud or Amazon EC2. It's best to run Keras on a Unix workstation - if using Windows, use Ubuntu dual boot.

my_list = [3,2,5]
sorted(my_list)

out[2]

[2, 3, 5]

# How to Install packages in Colab: !pip install <package_name>

## Make sure to turn on the GPU runtime by going to Runtime > Change Runtime Type

out[3]

Training a neural network revolves around the following concepts:

Low-Levl tensor manipualtion - the infrastructure that underlies all modern machine learning. This translates to TensorFlow APIs:
- Tensors, including special tensors that store the network's state (variables)
- Tensor Operations such as assition, relu, and matmul
- Backpropagation, a way to compute the gradient of mathematical expressions (handled in TensorFlow via the GradientTape object)
Second, high level deep learning concepts. This translates to Keras APIs:
- Layers, which are combined into a model
- A loss function, which defines the feedback signal used for learning
- An optimizer, which determines how learning proceeds
- Metrics to evaluate model performance, such as accuracy
- A training loop that performs mini-batch gradient stochastic gradient descent

import tensorflow as tf
"""
Tensors need to be created with some initial value.
All ones or All Zeros Tensors:
"""
x = tf.ones(shape=(2,1)) # Equivalent to np.ones(shape=(2,1))
print(x)
x = tf.zeros(shape=(2,1)) # Equivalent to np.zeros(shape=(2,1))
print(x)
"""
Random Tensors:
"""
x = tf.random.normal(shape=(3,1),mean=0.,stddev=1.) # Tensor of random values drawn from a normal distribution with mean 0 and a standard deviation of 1. Equivalent to np.random.normal(size=(3,1),loc=0.,scale=1.)
print(x)
x = tf.random.uniform(shape=(3,1),minval=0.,maxval=1.) # Tensor of Random values drawn from a uniform distribution between 0 and 1.
print(x)
"""
A significant difference between NumPy arrays and TensorFlow tensors is that TensorFlow tensors aren't assignable: theu're constant.
"""
import numpy as np
x = np.ones(shape=(2,2))
x[0,0] = 0. # This is acceptable for NumPy arrays
x = tf.ones(shape=(2,2))
try:
  x[0,0] = 2. # This is unacceptable in TensorFlow
except Exception as e:
  print(e)

out[5]

tf.Tensor(
[[1.]
[1.]], shape=(2, 1), dtype=float32)
tf.Tensor(
[[0.]
[0.]], shape=(2, 1), dtype=float32)
tf.Tensor(
[[-0.6934365 ]
[ 0.00505054]
[ 0.89546114]], shape=(3, 1), dtype=float32)
tf.Tensor(
[[0.40523243]
[0.5153372 ]
[0.7787037 ]], shape=(3, 1), dtype=float32)
'tensorflow.python.framework.ops.EagerTensor' object does not support item assignment

"""
Creating a TensorFlow variable
"""
v = tf.Variable(initial_value=tf.random.normal(shape=(3,1)))
print(v)
# The stae of the variable can be modifed via its `assign` method
v.assign(tf.ones((3,1)))
print(v)
# It also works for a subset of the coefficients
v[0,0].assign(3.)
print(v)
# `assign_add()` and `assign_sub()` are efficient equivalents of += and -=
v.assign_add(tf.ones((3,1)))
print(v)

out[6]

<tf.Variable 'Variable:0' shape=(3, 1) dtype=float32, numpy=
array([[-0.5112297 ],
[ 0.5778578 ],
[-0.00489818]], dtype=float32)>
<tf.Variable 'Variable:0' shape=(3, 1) dtype=float32, numpy=
array([[1.],
[1.],
[1.]], dtype=float32)>
<tf.Variable 'Variable:0' shape=(3, 1) dtype=float32, numpy=
array([[3.],
[1.],
[1.]], dtype=float32)>
<tf.Variable 'Variable:0' shape=(3, 1) dtype=float32, numpy=
array([[4.],
[2.],
[2.]], dtype=float32)>

# Just Like Numpy, TensorFlow offers a large collection of Tensor operations to express mathematical formula:
a = tf.ones((2,2))
a = a + tf.ones((2,2))
b = tf.square(a) # Take the square
c = tf.sqrt(a)
d = b + c # Add Two Tensors, element-wise
e = tf.matmul(a,b) # Take the dot product of two tensors
e *= d # Multiply two tensors (element-wise)

out[7]

"""
Here's something NumPy can't do: retrieve the Gradient of any differentiable expression with respect to any of its inputs.
Just oppern a `GradientTape` scope, apply some computation to one or several input tensors and retrieve the gradient of the result wrt to the inputs
"""
input_var = tf.Variable(initial_value=3.)
with tf.GradientTape() as tape:
  result = tf.square(input_var)
gradient = tape.gradient(result,input_var)
"""
This is most commonly used to retrieve the gradients of the loss of a model with respect to its weights: gradient = tape.gradient(loss,weights)
Only *trainable variables* are tracked by default. WIth a constant tensor, you'd have to manually mark it as being watched
"""
input_const = tf.constant(3.)
with tf.GradientTape() as tape:
  tape.watch(input_const) # Manually watch constant Tensor
  result = tf.square(input_const) 
gradient = tape.gradient(result,input_const)

time = tf.Variable(0.)
with tf.GradientTape() as outer_tape:
  with tf.GradientTape() as inner_tape:
    position = 4.9 * time ** 2
  speed = inner_tape.gradient(position,time)
acceleration = outer_tape.gradient(speed,time) # We compute the outer tape to compute the gradient from the inner tape
print("Acceleration = {:.1f}".format(acceleration))

out[8]

Acceleration = 9.8

### End-to-End Example A Linear Classifier in Pure TensorFlow 

"""
Coming up with some nicely linearly separable synthetic dat to work with: two classes of points in a 2D plane. 
"""
num_samples_per_class = 1000
negative_samples = np.random.multivariate_normal(mean=[0,3],cov=[[1,0.5],[0.5,1]],size=num_samples_per_class) # Generate a clss of points, 1000 random 2D points. cov corresponds to an oval-like point oreiented from bottom left to top right
positive_samples = np.random.multivariate_normal(mean=[3,0],cov=[[1,0.5],[0.5,1]],size=num_samples_per_class) # Generate the other class with a different mean and the same covariance matrix
import matplotlib.pyplot as plt 
def plot_data():
    fig, ax = plt.subplots(1,1,layout="constrained")
    ax.scatter(negative_samples[:,0],negative_samples[:,1],c='y',label="Negative Samples")
    ax.scatter(positive_samples[:,0],positive_samples[:,1],c='b',label="Positive Samples")
    ax.legend()
    ax.set_title("Visualizing Data")
    return ax
plot_data()
plt.show()
inputs = np.vstack((negative_samples,positive_samples)).astype("float32")
targets = np.vstack((np.zeros((num_samples_per_class,1),dtype=np.float32), np.ones((num_samples_per_class,1),dtype=np.float32)))
"""
A linear classifier is an affine transformation `(prediction = W * input + b) trained to minimize the sequare of the difference between predictions and the targets
"""
input_dim = 2 # The input will be 2D points
output_dim = 1 # The output prediction will be a single score per sample (0 or 1)
W = tf.Variable(initial_value=tf.random.uniform(shape=(input_dim,output_dim)))
b = tf.Variable(initial_value=tf.zeros(shape=(output_dim,)))

def model(inputs):
    """
    This is our foward pass function
    """
    return tf.matmul(inputs,W)+b

def square_loss(targets,predictions):
    # per_sample_losses will be a tensor with the same shape as the targets and predictions, containing per-sample loss score
    per_sample_losses = tf.square(targets-predictions)
    # We need to average these per-sample loss scores into a single scalar loss value: this is what reduce_mean does
    return tf.reduce_mean(per_sample_losses)

learning_rate = 0.1 
def training_step(inputs,targets):
    """
    The training steps receives some training data and updates the weights W and b so as to minimize the loss on the data
    """
    with tf.GradientTape() as tape:
        """
        Forwaed pass inside a gradient tape scope
        """
        predictions = model(inputs)
        loss = square_loss(predictions,targets)
    grad_loss_wrt_W, grad_loss_wrt_b = tape.gradient(loss,[W,b])
    """
    Update the weights
    """
    W.assign_sub(grad_loss_wrt_W * learning_rate)
    b.assign_sub(grad_loss_wrt_b * learning_rate)
    return loss 

"""
Below, for simplicity, we will do batch training instead of mini-batch training. On the one hand, this means each training step will take longer to run,
on the other, it means that each gradient update will be much more effective at reducing the loss of the training data . As a result, we will need fewer steps of training, and we 
should use a larger training rate than we would typically use for mini-batch 
"""
for step in range(40):
    loss = training_step(inputs,targets)
    print(f"Loss at step {step}: {loss:.4f}")

ax = plot_data()
x = np.linspace(-1,4,100)
"""
> Recall that the prediction value for a given point [x, y] is simply prediction == [[w1], [w2]] • [x, y] + b == w1 * x + w2 * y + b. Thus, class 0 is defined as w1 * x + w2 * y + b < 0.5, and class 1 is defined as w1 * x + w2 * y + b > 0.5. 
"""
y = -W[0] / W[1] * x + (0.5-b) / W[1] # Line Equation 
ax.plot(x,y,"-r")
plt.show()
"""
This is really what a linear classifier is all about: finding the parameters of a line (orm in higher-dimensional spaces, a hyperplane) neatly separating two classes of data
"""

out[9]

Loss at step 0: 8.3394
Loss at step 1: 1.2986
Loss at step 2: 0.3323
Loss at step 3: 0.1840
Loss at step 4: 0.1533
Loss at step 5: 0.1402
Loss at step 6: 0.1304
Loss at step 7: 0.1216
Loss at step 8: 0.1137
Loss at step 9: 0.1064
Loss at step 10: 0.0997
Loss at step 11: 0.0936
Loss at step 12: 0.0880
Loss at step 13: 0.0828
Loss at step 14: 0.0781
Loss at step 15: 0.0738
Loss at step 16: 0.0698
Loss at step 17: 0.0661
Loss at step 18: 0.0628
Loss at step 19: 0.0597
Loss at step 20: 0.0569
Loss at step 21: 0.0543
Loss at step 22: 0.0520
Loss at step 23: 0.0498
Loss at step 24: 0.0478
Loss at step 25: 0.0460
Loss at step 26: 0.0443
Loss at step 27: 0.0428
Loss at step 28: 0.0414
Loss at step 29: 0.0401
Loss at step 30: 0.0389
Loss at step 31: 0.0378
Loss at step 32: 0.0368
Loss at step 33: 0.0359
Loss at step 34: 0.0351
Loss at step 35: 0.0343
Loss at step 36: 0.0336
Loss at step 37: 0.0330
Loss at step 38: 0.0324
Loss at step 39: 0.0319

The fundamental data structure in neural networks is the layer. A layer is a data processing module that takes as input one or more tensors and that outputs one or more tensors. Some layers are stateless, but more frequently layers have a state: the layer's weights, one or several tensors learned with stochastic gradient descent, which contain the network's knowledge. Different types of layers are appropriate in different situations:

Densely Connected Layers (Fully Connected Layers) is appropriate for vector data (samples, features) (Dense Keras class)
Sequence Data ((samples, timestamps, features)) is typically processed by recurrent layers, such as an LSTM layer or 1D convolutional layers Conv1D
Image Data, stored in rank-4 tensors, is usually processed by 2D convolutional layers (Conv2D)

Building deep learning models in Keras is done by clipping together compatible layers to form useful data-transformation pipelines. Everything in Keras is either a Layer or something that closely interacts with a Layer. A Layer is an object that encapsulates some state (weights) and some computation (a forward pass). The wights are typically defined in build() and the computation is defined in the call() method. Just lie with LEGO bricks, you can only "clip" together layers that are compatible. The notion of layer compatibility refers specifically to the fact that every layer will only accept tensors of a certain shape and will return output tensors of a certain shape. When using Keras, you don't have to worry about size compatibility most of the time, because the layers you add to your models are dynamically built to match the shape of the incoming layer.

A deep learning model is a graph of layers, In Keras, that's the Model class. Until now, you've only seen Sequential models (a subclass of Model), which are simple stacks of layers, mapping a single input to a single output. As you move forward, you'll be exposed to a much broader variety of network topologies:

Two Branch Networks
Multihead networks
Residual Connections

The topology of a model defines a hypothesis space. By choosing a network topology, you constrain your space of possibilities (hypothesis space) to a specific series of tensor operations, mapping input data to output data. What you're searching for is a good set of values for the wight tensors involved in these tensor operations. To learn from data, you have to make assumptions about it. The structure of your hypothesis space is extremely important, and it encodes the assumptions you make about the problem, the prior knowledge that the model starts with.

from tensorflow import keras 

class SimpleDense(keras.layers.Layer):
    """
    All Keras layers inherit from the base Layer class
    """
    def __init__(self, units, activation=None):
        super().__init__()
        self.units = units 
        self.activation = activation 
    
    def build(self,input_shape):
        """
        Weight creation takes place in the build() method
        """
        input_dim = input_shape[-1]
        """
        add_weight() method is a sortcut for creating weights. It is also possible to create standalone variables and assign them to layer attributes, like self.W = tf.variable(tf.random.uniform(w_shape))
        """
        self.W = self.add_weight(shape=(input_dim, self.units), initializer="random_normal")
        self.b = self.add_weight(shape=(self.units,), initializer="zeros")
    
    def call(self,inputs):
        """
        The forward pass computation is defined in the call() method
        """
        y = tf.matmul(inputs,self.W) + self.b 
        if self.activation is not None:
            y = self.activation(y)
        return y 
my_dense = SimpleDense(units=32, activation=tf.nn.relu) # Instantiate our layer, defined previously 
input_tensor = tf.ones(shape=(2,784)) # Create some test inputs 
output_tensor = my_dense(input_tensor) # Call the layer on the inputs, kust like a function
print(output_tensor.shape)

out[11]

(2, 32)

**compile() Step:**Once the model architecture is defined, you choose three more things:

Loss function (objective function) - the quantity that will be minimized during training. It represents a measure of success for the task at hand
Optimizer - Determines how the network will be updated based on the loss function. It implements a specific variant of stochastic gradient descent (SGD)
Metrics - measures of success you want to monitor during training and validation, such as classification accuracy. Unlike the loss, training will not optimize directly for these metrics. As such, metrics don't need to be differentiable.

model = keras.Sequential([keras.layers.Dense(1)]) # Define a linear classifier 
model.compile(
  optimizer="rmsprop", # Specify the optiizer by name: RMSprop  
  loss="mean_squared_error", #  Specify the loss by name
  metrics=["accuracy"] # Specify a list of metrics 
)

Choosing the right loss function for the right problem is extremely important: your network will take any shortcut it can to minimize the loss, so if the objective doesn't fully correlate with success for the task at hand, your network will end up doing things you may not have wanted.

The fit() method implements the training loop itself. Key arguments:

The data (inputs and targets) to train on. It will typically be passed either in the form of NumPy arrays or a TensorFlow Dataset object.
The number of epochs to train on: how many times the training loop should iterate over the data passed
The batch size to use within each epoch of mini-batch gradient descent: the number of training examples considered to compute the gradients for one weight update step

hostory = model.fit(
  inputs, # The input examples, as a NumPy array 
  targets, # The corresponding NumPy targets, as a NumPy array 
  epochs=5, # The training loop will iterate over the data 5 times 
  batch_size=128 # The training loop will iterate over the data in batches of 128 examples
)

The fit() returns a Histroy object. This object contains a history field, which is dict mapping keys such as "loss" or specific metric names to the lust of their peer epoch values


"""
The goal of machine leanring is to obtain mdoels that perform well in general. To keep an eye on how the model does on new data, it's standard practice to reserve a subset of the training data as *validation data* - you will use validation data to compute a loss value and metrics value.
"""
model = keras.Sequential([keras.layers.Dense(1)])
model.compile(optimizer=keras.optimizers.RMSprop(learning_rate=0.1), loss=keras.losses.MeanSquaredError(), metrics=[keras.metrics.BinaryAccuracy()])

"""
To avoid having samples from only one class in teh validation data, shuffle the inpts and targets using a random indices permutation
"""
indices_permutations = np.random_permutation(len(inputs)) 
shuffled_inputs = inputs[indices_permutations]
shuffled_targets = targets[indices_permutations]

"""
Reserve 30% of the training inputs and targets for validation 
"""
num_valdatiopn_samples = int(0.3 * len(inputs))
val_inputs = shuffled_inputs[:num_valdatiopn_samples]
val_targets = shuffled_targets[:num_valdatiopn_samples]
training_inputs = shuffled_inputs[num_valdatiopn_samples:]
training_targets = shuffled_targets[num_valdatiopn_samples:]
model.fit(
    training_inputs, # Training data, used to update weights 
    training_targets,
    epochs=5,
    batch_size=16,
    validation_data=(val_inputs,val_targets) # Validation data, used only to monitor the validation loss and metrics
)

"""
You can use the evaluate() method to compute the validation loss and metrics after the training is complete.
This method will iterate in batches over the data bpasses and return a list of scalars, where the first entry is the validation loss and te following entries are the validation metrics. If the model has no metrics, only the validation loss is returned.  
"""
loss_and_metrics = model.evaluate(val_inputs, val_targets, batch_size=128)

"""
Once you've trained your model, you're going to wat to use it to make predictions on new data. This is called *inference*. The is done with the `predict()` method, which will iterate over the data in small batche sand return a NumPuy array of predictions. It can also process TensorFlow Dataset objects. 
"""
predictions = model.predict(new_inputs, batch_size=128) # Take a NumPy array or a dataset and return a NumPy array

Getting Started With Neural Networks: Classification and Regression

Two-class classification or binary classification is one of the most common kinds of machine learning problems. The example below is an example of binary classification with the IMDB dataset.

out[14]

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
[1m17464789/17464789[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 1us/step
Train Data[0]: [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]
Train Label[0]: 1
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json
[1m1641221/1641221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1us/step
x_train[0]: [0. 1. 1. ... 0. 0. 0.]

from tensorflow.keras.datasets import imdb
"""
- IMDB reviews are sequences of words that have been turned into sequences of integers, where each integer stands for a specific word in a dictionary
- num_words=10000 means to only keep the top 10,000 most frequently occuring words in the training data
- labels are either 0 (negative) or 1 (positive)
"""
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
print("Train Data[0]:",train_data[0])
print("Train Label[0]:",train_labels[0])

"""
Decoding Reviews back to text
"""
word_index = imdb.get_word_index() # word_index is a dictionary mapping words to an interger index
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()]) # Reverses it, mapping integer indices to words
decoded_review = " ".join([reverse_word_index.get(i - 3, "?") for i in train_data[0]]) # Decodes the review. Note that the indices are offset by 3, because 0, 1m and 2 are reserved indices for "padding", "start of sequence", and "unknown"

"""
Mult-hot encoding to prepare data
"""
import numpy as np
def vectorize_sequences(sequences, dimension=10_000):
    results = np.zeros((len(sequences), dimension)) # Create an all-zero matrix of shape (len(sequences), dimension)
    for i, sequence in enumerate(sequences):
        for j in sequence:
            results[i, j] = 1. # Set Specific inices of results[i] to 1s
    return results

x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)

print("x_train[0]:",x_train[0])

y_train = np.asarray(train_labels).astype("float32")
y_test = np.asarray(test_labels).astype("float32")

"""
There are two key architecture decisions to be made about stacks of Dense layers:
1. How many layers to use
2. How many units to use for each layer
"""

from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential([
    # The first agument is the nmber of units in the layer
    # Having 16 units means the weight matrix W wll have shape (input_dimension, 16): the dot
    # product with W will project the input data onto a 16-dimensional representation space.
    # You can intuitively understand the dimensionality of your representation space as "how
    # much freedom you're allowing the model to have when learning internal representations."
    # Having more units allows the model to learn more-complex representations, but it makes
    # the model more computationally expensive and may lead to learning unwanted pattern (overfitting data)
    layers.Dense(16, activation="relu"), # Each dense layer computes: output = relu(dot(input, W) + b)
    layers.Dense(16, activation="relu"),
    # The sigmoid activation function outputs a provabability (score between 0 and 1) indicating
    # how likely the sample is to have a target "1". A relu (rectified linear unit) is a function meant to zero out
    # negative values, whereas a sigmoid "squashes" arbitrary values into the [0,1] interval
    layers.Dense(1, activation="sigmoid")
])

"""
Coosing Loss Function and Optimizer
----------------------------------------------------

Because you're facing a binary classification problem and the output of your modek si a probability
(you end your model with asingle-unit layer ith a sigmoid activation), it's best to use `binary_crossentropy`
loss. Crossentropy is the best choice when you're doing models that output probabilities. Crossentropy is a quantity
from the field of information theory that measures the distance between probability distributions, or, in this case,
between the ground-truth distribution and your predictions.

rmsprop optimizer is a good default choice for virtually any problem.
"""
model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"])
x_val = x_train[:10_000]
partial_x_train = x_train[10_000:]
partial_y_train = y_train[10_000:]
y_val = y_train[:10_000]
partial_y_train = y_train[10_000:]
history = model.fit(
    partial_x_train,
    partial_y_train,
    epochs=20,
    batch_size=512,
    validation_data=(x_val, y_val)
)

out[15]

Train Data[0]: [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]
Train Label[0]: 1
x_train[0]: [0. 1. 1. ... 0. 0. 0.]
Epoch 1/20
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 84ms/step - accuracy: 0.6825 - loss: 0.6134 - val_accuracy: 0.8635 - val_loss: 0.4167
Epoch 2/20
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 20ms/step - accuracy: 0.8900 - loss: 0.3608 - val_accuracy: 0.8793 - val_loss: 0.3229
Epoch 3/20
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 21ms/step - accuracy: 0.9167 - loss: 0.2604 - val_accuracy: 0.8849 - val_loss: 0.2903
Epoch 4/20
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 20ms/step - accuracy: 0.9344 - loss: 0.2053 - val_accuracy: 0.8888 - val_loss: 0.2758
Epoch 5/20
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 21ms/step - accuracy: 0.9466 - loss: 0.1701 - val_accuracy: 0.8879 - val_loss: 0.2758
Epoch 6/20
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 23ms/step - accuracy: 0.9532 - loss: 0.1464 - val_accuracy: 0.8866 - val_loss: 0.2892
Epoch 7/20
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 42ms/step - accuracy: 0.9612 - loss: 0.1250 - val_accuracy: 0.8843 - val_loss: 0.2908
Epoch 8/20
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 21ms/step - accuracy: 0.9679 - loss: 0.1078 - val_accuracy: 0.8852 - val_loss: 0.3016
Epoch 9/20
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 23ms/step - accuracy: 0.9728 - loss: 0.0949 - val_accuracy: 0.8833 - val_loss: 0.3301
Epoch 10/20
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 20ms/step - accuracy: 0.9768 - loss: 0.0849 - val_accuracy: 0.8820 - val_loss: 0.3315
Epoch 11/20
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 20ms/step - accuracy: 0.9832 - loss: 0.0700 - val_accuracy: 0.8785 - val_loss: 0.3702
Epoch 12/20
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 21ms/step - accuracy: 0.9863 - loss: 0.0610 - val_accuracy: 0.8797 - val_loss: 0.3703
Epoch 13/20
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 20ms/step - accuracy: 0.9895 - loss: 0.0505 - val_accuracy: 0.8753 - val_loss: 0.3877
Epoch 14/20
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 20ms/step - accuracy: 0.9912 - loss: 0.0446 - val_accuracy: 0.8780 - val_loss: 0.4099
Epoch 15/20
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 21ms/step - accuracy: 0.9928 - loss: 0.0367 - val_accuracy: 0.8637 - val_loss: 0.4589
Epoch 16/20
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 20ms/step - accuracy: 0.9936 - loss: 0.0343 - val_accuracy: 0.8696 - val_loss: 0.4572
Epoch 17/20
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 20ms/step - accuracy: 0.9962 - loss: 0.0269 - val_accuracy: 0.8721 - val_loss: 0.4763
Epoch 18/20
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 25ms/step - accuracy: 0.9964 - loss: 0.0228 - val_accuracy: 0.8712 - val_loss: 0.5129
Epoch 19/20
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 31ms/step - accuracy: 0.9972 - loss: 0.0219 - val_accuracy: 0.8653 - val_loss: 0.5321
Epoch 20/20
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 28ms/step - accuracy: 0.9985 - loss: 0.0178 - val_accuracy: 0.8701 - val_loss: 0.5502

history_dict = history.history

"""
The dictionary contains four entries: one per metric that was being monitored during training and validation. Plotting Training and Validation Loss side by side:
"""
import matplotlib.pyplot as plt
loss_values = history_dict["loss"]
val_loss_values = history_dict["val_loss"]
epochs = range(1, len(loss_values) + 1)
plt.plot(epochs, loss_values, "bo", label="Training loss")
plt.plot(epochs, val_loss_values, "b", label="Validation loss")
plt.title("Training and validation loss")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend()
plt.show()

plt.clf() # Clears the figure
acc = history_dict["accuracy"]
val_acc = history_dict["val_accuracy"]
plt.plot(epochs, acc, "bo", label="Training acc")
plt.plot(epochs, val_acc, "b", label="Validation acc")
plt.title("Training and validation accuracy")
plt.xlabel("Epochs")
plt.ylabel("Accuracy")
plt.legend()
plt.show()

out[16]

Above, you can see that the training loss decreases with every epoch, and the training accuracy increase with every epoch - that is what you would expect with gradient descent optimization, but this isn't the case with validation accuracy / loss. The above graphs are an example of overfitting: after the fourth epoch, you're overoptimizing on the training data, and you end up learning representations that are specific to the training data and don;t generalize to data outside of the training set.

model = keras.Sequential([
    layers.Dense(16,activation="relu"),
    layers.Dense(16,activation="relu"),
    layers.Dense(1, activation="sigmoid")
])
model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"])
model.fit(x_train, y_train, epochs=4, batch_size=512)
results = model.evaluate(x_test, y_test)
print("Results [test loss, test accuracy]=",results)
predictions = model.predict(x_test)

out[18]

Epoch 1/4
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 20ms/step - accuracy: 0.7127 - loss: 0.5789
Epoch 2/4
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 12ms/step - accuracy: 0.9011 - loss: 0.2972
Epoch 3/4
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 12ms/step - accuracy: 0.9254 - loss: 0.2216
Epoch 4/4
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 13ms/step - accuracy: 0.9382 - loss: 0.1817
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - accuracy: 0.8772 - loss: 0.3039
Results [test loss, test accuracy]= [0.29987478256225586, 0.8802000284194946]
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step

!pip install keras-tuner -q

out[19]

[?25l [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/129.1 kB[0m [31m?[0m eta [36m-:--:--[0m [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m129.1/129.1 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25h

import keras_tuner

def build_model(hp):
  model = keras.models.Sequential()
  model.add(layers.Dense(hp.Choice('units_layer_1',[16,32,64,128]),activation=hp.Choice('activation_1',['relu','tanh'])))
  if hp.Boolean('two_layers'):
    model.add(layers.Dense(hp.Choice('units_layer_2',[16,32,64,128]),activation=hp.Choice('activation_2',['relu','tanh'])))
  if hp.Boolean('three_layers'):
    model.add(layers.Dense(hp.Choice('units_layer_3',[16,32,64,128]),activation=hp.Choice('activation_3',['relu','tanh'])))
  model.add(layers.Dense(1,activation="sigmoid"))
  model.compile(optimizer="rmsprop",loss=hp.Choice("loss",["binary_crossentropy","mse"]),metrics=["accuracy"])
  return model
tuner = keras_tuner.Hyperband(hypermodel=build_model,objective="val_accuracy",max_epochs=10)

"""
Implementing early stopping
"""
early_stopping_cb = keras.callbacks.EarlyStopping(patience=10,restore_best_weights=True)

tuner.search(x_train,y_train,epochs=100,validation_split=0.2,callbacks=[early_stopping_cb])

out[20]

Trial 30 Complete [00h 00m 27s]
val_accuracy: 0.8935999870300293

Best val_accuracy So Far: 0.8935999870300293
Total elapsed time: 00h 07m 13s

print(tuner.results_summary())
best_hp = tuner.get_best_hyperparameters()[0]
model = tuner.hypermodel.build(best_hp)
model.fit(x_train,y_train)
results = model.evaluate(x_test,y_test)
print(results)

out[21]

Results summary
Results in ./untitled_project
Showing 10 best trials
Objective(name="val_accuracy", direction="max")

Trial 0022 summary
Hyperparameters:
units_layer_1: 32
activation_1: tanh
two_layers: True
three_layers: False
loss: binary_crossentropy
units_layer_3: 64
activation_3: tanh
units_layer_2: 64
activation_2: relu
tuner/epochs: 4
tuner/initial_epoch: 0
tuner/bracket: 1
tuner/round: 0
Score: 0.8935999870300293

Trial 0029 summary
Hyperparameters:
units_layer_1: 128
activation_1: relu
two_layers: True
three_layers: True
loss: binary_crossentropy
units_layer_3: 64
activation_3: tanh
units_layer_2: 32
activation_2: tanh
tuner/epochs: 10
tuner/initial_epoch: 0
tuner/bracket: 0
tuner/round: 0
Score: 0.8935999870300293

Trial 0014 summary
Hyperparameters:
units_layer_1: 16
activation_1: relu
two_layers: False
three_layers: True
loss: binary_crossentropy
tuner/epochs: 4
tuner/initial_epoch: 2
tuner/bracket: 2
tuner/round: 1
units_layer_3: 16
activation_3: relu
tuner/trial_id: 0000
units_layer_2: 64
activation_2: relu
Score: 0.8925999999046326

Trial 0012 summary
Hyperparameters:
units_layer_1: 64
activation_1: relu
two_layers: False
three_layers: True
loss: binary_crossentropy
units_layer_3: 64
activation_3: relu
units_layer_2: 128
activation_2: relu
tuner/epochs: 4
tuner/initial_epoch: 2
tuner/bracket: 2
tuner/round: 1
tuner/trial_id: 0006
Score: 0.8921999931335449

Trial 0019 summary
Hyperparameters:
units_layer_1: 32
activation_1: relu
two_layers: False
three_layers: True
loss: binary_crossentropy
units_layer_3: 64
activation_3: relu
units_layer_2: 32
activation_2: relu
tuner/epochs: 4
tuner/initial_epoch: 0
tuner/bracket: 1
tuner/round: 0
Score: 0.8920000195503235

Trial 0006 summary
Hyperparameters:
units_layer_1: 64
activation_1: relu
two_layers: False
three_layers: True
loss: binary_crossentropy
units_layer_3: 64
activation_3: relu
units_layer_2: 128
activation_2: relu
tuner/epochs: 2
tuner/initial_epoch: 0
tuner/bracket: 2
tuner/round: 0
Score: 0.8916000127792358

Trial 0013 summary
Hyperparameters:
units_layer_1: 64
activation_1: tanh
two_layers: True
three_layers: False
loss: binary_crossentropy
units_layer_3: 16
activation_3: relu
tuner/epochs: 4
tuner/initial_epoch: 2
tuner/bracket: 2
tuner/round: 1
units_layer_2: 16
activation_2: relu
tuner/trial_id: 0001
Score: 0.8913999795913696

Trial 0020 summary
Hyperparameters:
units_layer_1: 16
activation_1: relu
two_layers: True
three_layers: False
loss: mse
units_layer_3: 128
activation_3: relu
units_layer_2: 128
activation_2: tanh
tuner/epochs: 4
tuner/initial_epoch: 0
tuner/bracket: 1
tuner/round: 0
Score: 0.8912000060081482

Trial 0001 summary
Hyperparameters:
units_layer_1: 64
activation_1: tanh
two_layers: True
three_layers: False
loss: binary_crossentropy
units_layer_3: 16
activation_3: relu
tuner/epochs: 2
tuner/initial_epoch: 0
tuner/bracket: 2
tuner/round: 0
units_layer_2: 16
activation_2: relu
Score: 0.890999972820282

Trial 0027 summary
Hyperparameters:
units_layer_1: 32
activation_1: relu
two_layers: True
three_layers: True
loss: binary_crossentropy
units_layer_3: 64
activation_3: relu
units_layer_2: 16
activation_2: relu
tuner/epochs: 10
tuner/initial_epoch: 0
tuner/bracket: 0
tuner/round: 0
Score: 0.890999972820282
None
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 2ms/step - accuracy: 0.8339 - loss: 0.3725
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - accuracy: 0.8810 - loss: 0.2865
[0.28666749596595764, 0.881600022315979]

Multiclass Classification Example

In this section, we build a model to classify Reuters newswires into 46 mutually exclusive topics. Because we have many classes, this problem is an instance of multi-class classification, and because each data point should be classified into only one category, the problem is more specifically an instance of single-label multiclass classification. If each data point could belong to multiple categories, we'd be facing a multilabel multiclass classification problem.

The Reuters datset is a set of short newswires and their topics. There are 46 different topics, equally distributed among samples.

from tensorflow.keras.datasets import reuters

(train_data, train_labels), (test_data, test_labels) = reuters.load_data(num_words=10_000)

print("Train Data Length:",len(train_data))
print("Test Data Length:",len(test_data))
"""
As with the IMDB reviews, each sample is a list of indices
"""
print("Train Data Instance:",train_data[10])

"""
How to decode the newswires back to text
"""
word_index = reuters.get_word_index()
reverse_word_index = dict(
    [(value, key) for (key, value) in word_index.items()]
)
decoded_newswire = " ".join(
    [reverse_word_index.get(i - 3, "?") for i in train_data[0]]
) # Note that the indices are offset by 3 because 0, 1, and 2 are reserved indices for "passing", "start of sequence", and "unkown"
print("Train Label of Instance",train_labels[10])
"""
The label associated with an example is an integer between 0 and 45 - a topic index.
"""
x_train = vectorize_sequences(train_data) # Vectorize training data
x_test = vectorize_sequences(test_data) # Vectorized test data

def to_one_hot(labels,dimension=46):
  results = np.zeros((len(labels), dimension))
  for i, label in enumerate(labels):
    results[i, label] = 1
  return results
y_train = to_one_hot(train_labels) # Vectorized Train Labels
y_test = to_one_hot(test_labels) # Vectorized Test Labels
"""
There is a built in way to do this in Keras
"""
from tensorflow.keras.utils import to_categorical
y_train = to_categorical(train_labels)
y_test = to_categorical(test_labels)
model = keras.Sequential([
    layers.Dense(64, activation="relu"),
    layers.Dense(64, activation="relu"),
    layers.Dense(46, activation="softmax"),
])
"""
Each layer in a stack of Dense layers can only access information rpesent in
the output of the previous layer. If one layer drops some information relevant to the classification problem, this information can never be recobered by later
layers: each layer can potentially bacomje an information-bottleneck. IOn the previous example, we used 16-dimensional intermediate layers, but a
16-dimensional space may be too limited to learn form 46 different classes: such small layers may act as information bottlenecks, permenanelt dropping
relevant information.
"""
model = keras.Sequential([
    layers.Dense(64, activation="relu"),
    layers.Dense(64, activation="relu"),
    # 46 units because 46 classes
    # softmax activation -> the model will output a probability distribution
    # over the 46 different output classes - for every input sample, the model will produce a 46-diemnsional output vector, where `output[i]` is the probability that the sample belongs to class i, the 46 scores will sum to 1
    layers.Dense(46, activation="softmax")
])
"""
The ebest loss function to use in this case is categorical_crossentropy. It measures the distance between two probability distributions here, between the probability distribution output by the model and the true distribution of the labels. By minimizing the distance between these two distributions, you train the model to output something as close as possible to the true labels.
"""
model.compile(optimizer="rmsprop",loss="categorical_crossentropy",metrics=["accuracy"])

out[23]

Train Data Length: 8982
Test Data Length: 2246
Train Data Instance: [1, 245, 273, 207, 156, 53, 74, 160, 26, 14, 46, 296, 26, 39, 74, 2979, 3554, 14, 46, 4689, 4329, 86, 61, 3499, 4795, 14, 61, 451, 4329, 17, 12]
Train Label of Instance 3

"""
Validating samples
"""
x_val = x_train[:1_000]
partial_x_train = x_train[1_000:]
y_val = y_train[:1_000]
partial_y_train = y_train[1_000:]
history = model.fit(partial_x_train, partial_y_train, epochs=20, batch_size=512, validation_data=(x_val, y_val))
# Plotting the Training and Validation Loss
loss = history.history["loss"]
val_loss = history.history["val_loss"]
epochs = range(1, len(loss) + 1)
plt.plot(epochs, loss, "bo", label="Training loss")
plt.plot(epochs, val_loss, "b", label="Validation loss")
plt.title("Training and validation loss")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend()
plt.show()
# Plotting the Validation Accuracy
plt.clf()
acc = history.history["accuracy"]
val_acc = history.history["val_accuracy"]
plt.plot(epochs, acc, "bo", label="Training accuracy")
plt.plot(epochs, val_acc, "b", label="Validation accuracy")
plt.title("Training and validation accuracy")
plt.xlabel("Epochs")
plt.ylabel("Accuracy")
plt.legend()
plt.show()

out[24]

Epoch 1/20
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 167ms/step - accuracy: 0.3650 - loss: 3.3407 - val_accuracy: 0.6030 - val_loss: 1.9616
Epoch 2/20
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 18ms/step - accuracy: 0.6418 - loss: 1.7109 - val_accuracy: 0.6750 - val_loss: 1.4404
Epoch 3/20
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 25ms/step - accuracy: 0.7138 - loss: 1.2762 - val_accuracy: 0.7170 - val_loss: 1.2448
Epoch 4/20
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 22ms/step - accuracy: 0.7742 - loss: 1.0195 - val_accuracy: 0.7520 - val_loss: 1.1196
Epoch 5/20
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 16ms/step - accuracy: 0.8142 - loss: 0.8511 - val_accuracy: 0.7740 - val_loss: 1.0302
Epoch 6/20
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 16ms/step - accuracy: 0.8513 - loss: 0.6893 - val_accuracy: 0.7790 - val_loss: 0.9963
Epoch 7/20
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step - accuracy: 0.8733 - loss: 0.5928 - val_accuracy: 0.8020 - val_loss: 0.9271
Epoch 8/20
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step - accuracy: 0.9024 - loss: 0.4641 - val_accuracy: 0.7970 - val_loss: 0.9176
Epoch 9/20
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step - accuracy: 0.9175 - loss: 0.4041 - val_accuracy: 0.8120 - val_loss: 0.8993
Epoch 10/20
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step - accuracy: 0.9278 - loss: 0.3464 - val_accuracy: 0.8070 - val_loss: 0.8861
Epoch 11/20
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step - accuracy: 0.9355 - loss: 0.3077 - val_accuracy: 0.8170 - val_loss: 0.8689
Epoch 12/20
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step - accuracy: 0.9478 - loss: 0.2446 - val_accuracy: 0.8210 - val_loss: 0.8673
Epoch 13/20
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 15ms/step - accuracy: 0.9495 - loss: 0.2194 - val_accuracy: 0.8180 - val_loss: 0.8670
Epoch 14/20
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step - accuracy: 0.9516 - loss: 0.1955 - val_accuracy: 0.8170 - val_loss: 0.8846
Epoch 15/20
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step - accuracy: 0.9522 - loss: 0.1795 - val_accuracy: 0.8150 - val_loss: 0.8793
Epoch 16/20
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step - accuracy: 0.9576 - loss: 0.1623 - val_accuracy: 0.8050 - val_loss: 0.9458
Epoch 17/20
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 16ms/step - accuracy: 0.9550 - loss: 0.1626 - val_accuracy: 0.8180 - val_loss: 0.9073
Epoch 18/20
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step - accuracy: 0.9546 - loss: 0.1459 - val_accuracy: 0.8060 - val_loss: 0.9612
Epoch 19/20
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step - accuracy: 0.9602 - loss: 0.1438 - val_accuracy: 0.8080 - val_loss: 0.9453
Epoch 20/20
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step - accuracy: 0.9545 - loss: 0.1363 - val_accuracy: 0.8190 - val_loss: 0.9365

"""
The model begins to overfit after 9 epochs -> train model from scratch with 9 epochs
"""
model = keras.Sequential([
 layers.Dense(64, activation="relu"),
 layers.Dense(64, activation="relu"),
 layers.Dense(46, activation="softmax")
])
model.compile(optimizer="rmsprop",
 loss="categorical_crossentropy",
 metrics=["accuracy"])
model.fit(x_train,
 y_train,
 epochs=9,
 batch_size=512)
results = model.evaluate(x_test, y_test)
results = model.evaluate(x_test, y_test)
print("Final Results")
print(results)

out[25]

Epoch 1/9
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 63ms/step - accuracy: 0.4027 - loss: 3.1839
Epoch 2/9
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 13ms/step - accuracy: 0.6740 - loss: 1.6156
Epoch 3/9
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step - accuracy: 0.7337 - loss: 1.2478
Epoch 4/9
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 13ms/step - accuracy: 0.7689 - loss: 1.0306
Epoch 5/9
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step - accuracy: 0.8162 - loss: 0.8453
Epoch 6/9
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 13ms/step - accuracy: 0.8531 - loss: 0.6742
Epoch 7/9
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step - accuracy: 0.8774 - loss: 0.5752
Epoch 8/9
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step - accuracy: 0.8999 - loss: 0.4880
Epoch 9/9
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step - accuracy: 0.9152 - loss: 0.3934
[1m71/71[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 10ms/step - accuracy: 0.7971 - loss: 0.9087
[1m71/71[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.7971 - loss: 0.9087
Final Results
[0.925804853439331, 0.7894033789634705]

You should avoid intermediate layers that have fewer units than the number of output classifications - this can introduce information bottlenecks that degrade performance - i.e., with 46 output classes, no intermediate layer should have less than 46 units.

Regression Example

Below is an example of regression - predicting a continuous value instead of a discrete label. We are attempting to predict the median price of homes in a given Boston suburb in the mid-1970s, given data points about the suburb at the time, such as the crime rate, the local property tax rate, and so on. Each feature in the input data has a different scale.

It is inappropriate to feed into a neural network values that all take wildly different ranges. A widespread best practice for dealing with such data is to do feature-wise normalization: for each feature in the input data (a column in the input data matrix), we subtract the mean of the feature and divide by the standard deviation, so that the feature is centered around 0 and has a unit standard deviation.

Because so few samples are available, we'll use a very small model with two intermediate layers, each with 64 units. In general, the less training data you have, thw worse overfitting will be, and using a small model is one way to mitigate overfitting.

The regression model ends with a single unit and no activation (it will be a linear layer). This is a typical setup for scalar regresison (a regression where you're trying to predict a single continuous value). The last layer being purely linear allows the model to predict values in any range.

The mean swaured error, the square of the difference between the predictions and the targets, is a widely used loss function for regresison problems. The mean absolute error (MAE) is the avsolutre value of the difference between the predictions and the targets.

from tensorflow.keras.datasets import boston_housing
(train_data, train_targets), (test_data, test_targets) = boston_housing.load_data()

out[27]

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/boston_housing.npz
[1m57026/57026[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step

print(train_data.shape)
print(test_data.shape)

out[28]

(404, 13)
(102, 13)

print(train_targets)

out[29]

[15.2 42.3 50. 21.1 17.7 18.5 11.3 15.6 15.6 14.4 12.1 17.9 23.1 19.9
15.7 8.8 50. 22.5 24.1 27.5 10.9 30.8 32.9 24. 18.5 13.3 22.9 34.7
16.6 17.5 22.3 16.1 14.9 23.1 34.9 25. 13.9 13.1 20.4 20. 15.2 24.7
22.2 16.7 12.7 15.6 18.4 21. 30.1 15.1 18.7 9.6 31.5 24.8 19.1 22.
14.5 11. 32. 29.4 20.3 24.4 14.6 19.5 14.1 14.3 15.6 10.5 6.3 19.3
19.3 13.4 36.4 17.8 13.5 16.5 8.3 14.3 16. 13.4 28.6 43.5 20.2 22.
23. 20.7 12.5 48.5 14.6 13.4 23.7 50. 21.7 39.8 38.7 22.2 34.9 22.5
31.1 28.7 46. 41.7 21. 26.6 15. 24.4 13.3 21.2 11.7 21.7 19.4 50.
22.8 19.7 24.7 36.2 14.2 18.9 18.3 20.6 24.6 18.2 8.7 44. 10.4 13.2
21.2 37. 30.7 22.9 20. 19.3 31.7 32. 23.1 18.8 10.9 50. 19.6 5.
14.4 19.8 13.8 19.6 23.9 24.5 25. 19.9 17.2 24.6 13.5 26.6 21.4 11.9
22.6 19.6 8.5 23.7 23.1 22.4 20.5 23.6 18.4 35.2 23.1 27.9 20.6 23.7
28. 13.6 27.1 23.6 20.6 18.2 21.7 17.1 8.4 25.3 13.8 22.2 18.4 20.7
31.6 30.5 20.3 8.8 19.2 19.4 23.1 23. 14.8 48.8 22.6 33.4 21.1 13.6
32.2 13.1 23.4 18.9 23.9 11.8 23.3 22.8 19.6 16.7 13.4 22.2 20.4 21.8
26.4 14.9 24.1 23.8 12.3 29.1 21. 19.5 23.3 23.8 17.8 11.5 21.7 19.9
25. 33.4 28.5 21.4 24.3 27.5 33.1 16.2 23.3 48.3 22.9 22.8 13.1 12.7
22.6 15. 15.3 10.5 24. 18.5 21.7 19.5 33.2 23.2 5. 19.1 12.7 22.3
10.2 13.9 16.3 17. 20.1 29.9 17.2 37.3 45.4 17.8 23.2 29. 22. 18.
17.4 34.6 20.1 25. 15.6 24.8 28.2 21.2 21.4 23.8 31. 26.2 17.4 37.9
17.5 20. 8.3 23.9 8.4 13.8 7.2 11.7 17.1 21.6 50. 16.1 20.4 20.6
21.4 20.6 36.5 8.5 24.8 10.8 21.9 17.3 18.9 36.2 14.9 18.2 33.3 21.8
19.7 31.6 24.8 19.4 22.8 7.5 44.8 16.8 18.7 50. 50. 19.5 20.1 50.
17.2 20.8 19.3 41.3 20.4 20.5 13.8 16.5 23.9 20.6 31.5 23.3 16.8 14.
33.8 36.1 12.8 18.3 18.7 19.1 29. 30.1 50. 50. 22. 11.9 37.6 50.
22.7 20.8 23.5 27.9 50. 19.3 23.9 22.6 15.2 21.7 19.2 43.8 20.3 33.2
19.9 22.5 32.7 22. 17.1 19. 15. 16.1 25.1 23.7 28.7 37.2 22.6 16.4
25. 29.8 22.1 17.4 18.1 30.3 17.5 24.7 12.6 26.5 28.7 13.3 10.4 24.4
23. 20. 17.8 7. 11.8 24.4 13.8 19.4 25.2 19.4 19.4 29.1]

mean = train_data.mean(axis=0)
train_data -= mean
std = train_data.std(axis=0)
train_data /= std
test_data -= mean
test_data /= std

out[30]

def build_model():
  """
  Since we need to instantiate the same model multiple times, we use a function to construct it.
  """
  model = keras.Sequential([
      layers.Dense(64,activation="relu"),
      layers.Dense(64, activation="relu"),
      layers.Dense(1)
  ])
  model.compile(optimizer="rmsprop",loss="mse",metrics=["mae"])
  return model

out[31]

Validating Approach using K-fold Validation

Since the amount of training data is small, the validation set would be small. As a consequence, the validation socres might change a lot depending on data points chosen: validation scores might have a high variation with regard to the validation split - this would prevent us from reliably evaluating model. Best practice in such situations is to use K-fold cross validation:

It consists of splitting the available data into K partitions (typically $K = 4 \space \text{or} \space 5$ ), instantiating $K$ identical models, and training each one on $K - 1$ partitions while evaluating on the remaining partition. The validation score for the model used is then the average of the $K$ validation scores obtained. In terms of code, this is straightforward.

k = 4
num_val_samples = len(train_data) // k
num_epochs = 100
all_scores = []
for i in range(k):
  print(f"Processing Fold ${i}")
  # Prepares the validation data: data from partition #k
  val_data = train_data[i*num_val_samples: (i+1)*num_val_samples]
  val_targets = train_targets[i * num_val_samples: (i+1) * num_val_samples]
  # Prepare the training data: data from all other partitions
  partial_train_data = np.concatenate(
      [train_data[:i * num_val_samples],
       train_data[(i+1)*num_val_samples:]],
      axis=0
  )
  partial_train_targets = np.concatenate(
      [train_targets[:i * num_val_samples],
       train_targets[(i+1)*num_val_samples:]],
      axis=0
  )
  model = build_model()
  # Trains in silent mode (verbose = 0)
  model.fit(partial_train_data, partial_train_targets, epochs=num_epochs, batch_size=16, verbose=0)
  val_mse, val_mae = model.evaluate(val_data, val_targets, verbose=0)
  all_scores.append(val_mae)

out[33]

Processing Fold $0
Processing Fold $1
Processing Fold $2
Processing Fold $3

all_scores

out[34]

[1.958532452583313, 2.4341795444488525, 2.312009334564209, 2.3132541179656982]

np.mean(all_scores)

out[35]

2.254493862390518

 num_epochs = 500
 all_mae_histories = []
 for i in range(k):
  print(f"Processing Fold #{i}")
  val_data = train_data[i*num_val_samples : (i+1) * num_val_samples] # Prepares the validation data: data from partition #k
  val_targets = train_targets[i * num_val_samples : (i+1) * num_val_samples]
  # Prepares the training data: data from all other partitions
  partial_train_data = np.concatenate(
      [train_data[:i * num_val_samples],
       train_data[(i+1) * num_val_samples:]],
      axis=0
  )
  partial_train_targets = np.concatenate(
  [train_targets[:i * num_val_samples],
   train_targets[(i+1) * num_val_samples:]],
   axis=0
  )
  # Builds the Keras model (already compiled)
  model = build_model()
  # Trains the model (in silent mode, verbose=0)
  history = model.fit(partial_train_data,partial_train_targets,validation_data=(val_data, val_targets), epochs=num_epochs, batch_size=16, verbose=0)
  mae_history = history.history["val_mae"]
  all_mae_histories.append(mae_history)

out[36]

Processing Fold #0
Processing Fold #1
Processing Fold #2
Processing Fold #3

average_mae_history = [np.mean([x[i] for x in all_mae_histories]) for i in range(num_epochs)]
plt.plot(range(1, len(average_mae_history) + 1),average_mae_history)
plt.xlabel("Epochs")
plt.ylabel("Validation MAE")
plt.show()
plt.clf()
truncated_mae_history = average_mae_history[10:]
plt.plot(range(1, len(truncated_mae_history) + 1), truncated_mae_history)
plt.xlabel("Epochs")
plt.ylabel("Validation MAE")
plt.show()

out[37]

User Comments

There are currently no comments for this article.