Deep Learning with Python - Chapters 9 and 10
"Advanced Deep Learning for Computer Vision" and "Deep Learning for Timerseries" go over image segmentation and recurrent neural networks (primarily in the context of timeseries data), respectively.
Advanced Deep Learning for Computer Vision
The Three Essential Computer Vision Tasks
-Image Classification: the goal s to assign one or more labels to an image. It can be single label or multi-label classification.
- Image segmentation: The goal is to "segment" or "partition" an image into several different areas, with each area usually representing a category
- Object Detection: The goal is to draw rectangles (called bounding boxes) around objects of interest in an image, and associate each rectangle with a class.
Other niche tasks: image similarity scoring, keypoint detection (pinpoint attributes of interest in an image, such as facial features), pose estimation, 3D mesh estimation, and so on. object detection is beyond the scope of an introductory book: checkout the RetinaNet example on keras.io, which shows how to build am object detection model from scratch in Keras.
Image Segmentation Example
Image segmentation with deep learning is about using a model to assign a class to each pixel in an image, thus segmenting the image into different zones (such as "background" "foreground" or "road" "car" and "sidewalk"). Two flavors if image segmentation:
- Semantic segmentation: each pixel is independently classified into a semantic category, like "cat". If there are two cats in the image, the corresponding pixels are all mapped to the generic "cat" category.
- Instance segmentation: seeks not only to classify image pixels by category, but also to parse out individual object instances. In an image with two cats in it, instance segmentation would treat "cat 1" and "cat 2" as two separate classes of pixels.
A segmentation mask is the image-segmentation equivalent of a label: it's an image the same size s the input image, with a single color channel where each integer value corresponds to the class of the corresponding pixel in the input image.
# Download and uncompress dataset using the wget and tar shell utilities
!wget http://www.robots.ox.ac.uk/~vgg/data/pets/data/images.tar.gz
!wget http://www.robots.ox.ac.uk/~vgg/data/pets/data/annotations.tar.gz
!tar -xf images.tar.gz
!tar -xf annotations.tar.gz
import os
input_dir = "/content/images/"
target_dir = "/content/annotations/trimaps/"
"""
The input pictures are stored as jpg and the corresponding segmentation mask is stored as a PNG file.
Below is preparing a list of file paths, as well as the list of the corresponding mask file paths
"""
input_img_paths = sorted([os.path.join(input_dir, fname) for fname in os.listdir(input_dir) if fname.endswith(".jpg")])
target_paths = sorted([os.path.join(target_dir,fname) for fname in os.listdir(target_dir) if fname.endswith(".png" )and not fname.startswith(".")])
import matplotlib.pyplot as plt
from tensorflow.keras.utils import load_img, img_to_array
plt.axis("off")
ax = plt.imshow(load_img(input_img_paths[9])) # Display input image number 9
plt.gca().set_title("An example image")
plt.show()
plt.clf()
def display_target(target_array,title=None):
# The original labels are 1, 2, and 3. We subtract 1 so that the
# labels range from 0 to 2, and then we multiply127 so that the
# labels become 0 (black), 127 (gray), 254 (newar-white)
normalized_array = (target_array.astype("uint8") - 1) * 127
plt.axis("off")
plt.imshow(normalized_array[:,:,0])
if isinstance(title,str):
plt.gca().set_title(title)
# We use color_mode="grayscale" so that the image we load is treated as having a single color channel
img = img_to_array(load_img(target_paths[9], color_mode="grayscale"))
display_target(img,"The corrsponding Target Mask")
"""
Spliiting data into training and validation set
"""
import numpy as np
import random
img_size = (200,200) # We resize everything to 200 x 200
num_imgs = len(input_img_paths) # Total number of samples in the data
"""
Shuffle the file paths (they were originall sorted by breed). We use the same seed in both statements to ensure that the input paths and target paths stay in the same order
"""
random.Random(1_337).shuffle(input_img_paths)
random.Random(1_337).shuffle(target_paths)
def path_to_input_image(path):
return img_to_array(load_img(path, target_size=img_size))
def path_to_target(path):
img = img_to_array(load_img(path, target_size=img_size))
img = img.astype("uint8") - 1 # Return 1 so that our labels become 0, 1, and 2
return img
"""
Load all images in the inpt_imgs float32 array (same order). The inputs have three channels (RGB values) and the targets have a single channel (which contains integer labels).
"""
input_imgs = np.zeros((num_imgs,) + img_size + (3,), dtype="float32")
targets = np.zeros((num_imgs,) + img_size + (1,), dtype="uint8")
# Rserver 1,000 samples for validation
num_val_samples = 1_000
# Split the data into a training and a validation set
train_input_imgs = input_imgs[:-num_val_samples]
train_targets = targets[:-num_val_samples]
val_input_imgs = input_imgs[-num_val_samples:]
val_targets = targets[-num_val_samples:]
from tensorflow import keras
from tensorflow.keras import layers
def get_model(img_size, num_classes):
inputs = keras.Input(shape=img_size+(3,))
# Rescale input images to the [0, 1] range
x = layers.Rescaling(1./255)(inputs)
# We use passing="same" everywhere tpo avoid the influence of border passing on feature map size
x = layers.Conv2D(64, 3, strides=2, activation="relu", padding="same")(x)
x = layers.Conv2D(64, 3, activation="relu", padding="same")(x)
x = layers.Conv2D(128, 3, strides=2, activation="relu", padding="same")(x)
x = layers.Conv2D(128, 3, activation="relu", padding="same")(x)
x = layers.Conv2D(256, 3, strides=2, padding="same", activation="relu")(x)
x = layers.Conv2D(256, 3, activation="relu", padding="same")(x)
x = layers.Conv2DTranspose(256, 3, activation="relu", padding="same")(x)
x = layers.Conv2DTranspose(
256, 3, activation="relu", padding="same", strides=2)(x)
x = layers.Conv2DTranspose(128, 3, activation="relu", padding="same")(x)
x = layers.Conv2DTranspose(
128, 3, activation="relu", padding="same", strides=2)(x)
x = layers.Conv2DTranspose(64, 3, activation="relu", padding="same")(x)
x = layers.Conv2DTranspose(
64, 3, activation="relu", padding="same", strides=2)(x)
# We end the model with a per-pixel three-way softmax to classify each output pizel into one of our three categories
outputs = layers.Conv2D(num_classes, 3, activation="softmax",
padding="same")(x)
model = keras.Model(inputs, outputs)
return model
model = get_model(img_size=img_size, num_classes=3)
print(model.summary())
The first half of the model resembles the kind of convnet you'd use for image classification: a stack of Conv2D layers, with gradually increasing filter sizes. We downsample our images three times by a factor of two each, ending up with activations of sizes (25, 25, 256). The purpose of this first hald is to encode the mages into smaller feature maps, where each spatial location (or pixel) contains information about a large spatial chunk of the original image. It is a kind of compression.
One important difference between the first half of this model and the classification modles you've seen before is the way we do downsampling: in classification, we do downsampling through MaxPooling2D layers and here we downsample by adding strides to every other convolution layer. We do this because, in the case of image segmentation, we care a lot about the spatial location of information in the image, since we need to produce a per-pixel target masks as output of the model. Max pooling layers hurt you quite a boit for segmentation tasks. Strided convolutions do a better job at downsampling featyre maps while retaining location information. Use stides instead of pooling when you care about feature location.
The second half othe model is a stakc of Conv2DTranspose layers. The outpit of the first half of the model iss a feature map of shape (25, 25, 256), but we want the final output to have the same shape as the target masks (200, 200, 3). We need to apply a kind of inverse of the transformations we've applied so far - something that will unspample the feature maps instead of downsampling them. That's the purose of the Conv2DTranspose layer: you can think of it as a kind of convolution layer that learns to unsample.
If you have an input of shape (100, 100, 64), and you run it through the layer Cov2D(128, 3, strides=2, padding="same"), you get an output of shape (50, 50, 128). If you run this output through the layer Conv2DTranspose(64, 3, strides=2, padding="same"), you get back an output of shape (100, 100, 64).
model.compile(optimizer="rmsprop", loss="sparse_categorical_crossentropy")
callbacks = [
keras.callbacks.ModelCheckpoint("oxford_segmentation.keras",
save_best_only=True)
]
history = model.fit(train_input_imgs, train_targets,epochs=50,
callbacks=callbacks, batch_size=64, validation_data=(val_input_imgs, val_targets))
epochs = range(1, len(history.history["loss"]) + 1)
loss = history.history["loss"]
val_loss = history.history["val_loss"]
plt.figure()
plt.plot(epochs, loss, "bo", label="Training loss")
plt.plot(epochs, val_loss, "b", label="Validation loss")
plt.title("Training and validation loss")
plt.legend()
plt.show()
from tensorflow.keras.utils import array_to_img
model = keras.models.load_model("oxford_segmentation.keras")
i = 4
test_image = val_input_imgs[i]
plt.axis("off")
plt.imshow(array_to_img(test_image))
mask = model.predict(np.expand_dims(test_image, 0))[0]
def display_mask(pred):
"""
Utility to display a model's prediction
"""
mask = np.argmax(pred, axis=-1)
mask *= 127
plt.axis("off")
plt.imshow(mask)
display_mask(mask)
Modern covnet Architecture Patterns
A model's "architecture" is the sum of the choices that went into crating it: which layers to use, how to configure them, and in what arrangement to connect them.These choices define the hypothesis space of your model: the space of possible functions that gradient descent can search over, parameterized by the model's weights. Like feature engineering, a good hypothesis space encodes prior knowledge that you have about the problem at hand and its solution.
A good model architecture is one that reduces the size of the search space or otherwise makes it easier to converge to a good point of the search space. Model architecture is all about making the problem simpler for gradient descent to solve. (Gradient descent is a stupid search process - so it needs all the help it can get.)
Modularity, hierarchy, and reuse
If you want to make a complex system simpler, there's a universal recipe you can apply: just structure your amorphous sip of complexity into modules, organize the modules into a hierarchy, and start reusing the same modules as appropriate ("reuse" is another word for abstraction in this context)
In general, a deep stack of narrow layers perform better than a shallow stack of large layers. There's a limit to how deep you can stack layers due to the problem of vanishing gradients. Vanushing gradients is cause by you're function chain (amount of layers) being too deep so that the error can not propragate to the earlier layers due to some amount of noise added by the functions.
To fix: force each function in the chain to be nondestructive - to retain a noiseless version of the information contained in the previous input. Thie easiest way to implement this is to use a residual connection. The residual connection acts as an information shortcut around destuctive or noisy blocks (such as blocks that contain activation unctions or dropout layers), enabling error gradient information from earlier layers to propagate noiselessly though a deep network.
# A residual connection in pseudocode
x = ... # Some input tensor
residual = x # Save a pointer tothe original input, called the resiudal
x = block(x) # The computation block can potentially be destructive or noisy
# add the original input to the layer's output:
# the final output will always preserve full informationabout the original input
x = add([x,residual])
from tensorflow import keras
from tensorflow.keras import layers
inputs = keras.Input(shape=(32,32,3))
x = layers.Conv2D(32,3,activation="relu")(inputs)
# Set aside the residual
residual = x
# Layer around which we create the residual connection
# padding="same" to avoid downsampling due to padding
x = layers.Conv2D(64,3,activation="relu",padding="same")(x)
# The residual only had 32 filters, so we use a 1x1 Conv2D to project to
# the correct shape
residual = layers.Conv2D(64,1)(residual)
# The block outut and the residual now
# have the same shape and can be added
x = layers.add([x, residual])
from tensorflow import keras
from tensorflow.keras import layers
inputs = keras.Input(shape=(32,32,3))
x = layers.Conv2D(32,3,activation="relu")(inputs)
# Set aside the residual
residual = x
# Block of two layers around which we create the residual connection
# padding="same" to avoid downsampling due to padding,
x = layers.Conv2D(64,3,activation="relu",padding="same")(x)
x = layers.MaxPooling2D(2, padding="same")(x)
# We use strides=2 in the residual projection to match the downsampling
# reated by the max pooling layer
residual = layers.Conv2D(64,1,strides="2")(residual)
# The block outut and the residual now
# have the same shape and can be added
x = layers.add([x, residual])
"""
Example of a simple Covnet structured into a series of blocks, each made of two convolutional layer and one optional max pooling layer, with a residual connection around each block
"""
inputs = keras.Input(shape=(32,32,3))
x = layers.Rescaling(1./255)(inputs)
def residual_block(x,filters,pooling=False):
"""
Utility function to apply a convolutional block with
a residual connection, with an option to add max pooling
"""
residual = x
x = layers.Conv2D(filters,3,activation="relu", padding="same")(x)
x = layers.Conv2D(filters,3,activation="relu", padding="same")(x)
if pooling:
x = layers.MaxPooling2D(2, padding="same")(x)
# Add a strided convolution to project the residual to the expected shape if using pooling
residual = layers.Conv2D(filters,1,strides=2)(residual)
elif filters != residual.shape[-1]:
# If not using max pooling, only project the residual if the number of channels has changed
residual = layers.Conv2D(filters,1)(residual)
x = layers.add([x,residual])
return x
# First block
x = residual_block(x, filters=32, pooling=True)
# Second Block: note increasing filter count
x = residual_block(x, filters=64, pooling=True)
# Last block doesn;t need a max pooling layer, since we will apply global average pooling right after it
x = residual_block(x, filters=128, pooling=False)
x = layers.GlobalAveragePooling2D()(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs=inputs, outputs=outputs)
print(model.summary())
With residual connections, you can build networks of arbitrary depth, without having to worry about vanishing gradients.
Batch Normalization
Normalization is a broad category of methods that seek to make different samples seen by machine learning model more similar to each other, which helps the model learn and generalize well to new data.
Batch normalization is a type of layer (BatchNormalization in Keras) that can adaptively normalize data even as the mean and variance change over tim during training. During training, it uses the mean and varaince of the current batch of data to normalize samples, and during inference, it uses an exponential moving average of the batch-wise mean and variance of the data seen during training.
[D]eep learning is not an exact science, but a set of everchanging, empirically derived engineering best practices, woven together by unreliable narratives.
The main effect of batch normalization appears to be that it helps with gradient propagation - much like residual connections - and thus allows for deeper networks. The BatchNormalization layer can be used after any layer.
# Because the output of the Conv2D layer gets normalized, the layer
# doesn't need its own bias vector - this makes the Conv2D layer
# slightly leaner
x = layers.Conv2D(32, 3, use_bias=False)(x)
x = layers.BatchNormalization()(x)
It is generally recommended placing the previous layer's activation after batch normalization layer. When fine tuning a model with BatchNormalization layers, it is recommended to leave these layers frozen. Otherwise, they will keep updating their internal mean or variance.
# Note the lack of activation
x = layers.Conv2D(32, 3, use_bias=False)(x)
x = layers.BatchNormalization()(x)
# Place the activation after the BatchNormalization layer
x = layers.Activation("relu")(x)
Depthwise Seperable Convolutions
The depthwise separable convolution layer is a drop-in replacement for Conv2D that will make the model smaller and leaner and cause it to perform a few percentage points better on its task. This layer performs spatial convolution on each channel of its input, independently, before mixing output channels via a pointwise convolution (a 1×1 convolution).
Depthwise separable convolution relies on the assumption that spatial locations in intermediate activations are highly correlated, but different channels are highly independent. This assumtion is generally true for image representations learned by deep neural networks.
This layer requires significantly fewer parameters andinvolves fewer computaions compared to regular convolution, while having comparable representational power. It results in smaller models that converge faster and are less prone to overfitting.
Summary
- Your model should be organized into repeated blocks of layers, usually made of multiple convolution layers and a max pooling layer
- The number of filters in your lauers should increase as the size of the spatial feature maps decreases.
- Deep and narrow is better than broad and shallow.
- Introducing residual connections around blocks of layers helps you train deeper networks.
- It can be beneficial to introduce batch normalization layers after your convolution layers.
- It can be beneficial to replace Conv2D layers with SeperableConv2D layers, which are more parameter-efficient.
Example of Small Xception Model:
data_augmentation = keras.Sequential(
[
layers.RandomFlip("horizontal"),
layers.RandomRotation(0.1),
layers.RandomZoom(0.2),
]
)
inputs = keras.Input(shape=(180, 180, 3))
# We use the same data augmentation config as before
x = data_augmentation(inputs)
# Don;t forget input rescaling
x = layers.Rescaling(1./255)(x)
# The assumption that underlies separable convolution -
# that "feature channels are largely independent" does not
# hold true for RGB images - they are actually highly correlated
# in natural images.
x = layers.Conv2D(filters=32, kernel_size=5, use_bias=False)(x)
# Apply a series of convolutional blocks with increasing feature depth
# Each block consists of two batch-normalized depthwise separable convolution
# layers and a max pooling layer, with a residual connection around the
# entire block.
for size in [32, 64, 128, 256, 512]:
residual = x
x = layers.BatchNormalization()(x)
x = layers.Activation("relu")(x)
x = layers.SeparableConv2D(size, 3, padding="same", use_bias=False)(x)
x = layers.BatchNormalization()(x)
x = layers.Activation("relu")(x)
x = layers.SeparableConv2D(size, 3, padding="same", use_bias=False)(x)
x = layers.MaxPooling2D(3, strides=2, padding="same")(x)
residual = layers.Conv2D(
size, 1, strides=2, padding="same", use_bias=False)(residual)
x = layers.add([x, residual])
# In the original model, we used a Flatten layer before the Dense layer
# Here, we go with GlobalAveragePooling2D layer
x = layers.GlobalAveragePooling2D()(x)
# We add dropout for regularization
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs=inputs, outputs=outputs)
model.compile(loss="binary_crossentropy",
optimizer="rmsprop",
metrics=["accuracy"])
callbacks = [
keras.callbacks.ModelCheckpoint(
filepath="convnet_from_scratch_with_augmentation.keras",
save_best_only=True,
monitor="val_loss")
]
history = model.fit(
train_dataset,
epochs=100,
validation_data=validation_dataset,
callbacks=callbacks)
Interpreting what convnets learn
A fundmental problem when building a computer vision application is that of interpretability: why did your classifier think a particular image contained a fridge, when all you can see is a truck?
The representations learned by convnets are highly ameable to visualization, in large part because they're representations of visual concepts. Techniques for visualizing and interpreting these representations:
- Visualizing intermediate convnet outputs (intermediate activation): useful for understanding how successive convnet layers transform their input, and for getting a first idea of how meaning of individual convnet filters
- Visualizing convent filters: Useful for understanding precisely what visual patterm or concept each filter in a convnet is receptive to
- Viuslaizing heatmaps of class activation in an image: Useful for understanding which parts of an image were identified as belonging to a given class, thus allowing you to localize objects in an image.
from tensorflow import keras
model = keras.models.load_model("convnet_from_scratch_with_augmentation.keras")
print(model.summary())
"""
Preporcessing a single image
"""
from tensorflow import keras
import numpy as np
# Download a test image
img_path = keras.utils.get_file(fname="cat.jpg",origin="https://img-datasets.s3.amazonaws.com/cat.jpg")
def get_img_array(img_path, target_size):
# Open the image file and resize it
img = keras.utils.load_img(img_path, target_size=target_size)
# Turn the image into a float32 NumPy array of shape (180, 180, 3)
array = keras.utils.img_to_array(img)
# Add a dimension to transform the array into a "batch" of a single sample. Its shape is now (1, 180, 180, 3)
array = np.expand_dims(array, axis=0)
return array
"""
Displaying the test picture
"""
import matplotlib.pyplot as plt
plt.axis("off")
plt.imshow(img_tensor[0].astype("uint8"))
plt.show()
from tensorflow.keras import layers
layer_outputs = []
layer_names = []
for layer in model.layers:
if isinstance(layer, (layers.Conv2D, layers.MaxPooling2D)):
# Extract the outputs of all Conv2D and MaxPooling2D layers and put them in a list
layer_outputs.append(layer.output)
# Save layer names for later
layer_names.append(layer.name)
# Create a model that will return these outputs, given the model input
activation_model = keras.Model(inputs=model.input, outputs=layer_outputs)
# Usingthe Model to copute layer activations
# Returns a list of nine NumPy arrays, one array per layer of activation
activations = activation_model.predict(img_tensor)
### Visualizing the Fifth Channel
import matplotlib.pyplot as plt
plt.matshow(first_layer_activation[0, :, :, 5], cmap="viridis")
# ... some visualization code
Things to note:
- The first layer acts as a collection of various edge detectors. At that stage, the activations retain almost all of the information present in the initial picture
- As you go deeper, the activations become increasingly abstract and less visually interpretable. Deeper presentations carry less inforamtion about the visual contents of the image, and increasinglu more information related to the class of the image.
- The sparsity of the activations increases with the depth of the layer: in the first layer, almost all filters are activated by the input image, but in the following layer, more and more filters are blank. This means the pattern encoded by the filter isn't found in the input image.
We have just evidenced an important universal characteristic of the representations learned by deep neural networks: the features extracted by a layer become increasingly abstract with the depth of the layer. The activations of higher layers carry less and less information about the specific input being seen, and more and more information about the target (in this case, the class of the image: cat or dog). A deep neural network effectively acts as an information distillation pipeline, with raw data going in (in this case, RGB pictures) and being repeatedly transformed so that irrelevant information is filtered out (for example, the specific visual appearance of the image), and useful information is magnified and refined (for example, the class of the image).
Visualizing covnet filters
Another way to inspect the filters learned by convnets is to diplay the visual pattern taht each filter is meant to respond to. This can be done with gradient ascent in input space: applying gradient descent to the value of the input image of a convnet so as to maximize the response of a specific filter, starting from a blank input image.
# ... some visualization code
The visualization below should tell you a lot about how convnet filters see the world: each layer in a convnet learns a collection of filters such that their inputs can be expressed as a combination of flters - this is similat to how the Fourier transform decomposes signals onto a bank of cosine functions. The filters get increasingly complex and refined as you go deeper in the model:
- the filters from the first layers in the model encode simple directional edges and colors
- the filters from layers a bit further up the stack encode simple textures made from combinations of edges and colors
- The filters in higher levels begin to resemble features found in natural images: feathers, eyes, leaves, and so on.
Visualizing Heatmaps of Class Activation
Debugging in the case of a classification mistake lies in a domain called model interpreability. The general category of techniques is called class activation map (CAM) visualiztion, and it consists of producing heatmaps over input images. It is a 2D grid of scores associaed with a specific output class, computed for every location in any input image, indicating how important each location is with respect to the class under consideration.
Deep Learning for Timeseries
Different kinds of Timeseries Tasks
A timeseries can be any data obtained via measurements at regular intervals, like the daily price of a stock, the hourly consumption of a city, or the weekly sales of a store. Working with timeseries involes understanding the dynamics of a system - its periodic cycles, how it trends over time, its regular regime and its sudden spikes.
The most common timeseries-related task is forecasting: predicting what will happen next in a series. Things you can do with timeseries:
- Classification
- Event Detection
- Anomaly Detection
The Fourier transform can be highly valuavke when preprocessing any data that is primarily characterized by its cycles and oscillations.
A Temperature-Forecasting Example
Predicting the temperature 24 hours in the future given a timesreies of hourly measurements of quantities such as atmosopheric pressure and humidity, recorded over the recent past by a set of sensors on the roof of a building. Recurrent Neural Networks (RNNs) really shine on this type of problem.
Perioficity over multiple timescales is an important and very common property of timeseries data. When exploring data, make sure to look for these patterns. When working with timeseries, it's important to use validation and test data that is more recent than the training data, because you're trying to predict the future given the past, not the reverse.
!wget https://s3.amazonaws.com/keras-datasets/jena_climate_2009_2016.csv.zip
!unzip jena_climate_2009_2016.csv.zip
"""
Inspecting teh data of the Jena weather dataset
"""
import os
fname = os.path.join("/content/jena_climate_2009_2016.csv")
with open(fname) as f:
data = f.read()
lines = data.split("\n")
header = lines[0].split(",")
lines = lines[1:]
print(header)
print(len(lines))
"""
Parse the data into NumPy arrays
"""
import numpy as np
temperature = np.zeros((len(lines),))
raw_data = np.zeros((len(lines), len(header) - 1))
for i, line in enumerate(lines):
values = [float(x) for x in line.split(",")[1:]]
temperature[i] = values[1] # Column 1 is the "temperature" array
raw_data[i, :] = values[:] # Store all columns in the "raw_data" array
"""
Plotting the temperature timeseries
"""
from matplotlib import pyplot as plt
plt.plot(range(len(temperature)), temperature)
plt.show()
# Plotting the first 10 daya od the temperature timeseries
plt.plot(range(1440), temperature[:1440])
plt.show()
"""
Computing the number of samples we'll use for each data split
"""
num_train_samples = int(0.5 * len(raw_data))
num_val_samples = int(0.25 * len(raw_data))
num_test_samples = len(raw_data) - num_train_samples - num_val_samples
print("num_train_samples:", num_train_samples)
print("num_val_samples:", num_val_samples)
print("num_test_samples:", num_test_samples)
# Normalize the data
mean = raw_data[:num_train_samples].mean(axis=0)
raw_data -= mean
std = raw_data[:num_train_samples].std(axis=0)
raw_data /= std
import tensorflow as tf
from tensorflow import keras
"""
Instantiating datasets for training, validation, and testing.
With timseries_data_from_array, we will use the following parameter values:
- sampling_rate = 6 - observations will be samples at one data point per hour, we will onlu keep one data point out of 6
- sequence_length - 120 - observations will go back 5 days (120 hours)
- delay = sampling_rate*(sequence_length+24-1) - the target for a squence will be the temperatur 24 hours after the end of the sequence
"""
sampling_rate = 6
sequence_length = 120
delay = sampling_rate * (sequence_length + 24 - 1)
batch_size = 256
train_dataset = keras.utils.timeseries_dataset_from_array(
raw_data[:-delay],
targets=temperature[delay:],
sampling_rate=sampling_rate,
sequence_length=sequence_length,
shuffle=True,
batch_size=batch_size,
start_index=0,
end_index=num_train_samples)
val_dataset = keras.utils.timeseries_dataset_from_array(
raw_data[:-delay],
targets=temperature[delay:],
sampling_rate=sampling_rate,
sequence_length=sequence_length,
shuffle=True,
batch_size=batch_size,
start_index=num_train_samples,
end_index=num_train_samples + num_val_samples)
test_dataset = keras.utils.timeseries_dataset_from_array(
raw_data[:-delay],
targets=temperature[delay:],
sampling_rate=sampling_rate,
sequence_length=sequence_length,
shuffle=True,
batch_size=batch_size,
start_index=num_train_samples + num_val_samples)
# Inspecting the output of the datasets
for samples, targets in train_dataset:
print("samples shape:", samples.shape)
print("targets shape:", targets.shape)
break
"""
Computing the common-sense baseline MAE
"""
def evaluate_naive_method(dataset):
"""
Common sense baseline = assume temp 24 hours from now will be
the same as what it is now
"""
total_abs_err = 0.
samples_seen = 0
for samples, targets in dataset:
preds = samples[:, -1, 1] * std[1] + mean[1]
total_abs_err += np.sum(np.abs(preds - targets))
samples_seen += samples.shape[0]
return total_abs_err / samples_seen
print(f"Validation MAE: {evaluate_naive_method(val_dataset):.2f}")
print(f"Test MAE: {evaluate_naive_method(test_dataset):.2f}")
In the same way that it's useful to establish a common-sense baseline before trying machine learning approaches, it's useful to try simple, cheap machine learning models (such as small, densely connected networks) before looking into complicated and computationally expensive models such as RNNs - best way to know that further complexity is warranted.
# Training and evaluating a densely connected model
from tensorflow import keras
from tensorflow.keras import layers
inputs = keras.Input(shape=(sequence_length, raw_data.shape[-1]))
x = layers.Flatten()(inputs)
x = layers.Dense(16, activation="relu")(x)
outputs = layers.Dense(1)(x)
model = keras.Model(inputs, outputs)
callbacks = [
keras.callbacks.ModelCheckpoint("jena_dense.keras",
save_best_only=True)
]
model.compile(optimizer="rmsprop", loss="mse", metrics=["mae"])
history = model.fit(train_dataset, epochs=10, validation_data=val_dataset, callbacks=callbacks)
# Reload the best model and evaluate it on the test data
model = keras.models.load_model("jena_dense.keras")
print(f"Test MAE: {model.evaluate(test_dataset)[1]:.2f}")
# Plotting the results
import matplotlib.pyplot as plt
loss = history.history["mae"]
val_loss = history.history["val_mae"]
epochs = range(1, len(loss) + 1)
plt.figure()
plt.plot(epochs, loss, "bo", label="Training MAE")
plt.plot(epochs, val_loss, "b", label="Validation MAE")
plt.title("Training and validation MAE")
plt.legend()
plt.show()
[A] pretty significant limitation of machine learning in general: unless the
learning algorithm is hardcoded to look for a specific kind of simple model, it can sometimes fail to find a simple solution to a simple problem. That's why leveraging good feature engineering and relevant architecture priors is essential: you need to precisely tell your model what it should be looking for.
You can use 1D convnets (the Conv1D) to fit any sequence dataq that follows the translation invariance assumption (meaning that id you slide a window over the sequence, the content f the window should follow the same properties independently of the location of the window).
inputs = keras.Input(shape=(sequence_length, raw_data.shape[-1]))
x = layers.Conv1D(8, 24, activation="relu")(inputs)
x = layers.MaxPooling1D(2)(x)
x = layers.Conv1D(8, 12, activation="relu")(x)
x = layers.MaxPooling1D(2)(x)
x = layers.Conv1D(8, 6, activation="relu")(x)
x = layers.GlobalAveragePooling1D()(x)
outputs = layers.Dense(1)(x)
model = keras.Model(inputs, outputs)
callbacks = [
keras.callbacks.ModelCheckpoint("jena_conv.keras",
save_best_only=True)
]
model.compile(optimizer="rmsprop", loss="mse", metrics=["mae"])
history = model.fit(train_dataset,
epochs=10,
validation_data=val_dataset,
callbacks=callbacks)
model = keras.models.load_model("jena_conv.keras")
print(f"Test MAE: {model.evaluate(test_dataset)[1]:.2f}")
import matplotlib.pyplot as plt
loss = history.history["mae"]
val_loss = history.history["val_mae"]
epochs = range(1, len(loss) + 1)
plt.figure()
plt.plot(epochs, loss, "bo", label="Training MAE")
plt.plot(epochs, val_loss, "b", label="Validation MAE")
plt.title("Training and validation MAE")
plt.legend()
plt.show()
The model performs worse than the densely connected one. Why?
- The wather fata doesn;t quite respect the translation invariance assumption. Weather data is only translation-inavariant for a very specific tmescale.
- Order in our data matters a lot. The recent past is more informative for predicting the next day's temperature than dive days ago. A 1D convnet is not able to leverage this fact.
Try looking at the data as what it is: a sequnce, where causality and order matter. There;s a family of neural network architectures designed specifically fo this use case: recurrent neural networks. Among them, the Long Short Term Memroy (LSTM) layer has long been very popular.
"""
A simple LSTM-based model
"""
inputs = keras.Input(shape=(sequence_length, raw_data.shape[-1]))
x = layers.LSTM(16)(inputs)
outputs = layers.Dense(1)(x)
model = keras.Model(inputs, outputs)
callbacks = [
keras.callbacks.ModelCheckpoint("jena_lstm.keras",
save_best_only=True)
]
model.compile(optimizer="rmsprop", loss="mse", metrics=["mae"])
history = model.fit(train_dataset,
epochs=10,
validation_data=val_dataset,
callbacks=callbacks)
model = keras.models.load_model("jena_lstm.keras")
print(f"Test MAE: {model.evaluate(test_dataset)[1]:.2f}")
# Plot model
import matplotlib.pyplot as plt
loss = history.history["mae"]
val_loss = history.history["val_mae"]
epochs = range(1, len(loss) + 1)
plt.figure()
plt.plot(epochs, loss, "bo", label="Training MAE")
plt.plot(epochs, val_loss, "b", label="Validation MAE")
plt.title("Training and validation MAE")
plt.legend()
plt.show()
The plot below shows much better results than the first two approaches - and it can beat the baseline, demonstrating the value of ML on this task.
Understanding Recurrent Neural Networs
A major characteristic of all neural networks you've seen so far, such as densely connected networks and convnets, is that they have no memory. Each input shown to them is processed independently, with no state kept between inputs. With such networks, in order to process a sequence or a temporal series of data points, you have to show the entire sequence to the network at once: turn it into a single data point. For instance, this is what we did in the densely connected network example: we flattened our five days of data into a single large vector and processed it in one go. Such networks are called feedforward networks.
A recurrent neural network (RNN) processes sequences by iterating through the sequence elements and maintaining a state that contains information relative to what it has seen so far. In effect, an RNN is a type of Neural Network that has an internal loop.
The state of the RNN is reset between processing two different, independent sequences (such as two samples in a batch), so you still consider one sequence to be a single data point: a single input to the network. This data point is no longer processed in a single step, but the network loops over the sequence elements.
"""
Pseudocode RNN
"""
state_t = 0 # The state at t
for input_t in input_sequence: # Iterates over sequence elements
output_t = f(input_t,state_t)
state_t = # The previous output becomes the state for the next iteration
"""
More detailed pseudocode for the RNN
"""
state_t = 0
for input_t in input_sequence:
output_t = activation(dot(W, input_t) + dot(U, state_t) + b)
state_t = output_t
"""
NumPy implementation of a simple RNN
"""
import numpy as np
timesteps = 100 # Number of timestaps in the input sequence
input_features = 32 # Dimensionality of the input feature space
output_features = 64 # Dimensionality of the output feature space
inputs = np.random.random((timesteps,input_features)) # Input data: andom noise for the sake of the example
state_t = np.zeros((output_features, )) # Initial state: all zero vector
"""
Create random weight matrices
"""
W = np.random.random((output_features, input_features))
U = np.random.random((output_features, output_features))
b = np.random.random((output_features,))
successive_outputs = []
for input_t in inputs: # input_t is a vector of shape (input_features,)
# Combines the input with the current state (the previous output) to obtain
# the current output. We use `tanh` to add non-linearity (could use
# other activation function)
output_t = np.tanh(np.dot(W, input_t) + np.dot(U, state_t) + b)
# Store this output in a list
successive_outputs.append(output_t)
# Update the state of the nextwork for the next timestamp
state_t = output_t
# The final output is a rank-2 tensor of shape (timesteps, output_features)
final_output_sequence = np.stack(successive_outputs, axis=0)
In summary, a RNN is a for loop that reuses quantities computed during the previous iteration of the loop, nothing more. RNNs are characterized by their step function.
The SimpleRNN layer in Keras processes batches of sequences, like all other Keras layers, not a single sequence like the NumPy example above - it takes an input of shape (batch_size, timesteps, input_features).
# An RNN layer that can process sequences of any length
num_features = 14
# setting the timesteps entry to None in the shape argument
# enables network to process sequences of arbitrary length
# If model is meant to process sequences of same length,
# it is recommended to specify the length
inputs - keras.Input(shape=(None,num_features))
outputs - layers.SimpleRNN(16)(inputs)
All recurrent layers in Keras (SimpleRNN, LSTM, and GRU) can be run in two different modes: they can either return full sequences of successive outputs for each time step (a rank-3 tensor of shape (batch_size, timesteps, ouutput_features)) or return only the last output for each input sequence (rank-2 tensor of shape (batch_size, output_features)), These modes are controlled by the return_sequences argument.
"""
A RNN layer that returns only its last output step
"""
num_features = 14
steps = 120
inputs = keras.Input(shape=(steps, num_features))
# Note that return_sequences=False
outputs = layers.SimpleRNN(16, return_sequences=False)(inputs)
print(outputs.shape)
"""
An RNN layer that returns its full output sequence
"""
num_features = 14
steps = 120
inputs = keras.Input(shape=(steps, num_features))
# Note that return_sequences=False
outputs = layers.SimpleRNN(16, return_sequences=True)(inputs)
print(outputs.shape)
"""
Stacking RNN Layers
It's sometimes useful to stack several recurrent layers one after the
other in order to increase the representational power of a network. In
such a setup, you have to get all of the intermediate layers
to return a full sequence of outputs
"""
inputs = keras.Input(shape=(steps, num_features))
x = layers.SimpleRNN(16, return_sequences=True)(inputs)
x = layers.SimpleRNN(16, return_sequences=True)(x)
outputs = layers.SimpleRNN(16)(x)
In practice you'll rarely work with the SimpleRNN layer. Although it should be theoretically able to retain at time t information about inputs seen many timestamps before, such long-term dependencies prove impossible to learn in practice. This is due to the vanishing graidents problem, an effect that is similatr to what is observed with non-recurrent neural networks that are many layers deep: as you keep adding layers to a network, the network eventually becomes untrainable.
The LSTM (Long Term Short Memory) algorithm was developed to address this vanishing gradients problem. The LSTM layer is like the SimpleRNN layer but it adds a way to carry information across many timestamps. The LSTM saves information for later, preventing older signals from gradually vanishing during processing - this should remind you of residual connections.
Adding to the picture above additional data flow that carries information across timesteps. Call its values at different timesteps c_t where C stands for carry.
This information will have the following impact on the cell: it will be combined with the input connection and the recurrent connection (via a dense transformation: a dot product with a weight matrix followed by a bias add and the application of an activation function), and it will affect the state being sent to the next timestep (via an activation function and a multiplication operation). Conceptually, the carry dataflow is a way to modulate the next output and the next state.
"""
Pseudocode details of the LSTM architecture (1/2)
"""
output_t = activation(dot(state_t, Uo) + dot(input_t, Wo) + dot(c_t, Vo) + bo)
i_t = activation(dot(state_t, Ui) + dot(input_t, Wi) + bi)
f_t = activation(dot(state_t, Uf) + dot(input_t, Wf) + bf)
k_t = activation(dot(state_t, Uk) + dot(input_t, Wk) + bk)
"""
Psuedocode of the LSTM architecture (2/2)
"""
c_t+1 = i_t * k_t + c_t * f_t
About an LSTM: just keep in mind what the LSTM cell is meant to do: allow past information to be reinjected at a later time, this fighting the vanishing-gradient problem.
Advances use of recurrent neural networks
- Recurrent dropout - a variant of droput, used to fight overfitting in recurrent layers
- Stacking Recurrent Layers: increases the representational power of the model (at the cost of higher computational loads)
- Bidrectional Recurrent Layers: prsent the same information to a recurrent network in different ways, increasing accuracy and mitigating foregtting issues
Recurrent Dropout
The same dropout mask (the same pattern of dropped units) should be applied at every timestep, instead of using a dropout mask that varies randomly from timestep to timestep. In order to regularize the representations formed by the recurrent gates of layers such as GRU and LSTM, a temporally constant dropout mask should be applid to the inner recurrent activations of the layer (a recurrent dropout mask). Using the same dropout mask at every timestamp allows the network to propery proagate its learning error through time.
Every reccurent layer in Keras has two dropout-related arguments: dropot, a float specifying the dropot rate for input units of the layer, and recurrent_dropout specifying the dropout rate of the recurrent units.
"""
Training and evaluating a dropout-regularized LSTM
"""
inputs = keras.Input(shape=(sequence_length, raw_data.shape[-1]))
x = layers.LSTM(32, recurrent_dropout=0.25)(inputs)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1)(x)
model = keras.Model(inputs, outputs)
callbacks = [
keras.callbacks.ModelCheckpoint("jena_lstm_dropout.keras",
save_best_only=True)
]
model.compile(optimizer="rmsprop", loss="mse", metrics=["mae"])
history = model.fit(train_dataset,
epochs=50,
validation_data=val_dataset,
callbacks=callbacks)
import matplotlib.pyplot as plt
loss = history.history["mae"]
val_loss = history.history["val_mae"]
epochs = range(1, len(loss) + 1)
plt.figure()
plt.plot(epochs, loss, "bo", label="Training MAE")
plt.plot(epochs, val_loss, "b", label="Validation MAE")
plt.legend()
plt.show()
The image above shows that adding regularization prevents overfitting through 20 epochs. Larger RNNs can greatly benefit from a GPU runtime, but recurrent models with very few parameters can be significantly faster on multicore CPU than on GPU.
Stacking Recurrent Layers
Recurrent layer stacking is a classic way to build more-powerful recurrent networks. To stack recurrent layers on top of each other in Keras, all intermediate layers should return their full sequence of outputs (a rank-3 tensor) rather than their outpit at the last timestep.
"""
Training and evaluating a dropout-regularized, stacked GRU model
In this example, we stack two dropout-regularized recurrent layers
We use GRU instead of LSTM - GRU can be thought of as a slightly
simpler, streamlined version of LSTM
"""
inputs = keras.Input(shape=(sequence_length, raw_data.shape[-1]))
x = layers.GRU(32, recurrent_dropout=0.5, return_sequences=True)(inputs)
x = layers.GRU(32, recurrent_dropout=0.5)(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1)(x)
model = keras.Model(inputs, outputs)
callbacks = [
keras.callbacks.ModelCheckpoint("jena_stacked_gru_dropout.keras",
save_best_only=True)
]
model.compile(optimizer="rmsprop", loss="mse", metrics=["mae"])
history = model.fit(train_dataset,
epochs=50,
validation_data=val_dataset,
callbacks=callbacks)
model = keras.models.load_model("jena_stacked_gru_dropout.keras")
print(f"Test MAE: {model.evaluate(test_dataset)[1]:.2f}")
import matplotlib.pyplot as plt
loss = history.history["mae"]
val_loss = history.history["val_mae"]
epochs = range(1, len(loss) + 1)
plt.figure()
plt.plot(epochs, loss, "bo", label="Training MAE")
plt.plot(epochs, val_loss, "b", label="Validation MAE")
plt.legend()
plt.show()
Bidirectional Recurrent Layers
A bidrectional RNN is a common RNN variant that can offer greater performance than a regular RNN on certain tasks. It's frequently used in natural language processing - you could call it the Swiss army knife of deep learning for natural language processing.
A bidrectional RNN exploits the order sensitivity of RNNs: it uses two regular RNNs, such as the GRU and LSTM tlayers, wach of which processes the input sequence in one direction, and then merges their representations.
Which order of data points you use isn't as important in natural language processing: the importance of a word in understanding a sentence isn't usually dependent on its position in the sentece. On text data, reversed order provessing works just as well as chronological processing. Word order matters in NLP, but which order you use isn't crucial.
A bidrectional RNN looks at sequences both ways in order to improve on the performance of chronological-order RNNs.
To instantiate a bidirectional RNN in Keras, you use the Bidrectional layer, which takes as its first argument a recurrent layer instance.
# Training and evaluating a bidirectional LSTM
inputs = keras.Input(shape=(sequence_length, raw_data.shape[-1]))
x = layers.Bidirectional(layers.LSTM(16))(inputs)
outputs = layers.Dense(1)(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop", loss="mse", metrics=["mae"])
history = model.fit(train_dataset,
epochs=10,
validation_data=val_dataset)
Always remember that all trading is fundamentally information arbitrage: gaining an advantage by leveraging data or insights that other market participants are missing. Trying to use well-known machine learning techniques and publicly available data to beat the markets is effectively a dead end, since you won't have any information advantage compared to everyone else.