Deep Learning with Python - Chapters 9 and 10

"Advanced Deep Learning for Computer Vision" and "Deep Learning for Timerseries" go over image segmentation and recurrent neural networks (primarily in the context of timeseries data), respectively.

Advanced Deep Learning for Computer Vision

The Three Essential Computer Vision Tasks

-Image Classification: the goal s to assign one or more labels to an image. It can be single label or multi-label classification.

  • Image segmentation: The goal is to "segment" or "partition" an image into several different areas, with each area usually representing a category
  • Object Detection: The goal is to draw rectangles (called bounding boxes) around objects of interest in an image, and associate each rectangle with a class.

Three Main Computer Vision Tasks

Other niche tasks: image similarity scoring, keypoint detection (pinpoint attributes of interest in an image, such as facial features), pose estimation, 3D mesh estimation, and so on. object detection is beyond the scope of an introductory book: checkout the RetinaNet example on keras.io, which shows how to build am object detection model from scratch in Keras.

Image Segmentation Example

Image segmentation with deep learning is about using a model to assign a class to each pixel in an image, thus segmenting the image into different zones (such as "background" "foreground" or "road" "car" and "sidewalk"). Two flavors if image segmentation:

  • Semantic segmentation: each pixel is independently classified into a semantic category, like "cat". If there are two cats in the image, the corresponding pixels are all mapped to the generic "cat" category.
  • Instance segmentation: seeks not only to classify image pixels by category, but also to parse out individual object instances. In an image with two cats in it, instance segmentation would treat "cat 1" and "cat 2" as two separate classes of pixels.

A segmentation mask is the image-segmentation equivalent of a label: it's an image the same size s the input image, with a single color channel where each integer value corresponds to the class of the corresponding pixel in the input image.

# Download and uncompress dataset using the wget and tar shell utilities
!wget http://www.robots.ox.ac.uk/~vgg/data/pets/data/images.tar.gz
!wget http://www.robots.ox.ac.uk/~vgg/data/pets/data/annotations.tar.gz
!tar -xf images.tar.gz
!tar -xf annotations.tar.gz
out[2]

--2024-09-07 06:31:38-- http://www.robots.ox.ac.uk/~vgg/data/pets/data/images.tar.gz
Resolving www.robots.ox.ac.uk (www.robots.ox.ac.uk)... 129.67.94.2
Connecting to www.robots.ox.ac.uk (www.robots.ox.ac.uk)|129.67.94.2|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.robots.ox.ac.uk/~vgg/data/pets/data/images.tar.gz [following]
--2024-09-07 06:31:39-- https://www.robots.ox.ac.uk/~vgg/data/pets/data/images.tar.gz
Connecting to www.robots.ox.ac.uk (www.robots.ox.ac.uk)|129.67.94.2|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://thor.robots.ox.ac.uk/pets/images.tar.gz [following]
--2024-09-07 06:31:40-- https://thor.robots.ox.ac.uk/pets/images.tar.gz
Resolving thor.robots.ox.ac.uk (thor.robots.ox.ac.uk)... 129.67.95.98
Connecting to thor.robots.ox.ac.uk (thor.robots.ox.ac.uk)|129.67.95.98|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 791918971 (755M) [application/octet-stream]
Saving to: ‘images.tar.gz’

images.tar.gz 100%[===================>] 755.23M 21.0MB/s in 38s

2024-09-07 06:32:19 (19.8 MB/s) - ‘images.tar.gz’ saved [791918971/791918971]

URL transformed to HTTPS due to an HSTS policy
--2024-09-07 06:32:19-- https://www.robots.ox.ac.uk/~vgg/data/pets/data/annotations.tar.gz
Resolving www.robots.ox.ac.uk (www.robots.ox.ac.uk)... 129.67.94.2
Connecting to www.robots.ox.ac.uk (www.robots.ox.ac.uk)|129.67.94.2|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://thor.robots.ox.ac.uk/pets/annotations.tar.gz [following]
--2024-09-07 06:32:20-- https://thor.robots.ox.ac.uk/pets/annotations.tar.gz
Resolving thor.robots.ox.ac.uk (thor.robots.ox.ac.uk)... 129.67.95.98
Connecting to thor.robots.ox.ac.uk (thor.robots.ox.ac.uk)|129.67.95.98|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19173078 (18M) [application/octet-stream]
Saving to: ‘annotations.tar.gz’

annotations.tar.gz 100%[===================>] 18.28M 9.49MB/s in 1.9s

2024-09-07 06:32:24 (9.49 MB/s) - ‘annotations.tar.gz’ saved [19173078/19173078]

import os

input_dir = "/content/images/"
target_dir = "/content/annotations/trimaps/"

"""
The input pictures are stored as jpg and the corresponding segmentation mask is stored as a PNG file.
Below is preparing a list of file paths, as well as the list of the corresponding mask file paths
"""

input_img_paths = sorted([os.path.join(input_dir, fname) for fname in os.listdir(input_dir) if fname.endswith(".jpg")])

target_paths = sorted([os.path.join(target_dir,fname) for fname in os.listdir(target_dir) if fname.endswith(".png" )and not fname.startswith(".")])

import matplotlib.pyplot as plt
from tensorflow.keras.utils import load_img, img_to_array

plt.axis("off")
ax = plt.imshow(load_img(input_img_paths[9])) # Display input image number 9
plt.gca().set_title("An example image")
plt.show()
plt.clf()
def display_target(target_array,title=None):
  # The original labels are 1, 2, and 3. We subtract 1 so that the
  # labels range from 0 to 2, and then we multiply127 so that the
  # labels become 0 (black), 127 (gray), 254 (newar-white)
  normalized_array = (target_array.astype("uint8") - 1) * 127
  plt.axis("off")
  plt.imshow(normalized_array[:,:,0])
  if isinstance(title,str):
    plt.gca().set_title(title)
# We use color_mode="grayscale" so that the image we load is treated as having a single color channel
img = img_to_array(load_img(target_paths[9], color_mode="grayscale"))
display_target(img,"The corrsponding Target Mask")
out[3]
Jupyter Notebook Image

<Figure size 640x480 with 1 Axes>

Jupyter Notebook Image

<Figure size 640x480 with 1 Axes>

"""
Spliiting data into training and validation set
"""
import numpy as np
import random

img_size = (200,200) # We resize everything to 200 x 200
num_imgs = len(input_img_paths) # Total number of samples in the data

"""
Shuffle the file paths (they were originall sorted by breed). We use the same seed in both statements to ensure that the input paths and target paths stay in the same order
"""
random.Random(1_337).shuffle(input_img_paths)
random.Random(1_337).shuffle(target_paths)

def path_to_input_image(path):
  return img_to_array(load_img(path, target_size=img_size))

def path_to_target(path):
  img = img_to_array(load_img(path, target_size=img_size))
  img = img.astype("uint8") - 1 # Return 1 so that our labels become 0, 1, and 2
  return img

"""
Load all images in the inpt_imgs float32 array (same order). The inputs have three channels (RGB values) and the targets have a single channel (which contains integer labels).
"""
input_imgs = np.zeros((num_imgs,) + img_size + (3,), dtype="float32")
targets = np.zeros((num_imgs,) + img_size + (1,), dtype="uint8")

# Rserver 1,000 samples for validation
num_val_samples = 1_000
# Split the data into a training and a validation set
train_input_imgs = input_imgs[:-num_val_samples]
train_targets = targets[:-num_val_samples]
val_input_imgs = input_imgs[-num_val_samples:]
val_targets = targets[-num_val_samples:]
out[4]
from tensorflow import keras
from tensorflow.keras import layers

def get_model(img_size, num_classes):
  inputs = keras.Input(shape=img_size+(3,))
  # Rescale input images to the [0, 1] range
  x = layers.Rescaling(1./255)(inputs)
  # We use passing="same" everywhere tpo avoid the influence of border passing on feature map size
  x = layers.Conv2D(64, 3, strides=2, activation="relu", padding="same")(x)
  x = layers.Conv2D(64, 3, activation="relu", padding="same")(x)
  x = layers.Conv2D(128, 3, strides=2, activation="relu", padding="same")(x)
  x = layers.Conv2D(128, 3, activation="relu", padding="same")(x)
  x = layers.Conv2D(256, 3, strides=2, padding="same", activation="relu")(x)
  x = layers.Conv2D(256, 3, activation="relu", padding="same")(x)
  x = layers.Conv2DTranspose(256, 3, activation="relu", padding="same")(x)
  x = layers.Conv2DTranspose(
  256, 3, activation="relu", padding="same", strides=2)(x)
  x = layers.Conv2DTranspose(128, 3, activation="relu", padding="same")(x)
  x = layers.Conv2DTranspose(
  128, 3, activation="relu", padding="same", strides=2)(x)
  x = layers.Conv2DTranspose(64, 3, activation="relu", padding="same")(x)
  x = layers.Conv2DTranspose(
  64, 3, activation="relu", padding="same", strides=2)(x)
  # We end the model with a per-pixel three-way softmax to classify each output pizel into one of our three categories
  outputs = layers.Conv2D(num_classes, 3, activation="softmax",
  padding="same")(x)
  model = keras.Model(inputs, outputs)
  return model

model = get_model(img_size=img_size, num_classes=3)
print(model.summary())
out[5]

Model: "functional"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓

┃ Layer (type)  ┃ Output Shape  ┃  Param # ┃

┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩

│ input_layer (InputLayer) │ (None, 200, 200, 3) │ 0 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ rescaling (Rescaling) │ (None, 200, 200, 3) │ 0 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ conv2d (Conv2D) │ (None, 100, 100, 64) │ 1,792 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ conv2d_1 (Conv2D) │ (None, 100, 100, 64) │ 36,928 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ conv2d_2 (Conv2D) │ (None, 50, 50, 128) │ 73,856 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ conv2d_3 (Conv2D) │ (None, 50, 50, 128) │ 147,584 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ conv2d_4 (Conv2D) │ (None, 25, 25, 256) │ 295,168 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ conv2d_5 (Conv2D) │ (None, 25, 25, 256) │ 590,080 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ conv2d_transpose (Conv2DTranspose) │ (None, 25, 25, 256) │ 590,080 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ conv2d_transpose_1 (Conv2DTranspose) │ (None, 50, 50, 256) │ 590,080 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ conv2d_transpose_2 (Conv2DTranspose) │ (None, 50, 50, 128) │ 295,040 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ conv2d_transpose_3 (Conv2DTranspose) │ (None, 100, 100, 128) │ 147,584 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ conv2d_transpose_4 (Conv2DTranspose) │ (None, 100, 100, 64) │ 73,792 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ conv2d_transpose_5 (Conv2DTranspose) │ (None, 200, 200, 64) │ 36,928 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ conv2d_6 (Conv2D) │ (None, 200, 200, 3) │ 1,731 │

└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘

 Total params: 2,880,643 (10.99 MB)

 Trainable params: 2,880,643 (10.99 MB)

 Non-trainable params: 0 (0.00 B)

None

The first half of the model resembles the kind of convnet you'd use for image classification: a stack of Conv2D layers, with gradually increasing filter sizes. We downsample our images three times by a factor of two each, ending up with activations of sizes (25, 25, 256). The purpose of this first hald is to encode the mages into smaller feature maps, where each spatial location (or pixel) contains information about a large spatial chunk of the original image. It is a kind of compression.

One important difference between the first half of this model and the classification modles you've seen before is the way we do downsampling: in classification, we do downsampling through MaxPooling2D layers and here we downsample by adding strides to every other convolution layer. We do this because, in the case of image segmentation, we care a lot about the spatial location of information in the image, since we need to produce a per-pixel target masks as output of the model. Max pooling layers hurt you quite a boit for segmentation tasks. Strided convolutions do a better job at downsampling featyre maps while retaining location information. Use stides instead of pooling when you care about feature location.

The second half othe model is a stakc of Conv2DTranspose layers. The outpit of the first half of the model iss a feature map of shape (25, 25, 256), but we want the final output to have the same shape as the target masks (200, 200, 3). We need to apply a kind of inverse of the transformations we've applied so far - something that will unspample the feature maps instead of downsampling them. That's the purose of the Conv2DTranspose layer: you can think of it as a kind of convolution layer that learns to unsample.

If you have an input of shape (100, 100, 64), and you run it through the layer Cov2D(128, 3, strides=2, padding="same"), you get an output of shape (50, 50, 128). If you run this output through the layer Conv2DTranspose(64, 3, strides=2, padding="same"), you get back an output of shape (100, 100, 64).

Good Video on Unsampling and Conv2DTranspose

model.compile(optimizer="rmsprop", loss="sparse_categorical_crossentropy")
callbacks = [
 keras.callbacks.ModelCheckpoint("oxford_segmentation.keras",
 save_best_only=True)
]
history = model.fit(train_input_imgs, train_targets,epochs=50,
callbacks=callbacks, batch_size=64, validation_data=(val_input_imgs, val_targets))
out[7]

Epoch 1/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 148s 1s/step - loss: 1.0565 - val_loss: 0.9592
Epoch 2/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 105s 472ms/step - loss: 0.9291 - val_loss: 0.8415
Epoch 3/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 85s 499ms/step - loss: 0.8138 - val_loss: 0.7335
Epoch 4/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 79s 473ms/step - loss: 0.7083 - val_loss: 0.6354
Epoch 5/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 82s 473ms/step - loss: 0.6126 - val_loss: 0.5471
Epoch 6/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 84s 499ms/step - loss: 0.5268 - val_loss: 0.4684
Epoch 7/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 47s 471ms/step - loss: 0.4505 - val_loss: 0.3990
Epoch 8/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 47s 473ms/step - loss: 0.3832 - val_loss: 0.3382
Epoch 9/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 82s 474ms/step - loss: 0.3245 - val_loss: 0.2854
Epoch 10/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 82s 474ms/step - loss: 0.2736 - val_loss: 0.2400
Epoch 11/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 82s 475ms/step - loss: 0.2298 - val_loss: 0.2010
Epoch 12/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 82s 473ms/step - loss: 0.1924 - val_loss: 0.1679
Epoch 13/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 85s 499ms/step - loss: 0.1606 - val_loss: 0.1399
Epoch 14/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 79s 473ms/step - loss: 0.1337 - val_loss: 0.1163
Epoch 15/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 84s 498ms/step - loss: 0.1111 - val_loss: 0.0965
Epoch 16/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 82s 500ms/step - loss: 0.0921 - val_loss: 0.0799
Epoch 17/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 79s 472ms/step - loss: 0.0763 - val_loss: 0.0661
Epoch 18/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 85s 499ms/step - loss: 0.0631 - val_loss: 0.0546
Epoch 19/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 82s 499ms/step - loss: 0.0521 - val_loss: 0.0451
Epoch 20/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 82s 499ms/step - loss: 0.0430 - val_loss: 0.0372
Epoch 21/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 79s 473ms/step - loss: 0.0355 - val_loss: 0.0307
Epoch 22/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 82s 474ms/step - loss: 0.0292 - val_loss: 0.0253
Epoch 23/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 82s 473ms/step - loss: 0.0241 - val_loss: 0.0208
Epoch 24/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 47s 472ms/step - loss: 0.0198 - val_loss: 0.0171
Epoch 25/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 47s 474ms/step - loss: 0.0163 - val_loss: 0.0141
Epoch 26/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 82s 473ms/step - loss: 0.0134 - val_loss: 0.0116
Epoch 27/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 47s 472ms/step - loss: 0.0111 - val_loss: 0.0095
Epoch 28/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 47s 473ms/step - loss: 0.0091 - val_loss: 0.0078
Epoch 29/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 82s 475ms/step - loss: 0.0075 - val_loss: 0.0065
Epoch 30/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 82s 475ms/step - loss: 0.0062 - val_loss: 0.0053
Epoch 31/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 84s 499ms/step - loss: 0.0051 - val_loss: 0.0044
Epoch 32/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 79s 473ms/step - loss: 0.0042 - val_loss: 0.0036
Epoch 33/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 84s 499ms/step - loss: 0.0034 - val_loss: 0.0030
Epoch 34/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 79s 475ms/step - loss: 0.0028 - val_loss: 0.0024
Epoch 35/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 47s 473ms/step - loss: 0.0023 - val_loss: 0.0020
Epoch 36/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 82s 473ms/step - loss: 0.0019 - val_loss: 0.0017
Epoch 37/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 47s 473ms/step - loss: 0.0016 - val_loss: 0.0014
Epoch 38/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 82s 474ms/step - loss: 0.0013 - val_loss: 0.0012
Epoch 39/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 82s 473ms/step - loss: 0.0011 - val_loss: 9.6275e-04
Epoch 40/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 82s 473ms/step - loss: 9.2231e-04 - val_loss: 8.0982e-04
Epoch 41/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 82s 474ms/step - loss: 7.7691e-04 - val_loss: 6.8570e-04
Epoch 42/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 82s 473ms/step - loss: 6.5907e-04 - val_loss: 5.8515e-04
Epoch 43/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 47s 472ms/step - loss: 5.6356e-04 - val_loss: 5.0365e-04
Epoch 44/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 85s 499ms/step - loss: 4.8599e-04 - val_loss: 4.3716e-04
Epoch 45/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 79s 472ms/step - loss: 4.2281e-04 - val_loss: 3.8283e-04
Epoch 46/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 82s 474ms/step - loss: 3.7113e-04 - val_loss: 3.3826e-04
Epoch 47/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 84s 500ms/step - loss: 3.2854e-04 - val_loss: 3.0132e-04
Epoch 48/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 50s 498ms/step - loss: 2.9325e-04 - val_loss: 2.7057e-04
Epoch 49/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 47s 472ms/step - loss: 2.6377e-04 - val_loss: 2.4459e-04
Epoch 50/50
100/100 ━━━━━━━━━━━━━━━━━━━━ 85s 500ms/step - loss: 2.3893e-04 - val_loss: 2.2266e-04

epochs = range(1, len(history.history["loss"]) + 1)
loss = history.history["loss"]
val_loss = history.history["val_loss"]
plt.figure()
plt.plot(epochs, loss, "bo", label="Training loss")
plt.plot(epochs, val_loss, "b", label="Validation loss")
plt.title("Training and validation loss")
plt.legend()
plt.show()
out[8]
Jupyter Notebook Image

<Figure size 640x480 with 1 Axes>

from tensorflow.keras.utils import array_to_img
model = keras.models.load_model("oxford_segmentation.keras")
i = 4
test_image = val_input_imgs[i]
plt.axis("off")
plt.imshow(array_to_img(test_image))
mask = model.predict(np.expand_dims(test_image, 0))[0]

def display_mask(pred):
  """
  Utility to display a model's prediction
  """
  mask = np.argmax(pred, axis=-1)
  mask *= 127
  plt.axis("off")
  plt.imshow(mask)

display_mask(mask)
out[9]

1/1 ━━━━━━━━━━━━━━━━━━━━ 2s 2s/step

Jupyter Notebook Image

<Figure size 640x480 with 1 Axes>

Modern covnet Architecture Patterns

A model's "architecture" is the sum of the choices that went into crating it: which layers to use, how to configure them, and in what arrangement to connect them.These choices define the hypothesis space of your model: the space of possible functions that gradient descent can search over, parameterized by the model's weights. Like feature engineering, a good hypothesis space encodes prior knowledge that you have about the problem at hand and its solution.

A good model architecture is one that reduces the size of the search space or otherwise makes it easier to converge to a good point of the search space. Model architecture is all about making the problem simpler for gradient descent to solve. (Gradient descent is a stupid search process - so it needs all the help it can get.)

Modularity, hierarchy, and reuse

If you want to make a complex system simpler, there's a universal recipe you can apply: just structure your amorphous sip of complexity into modules, organize the modules into a hierarchy, and start reusing the same modules as appropriate ("reuse" is another word for abstraction in this context)

In general, a deep stack of narrow layers perform better than a shallow stack of large layers. There's a limit to how deep you can stack layers due to the problem of vanishing gradients. Vanushing gradients is cause by you're function chain (amount of layers) being too deep so that the error can not propragate to the earlier layers due to some amount of noise added by the functions.

To fix: force each function in the chain to be nondestructive - to retain a noiseless version of the information contained in the previous input. Thie easiest way to implement this is to use a residual connection. The residual connection acts as an information shortcut around destuctive or noisy blocks (such as blocks that contain activation unctions or dropout layers), enabling error gradient information from earlier layers to propagate noiselessly though a deep network.

Residual Connection around a Processing Block

# A residual connection in pseudocode
x = ... # Some input tensor
residual = x # Save a pointer tothe original input, called the resiudal
x = block(x) # The computation block can potentially be destructive or noisy
 # add the original input to the layer's output:
 # the final output will always preserve full informationabout the original input
x = add([x,residual])
from tensorflow import keras
from tensorflow.keras import layers

inputs = keras.Input(shape=(32,32,3))
x = layers.Conv2D(32,3,activation="relu")(inputs)
# Set aside the residual
residual = x
# Layer around which we create the residual connection
# padding="same" to avoid downsampling due to padding
x = layers.Conv2D(64,3,activation="relu",padding="same")(x)
# The residual only had 32 filters, so we use a 1x1 Conv2D to project to
# the correct shape
residual = layers.Conv2D(64,1)(residual)
# The block outut and the residual now
# have the same shape and can be added
x = layers.add([x, residual])
from tensorflow import keras
from tensorflow.keras import layers

inputs = keras.Input(shape=(32,32,3))
x = layers.Conv2D(32,3,activation="relu")(inputs)
# Set aside the residual
residual = x
# Block of two layers around which we create the residual connection
# padding="same" to avoid downsampling due to padding,
x = layers.Conv2D(64,3,activation="relu",padding="same")(x)
x = layers.MaxPooling2D(2, padding="same")(x)
# We use strides=2 in the residual projection to match the downsampling
# reated by the max pooling layer
residual = layers.Conv2D(64,1,strides="2")(residual)
# The block outut and the residual now
# have the same shape and can be added
x = layers.add([x, residual])
"""
Example of a simple Covnet structured into a series of blocks, each made of two convolutional layer and one optional max pooling layer, with a residual connection around each block
"""
inputs = keras.Input(shape=(32,32,3))
x = layers.Rescaling(1./255)(inputs)
def residual_block(x,filters,pooling=False):
  """
  Utility function to apply a convolutional block with
  a residual connection, with an option to add max pooling
  """
  residual = x
  x = layers.Conv2D(filters,3,activation="relu", padding="same")(x)
  x = layers.Conv2D(filters,3,activation="relu", padding="same")(x)
  if pooling:
    x = layers.MaxPooling2D(2, padding="same")(x)
    # Add a strided convolution to project the residual to the expected shape if using pooling
    residual = layers.Conv2D(filters,1,strides=2)(residual)
  elif filters != residual.shape[-1]:
    # If not using max pooling, only project the residual if the number of channels has changed
    residual = layers.Conv2D(filters,1)(residual)
  x = layers.add([x,residual])
  return x
# First block
x = residual_block(x, filters=32, pooling=True)
# Second Block: note increasing filter count
x = residual_block(x, filters=64, pooling=True)
# Last block doesn;t need a max pooling layer, since we will apply global average pooling right after it
x = residual_block(x, filters=128, pooling=False)

x = layers.GlobalAveragePooling2D()(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs=inputs, outputs=outputs)
print(model.summary())
out[11]

Model: "functional_1"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┓

┃ Layer (type)  ┃ Output Shape  ┃  Param # ┃ Connected to  ┃

┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━┩

│ input_layer_1 │ (None, 32, 32, 3) │ 0 │ - │

│ (InputLayer) │ │ │ │

├───────────────────────────┼────────────────────────┼────────────────┼────────────────────────┤

│ rescaling_1 (Rescaling) │ (None, 32, 32, 3) │ 0 │ input_layer_1[0][0] │

├───────────────────────────┼────────────────────────┼────────────────┼────────────────────────┤

│ conv2d_7 (Conv2D) │ (None, 32, 32, 32) │ 896 │ rescaling_1[0][0] │

├───────────────────────────┼────────────────────────┼────────────────┼────────────────────────┤

│ conv2d_8 (Conv2D) │ (None, 32, 32, 32) │ 9,248 │ conv2d_7[0][0] │

├───────────────────────────┼────────────────────────┼────────────────┼────────────────────────┤

│ max_pooling2d │ (None, 16, 16, 32) │ 0 │ conv2d_8[0][0] │

│ (MaxPooling2D) │ │ │ │

├───────────────────────────┼────────────────────────┼────────────────┼────────────────────────┤

│ conv2d_9 (Conv2D) │ (None, 16, 16, 32) │ 128 │ rescaling_1[0][0] │

├───────────────────────────┼────────────────────────┼────────────────┼────────────────────────┤

│ add (Add) │ (None, 16, 16, 32) │ 0 │ max_pooling2d[0][0], │

│ │ │ │ conv2d_9[0][0] │

├───────────────────────────┼────────────────────────┼────────────────┼────────────────────────┤

│ conv2d_10 (Conv2D) │ (None, 16, 16, 64) │ 18,496 │ add[0][0] │

├───────────────────────────┼────────────────────────┼────────────────┼────────────────────────┤

│ conv2d_11 (Conv2D) │ (None, 16, 16, 64) │ 36,928 │ conv2d_10[0][0] │

├───────────────────────────┼────────────────────────┼────────────────┼────────────────────────┤

│ max_pooling2d_1 │ (None, 8, 8, 64) │ 0 │ conv2d_11[0][0] │

│ (MaxPooling2D) │ │ │ │

├───────────────────────────┼────────────────────────┼────────────────┼────────────────────────┤

│ conv2d_12 (Conv2D) │ (None, 8, 8, 64) │ 2,112 │ add[0][0] │

├───────────────────────────┼────────────────────────┼────────────────┼────────────────────────┤

│ add_1 (Add) │ (None, 8, 8, 64) │ 0 │ max_pooling2d_1[0][0], │

│ │ │ │ conv2d_12[0][0] │

├───────────────────────────┼────────────────────────┼────────────────┼────────────────────────┤

│ conv2d_13 (Conv2D) │ (None, 8, 8, 128) │ 73,856 │ add_1[0][0] │

├───────────────────────────┼────────────────────────┼────────────────┼────────────────────────┤

│ conv2d_14 (Conv2D) │ (None, 8, 8, 128) │ 147,584 │ conv2d_13[0][0] │

├───────────────────────────┼────────────────────────┼────────────────┼────────────────────────┤

│ conv2d_15 (Conv2D) │ (None, 8, 8, 128) │ 8,320 │ add_1[0][0] │

├───────────────────────────┼────────────────────────┼────────────────┼────────────────────────┤

│ add_2 (Add) │ (None, 8, 8, 128) │ 0 │ conv2d_14[0][0], │

│ │ │ │ conv2d_15[0][0] │

├───────────────────────────┼────────────────────────┼────────────────┼────────────────────────┤

│ global_average_pooling2d │ (None, 128) │ 0 │ add_2[0][0] │

│ (GlobalAveragePooling2D) │ │ │ │

├───────────────────────────┼────────────────────────┼────────────────┼────────────────────────┤

│ dense (Dense) │ (None, 1) │ 129 │ global_average_poolin… │

└───────────────────────────┴────────────────────────┴────────────────┴────────────────────────┘

 Total params: 297,697 (1.14 MB)

 Trainable params: 297,697 (1.14 MB)

 Non-trainable params: 0 (0.00 B)

None

With residual connections, you can build networks of arbitrary depth, without having to worry about vanishing gradients.

Batch Normalization

Normalization is a broad category of methods that seek to make different samples seen by machine learning model more similar to each other, which helps the model learn and generalize well to new data.

Batch normalization is a type of layer (BatchNormalization in Keras) that can adaptively normalize data even as the mean and variance change over tim during training. During training, it uses the mean and varaince of the current batch of data to normalize samples, and during inference, it uses an exponential moving average of the batch-wise mean and variance of the data seen during training.

[D]eep learning is not an exact science, but a set of everchanging, empirically derived engineering best practices, woven together by unreliable narratives.

The main effect of batch normalization appears to be that it helps with gradient propagation - much like residual connections - and thus allows for deeper networks. The BatchNormalization layer can be used after any layer.

# Because the output of the Conv2D layer gets normalized, the layer
# doesn't need its own bias vector - this makes the Conv2D layer
# slightly leaner
x = layers.Conv2D(32, 3, use_bias=False)(x)
x = layers.BatchNormalization()(x)

It is generally recommended placing the previous layer's activation after batch normalization layer. When fine tuning a model with BatchNormalization layers, it is recommended to leave these layers frozen. Otherwise, they will keep updating their internal mean or variance.

# Note the lack of activation
x = layers.Conv2D(32, 3, use_bias=False)(x)
x = layers.BatchNormalization()(x)
# Place the activation after the BatchNormalization layer
x = layers.Activation("relu")(x)

Depthwise Seperable Convolutions

The depthwise separable convolution layer is a drop-in replacement for Conv2D that will make the model smaller and leaner and cause it to perform a few percentage points better on its task. This layer performs spatial convolution on each channel of its input, independently, before mixing output channels via a pointwise convolution (a 1×11 \times 11×1 convolution).

Depthwise Convolution: Independent spatial convs per channel

Depthwise separable convolution relies on the assumption that spatial locations in intermediate activations are highly correlated, but different channels are highly independent. This assumtion is generally true for image representations learned by deep neural networks.

This layer requires significantly fewer parameters andinvolves fewer computaions compared to regular convolution, while having comparable representational power. It results in smaller models that converge faster and are less prone to overfitting.

Summary

  • Your model should be organized into repeated blocks of layers, usually made of multiple convolution layers and a max pooling layer
  • The number of filters in your lauers should increase as the size of the spatial feature maps decreases.
  • Deep and narrow is better than broad and shallow.
  • Introducing residual connections around blocks of layers helps you train deeper networks.
  • It can be beneficial to introduce batch normalization layers after your convolution layers.
  • It can be beneficial to replace Conv2D layers with SeperableConv2D layers, which are more parameter-efficient.

Example of Small Xception Model:

data_augmentation = keras.Sequential(
 [
 layers.RandomFlip("horizontal"),
 layers.RandomRotation(0.1),
 layers.RandomZoom(0.2),
 ]
)

inputs = keras.Input(shape=(180, 180, 3))
# We use the same data augmentation config as before
x = data_augmentation(inputs)
# Don;t forget input rescaling
x = layers.Rescaling(1./255)(x)
# The assumption that underlies separable convolution -
# that "feature channels are largely independent" does not
# hold true for RGB images - they are actually highly correlated
# in natural images.
x = layers.Conv2D(filters=32, kernel_size=5, use_bias=False)(x)

# Apply a series of convolutional blocks with increasing feature depth
# Each block consists of two batch-normalized depthwise separable convolution
# layers and a max pooling layer, with a residual connection around the
# entire block.
for size in [32, 64, 128, 256, 512]:
  residual = x
  x = layers.BatchNormalization()(x)
  x = layers.Activation("relu")(x)
  x = layers.SeparableConv2D(size, 3, padding="same", use_bias=False)(x)
  x = layers.BatchNormalization()(x)
  x = layers.Activation("relu")(x)
  x = layers.SeparableConv2D(size, 3, padding="same", use_bias=False)(x)
  x = layers.MaxPooling2D(3, strides=2, padding="same")(x)
  residual = layers.Conv2D(
  size, 1, strides=2, padding="same", use_bias=False)(residual)
  x = layers.add([x, residual])

# In the original model, we used a Flatten layer before the Dense layer
# Here, we go with GlobalAveragePooling2D layer
x = layers.GlobalAveragePooling2D()(x)
# We add dropout for regularization
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs=inputs, outputs=outputs)
model.compile(loss="binary_crossentropy",
 optimizer="rmsprop",
 metrics=["accuracy"])
callbacks = [
 keras.callbacks.ModelCheckpoint(
 filepath="convnet_from_scratch_with_augmentation.keras",
 save_best_only=True,
 monitor="val_loss")
]
history = model.fit(
 train_dataset,
 epochs=100,
 validation_data=validation_dataset,
 callbacks=callbacks)

Interpreting what convnets learn

A fundmental problem when building a computer vision application is that of interpretability: why did your classifier think a particular image contained a fridge, when all you can see is a truck?

The representations learned by convnets are highly ameable to visualization, in large part because they're representations of visual concepts. Techniques for visualizing and interpreting these representations:

  • Visualizing intermediate convnet outputs (intermediate activation): useful for understanding how successive convnet layers transform their input, and for getting a first idea of how meaning of individual convnet filters
  • Visualizing convent filters: Useful for understanding precisely what visual patterm or concept each filter in a convnet is receptive to
  • Viuslaizing heatmaps of class activation in an image: Useful for understanding which parts of an image were identified as belonging to a given class, thus allowing you to localize objects in an image.
from tensorflow import keras
model = keras.models.load_model("convnet_from_scratch_with_augmentation.keras")
print(model.summary())

"""
Preporcessing a single image
"""
from tensorflow import keras
import numpy as np

# Download a test image
img_path = keras.utils.get_file(fname="cat.jpg",origin="https://img-datasets.s3.amazonaws.com/cat.jpg")


def get_img_array(img_path, target_size):
  # Open the image file and resize it
  img = keras.utils.load_img(img_path, target_size=target_size)
  # Turn the image into a float32 NumPy array of shape (180, 180, 3)
  array = keras.utils.img_to_array(img)
  # Add a dimension to transform the array into a "batch" of a single sample. Its shape is now (1, 180, 180, 3)
  array = np.expand_dims(array, axis=0)
  return array

"""
Displaying the test picture
"""
import matplotlib.pyplot as plt
plt.axis("off")
plt.imshow(img_tensor[0].astype("uint8"))
plt.show()

The Test Cat Picture

from tensorflow.keras import layers
layer_outputs = []
layer_names = []
for layer in model.layers:
 if isinstance(layer, (layers.Conv2D, layers.MaxPooling2D)):
    # Extract the outputs of all Conv2D and MaxPooling2D layers and put them in a list
    layer_outputs.append(layer.output)
    # Save layer names for later
    layer_names.append(layer.name)
# Create a model that will return these outputs, given the model input
activation_model = keras.Model(inputs=model.input, outputs=layer_outputs)
# Usingthe Model to copute layer activations
# Returns a list of nine NumPy arrays, one array per layer of activation
activations = activation_model.predict(img_tensor)

### Visualizing the Fifth Channel
import matplotlib.pyplot as plt
plt.matshow(first_layer_activation[0, :, :, 5], cmap="viridis")

Fifth Channel of the Activation

# ... some visualization code

Every channel fo every layeractivation on test cat picture

Things to note:

  • The first layer acts as a collection of various edge detectors. At that stage, the activations retain almost all of the information present in the initial picture
  • As you go deeper, the activations become increasingly abstract and less visually interpretable. Deeper presentations carry less inforamtion about the visual contents of the image, and increasinglu more information related to the class of the image.
  • The sparsity of the activations increases with the depth of the layer: in the first layer, almost all filters are activated by the input image, but in the following layer, more and more filters are blank. This means the pattern encoded by the filter isn't found in the input image.

We have just evidenced an important universal characteristic of the representations learned by deep neural networks: the features extracted by a layer become increasingly abstract with the depth of the layer. The activations of higher layers carry less and less information about the specific input being seen, and more and more information about the target (in this case, the class of the image: cat or dog). A deep neural network effectively acts as an information distillation pipeline, with raw data going in (in this case, RGB pictures) and being repeatedly transformed so that irrelevant information is filtered out (for example, the specific visual appearance of the image), and useful information is magnified and refined (for example, the class of the image).

Visualizing covnet filters

Another way to inspect the filters learned by convnets is to diplay the visual pattern taht each filter is meant to respond to. This can be done with gradient ascent in input space: applying gradient descent to the value of the input image of a convnet so as to maximize the response of a specific filter, starting from a blank input image.

# ... some visualization code

The visualization below should tell you a lot about how convnet filters see the world: each layer in a convnet learns a collection of filters such that their inputs can be expressed as a combination of flters - this is similat to how the Fourier transform decomposes signals onto a bank of cosine functions. The filters get increasingly complex and refined as you go deeper in the model:

  • the filters from the first layers in the model encode simple directional edges and colors
  • the filters from layers a bit further up the stack encode simple textures made from combinations of edges and colors
  • The filters in higher levels begin to resemble features found in natural images: feathers, eyes, leaves, and so on.

Some filter Patterns for Convolutional Layers

Visualizing Heatmaps of Class Activation

Debugging in the case of a classification mistake lies in a domain called model interpreability. The general category of techniques is called class activation map (CAM) visualiztion, and it consists of producing heatmaps over input images. It is a 2D grid of scores associaed with a specific output class, computed for every location in any input image, indicating how important each location is with respect to the class under consideration.

Deep Learning for Timeseries

Different kinds of Timeseries Tasks

A timeseries can be any data obtained via measurements at regular intervals, like the daily price of a stock, the hourly consumption of a city, or the weekly sales of a store. Working with timeseries involes understanding the dynamics of a system - its periodic cycles, how it trends over time, its regular regime and its sudden spikes.

The most common timeseries-related task is forecasting: predicting what will happen next in a series. Things you can do with timeseries:

  • Classification
  • Event Detection
  • Anomaly Detection

The Fourier transform can be highly valuavke when preprocessing any data that is primarily characterized by its cycles and oscillations.

A Temperature-Forecasting Example

Predicting the temperature 24 hours in the future given a timesreies of hourly measurements of quantities such as atmosopheric pressure and humidity, recorded over the recent past by a set of sensors on the roof of a building. Recurrent Neural Networks (RNNs) really shine on this type of problem.

Perioficity over multiple timescales is an important and very common property of timeseries data. When exploring data, make sure to look for these patterns. When working with timeseries, it's important to use validation and test data that is more recent than the training data, because you're trying to predict the future given the past, not the reverse.

!wget https://s3.amazonaws.com/keras-datasets/jena_climate_2009_2016.csv.zip
!unzip jena_climate_2009_2016.csv.zip
out[16]

--2024-09-07 23:17:19-- https://s3.amazonaws.com/keras-datasets/jena_climate_2009_2016.csv.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.90.150, 54.231.139.200, 52.217.123.160, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.90.150|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13565642 (13M) [application/zip]
Saving to: ‘jena_climate_2009_2016.csv.zip’

jena_climate_2009_2 100%[===================>] 12.94M 17.1MB/s in 0.8s

2024-09-07 23:17:21 (17.1 MB/s) - ‘jena_climate_2009_2016.csv.zip’ saved [13565642/13565642]

Archive: jena_climate_2009_2016.csv.zip
inflating: jena_climate_2009_2016.csv
inflating: __MACOSX/._jena_climate_2009_2016.csv

"""
Inspecting teh data of the Jena weather dataset
"""
import os
fname = os.path.join("/content/jena_climate_2009_2016.csv")
with open(fname) as f:
 data = f.read()
lines = data.split("\n")
header = lines[0].split(",")
lines = lines[1:]
print(header)
print(len(lines))
out[17]

['"Date Time"', '"p (mbar)"', '"T (degC)"', '"Tpot (K)"', '"Tdew (degC)"', '"rh (%)"', '"VPmax (mbar)"', '"VPact (mbar)"', '"VPdef (mbar)"', '"sh (g/kg)"', '"H2OC (mmol/mol)"', '"rho (g/m**3)"', '"wv (m/s)"', '"max. wv (m/s)"', '"wd (deg)"']
420451

"""
Parse the data into NumPy arrays
"""
import numpy as np
temperature = np.zeros((len(lines),))
raw_data = np.zeros((len(lines), len(header) - 1))
for i, line in enumerate(lines):
  values = [float(x) for x in line.split(",")[1:]]
  temperature[i] = values[1] # Column 1 is the "temperature" array
  raw_data[i, :] = values[:] # Store all columns in the "raw_data" array
out[18]
"""
Plotting the temperature timeseries
"""
from matplotlib import pyplot as plt
plt.plot(range(len(temperature)), temperature)
plt.show()
out[19]
Jupyter Notebook Image

<Figure size 640x480 with 1 Axes>

# Plotting the first 10 daya od the temperature timeseries
plt.plot(range(1440), temperature[:1440])
plt.show()
out[20]
Jupyter Notebook Image

<Figure size 640x480 with 1 Axes>

"""
Computing the number of samples  we'll use for each data split
"""
num_train_samples = int(0.5 * len(raw_data))
num_val_samples = int(0.25 * len(raw_data))
num_test_samples = len(raw_data) - num_train_samples - num_val_samples
print("num_train_samples:", num_train_samples)
print("num_val_samples:", num_val_samples)
print("num_test_samples:", num_test_samples)
out[21]

num_train_samples: 210225
num_val_samples: 105112
num_test_samples: 105114

# Normalize the data
mean = raw_data[:num_train_samples].mean(axis=0)
raw_data -= mean
std = raw_data[:num_train_samples].std(axis=0)
raw_data /= std
out[22]
import tensorflow as tf
from tensorflow import keras
"""
Instantiating datasets for training, validation, and testing.
With timseries_data_from_array, we will use the following parameter values:
- sampling_rate = 6 - observations will be samples at one data point per hour, we will onlu keep one data point out of 6
- sequence_length - 120 - observations will go back 5 days (120 hours)
- delay = sampling_rate*(sequence_length+24-1) - the target for a squence will be the temperatur 24 hours after the end of the sequence
"""
sampling_rate = 6
sequence_length = 120
delay = sampling_rate * (sequence_length + 24 - 1)
batch_size = 256
train_dataset = keras.utils.timeseries_dataset_from_array(
 raw_data[:-delay],
 targets=temperature[delay:],
 sampling_rate=sampling_rate,
 sequence_length=sequence_length,
 shuffle=True,
 batch_size=batch_size,
 start_index=0,
 end_index=num_train_samples)
val_dataset = keras.utils.timeseries_dataset_from_array(
 raw_data[:-delay],
 targets=temperature[delay:],
 sampling_rate=sampling_rate,
 sequence_length=sequence_length,
 shuffle=True,
 batch_size=batch_size,
 start_index=num_train_samples,
 end_index=num_train_samples + num_val_samples)
test_dataset = keras.utils.timeseries_dataset_from_array(
 raw_data[:-delay],
 targets=temperature[delay:],
 sampling_rate=sampling_rate,
 sequence_length=sequence_length,
 shuffle=True,
 batch_size=batch_size,
 start_index=num_train_samples + num_val_samples)
out[23]
# Inspecting the output of the datasets
for samples, targets in train_dataset:
  print("samples shape:", samples.shape)
  print("targets shape:", targets.shape)
  break
out[24]

samples shape: (256, 120, 14)
targets shape: (256,)

"""
Computing the common-sense baseline MAE
"""
def evaluate_naive_method(dataset):
  """
  Common sense baseline = assume temp 24 hours from now will be
  the same as what it is now
  """
  total_abs_err = 0.
  samples_seen = 0
  for samples, targets in dataset:
    preds = samples[:, -1, 1] * std[1] + mean[1]
    total_abs_err += np.sum(np.abs(preds - targets))
    samples_seen += samples.shape[0]
  return total_abs_err / samples_seen
print(f"Validation MAE: {evaluate_naive_method(val_dataset):.2f}")
print(f"Test MAE: {evaluate_naive_method(test_dataset):.2f}")
out[25]

Validation MAE: 2.44
Test MAE: 2.62

In the same way that it's useful to establish a common-sense baseline before trying machine learning approaches, it's useful to try simple, cheap machine learning models (such as small, densely connected networks) before looking into complicated and computationally expensive models such as RNNs - best way to know that further complexity is warranted.

# Training and evaluating a densely connected model
from tensorflow import keras
from tensorflow.keras import layers
inputs = keras.Input(shape=(sequence_length, raw_data.shape[-1]))
x = layers.Flatten()(inputs)
x = layers.Dense(16, activation="relu")(x)
outputs = layers.Dense(1)(x)
model = keras.Model(inputs, outputs)
callbacks = [
 keras.callbacks.ModelCheckpoint("jena_dense.keras",
 save_best_only=True)
]
model.compile(optimizer="rmsprop", loss="mse", metrics=["mae"])
history = model.fit(train_dataset, epochs=10, validation_data=val_dataset, callbacks=callbacks)
# Reload the best model and evaluate it on the test data
model = keras.models.load_model("jena_dense.keras")
print(f"Test MAE: {model.evaluate(test_dataset)[1]:.2f}")
out[27]
# Plotting the results
import matplotlib.pyplot as plt
loss = history.history["mae"]
val_loss = history.history["val_mae"]
epochs = range(1, len(loss) + 1)
plt.figure()
plt.plot(epochs, loss, "bo", label="Training MAE")
plt.plot(epochs, val_loss, "b", label="Validation MAE")
plt.title("Training and validation MAE")
plt.legend()
plt.show()
out[28]

Training and Validation MAE on the Jena Temperature-forecasting task with Densely Connected Network

[A] pretty significant limitation of machine learning in general: unless the
learning algorithm is hardcoded to look for a specific kind of simple model, it can sometimes fail to find a simple solution to a simple problem. That's why leveraging good feature engineering and relevant architecture priors is essential: you need to precisely tell your model what it should be looking for.

You can use 1D convnets (the Conv1D) to fit any sequence dataq that follows the translation invariance assumption (meaning that id you slide a window over the sequence, the content f the window should follow the same properties independently of the location of the window).

inputs = keras.Input(shape=(sequence_length, raw_data.shape[-1]))
x = layers.Conv1D(8, 24, activation="relu")(inputs)
x = layers.MaxPooling1D(2)(x)
x = layers.Conv1D(8, 12, activation="relu")(x)
x = layers.MaxPooling1D(2)(x)
x = layers.Conv1D(8, 6, activation="relu")(x)
x = layers.GlobalAveragePooling1D()(x)
outputs = layers.Dense(1)(x)
model = keras.Model(inputs, outputs)
callbacks = [
 keras.callbacks.ModelCheckpoint("jena_conv.keras",
 save_best_only=True)
]
model.compile(optimizer="rmsprop", loss="mse", metrics=["mae"])
history = model.fit(train_dataset,
 epochs=10,
 validation_data=val_dataset,
 callbacks=callbacks)
model = keras.models.load_model("jena_conv.keras")
print(f"Test MAE: {model.evaluate(test_dataset)[1]:.2f}")
out[30]

Epoch 1/10
819/819 ━━━━━━━━━━━━━━━━━━━━ 43s 49ms/step - loss: 34.8237 - mae: 4.4933 - val_loss: 19.5832 - val_mae: 3.4747
Epoch 2/10
819/819 ━━━━━━━━━━━━━━━━━━━━ 37s 45ms/step - loss: 16.8202 - mae: 3.2654 - val_loss: 16.0923 - val_mae: 3.2077
Epoch 3/10
819/819 ━━━━━━━━━━━━━━━━━━━━ 46s 55ms/step - loss: 15.2705 - mae: 3.1062 - val_loss: 16.5515 - val_mae: 3.2342
Epoch 4/10
819/819 ━━━━━━━━━━━━━━━━━━━━ 81s 55ms/step - loss: 14.2592 - mae: 2.9956 - val_loss: 16.3722 - val_mae: 3.1609
Epoch 5/10
819/819 ━━━━━━━━━━━━━━━━━━━━ 72s 42ms/step - loss: 13.3286 - mae: 2.8938 - val_loss: 15.1987 - val_mae: 3.0805
Epoch 6/10
819/819 ━━━━━━━━━━━━━━━━━━━━ 41s 43ms/step - loss: 12.7599 - mae: 2.8304 - val_loss: 14.9143 - val_mae: 3.0331
Epoch 7/10
819/819 ━━━━━━━━━━━━━━━━━━━━ 35s 43ms/step - loss: 12.3180 - mae: 2.7829 - val_loss: 14.6139 - val_mae: 2.9980
Epoch 8/10
819/819 ━━━━━━━━━━━━━━━━━━━━ 44s 54ms/step - loss: 11.9375 - mae: 2.7372 - val_loss: 15.3663 - val_mae: 3.0559
Epoch 9/10
819/819 ━━━━━━━━━━━━━━━━━━━━ 45s 55ms/step - loss: 11.6083 - mae: 2.6993 - val_loss: 14.9547 - val_mae: 3.0521
Epoch 10/10
819/819 ━━━━━━━━━━━━━━━━━━━━ 72s 43ms/step - loss: 11.2888 - mae: 2.6635 - val_loss: 16.1435 - val_mae: 3.1176
405/405 ━━━━━━━━━━━━━━━━━━━━ 13s 31ms/step - loss: 15.6966 - mae: 3.1408
Test MAE: 3.13

import matplotlib.pyplot as plt
loss = history.history["mae"]
val_loss = history.history["val_mae"]
epochs = range(1, len(loss) + 1)
plt.figure()
plt.plot(epochs, loss, "bo", label="Training MAE")
plt.plot(epochs, val_loss, "b", label="Validation MAE")
plt.title("Training and validation MAE")
plt.legend()
plt.show()
out[31]
Jupyter Notebook Image

<Figure size 640x480 with 1 Axes>

The model performs worse than the densely connected one. Why?

  1. The wather fata doesn;t quite respect the translation invariance assumption. Weather data is only translation-inavariant for a very specific tmescale.
  2. Order in our data matters a lot. The recent past is more informative for predicting the next day's temperature than dive days ago. A 1D convnet is not able to leverage this fact.

Try looking at the data as what it is: a sequnce, where causality and order matter. There;s a family of neural network architectures designed specifically fo this use case: recurrent neural networks. Among them, the Long Short Term Memroy (LSTM) layer has long been very popular.

"""
A simple LSTM-based model
"""
inputs = keras.Input(shape=(sequence_length, raw_data.shape[-1]))
x = layers.LSTM(16)(inputs)
outputs = layers.Dense(1)(x)
model = keras.Model(inputs, outputs)
callbacks = [
 keras.callbacks.ModelCheckpoint("jena_lstm.keras",
 save_best_only=True)
]
model.compile(optimizer="rmsprop", loss="mse", metrics=["mae"])
history = model.fit(train_dataset,
 epochs=10,
 validation_data=val_dataset,
 callbacks=callbacks)
model = keras.models.load_model("jena_lstm.keras")
print(f"Test MAE: {model.evaluate(test_dataset)[1]:.2f}")
out[33]

Epoch 1/10
819/819 ━━━━━━━━━━━━━━━━━━━━ 42s 49ms/step - loss: 80.0980 - mae: 6.8467 - val_loss: 13.1614 - val_mae: 2.7504
Epoch 2/10
819/819 ━━━━━━━━━━━━━━━━━━━━ 40s 49ms/step - loss: 12.3676 - mae: 2.7158 - val_loss: 9.7612 - val_mae: 2.4137
Epoch 3/10
819/819 ━━━━━━━━━━━━━━━━━━━━ 40s 48ms/step - loss: 9.8556 - mae: 2.4567 - val_loss: 9.8353 - val_mae: 2.4180
Epoch 4/10
819/819 ━━━━━━━━━━━━━━━━━━━━ 41s 50ms/step - loss: 9.4562 - mae: 2.4026 - val_loss: 10.1129 - val_mae: 2.4531
Epoch 5/10
819/819 ━━━━━━━━━━━━━━━━━━━━ 42s 50ms/step - loss: 9.1964 - mae: 2.3678 - val_loss: 9.8855 - val_mae: 2.4221
Epoch 6/10
819/819 ━━━━━━━━━━━━━━━━━━━━ 40s 48ms/step - loss: 8.9176 - mae: 2.3321 - val_loss: 9.9856 - val_mae: 2.4488
Epoch 7/10
819/819 ━━━━━━━━━━━━━━━━━━━━ 49s 60ms/step - loss: 8.6855 - mae: 2.3033 - val_loss: 9.8861 - val_mae: 2.4366
Epoch 8/10
819/819 ━━━━━━━━━━━━━━━━━━━━ 39s 47ms/step - loss: 8.5118 - mae: 2.2795 - val_loss: 9.9469 - val_mae: 2.4389
Epoch 9/10
819/819 ━━━━━━━━━━━━━━━━━━━━ 42s 48ms/step - loss: 8.3544 - mae: 2.2578 - val_loss: 10.0561 - val_mae: 2.4555
Epoch 10/10
819/819 ━━━━━━━━━━━━━━━━━━━━ 51s 60ms/step - loss: 8.2187 - mae: 2.2396 - val_loss: 10.3117 - val_mae: 2.4827
405/405 ━━━━━━━━━━━━━━━━━━━━ 14s 32ms/step - loss: 11.0684 - mae: 2.6063
Test MAE: 2.60

# Plot model
import matplotlib.pyplot as plt
loss = history.history["mae"]
val_loss = history.history["val_mae"]
epochs = range(1, len(loss) + 1)
plt.figure()
plt.plot(epochs, loss, "bo", label="Training MAE")
plt.plot(epochs, val_loss, "b", label="Validation MAE")
plt.title("Training and validation MAE")
plt.legend()
plt.show()
out[34]

The plot below shows much better results than the first two approaches - and it can beat the baseline, demonstrating the value of ML on this task.

Understanding Recurrent Neural Networs

A major characteristic of all neural networks you've seen so far, such as densely connected networks and convnets, is that they have no memory. Each input shown to them is processed independently, with no state kept between inputs. With such networks, in order to process a sequence or a temporal series of data points, you have to show the entire sequence to the network at once: turn it into a single data point. For instance, this is what we did in the densely connected network example: we flattened our five days of data into a single large vector and processed it in one go. Such networks are called feedforward networks.

A recurrent neural network (RNN) processes sequences by iterating through the sequence elements and maintaining a state that contains information relative to what it has seen so far. In effect, an RNN is a type of Neural Network that has an internal loop.

Recurrent Neural Network

The state of the RNN is reset between processing two different, independent sequences (such as two samples in a batch), so you still consider one sequence to be a single data point: a single input to the network. This data point is no longer processed in a single step, but the network loops over the sequence elements.

"""
Pseudocode RNN
"""
state_t = 0 # The state at t
for input_t in input_sequence: # Iterates over sequence elements
  output_t = f(input_t,state_t)
  state_t = # The previous output becomes the state for the next iteration
"""
More detailed pseudocode for the RNN
"""
state_t = 0
for input_t in input_sequence:
  output_t = activation(dot(W, input_t) + dot(U, state_t) + b)
  state_t = output_t
out[36]
"""
NumPy implementation of a simple RNN
"""
import numpy as np
timesteps = 100 # Number of timestaps in the input sequence
input_features = 32 # Dimensionality of the input feature space
output_features = 64 # Dimensionality of the output feature space
inputs = np.random.random((timesteps,input_features)) # Input data: andom noise for the sake of the example
state_t = np.zeros((output_features, )) # Initial state: all zero vector
"""
Create random weight matrices
"""
W = np.random.random((output_features, input_features))
U = np.random.random((output_features, output_features))
b = np.random.random((output_features,))
successive_outputs = []
for input_t in inputs: # input_t is a vector of shape (input_features,)
  # Combines the input with the current state (the previous output) to obtain
  # the current output. We use `tanh` to add non-linearity (could use
  # other activation function)
  output_t = np.tanh(np.dot(W, input_t) + np.dot(U, state_t) + b)
  # Store this output in a list
  successive_outputs.append(output_t)
  # Update the state of the nextwork for the next timestamp
  state_t = output_t
# The final output is a rank-2 tensor of shape (timesteps, output_features)
final_output_sequence = np.stack(successive_outputs, axis=0)
out[37]

In summary, a RNN is a for loop that reuses quantities computed during the previous iteration of the loop, nothing more. RNNs are characterized by their step function.

Simple RNN, Unrolled over Time

The SimpleRNN layer in Keras processes batches of sequences, like all other Keras layers, not a single sequence like the NumPy example above - it takes an input of shape (batch_size, timesteps, input_features).

# An RNN layer that can process sequences of any length
num_features = 14
# setting the timesteps entry to None in the shape argument
# enables network to process sequences of arbitrary length
# If model is meant to process sequences of same length,
# it is recommended to specify the length
inputs - keras.Input(shape=(None,num_features))
outputs - layers.SimpleRNN(16)(inputs)

All recurrent layers in Keras (SimpleRNN, LSTM, and GRU) can be run in two different modes: they can either return full sequences of successive outputs for each time step (a rank-3 tensor of shape (batch_size, timesteps, ouutput_features)) or return only the last output for each input sequence (rank-2 tensor of shape (batch_size, output_features)), These modes are controlled by the return_sequences argument.

"""
A RNN layer that returns only its last output step
"""
num_features = 14
steps = 120
inputs = keras.Input(shape=(steps, num_features))
# Note that return_sequences=False
outputs = layers.SimpleRNN(16, return_sequences=False)(inputs)
print(outputs.shape)
out[39]

(None, 16)

"""
An RNN layer that returns its full output sequence
"""
num_features = 14
steps = 120
inputs = keras.Input(shape=(steps, num_features))
# Note that return_sequences=False
outputs = layers.SimpleRNN(16, return_sequences=True)(inputs)
print(outputs.shape)
out[40]

(None, 120, 16)

"""
Stacking RNN Layers

It's sometimes useful to stack several recurrent layers one after the
other in order to increase the representational power of a network. In
such a setup, you have to get all of the intermediate layers
to return a full sequence of outputs
"""
inputs = keras.Input(shape=(steps, num_features))
x = layers.SimpleRNN(16, return_sequences=True)(inputs)
x = layers.SimpleRNN(16, return_sequences=True)(x)
outputs = layers.SimpleRNN(16)(x)
out[41]

In practice you'll rarely work with the SimpleRNN layer. Although it should be theoretically able to retain at time t information about inputs seen many timestamps before, such long-term dependencies prove impossible to learn in practice. This is due to the vanishing graidents problem, an effect that is similatr to what is observed with non-recurrent neural networks that are many layers deep: as you keep adding layers to a network, the network eventually becomes untrainable.

The LSTM (Long Term Short Memory) algorithm was developed to address this vanishing gradients problem. The LSTM layer is like the SimpleRNN layer but it adds a way to carry information across many timestamps. The LSTM saves information for later, preventing older signals from gradually vanishing during processing - this should remind you of residual connections.

Starting Point of LSTM Layer: A SimpleRNN

Adding to the picture above additional data flow that carries information across timesteps. Call its values at different timesteps c_t where C stands for carry.

This information will have the following impact on the cell: it will be combined with the input connection and the recurrent connection (via a dense transformation: a dot product with a weight matrix followed by a bias add and the application of an activation function), and it will affect the state being sent to the next timestep (via an activation function and a multiplication operation). Conceptually, the carry dataflow is a way to modulate the next output and the next state.

Going from a Simple RNN to an LSTM: adding a carry track

"""
Pseudocode details of the LSTM architecture (1/2)
"""
output_t = activation(dot(state_t, Uo) + dot(input_t, Wo) + dot(c_t, Vo) + bo)
i_t = activation(dot(state_t, Ui) + dot(input_t, Wi) + bi)
f_t = activation(dot(state_t, Uf) + dot(input_t, Wf) + bf)
k_t = activation(dot(state_t, Uk) + dot(input_t, Wk) + bk)

"""
Psuedocode of the LSTM architecture (2/2)
"""
c_t+1 = i_t * k_t + c_t * f_t

out[43]

Anatomy of an LSTM

About an LSTM: just keep in mind what the LSTM cell is meant to do: allow past information to be reinjected at a later time, this fighting the vanishing-gradient problem.

Advances use of recurrent neural networks

  • Recurrent dropout - a variant of droput, used to fight overfitting in recurrent layers
  • Stacking Recurrent Layers: increases the representational power of the model (at the cost of higher computational loads)
  • Bidrectional Recurrent Layers: prsent the same information to a recurrent network in different ways, increasing accuracy and mitigating foregtting issues

Recurrent Dropout

The same dropout mask (the same pattern of dropped units) should be applied at every timestep, instead of using a dropout mask that varies randomly from timestep to timestep. In order to regularize the representations formed by the recurrent gates of layers such as GRU and LSTM, a temporally constant dropout mask should be applid to the inner recurrent activations of the layer (a recurrent dropout mask). Using the same dropout mask at every timestamp allows the network to propery proagate its learning error through time.

Every reccurent layer in Keras has two dropout-related arguments: dropot, a float specifying the dropot rate for input units of the layer, and recurrent_dropout specifying the dropout rate of the recurrent units.

"""
Training and evaluating a dropout-regularized LSTM
"""
inputs = keras.Input(shape=(sequence_length, raw_data.shape[-1]))
x = layers.LSTM(32, recurrent_dropout=0.25)(inputs)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1)(x)
model = keras.Model(inputs, outputs)
callbacks = [
 keras.callbacks.ModelCheckpoint("jena_lstm_dropout.keras",
 save_best_only=True)
]
model.compile(optimizer="rmsprop", loss="mse", metrics=["mae"])
history = model.fit(train_dataset,
 epochs=50,
 validation_data=val_dataset,
 callbacks=callbacks)

import matplotlib.pyplot as plt
loss = history.history["mae"]
val_loss = history.history["val_mae"]
epochs = range(1, len(loss) + 1)
plt.figure()
plt.plot(epochs, loss, "bo", label="Training MAE")
plt.plot(epochs, val_loss, "b", label="Validation MAE")
plt.legend()
plt.show()
out[45]

Epoch 1/50
819/819 ━━━━━━━━━━━━━━━━━━━━ 186s 224ms/step - loss: 52.3167 - mae: 5.3374 - val_loss: 9.7327 - val_mae: 2.4323
Epoch 2/50
819/819 ━━━━━━━━━━━━━━━━━━━━ 201s 224ms/step - loss: 15.3491 - mae: 3.0531 - val_loss: 9.2365 - val_mae: 2.3752
Epoch 3/50
819/819 ━━━━━━━━━━━━━━━━━━━━ 183s 223ms/step - loss: 14.6293 - mae: 2.9746 - val_loss: 8.9734 - val_mae: 2.3366
Epoch 4/50
819/819 ━━━━━━━━━━━━━━━━━━━━ 201s 222ms/step - loss: 14.0800 - mae: 2.9125 - val_loss: 8.8099 - val_mae: 2.3071
Epoch 5/50
819/819 ━━━━━━━━━━━━━━━━━━━━ 184s 224ms/step - loss: 13.7247 - mae: 2.8793 - val_loss: 8.8446 - val_mae: 2.3114
Epoch 6/50
270/819 ━━━━━━━━━━━━━━━━━━━━ 1:42 186ms/step - loss: 13.5038 - mae: 2.8586

The image above shows that adding regularization prevents overfitting through 20 epochs. Larger RNNs can greatly benefit from a GPU runtime, but recurrent models with very few parameters can be significantly faster on multicore CPU than on GPU.

Stacking Recurrent Layers

Recurrent layer stacking is a classic way to build more-powerful recurrent networks. To stack recurrent layers on top of each other in Keras, all intermediate layers should return their full sequence of outputs (a rank-3 tensor) rather than their outpit at the last timestep.

"""
Training and evaluating a dropout-regularized, stacked GRU model

In this example, we stack two dropout-regularized recurrent layers
We use GRU instead of LSTM - GRU can be  thought of as a slightly
simpler, streamlined version of LSTM
"""
inputs = keras.Input(shape=(sequence_length, raw_data.shape[-1]))
x = layers.GRU(32, recurrent_dropout=0.5, return_sequences=True)(inputs)
x = layers.GRU(32, recurrent_dropout=0.5)(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1)(x)
model = keras.Model(inputs, outputs)
callbacks = [
 keras.callbacks.ModelCheckpoint("jena_stacked_gru_dropout.keras",
 save_best_only=True)
]
model.compile(optimizer="rmsprop", loss="mse", metrics=["mae"])
history = model.fit(train_dataset,
 epochs=50,
 validation_data=val_dataset,
 callbacks=callbacks)
model = keras.models.load_model("jena_stacked_gru_dropout.keras")
print(f"Test MAE: {model.evaluate(test_dataset)[1]:.2f}")
import matplotlib.pyplot as plt
loss = history.history["mae"]
val_loss = history.history["val_mae"]
epochs = range(1, len(loss) + 1)
plt.figure()
plt.plot(epochs, loss, "bo", label="Training MAE")
plt.plot(epochs, val_loss, "b", label="Validation MAE")
plt.legend()
plt.show()
out[47]

Bidirectional Recurrent Layers

A bidrectional RNN is a common RNN variant that can offer greater performance than a regular RNN on certain tasks. It's frequently used in natural language processing - you could call it the Swiss army knife of deep learning for natural language processing.

A bidrectional RNN exploits the order sensitivity of RNNs: it uses two regular RNNs, such as the GRU and LSTM tlayers, wach of which processes the input sequence in one direction, and then merges their representations.

Which order of data points you use isn't as important in natural language processing: the importance of a word in understanding a sentence isn't usually dependent on its position in the sentece. On text data, reversed order provessing works just as well as chronological processing. Word order matters in NLP, but which order you use isn't crucial.

A bidrectional RNN looks at sequences both ways in order to improve on the performance of chronological-order RNNs.

How a Bidrectional RNN Layer Works

To instantiate a bidirectional RNN in Keras, you use the Bidrectional layer, which takes as its first argument a recurrent layer instance.

# Training and evaluating a bidirectional LSTM

inputs = keras.Input(shape=(sequence_length, raw_data.shape[-1]))
x = layers.Bidirectional(layers.LSTM(16))(inputs)
outputs = layers.Dense(1)(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop", loss="mse", metrics=["mae"])
history = model.fit(train_dataset,
 epochs=10,
 validation_data=val_dataset)

Always remember that all trading is fundamentally information arbitrage: gaining an advantage by leveraging data or insights that other market participants are missing. Trying to use well-known machine learning techniques and publicly available data to beat the markets is effectively a dead end, since you won't have any information advantage compared to everyone else.