Deep Learning with Python - Chapters 12 and 13

"Generative Deep Learning" and "Best Practices for the Real World" go over text generation and image generation algorithms and best practices for Deep Learning in the real world.

Generative Deep Learning

Text Generation

Exploring how recurrent neural networks can be used to generate sequence data. The universal way to generate sequence data in deep learning is to train a model (usually a Transformer or an RNN) to predict the next token or next few tokens in a sequence, using the previous tokens as input. When working with text data, tokens are typically words or characters, and any network that can model the probability of the next token given the previous ones is called a language model. A language model captures the latent space of language: its statistical structure.

Once you have a trained language model, you can sample from it (generate new sequences): you feed it an inital string of text (called conditioning data), ask it to generate the next character or next word, add the generated output back to the input data, and repeat the process many times. This loop allows you to generate sequences of arbitrary length that reflect the structure of the data on which the model was trained: sequences that look almost like human-written sentences.

The Process of Text Generation with a Language Model

The importance of sampling strategy

When generating text, the way you choose the next token is critically important. A naive approach is greedy sampling - always choosing trhe most likely next character (this dosn't work well). Stochastic sampling introduces randomness in the sampling process by sampling from the probability distribution ofthe next character. There's one issue with this strategy: it doesn;t offer a way to control the amount of randomness in the sampling process.

When sampling from generative models, it's always good to explore different amounts of randomness in the generation process. In order to control the amount of stochasticity in the sampling process, we'll introduce a parameter called the softmax temperature, which characterizes the entropy of the probability distribution used for sampling: it characterizes how surpising or predictable the choice of the next word will be.

Different Reweights of Probability Distribution

"""
Reweighting a probability distribution to a different temperature
"""
import numpy as np
def reweight_distribution(original_distribution, temperature=0.5):
  """
  original_distribution is a 1D NumPy array of probability values that must sum
  to 1. temperature is afactor qunatifying the entropy of the output
  distribution
  """
  distribution = np.log(original_distribution) / temperature
  distribution = np.exp(distribution)
  # Returns a rewrighted version of the original distribution
  # The sum of the distribution may no longer be 1
  # so you divide it by its sum to obtain the new distribution
  return distribution / np.sum(distribution)
out[2]

Implementing Text Generation with Keras

We'll be generating new moview reviews.

# Load the imdb data
!wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz
out[4]

--2024-09-09 05:41:37-- https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’

aclImdb_v1.tar.gz 100%[===================>] 80.23M 36.0MB/s in 2.2s

2024-09-09 05:41:39 (36.0 MB/s) - ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]

"""
Creating a dataset from text files (one file = one sample)
"""
import tensorflow as tf
from tensorflow import keras
dataset = keras.utils.text_dataset_from_directory(directory="aclImdb", label_mode=None, batch_size=256)
dataset = dataset.map(lambda x: tf.strings.regex_replace(x, "<br />", " "))
"""
Preparing a TextVectorization layer
"""
from tensorflow.keras.layers import TextVectorization
sequence_length = 100
# Consider the top 15,000 most common words - everything else
# will be treated as out-of-vocabulary
vocab_size = 15000
text_vectorization = TextVectorization(
 max_tokens=vocab_size,
 # We want to return integer word index sequences
 output_mode="int",
 # We'll work with inputs and targets of length 100
 output_sequence_length=sequence_length,
)
text_vectorization.adapt(dataset)
"""
Setting up language modeling dataset
"""
def prepare_lm_dataset(text_batch):
  # Convert a batch of texts (strings) to a batch of
  # integer sequences
  vectorized_sequences = text_vectorization(text_batch)
  # Create inputs by cutting off the last word of the sequences
  x = vectorized_sequences[:, :-1]
  # Create targets by offseting the sequences by 1
  y = vectorized_sequences[:, 1:]
  return x, y

lm_dataset = dataset.map(prepare_lm_dataset, num_parallel_calls=4)
out[5]

Found 100006 files.

We'll train a model to predict a probability distribution over the next word in a sentence, given a number of initial words. When the model is trained, we'll feed it with a prompt, sample the next word, add that word back to the prompt, and repeat, until we've generated a short paragraph.

Comparing Next Word vs Sequence-to-Sequence Modeling

"""
Implementing positional embedding as a subclassed layer
"""
from tensorflow.keras.layers import Layer
from tensorflow.python.framework.ops import enable_eager_execution


class PositionalEmbedding(Layer):
  """
  A downside of position embeddings is that the sequence lengths
  needs to be known in advance
  """
  def __init__(self, sequence_length, input_dim, output_dim, **kwargs):
    super().__init__(**kwargs)
    # Prepare an Embedding layer for the token indices
    self.token_embeddings = layers.Embedding(input_dim=input_dim, output_dim=output_dim)
    # Add another one for the token positions
    self.position_embeddings = layers.Embedding(
    input_dim=sequence_length, output_dim=output_dim)
    self.sequence_length = sequence_length
    self.input_dim = input_dim
    self.output_dim = output_dim
  def call(self,inputs):
    length = tf.shape(inputs)[-1]
    positions = tf.range(start=0, limit=length, delta=1)
    embedded_tokens = self.token_embeddings(inputs)
    embedded_positions = self.position_embeddings(positions)
    # Add both embedding vectors together
    return embedded_tokens + embedded_positions
  def compute_mask(self, inputs, mask=None):
    """
    Like the Embedding Layer, this layer should be able to generate a
    mask so we can ignore padding 0s in the inputs. The compute_mask
    method will be called automatically by the framework, and the mask
    will get propagated to the next layer
    """
    return tf.math.not_equal(inputs, 0)
  def get_config(self):
    """
    Implement serialization so that we can save the model
    """
    config = super().get_config()
    config.update({
      "output_dim": self.output_dim,
      "sequence_length": self.sequence_length,
      "input_dim": self.input_dim,
    })
    return config
"""
The Transformer Decoder
"""
class TransformerDecoder(Layer):
  def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
    super().__init__(**kwargs)
    self.embed_dim = embed_dim
    self.dense_dim = dense_dim
    self.num_heads = num_heads
    self.attention_1 = layers.MultiHeadAttention(
    num_heads=num_heads, key_dim=embed_dim)
    self.attention_2 = layers.MultiHeadAttention(
    num_heads=num_heads, key_dim=embed_dim)
    self.dense_proj = keras.Sequential(
      [layers.Dense(dense_dim, activation="relu"),
      layers.Dense(embed_dim),]
    )
    self.layernorm_1 = layers.LayerNormalization()
    self.layernorm_2 = layers.LayerNormalization()
    self.layernorm_3 = layers.LayerNormalization()
    # This attribute ensures that the layer will propagate its input mask
    # to its outputs; masking in Keras is explicitly opt-in.
    self.supports_masking = True
  def get_config(self):
    config = super().get_config()
    config.update({
      "embed_dim": self.embed_dim,
      "num_heads": self.num_heads,
      "dense_dim": self.dense_dim,
    })
    return config
  def get_causal_attention_mask(self, inputs):
    """
    TransformerDecoder method that generates a casual mask
    """
    input_shape = tf.shape(inputs)
    batch_size, sequence_length = input_shape[0], input_shape[1]
    i = tf.range(sequence_length)[:, tf.newaxis]
    j = tf.range(sequence_length)
    # Gnerate a matrix of shape (sequence_length, sequence_length)
    # with 1s in one half and 0s in the other
    mask = tf.cast(i >= j, dtype="int32")
    """
    Replicate it along the batch axis to get a matrix of shape (batch_size, sequence_length, sequence_length)
    """
    mask = tf.reshape(mask, (1, input_shape[1], input_shape[1]))
    mult = tf.concat(
      [tf.expand_dims(batch_size, -1),
      tf.constant([1, 1], dtype=tf.int32)], axis=0)
    return tf.tile(mask, mult)
  def calll(self, inputs, encoder_outputs, mask=None):
    """
    The forward pass of the TransformerDecoder
    """
    # Retrieve the casual mask
    causal_mask = self.get_causal_attention_mask(inputs)
    if mask is not None:
      # Prepare the input mask (that describes padding locations
      # in the target sequence)
      padding_mask = tf.cast(mask[:, tf.newaxis, :], dtype="int32")
      # Merge the two masks together
      padding_mask = tf.minimum(padding_mask, causal_mask)
    attention_output_1 = self.attention_1(
      query=inputs,
      value=inputs,
      key=inputs,
      # Pass the casual mask to the first attention layer, which performs
      # self-attention over the target sequence
      attention_mask=causal_mask
    )
    attention_output_1 = self.layernorm_1(inputs + attention_output_1)
    attention_output_2 = self.attention_2(
      query=attention_output_1,
      value=encoder_outputs,
      key=encoder_outputs,
      # Pass the combined mask to the second attention layer, which
      # relates the source sequence to the target sequence
      attention_mask=padding_mask,
    )
    attention_output_2 = self.layernorm_2(
    attention_output_1 + attention_output_2)
    proj_output = self.dense_proj(attention_output_2)
    return self.layernorm_3(attention_output_2 + proj_output)
out[7]
"""
A simple Transformer-based language model
"""

from tensorflow.keras import layers
embed_dim = 256
latent_dim = 2048
num_heads = 2
inputs = keras.Input(shape=(None,), dtype="int64")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(inputs)
x = TransformerDecoder(embed_dim, latent_dim, num_heads)(x, x)
# Softmax over possible vocabulary words, computed for each output
# sequence timestep
outputs = layers.Dense(vocab_size, activation="softmax")(x)
model = keras.Model(inputs, outputs)
model.compile(loss="sparse_categorical_crossentropy", optimizer="rmsprop")
out[8]
"""
The text-generation callback
"""
import numpy as np

# Dict that maps word indices back to strings, to be used for
# text decoding
tokens_index = dict(enumerate(text_vectorization.get_vocabulary()))

def sample_next(predictions, temperature=1.0):
  """
  Implements variable-temperature sampling from a probability
  distribution
  """
  predictions = np.asarray(predictions).astype("float64")
  predictions = np.log(predictions) / temperature
  exp_preds = np.exp(predictions)
  predictions = exp_preds / np.sum(exp_preds)
  probas = np.random.multinomial(1, predictions, 1)
  return np.argmax(probas)


class TextGenerator(keras.callbacks.Callback):
  def __init__(self,
      prompt, # Prompt that we use to seed text generation
      generate_length, # How many words to generate
      model_input_length,
      temperatures=(1.,), # Range of temperatures to use for sampling
      print_freq=1
    ):
    self.prompt = prompt
    self.generate_length = generate_length
    self.model_input_length = model_input_length
    self.temperatures = temperatures
    self.print_freq = print_freq
  def on_epoch_end(self, epoch, logs=None):
    if (epoch + 1) % self.print_freq != 0:
      return
    for temperature in self.temperatures:
      print("== Generating with temperature", temperature)
      # When generating text, we start from our prompt
      sentence = self.prompt
      for i in range(self.generate_length):
        # Feed the current sequence into our model
        tokenized_sentence = text_vectorization([sentence])
        predictions = self.model(tokenized_sentence)
        # Retrieve the predictions for the last timestep and use them to sample
        # a new word
        next_token = sample_next(predictions[0, i, :])
        sampled_token = tokens_index[next_token]
        # Append the word to the current sequence and repeat
        sentence += " " + sampled_token
      print(sentence)

prompt = "This movie"
text_gen_callback = TextGenerator(
 prompt,
 generate_length=50,
 model_input_length=sequence_length,
 temperatures=(0.2, 0.5, 0.7, 1., 1.5)) # We'll use a diverse range of
 # temperatures to sample text, to demonstrate the effect of temperature on
 # text generation
"""
Fitting the language model
"""
model.fit(lm_dataset, epochs=200, callbacks=[text_gen_callback])
out[9]

Always experiment with multiple sampling strageties (temperatures). A clever balance between learned structure and randomness is what makes generation interesting. Language models are all form and no substance.

DeepDream

DeepDream is an artistic image-modification technique that uses the representations learned by convolutional neural networks. The DeepDream algorithm is almost identical to the convnet filter-visualization technique introduced in chapter 9, consisting of running a convnet in reverse: doing gradient ascent on the input to the convnet in order to maximize the activation of a specific filter in an upper layer of the convnet.

"""
Fetching the test image
"""
from tensorflow import keras
import matplotlib.pyplot as plt
base_image_path = keras.utils.get_file(
 "coast.jpg", origin="https://img-datasets.s3.amazonaws.com/coast.jpg")
plt.axis("off")
plt.imshow(keras.utils.load_img(base_image_path))
out[11]

Downloading data from https://img-datasets.s3.amazonaws.com/coast.jpg
440742/440742 ━━━━━━━━━━━━━━━━━━━━ 0s 0us/step

<matplotlib.image.AxesImage at 0x7b5e8d5cd390>

Jupyter Notebook Image

<Figure size 640x480 with 1 Axes>

"""
Instantiating a pretrained InceptionV3 model
"""
from tensorflow.keras.applications import inception_v3
model = inception_v3.InceptionV3(weights="imagenet", include_top=False)
out[12]

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/inception_v3/inception_v3_weights_tf_dim_ordering_tf_kernels_notop.h5
87910968/87910968 ━━━━━━━━━━━━━━━━━━━━ 0s 0us/step

"""
Configuring the contribution of each layer to the DeepDream loss
"""
# Layers for which we try to maximize activation, as well as their weight in the total loss. You can tweak these settings to obtain new visual effects
layer_settings = {
 "mixed4": 1.0,
 "mixed5": 1.5,
 "mixed6": 2.0,
 "mixed7": 2.5,
}
# Symbolic outputs of each layer
outputs_dict = dict(
 [
 (layer.name, layer.output)
 for layer in [model.get_layer(name)
 for name in layer_settings.keys()]
 ]
)
# Model that returns the activation values for every target layer (as a dict)
feature_extractor = keras.Model(inputs=model.inputs, outputs=outputs_dict)
out[13]
"""
The DeepDream loss
"""
def compute_loss(input_image):
  # Extract activations
  features = feature_extractor(input_image)
  # Initialize the loss to 0
  loss = tf.zeros(shape=())
  for name in features.keys():
    coeff = layer_settings[name]
    activation = features[name]
    # We avoud border artifacts by only involving non-border pixels in
    # the loss
    loss += coeff * tf.reduce_mean(tf.square(activation[:, 2:-2, 2:-2, :]))
  return loss
out[14]
"""
The DeepDream gradient ascent process
"""
import tensorflow as tf

@tf.function # Make the training step fast by compiling it as a tf.function function
def gradient_ascent_step(image, learning_rate):
  with tf.GradientTape() as tape:
    """
    Compute gradients of DeepDream loss with respect to the current image
    """
    tape.watch(image)
    loss = compute_loss(image)
  grads = tape.gradient(loss, image)
  # Normalize gradients (the same trick we used in Chapter 9)
  grads = tf.math.l2_normalize(grads)
  image += learning_rate * grads
  return loss, image


def gradient_ascent_loop(image, iterations, learning_rate, max_loss=None):
  """
  This runs gradient ascent for a given image scale (octave)
  """
  for i in range(iterations):
    """
    Repeatedly update the image in a way that increase the DeepDream loss
    """
    loss, image = gradient_ascent_step(image, learning_rate)
    if max_loss is not None and loss > max_loss:
      """
      Break out if the loss crosses a certain threshold
      (over-optimizing would create unwanted image artifacts)
      """
      break
    print(f"... Loss value at step {i}: {loss:.2f}")
  return image
out[15]
"""
Image Processing utilities
"""
step = 20. # Gradient ascent step size
num_octave = 3 # Number of scales at which to run gradient ascent
octave_scale = 1.4 # Size ratio between successive scales
iterations = 30 #Number of gradient ascent steps per scale
max_loss = 15. # We'll stop the gradient ascent process for a scale if the loss gets higher than this

import numpy as np
def preprocess_image(image_path):
  """
  Util function to open, resize and fromat pictures into appropriate arrays
  """
  img = keras.utils.load_img(image_path)
  img = keras.utils.img_to_array(img)
  img = np.expand_dims(img, axis=0)
  img = keras.applications.inception_v3.preprocess_input(img)
  return img
def deprocess_image(img):
  """
  Util function to convert a NumPy array
  into a valid image
  """
  img = img.reshape((img.shape[1], img.shape[2], 3))
  img /= 2.0
  img += 0.5
  img *= 255.
  # Convert to uint8 and clip to valid range [0,255]
  img = np.clip(img, 0, 255).astype("uint8")
  return img
out[16]
"""
Running Gradient Ascent over multiple successive octaves
"""
# Load the test image
original_img = preprocess_image(base_image_path)
original_shape = original_img.shape[1:3]

successive_shapes = [original_shape]
"""
Compute the target shape of the image at different octaves
"""
for i in range(1, num_octave):
    shape = tuple([int(dim / (octave_scale ** i)) for dim in original_shape])
    successive_shapes.append(shape)
successive_shapes = successive_shapes[::-1]

shrunk_original_img = tf.image.resize(original_img, successive_shapes[0])

# Make a copy of the image (we need to keep the original around)
img = tf.identity(original_img)
"""
Iterate over the different octaves
"""
for i, shape in enumerate(successive_shapes):
    print(f"Processing octave {i} with shape {shape}")
    # Iterate over the different octaves
    img = tf.image.resize(img, shape)
    # Run the gradient ascent, altering the dream
    img = gradient_ascent_loop(
        img, iterations=iterations, learning_rate=step, max_loss=max_loss
    )
    # Scale up the smaller version of the original image: it will be
    # pixellated
    upscaled_shrunk_original_img = tf.image.resize(shrunk_original_img, shape)
    # Compute the high-quality version of the original image at this isze
    same_size_original = tf.image.resize(original_img, shape)
    # The difference between the two is the detail that was lost
    # when scaling up
    lost_detail = same_size_original - upscaled_shrunk_original_img
    # Reinject lost detail into the dream
    img += lost_detail
    shrunk_original_img = tf.image.resize(original_img, shape)

# Save the final result
keras.utils.save_img("dream.png", deprocess_image(img.numpy()))
out[17]

Running Deep Dream on Test Image

Neural style Transfer

IN addition to DeepDream, another major developemnt in deep-learning-driven image modification is neural style transfer. Ity consists of applying the stule of a reference image to a target image whule conserving the content of the target image. In this context, style essentially means textures, colors, and visual patterns in the image, at various spatial scales, and the content is the higher-level macrostructure of the image.

Neural Style Transfer Example

The key notion behind implementing style transfer is the same idea that's central to all deep learning algorithms: you define a loss function to specify what you want to achieve, and you minimize this loss.

Neural Style Transfer Loss Function pseudocode:

loss = (distance(style(reference_image) - style(combination_image)) + distance(content(original_image) - content(combination_image)))

  • A good candidate for the content loss is the L2 norm between the activations of an upper layer in a pretrained convnet, computed over the target image, and the activations of the same layer computed over the generated image.
  • For the style loss, the neural style transfer paper authors use the Gram matrix of a layer's activations: the inner product of the feature maps of a given layer.

You can use a pretrained convnet to define a loss that will do the following:

  • Preserve content by maintaining a similar high-level layer activations between the original image and the generated image. The convnet should "see" both the original image and the generated image as containing teh same things.
  • Preserve style by maintinaing similar correlations within activations for both the low-level layers and high-level layers, Feature correlations capture textures: the generated image and style reference image should share the same textures at different spatial scales.
"""
Getting the style and content images
"""
from tensorflow import keras
# Path to the image we want to transform
base_image_path = keras.utils.get_file("sf.jpg",origin="https://img-datasets.s3.amazonaws.com/sf.jpg")
# Path to the style image
style_reference_image_path = keras.utils.get_file("starry_night.jpg", origin="https://img-datasets.s3.amazonaws.com/starry_night.jpg")
# Dimensions of the generated picture
original_width, original_height = keras.utils.load_img(base_image_path).size
img_height = 400
img_width = round(original_width * img_height / original_height)
out[19]

Downloading data from https://img-datasets.s3.amazonaws.com/sf.jpg
575046/575046 ━━━━━━━━━━━━━━━━━━━━ 0s 0us/step
Downloading data from https://img-datasets.s3.amazonaws.com/starry_night.jpg
943128/943128 ━━━━━━━━━━━━━━━━━━━━ 0s 0us/step

"""
Auxilary Functions
"""
import numpy as np
def preprocess_image(image_path):
  """
  Util functions to open, resize, and format pictures
  into appropriate arrays
  """
  img = keras.utils.load_img(
  image_path, target_size=(img_height, img_width))
  img = keras.utils.img_to_array(img)
  img = np.expand_dims(img, axis=0)
  img = keras.applications.vgg19.preprocess_input(img)
  return img
def deprocess_image(img):
  """
  Util function to convert a NumPy array into a valid image
  """
  img = img.reshape((img_height, img_width, 3))
  """
  Zero-centring by removing the mean pixel value from ImageNet.
  This reverses a transformation done by vgg19.preprocess_input
  """
  img[:, :, 0] += 103.939
  img[:, :, 1] += 116.779
  img[:, :, 2] += 123.68
  """
  Converts images from 'BGR' to 'RBG'
  This is also part of the reversal of vgg19.preprocess_input
  """
  img = img[:, :, ::-1]
  img = np.clip(img, 0, 255).astype("uint8")
  return img
out[20]
"""
Using a pretrained VGG19 model to create a feature extractor
"""
# Build a VGG19 model loaded with pretrained ImageNet weights
model = keras.applications.vgg19.VGG19(weights="imagenet", include_top=False)
outputs_dict = dict([(layer.name, layer.output) for layer in model.layers])
# Model that returns the activation values for every taget layer (as a dict)
feature_extractor = keras.Model(inputs=model.inputs, outputs=outputs_dict)
out[21]

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg19/vgg19_weights_tf_dim_ordering_tf_kernels_notop.h5
80134624/80134624 ━━━━━━━━━━━━━━━━━━━━ 0s 0us/step

"""
Content loss
"""
def content_loss(base_img, combination_img):
  return tf.reduce_sum(tf.square(combination_img - base_img))
out[22]
"""
Style Loss
"""
def gram_matrix(x):
  x = tf.transpose(x, (2, 0, 1))
  features = tf.reshape(x, (tf.shape(x)[0], -1))
  gram = tf.matmul(features, tf.transpose(features))
  return gram
def style_loss(style_img, combination_img):
  S = gram_matrix(style_img)
  C = gram_matrix(combination_img)
  channels = 3
  size = img_height * img_width
  return tf.reduce_sum(tf.square(S - C)) / (4.0 * (channels ** 2) * (size ** 2))
out[23]
"""
Total variation loss
"""
def total_variation_loss(x):
  a = tf.square(x[:, : img_height - 1, : img_width - 1, :] - x[:, 1:, : img_width - 1, :])
  b = tf.square(x[:, : img_height - 1, : img_width - 1, :] - x[:, : img_height - 1, 1:, :])
  return tf.reduce_sum(tf.pow(a + b, 1.25))
out[24]
"""
Defining the final loss that you'll minimize
"""
# List of layers to use for the style loss
style_layer_names = [
 "block1_conv1",
 "block2_conv1",
 "block3_conv1",
 "block4_conv1",
 "block5_conv1",
]
# The layer to use for the content loss
content_layer_name = "block5_conv2"
# Contribution weight of the total variation loss
total_variation_weight = 1e-6
# Contribution weight of the style loss
style_weight = 1e-6
# Contribution weight of the content loss
content_weight = 2.5e-8

def compute_loss(combination_image, base_image, style_reference_image):
  input_tensor = tf.concat([base_image, style_reference_image, combination_image], axis=0)
  features = feature_extractor(input_tensor)
  # Initialize the loss to 0
  loss = tf.zeros(shape=())
  # Add the content loss
  layer_features = features[content_layer_name]
  base_image_features = layer_features[0, :, :, :]
  combination_features = layer_features[2, :, :, :]
  loss = loss + content_weight * content_loss(base_image_features, combination_features)
  for layer_name in style_layer_names:
    # Add the style loss
    layer_features = features[layer_name]
    style_reference_features = layer_features[1, :, :, :]
    combination_features = layer_features[2, :, :, :]
    style_loss_value = style_loss(
    style_reference_features, combination_features)
    loss += (style_weight / len(style_layer_names)) * style_loss_value
  loss += total_variation_weight * total_variation_loss(combination_image)
  return loss
out[25]
"""
Setting up the gradient-descent process
"""

import tensorflow as tf
@tf.function # Make the training step fast by compiling as tf.function
def compute_loss_and_grads(combination_image, base_image, style_reference_image):
  """

  """
  with tf.GradientTape() as tape:
    loss = compute_loss(combination_image, base_image, style_reference_image)
  grads = tape.gradient(loss, combination_image)
  return loss, grads

optimizer = keras.optimizers.SGD(
    # We'll start with a learning rate of 100 and decrease it by 4%
    # every 100 steps
 keras.optimizers.schedules.ExponentialDecay(
     initial_learning_rate=100.0, decay_steps=100, decay_rate=0.96
 )
)
base_image = preprocess_image(base_image_path)
style_reference_image = preprocess_image(style_reference_image_path)
# Use a Variable to store the combination image since we'll be updating
# it during training
combination_image = tf.Variable(preprocess_image(base_image_path))

iterations = 4000
for i in range(1, iterations + 1):
  loss, grads = compute_loss_and_grads(combination_image, base_image, style_reference_image)
  # Update the combination image in a direction that reduces the style
  # transfer loss.
  optimizer.apply_gradients([(grads, combination_image)])
  if i % 100 == 0:
    # print(f"Iteration {i}: loss={loss:.2f}")
    img = deprocess_image(combination_image.numpy())
    fname = f"combination_image_at_iteration_{i}.png"
    # Sae the combination image at regular intervals
    keras.utils.save_img(fname, img)
out[26]

The resulting image can be seen below:

Resulting Image

Generating Images with Variational Autoencoders

Two main techniques in image generation: variational autoencoders (VAEs) and generative adverserial networks (GANs).

Sampling from Latent Space of Images

The key idea of image generation is to develop a low-dimensional latent space of representations (which, like everything else in deep learning, is a vector space), where any point can be mapped to a "valid" image: an image that looks like the real thing. The module capable of realizing this mapping, taking as input a latent point and outputting an image (a grid of pixels), is called a generator (in the case of GANs) or a decoder (in the case of VAEs). Once such a latent space has been learned, you can sample points from it, and, by mapping them back to image space, generate imges that have never been seen before.

VAEs are great for learning latent spaces that are well-structured, were specific directions encode a meaningful axis of variation in the data. GANs generate images that can potentially be highly realistic, but the latent space they come from may not have as much structure and continuity.

Concept Vectors for Image Editing

Given a latent space of representations, or an embedding space, certain directions in the space may encode interesting axes of variation in the original data.

Variational Autoencoders

Variational autoencoders are a kind of generative model that's especially appropraite for the task of image editing via concept vectors. They're a modern take on autoencoders (a type of network that aims to encode an input to a low-dimensional latent space and then decode it back) that mixes ideas from deep learning with Bayesian inference.

A classical image autoencoder takes an image, maps it to a latent vector space via an encoder module, then decodes it back to an output with the same dimesnions as the original image, via a decoder module. It's then trained by using as target data the same images as the inputr images, meaning the autoencoder learns to reconstruct the original inputs. By imposing various constraints on the code (the output of the encoder), you can get teh autoencoder to learn more- or less-interesting latent representations of the data.

Autoencoder at work

A VAE, instead of compressing its input iamge into a fixed code in the latent space, turns the image into the parametrs of a statistical distribution: a mean and a variance. The VAE then uses the mean and variance parameters to randomly sample one element of the distibution, and decodes that element back to the original input. The stochasticity of this process improves robustness and forces the latent space to encode meaningful representations everywhere: every point sampled in the latent space is decoded to a valid output.

How a VAE works:

  1. An encoder module turns the input sample, input_img, into two parameters in a latent space of representations, z_mean and z_log_variance
  2. You randomly sample a point z from the latent normal distribution that's assumed to generate the input image, via z = z_mean + exp(z_log_variance) * epsilon, where epsilon is a random tensor of small values.
  3. A decoder module maps this point in the latent space back to the original input image.

VAE Implementation

The parameters of a VAE are trained via two loss functions: a reconstruction loss that forces the decoded samples to match the initial inputs, and a regularization loss that helps learn well-rounded latent distributions and reduces overfitting to the training data.

Implementing a VAE with Keras

Implementing a VAR that can generate MNIST digits in three parts:

  • An encoder network that truns a real image into a mean and a variance in the latent space.
  • A sampling layer that takes such a mean and variance, and uses them to sample a random point from the latent space.
  • A decoder network that turns points form the latent space back into images.
"""
VAE encoder network
"""
from tensorflow import keras
from tensorflow.keras import layers
latent_dim = 2 # Dimensionality of the latent space: a 2D plane

encoder_inputs = keras.Input(shape=(28,28,1))
x = layers.Conv2D(
 32, 3, activation="relu", strides=2, padding="same")(encoder_inputs)
x = layers.Conv2D(64, 3, activation="relu", strides=2, padding="same")(x)
x = layers.Flatten()(x)
x = layers.Dense(16, activation="relu")(x)
"""
The input image ends up being encoded into these two parametrs
"""
z_mean = layers.Dense(latent_dim, name="z_mean")(x)
z_log_var = layers.Dense(latent_dim, name="z_log_var")(x)
encoder = keras.Model(encoder_inputs, [z_mean, z_log_var], name="encoder")
out[28]
model.summary()
out[29]

Model: "vgg19"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓

┃ Layer (type)  ┃ Output Shape  ┃  Param # ┃

┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩

│ input_layer_8 (InputLayer) │ (None, None, None, 3) │ 0 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ block1_conv1 (Conv2D) │ (None, None, None, 64) │ 1,792 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ block1_conv2 (Conv2D) │ (None, None, None, 64) │ 36,928 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ block1_pool (MaxPooling2D) │ (None, None, None, 64) │ 0 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ block2_conv1 (Conv2D) │ (None, None, None, 128) │ 73,856 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ block2_conv2 (Conv2D) │ (None, None, None, 128) │ 147,584 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ block2_pool (MaxPooling2D) │ (None, None, None, 128) │ 0 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ block3_conv1 (Conv2D) │ (None, None, None, 256) │ 295,168 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ block3_conv2 (Conv2D) │ (None, None, None, 256) │ 590,080 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ block3_conv3 (Conv2D) │ (None, None, None, 256) │ 590,080 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ block3_conv4 (Conv2D) │ (None, None, None, 256) │ 590,080 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ block3_pool (MaxPooling2D) │ (None, None, None, 256) │ 0 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ block4_conv1 (Conv2D) │ (None, None, None, 512) │ 1,180,160 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ block4_conv2 (Conv2D) │ (None, None, None, 512) │ 2,359,808 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ block4_conv3 (Conv2D) │ (None, None, None, 512) │ 2,359,808 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ block4_conv4 (Conv2D) │ (None, None, None, 512) │ 2,359,808 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ block4_pool (MaxPooling2D) │ (None, None, None, 512) │ 0 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ block5_conv1 (Conv2D) │ (None, None, None, 512) │ 2,359,808 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ block5_conv2 (Conv2D) │ (None, None, None, 512) │ 2,359,808 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ block5_conv3 (Conv2D) │ (None, None, None, 512) │ 2,359,808 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ block5_conv4 (Conv2D) │ (None, None, None, 512) │ 2,359,808 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ block5_pool (MaxPooling2D) │ (None, None, None, 512) │ 0 │

└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘

 Total params: 20,024,384 (76.39 MB)

 Trainable params: 20,024,384 (76.39 MB)

 Non-trainable params: 0 (0.00 B)

"""
Latent space sampling layer
"""
import tensorflow as tf

class Sampler(layers.Layer):
  def call(self, z_mean, z_log_var):
    batch_size = tf.shape(z_mean)[0]
    z_size = tf.shape(z_mean)[1]
    # Draw a batch of random normal vectors
    epsilon = tf.random.normal(shape=(batch_size, z_size))
    # Apply the VAE sampling formula
    return z_mean + tf.exp(0.5 * z_log_var) * epsilon
out[30]
"""
VAE decoder network, mapping latent space points to images
"""
# Input where we'll feed z
latent_inputs = keras.Input(shape=(latent_dim,))
# Produce the same number of coefficients that we had at the level
# of the Flatten layer in the encoder
x = layers.Dense(7 * 7 * 64, activation="relu")(latent_inputs)
# Revert the Flatten layer of the encoder
x = layers.Reshape((7, 7, 64))(x)
# Revert the Conv2D layers of the encoder
x = layers.Conv2DTranspose(64, 3, activation="relu", strides=2, padding="same")(x)
x = layers.Conv2DTranspose(32, 3, activation="relu", strides=2, padding="same")(x)
# The ooutput ends up with shape (28,28,1)
decoder_outputs = layers.Conv2D(1, 3, activation="sigmoid", padding="same")(x)
decoder = keras.Model(latent_inputs, decoder_outputs, name="decoder")
model.summary()
out[31]

Model: "vgg19"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓

┃ Layer (type)  ┃ Output Shape  ┃  Param # ┃

┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩

│ input_layer_8 (InputLayer) │ (None, None, None, 3) │ 0 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ block1_conv1 (Conv2D) │ (None, None, None, 64) │ 1,792 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ block1_conv2 (Conv2D) │ (None, None, None, 64) │ 36,928 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ block1_pool (MaxPooling2D) │ (None, None, None, 64) │ 0 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ block2_conv1 (Conv2D) │ (None, None, None, 128) │ 73,856 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ block2_conv2 (Conv2D) │ (None, None, None, 128) │ 147,584 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ block2_pool (MaxPooling2D) │ (None, None, None, 128) │ 0 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ block3_conv1 (Conv2D) │ (None, None, None, 256) │ 295,168 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ block3_conv2 (Conv2D) │ (None, None, None, 256) │ 590,080 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ block3_conv3 (Conv2D) │ (None, None, None, 256) │ 590,080 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ block3_conv4 (Conv2D) │ (None, None, None, 256) │ 590,080 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ block3_pool (MaxPooling2D) │ (None, None, None, 256) │ 0 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ block4_conv1 (Conv2D) │ (None, None, None, 512) │ 1,180,160 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ block4_conv2 (Conv2D) │ (None, None, None, 512) │ 2,359,808 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ block4_conv3 (Conv2D) │ (None, None, None, 512) │ 2,359,808 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ block4_conv4 (Conv2D) │ (None, None, None, 512) │ 2,359,808 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ block4_pool (MaxPooling2D) │ (None, None, None, 512) │ 0 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ block5_conv1 (Conv2D) │ (None, None, None, 512) │ 2,359,808 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ block5_conv2 (Conv2D) │ (None, None, None, 512) │ 2,359,808 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ block5_conv3 (Conv2D) │ (None, None, None, 512) │ 2,359,808 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ block5_conv4 (Conv2D) │ (None, None, None, 512) │ 2,359,808 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ block5_pool (MaxPooling2D) │ (None, None, None, 512) │ 0 │

└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘

 Total params: 20,024,384 (76.39 MB)

 Trainable params: 20,024,384 (76.39 MB)

 Non-trainable params: 0 (0.00 B)

"""
VAE model with custom train_step()

VAE is an example of self-supervised learning, because it
uses inputs as targets. Whenever you depart from classic supervised
learning, it's common to subclass the Model class and implement
a custom train_step() to sepcify the new training logic
"""
class VAE(keras.Model):
  def __init__(self, encoder, decoder, **kwargs):
    super().__init__(**kwargs)
    self.encoder = encoder
    self.decoder = decoder
    self.sampler = Sampler()
    """
    Use these metrics to keep track of the loss averages
    over each epoch
    """
    self.total_loss_tracker = keras.metrics.Mean(name="total_loss")
    self.reconstruction_loss_tracker = keras.metrics.Mean(
    name="reconstruction_loss")
    self.kl_loss_tracker = keras.metrics.Mean(name="kl_loss")
  @property
  def metrics(self):
    """
    We list these metrics in the metrics property to enable the model
    to reset them after each epoch (or in between fit and evaluate calls)
    """
    return [
        self.total_loss_tracker,
        self.reconstruction_loss_tracker,
        self.kl_loss_tracker
        ]
  def train_step(self, data):
    with tf.GradientTape() as tape:
      z_mean, z_log_var = self.encoder(data)
      z = self.sampler(z_mean, z_log_var)
      reconstruction = decoder(z)
      # We sum the reconstruction loss over the spatial dimensions
      # and take its mean over the batch dimension
      reconstruction_loss = tf.reduce_mean(
          tf.reduce_sum(
              keras.losses.binary_crossentropy(data, reconstruction),
              axis=(1, 2)
          )
      )
      # Adding the regularization-term (Kullback-Leibler divergence)
      kl_loss = -0.5 * (1 + z_log_var - tf.square(z_mean) - tf.exp(z_log_var))
      total_loss = reconstruction_loss + tf.reduce_mean(kl_loss)
    grads = tape.gradient(total_loss, self.trainable_weights)
    self.optimizer.apply_gradients(zip(grads, self.trainable_weights))
    self.total_loss_tracker.update_state(total_loss)
    self.reconstruction_loss_tracker.update_state(reconstruction_loss)
    self.kl_loss_tracker.update_state(kl_loss)
    return {
        "total_loss": self.total_loss_tracker.result(),
        "reconstruction_loss": self.reconstruction_loss_tracker.result(),
        "kl_loss": self.kl_loss_tracker.result(),
    }
out[32]
"""
Training the VAE
"""
import numpy as np
(x_train, _), (x_test, _) = keras.datasets.mnist.load_data()
# We train on all MNIST digits, so we concatenate the training
# and test samples
mnist_digits = np.concatenate([x_train, x_test], axis=0)
mnist_digits = np.expand_dims(mnist_digits, -1).astype("float32") / 255
vae = VAE(encoder, decoder)
# Note that we don;t pass a loss argument in compile(), since the loss is
# already part of the train_step()
vae.compile(optimizer=keras.optimizers.Adam(), run_eagerly=True)
# Note that we don't pass targets in fit(), since train_step() doesn't
# expect any
vae.fit(mnist_digits, epochs=30, batch_size=128)
out[33]

Grid of Digits Sampled from Latent Space

Introduction to Generative Adverserial Networks

Generative Adverserial Networks (GANs) are an alternative to VAEs for learning latent spaces of images. They enable the generation of fairly realistic synthetic images by forcing the generated images to be statistically almost indistinguishable from real ones. A GAN is made of two parts:

  • Generator Network: Takes as input a random vector (a rnadom point in the latent space), and decodes it into a synthetic image
  • Discriminator Netowrk (or adversary): Takes as input an image (real or synthetic) and predicts whether the image came from the training set or was created by the generator network

The generator network is trained to be able to fool the discrimnator network, and thus it evolves toward generating increasingly realistic images as training goes on: artifical images that look indistinguishable from real ones, to the extent that it's impossible for the discriminator network to tell the two apart. Meanwhile, the discriminatoris constantly adapting to the gradually improving capabilities of the generator, setting a high bar of realism for the generated images. Once training is over, the generator is capable of turning any point in its input space into a believable image.

Generator / Discriminator Example

GANs are notoriously difficult to train - getting a GAN to work requires lots of careful tuning of the model architecture and training parameters.

Best Practices for the Real World

Getting the most out of your models

Hyperparameter tuning with KerasTuner.With a typical search space and dataset, you'll often find yourself letting the hyperparameter search run overnight or even over several days. Look into AutoML and AutoKeras. Another powerful technique for obtaining the best possible results on a task is model ensembling. Ensembling consists of pooling together the predictions of a set of different models to produce better predictions.

Scaling-up model training

Faster training directly improves the quality of your deep learning solutions. Mixed precision training can speed up the training of almost any model by up to 3X. Mixed precision is about leveraging 16-bit computations in places where precision isn't an issue, and to work with 32-bit values in other places to maintain numerical stability. Modern GPUs and TPUs feature specialized jardware that can run 16-bit operations much faster and use less memory than equivalent 32-bit operations.

When training on GPU, you can turn on mixed precision like this:

from tensorflow import keras
keras.mixed_precision.set_global_policy("mixed_float16")

Note that some operations may be numerically unstable in float16 (in particular, softmax and crossentropy). If you need to opt out of mixed precision for a specific layer, just pass the argument dtype="float32" to the constructor of this layer.

Multi-GPU Training

There are two ways to distribute computation across multiple devices: data parallelism and model parallelism. With data parallelism, a single model is replicated on multiple devices or multiple machines. Each of the model replicas processes different batches of data, and then they merge their results. With model parallelism, different parts of a single model run on different devices, processing a single batch of data together at the same time. This works best with models that have a naturally parallel architecture, such as models that feature moultiple branches. In practice, model parallelism is only used for models that are too large to fit on any single device.

tRaining on TPUs is generally faster than training with GPUs. When training with TPUs, there is an extra step that you need to take before you can start building a model: you need to connect to the TPU cluster:

import tensorflow as tf
tpu = tf.distribute.cluster_resolver.TPUClusterResolver.connect()
print("Device:", tpu.master())