Deep Learning with Python - Chapters 12 and 13
"Generative Deep Learning" and "Best Practices for the Real World" go over text generation and image generation algorithms and best practices for Deep Learning in the real world.
Generative Deep Learning
Text Generation
Exploring how recurrent neural networks can be used to generate sequence data. The universal way to generate sequence data in deep learning is to train a model (usually a Transformer or an RNN) to predict the next token or next few tokens in a sequence, using the previous tokens as input. When working with text data, tokens are typically words or characters, and any network that can model the probability of the next token given the previous ones is called a language model. A language model captures the latent space of language: its statistical structure.
Once you have a trained language model, you can sample from it (generate new sequences): you feed it an inital string of text (called conditioning data), ask it to generate the next character or next word, add the generated output back to the input data, and repeat the process many times. This loop allows you to generate sequences of arbitrary length that reflect the structure of the data on which the model was trained: sequences that look almost like human-written sentences.
The importance of sampling strategy
When generating text, the way you choose the next token is critically important. A naive approach is greedy sampling - always choosing trhe most likely next character (this dosn't work well). Stochastic sampling introduces randomness in the sampling process by sampling from the probability distribution ofthe next character. There's one issue with this strategy: it doesn;t offer a way to control the amount of randomness in the sampling process.
When sampling from generative models, it's always good to explore different amounts of randomness in the generation process. In order to control the amount of stochasticity in the sampling process, we'll introduce a parameter called the softmax temperature, which characterizes the entropy of the probability distribution used for sampling: it characterizes how surpising or predictable the choice of the next word will be.
"""
Reweighting a probability distribution to a different temperature
"""
import numpy as np
def reweight_distribution(original_distribution, temperature=0.5):
"""
original_distribution is a 1D NumPy array of probability values that must sum
to 1. temperature is afactor qunatifying the entropy of the output
distribution
"""
distribution = np.log(original_distribution) / temperature
distribution = np.exp(distribution)
# Returns a rewrighted version of the original distribution
# The sum of the distribution may no longer be 1
# so you divide it by its sum to obtain the new distribution
return distribution / np.sum(distribution)
Implementing Text Generation with Keras
We'll be generating new moview reviews.
# Load the imdb data
!wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz
"""
Creating a dataset from text files (one file = one sample)
"""
import tensorflow as tf
from tensorflow import keras
dataset = keras.utils.text_dataset_from_directory(directory="aclImdb", label_mode=None, batch_size=256)
dataset = dataset.map(lambda x: tf.strings.regex_replace(x, "<br />", " "))
"""
Preparing a TextVectorization layer
"""
from tensorflow.keras.layers import TextVectorization
sequence_length = 100
# Consider the top 15,000 most common words - everything else
# will be treated as out-of-vocabulary
vocab_size = 15000
text_vectorization = TextVectorization(
max_tokens=vocab_size,
# We want to return integer word index sequences
output_mode="int",
# We'll work with inputs and targets of length 100
output_sequence_length=sequence_length,
)
text_vectorization.adapt(dataset)
"""
Setting up language modeling dataset
"""
def prepare_lm_dataset(text_batch):
# Convert a batch of texts (strings) to a batch of
# integer sequences
vectorized_sequences = text_vectorization(text_batch)
# Create inputs by cutting off the last word of the sequences
x = vectorized_sequences[:, :-1]
# Create targets by offseting the sequences by 1
y = vectorized_sequences[:, 1:]
return x, y
lm_dataset = dataset.map(prepare_lm_dataset, num_parallel_calls=4)
We'll train a model to predict a probability distribution over the next word in a sentence, given a number of initial words. When the model is trained, we'll feed it with a prompt, sample the next word, add that word back to the prompt, and repeat, until we've generated a short paragraph.
"""
Implementing positional embedding as a subclassed layer
"""
from tensorflow.keras.layers import Layer
from tensorflow.python.framework.ops import enable_eager_execution
class PositionalEmbedding(Layer):
"""
A downside of position embeddings is that the sequence lengths
needs to be known in advance
"""
def __init__(self, sequence_length, input_dim, output_dim, **kwargs):
super().__init__(**kwargs)
# Prepare an Embedding layer for the token indices
self.token_embeddings = layers.Embedding(input_dim=input_dim, output_dim=output_dim)
# Add another one for the token positions
self.position_embeddings = layers.Embedding(
input_dim=sequence_length, output_dim=output_dim)
self.sequence_length = sequence_length
self.input_dim = input_dim
self.output_dim = output_dim
def call(self,inputs):
length = tf.shape(inputs)[-1]
positions = tf.range(start=0, limit=length, delta=1)
embedded_tokens = self.token_embeddings(inputs)
embedded_positions = self.position_embeddings(positions)
# Add both embedding vectors together
return embedded_tokens + embedded_positions
def compute_mask(self, inputs, mask=None):
"""
Like the Embedding Layer, this layer should be able to generate a
mask so we can ignore padding 0s in the inputs. The compute_mask
method will be called automatically by the framework, and the mask
will get propagated to the next layer
"""
return tf.math.not_equal(inputs, 0)
def get_config(self):
"""
Implement serialization so that we can save the model
"""
config = super().get_config()
config.update({
"output_dim": self.output_dim,
"sequence_length": self.sequence_length,
"input_dim": self.input_dim,
})
return config
"""
The Transformer Decoder
"""
class TransformerDecoder(Layer):
def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
super().__init__(**kwargs)
self.embed_dim = embed_dim
self.dense_dim = dense_dim
self.num_heads = num_heads
self.attention_1 = layers.MultiHeadAttention(
num_heads=num_heads, key_dim=embed_dim)
self.attention_2 = layers.MultiHeadAttention(
num_heads=num_heads, key_dim=embed_dim)
self.dense_proj = keras.Sequential(
[layers.Dense(dense_dim, activation="relu"),
layers.Dense(embed_dim),]
)
self.layernorm_1 = layers.LayerNormalization()
self.layernorm_2 = layers.LayerNormalization()
self.layernorm_3 = layers.LayerNormalization()
# This attribute ensures that the layer will propagate its input mask
# to its outputs; masking in Keras is explicitly opt-in.
self.supports_masking = True
def get_config(self):
config = super().get_config()
config.update({
"embed_dim": self.embed_dim,
"num_heads": self.num_heads,
"dense_dim": self.dense_dim,
})
return config
def get_causal_attention_mask(self, inputs):
"""
TransformerDecoder method that generates a casual mask
"""
input_shape = tf.shape(inputs)
batch_size, sequence_length = input_shape[0], input_shape[1]
i = tf.range(sequence_length)[:, tf.newaxis]
j = tf.range(sequence_length)
# Gnerate a matrix of shape (sequence_length, sequence_length)
# with 1s in one half and 0s in the other
mask = tf.cast(i >= j, dtype="int32")
"""
Replicate it along the batch axis to get a matrix of shape (batch_size, sequence_length, sequence_length)
"""
mask = tf.reshape(mask, (1, input_shape[1], input_shape[1]))
mult = tf.concat(
[tf.expand_dims(batch_size, -1),
tf.constant([1, 1], dtype=tf.int32)], axis=0)
return tf.tile(mask, mult)
def calll(self, inputs, encoder_outputs, mask=None):
"""
The forward pass of the TransformerDecoder
"""
# Retrieve the casual mask
causal_mask = self.get_causal_attention_mask(inputs)
if mask is not None:
# Prepare the input mask (that describes padding locations
# in the target sequence)
padding_mask = tf.cast(mask[:, tf.newaxis, :], dtype="int32")
# Merge the two masks together
padding_mask = tf.minimum(padding_mask, causal_mask)
attention_output_1 = self.attention_1(
query=inputs,
value=inputs,
key=inputs,
# Pass the casual mask to the first attention layer, which performs
# self-attention over the target sequence
attention_mask=causal_mask
)
attention_output_1 = self.layernorm_1(inputs + attention_output_1)
attention_output_2 = self.attention_2(
query=attention_output_1,
value=encoder_outputs,
key=encoder_outputs,
# Pass the combined mask to the second attention layer, which
# relates the source sequence to the target sequence
attention_mask=padding_mask,
)
attention_output_2 = self.layernorm_2(
attention_output_1 + attention_output_2)
proj_output = self.dense_proj(attention_output_2)
return self.layernorm_3(attention_output_2 + proj_output)
"""
A simple Transformer-based language model
"""
from tensorflow.keras import layers
embed_dim = 256
latent_dim = 2048
num_heads = 2
inputs = keras.Input(shape=(None,), dtype="int64")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(inputs)
x = TransformerDecoder(embed_dim, latent_dim, num_heads)(x, x)
# Softmax over possible vocabulary words, computed for each output
# sequence timestep
outputs = layers.Dense(vocab_size, activation="softmax")(x)
model = keras.Model(inputs, outputs)
model.compile(loss="sparse_categorical_crossentropy", optimizer="rmsprop")
"""
The text-generation callback
"""
import numpy as np
# Dict that maps word indices back to strings, to be used for
# text decoding
tokens_index = dict(enumerate(text_vectorization.get_vocabulary()))
def sample_next(predictions, temperature=1.0):
"""
Implements variable-temperature sampling from a probability
distribution
"""
predictions = np.asarray(predictions).astype("float64")
predictions = np.log(predictions) / temperature
exp_preds = np.exp(predictions)
predictions = exp_preds / np.sum(exp_preds)
probas = np.random.multinomial(1, predictions, 1)
return np.argmax(probas)
class TextGenerator(keras.callbacks.Callback):
def __init__(self,
prompt, # Prompt that we use to seed text generation
generate_length, # How many words to generate
model_input_length,
temperatures=(1.,), # Range of temperatures to use for sampling
print_freq=1
):
self.prompt = prompt
self.generate_length = generate_length
self.model_input_length = model_input_length
self.temperatures = temperatures
self.print_freq = print_freq
def on_epoch_end(self, epoch, logs=None):
if (epoch + 1) % self.print_freq != 0:
return
for temperature in self.temperatures:
print("== Generating with temperature", temperature)
# When generating text, we start from our prompt
sentence = self.prompt
for i in range(self.generate_length):
# Feed the current sequence into our model
tokenized_sentence = text_vectorization([sentence])
predictions = self.model(tokenized_sentence)
# Retrieve the predictions for the last timestep and use them to sample
# a new word
next_token = sample_next(predictions[0, i, :])
sampled_token = tokens_index[next_token]
# Append the word to the current sequence and repeat
sentence += " " + sampled_token
print(sentence)
prompt = "This movie"
text_gen_callback = TextGenerator(
prompt,
generate_length=50,
model_input_length=sequence_length,
temperatures=(0.2, 0.5, 0.7, 1., 1.5)) # We'll use a diverse range of
# temperatures to sample text, to demonstrate the effect of temperature on
# text generation
"""
Fitting the language model
"""
model.fit(lm_dataset, epochs=200, callbacks=[text_gen_callback])
Always experiment with multiple sampling strageties (temperatures). A clever balance between learned structure and randomness is what makes generation interesting. Language models are all form and no substance.
DeepDream
DeepDream is an artistic image-modification technique that uses the representations learned by convolutional neural networks. The DeepDream algorithm is almost identical to the convnet filter-visualization technique introduced in chapter 9, consisting of running a convnet in reverse: doing gradient ascent on the input to the convnet in order to maximize the activation of a specific filter in an upper layer of the convnet.
"""
Fetching the test image
"""
from tensorflow import keras
import matplotlib.pyplot as plt
base_image_path = keras.utils.get_file(
"coast.jpg", origin="https://img-datasets.s3.amazonaws.com/coast.jpg")
plt.axis("off")
plt.imshow(keras.utils.load_img(base_image_path))
"""
Instantiating a pretrained InceptionV3 model
"""
from tensorflow.keras.applications import inception_v3
model = inception_v3.InceptionV3(weights="imagenet", include_top=False)
"""
Configuring the contribution of each layer to the DeepDream loss
"""
# Layers for which we try to maximize activation, as well as their weight in the total loss. You can tweak these settings to obtain new visual effects
layer_settings = {
"mixed4": 1.0,
"mixed5": 1.5,
"mixed6": 2.0,
"mixed7": 2.5,
}
# Symbolic outputs of each layer
outputs_dict = dict(
[
(layer.name, layer.output)
for layer in [model.get_layer(name)
for name in layer_settings.keys()]
]
)
# Model that returns the activation values for every target layer (as a dict)
feature_extractor = keras.Model(inputs=model.inputs, outputs=outputs_dict)
"""
The DeepDream loss
"""
def compute_loss(input_image):
# Extract activations
features = feature_extractor(input_image)
# Initialize the loss to 0
loss = tf.zeros(shape=())
for name in features.keys():
coeff = layer_settings[name]
activation = features[name]
# We avoud border artifacts by only involving non-border pixels in
# the loss
loss += coeff * tf.reduce_mean(tf.square(activation[:, 2:-2, 2:-2, :]))
return loss
"""
The DeepDream gradient ascent process
"""
import tensorflow as tf
@tf.function # Make the training step fast by compiling it as a tf.function function
def gradient_ascent_step(image, learning_rate):
with tf.GradientTape() as tape:
"""
Compute gradients of DeepDream loss with respect to the current image
"""
tape.watch(image)
loss = compute_loss(image)
grads = tape.gradient(loss, image)
# Normalize gradients (the same trick we used in Chapter 9)
grads = tf.math.l2_normalize(grads)
image += learning_rate * grads
return loss, image
def gradient_ascent_loop(image, iterations, learning_rate, max_loss=None):
"""
This runs gradient ascent for a given image scale (octave)
"""
for i in range(iterations):
"""
Repeatedly update the image in a way that increase the DeepDream loss
"""
loss, image = gradient_ascent_step(image, learning_rate)
if max_loss is not None and loss > max_loss:
"""
Break out if the loss crosses a certain threshold
(over-optimizing would create unwanted image artifacts)
"""
break
print(f"... Loss value at step {i}: {loss:.2f}")
return image
"""
Image Processing utilities
"""
step = 20. # Gradient ascent step size
num_octave = 3 # Number of scales at which to run gradient ascent
octave_scale = 1.4 # Size ratio between successive scales
iterations = 30 #Number of gradient ascent steps per scale
max_loss = 15. # We'll stop the gradient ascent process for a scale if the loss gets higher than this
import numpy as np
def preprocess_image(image_path):
"""
Util function to open, resize and fromat pictures into appropriate arrays
"""
img = keras.utils.load_img(image_path)
img = keras.utils.img_to_array(img)
img = np.expand_dims(img, axis=0)
img = keras.applications.inception_v3.preprocess_input(img)
return img
def deprocess_image(img):
"""
Util function to convert a NumPy array
into a valid image
"""
img = img.reshape((img.shape[1], img.shape[2], 3))
img /= 2.0
img += 0.5
img *= 255.
# Convert to uint8 and clip to valid range [0,255]
img = np.clip(img, 0, 255).astype("uint8")
return img
"""
Running Gradient Ascent over multiple successive octaves
"""
# Load the test image
original_img = preprocess_image(base_image_path)
original_shape = original_img.shape[1:3]
successive_shapes = [original_shape]
"""
Compute the target shape of the image at different octaves
"""
for i in range(1, num_octave):
shape = tuple([int(dim / (octave_scale ** i)) for dim in original_shape])
successive_shapes.append(shape)
successive_shapes = successive_shapes[::-1]
shrunk_original_img = tf.image.resize(original_img, successive_shapes[0])
# Make a copy of the image (we need to keep the original around)
img = tf.identity(original_img)
"""
Iterate over the different octaves
"""
for i, shape in enumerate(successive_shapes):
print(f"Processing octave {i} with shape {shape}")
# Iterate over the different octaves
img = tf.image.resize(img, shape)
# Run the gradient ascent, altering the dream
img = gradient_ascent_loop(
img, iterations=iterations, learning_rate=step, max_loss=max_loss
)
# Scale up the smaller version of the original image: it will be
# pixellated
upscaled_shrunk_original_img = tf.image.resize(shrunk_original_img, shape)
# Compute the high-quality version of the original image at this isze
same_size_original = tf.image.resize(original_img, shape)
# The difference between the two is the detail that was lost
# when scaling up
lost_detail = same_size_original - upscaled_shrunk_original_img
# Reinject lost detail into the dream
img += lost_detail
shrunk_original_img = tf.image.resize(original_img, shape)
# Save the final result
keras.utils.save_img("dream.png", deprocess_image(img.numpy()))
Neural style Transfer
IN addition to DeepDream, another major developemnt in deep-learning-driven image modification is neural style transfer. Ity consists of applying the stule of a reference image to a target image whule conserving the content of the target image. In this context, style essentially means textures, colors, and visual patterns in the image, at various spatial scales, and the content is the higher-level macrostructure of the image.
The key notion behind implementing style transfer is the same idea that's central to all deep learning algorithms: you define a loss function to specify what you want to achieve, and you minimize this loss.
Neural Style Transfer Loss Function pseudocode:
loss = (distance(style(reference_image) - style(combination_image)) + distance(content(original_image) - content(combination_image)))
- A good candidate for the content loss is the L2 norm between the activations of an upper layer in a pretrained convnet, computed over the target image, and the activations of the same layer computed over the generated image.
- For the style loss, the neural style transfer paper authors use the Gram matrix of a layer's activations: the inner product of the feature maps of a given layer.
You can use a pretrained convnet to define a loss that will do the following:
- Preserve content by maintaining a similar high-level layer activations between the original image and the generated image. The convnet should "see" both the original image and the generated image as containing teh same things.
- Preserve style by maintinaing similar correlations within activations for both the low-level layers and high-level layers, Feature correlations capture textures: the generated image and style reference image should share the same textures at different spatial scales.
"""
Getting the style and content images
"""
from tensorflow import keras
# Path to the image we want to transform
base_image_path = keras.utils.get_file("sf.jpg",origin="https://img-datasets.s3.amazonaws.com/sf.jpg")
# Path to the style image
style_reference_image_path = keras.utils.get_file("starry_night.jpg", origin="https://img-datasets.s3.amazonaws.com/starry_night.jpg")
# Dimensions of the generated picture
original_width, original_height = keras.utils.load_img(base_image_path).size
img_height = 400
img_width = round(original_width * img_height / original_height)
"""
Auxilary Functions
"""
import numpy as np
def preprocess_image(image_path):
"""
Util functions to open, resize, and format pictures
into appropriate arrays
"""
img = keras.utils.load_img(
image_path, target_size=(img_height, img_width))
img = keras.utils.img_to_array(img)
img = np.expand_dims(img, axis=0)
img = keras.applications.vgg19.preprocess_input(img)
return img
def deprocess_image(img):
"""
Util function to convert a NumPy array into a valid image
"""
img = img.reshape((img_height, img_width, 3))
"""
Zero-centring by removing the mean pixel value from ImageNet.
This reverses a transformation done by vgg19.preprocess_input
"""
img[:, :, 0] += 103.939
img[:, :, 1] += 116.779
img[:, :, 2] += 123.68
"""
Converts images from 'BGR' to 'RBG'
This is also part of the reversal of vgg19.preprocess_input
"""
img = img[:, :, ::-1]
img = np.clip(img, 0, 255).astype("uint8")
return img
"""
Using a pretrained VGG19 model to create a feature extractor
"""
# Build a VGG19 model loaded with pretrained ImageNet weights
model = keras.applications.vgg19.VGG19(weights="imagenet", include_top=False)
outputs_dict = dict([(layer.name, layer.output) for layer in model.layers])
# Model that returns the activation values for every taget layer (as a dict)
feature_extractor = keras.Model(inputs=model.inputs, outputs=outputs_dict)
"""
Content loss
"""
def content_loss(base_img, combination_img):
return tf.reduce_sum(tf.square(combination_img - base_img))
"""
Style Loss
"""
def gram_matrix(x):
x = tf.transpose(x, (2, 0, 1))
features = tf.reshape(x, (tf.shape(x)[0], -1))
gram = tf.matmul(features, tf.transpose(features))
return gram
def style_loss(style_img, combination_img):
S = gram_matrix(style_img)
C = gram_matrix(combination_img)
channels = 3
size = img_height * img_width
return tf.reduce_sum(tf.square(S - C)) / (4.0 * (channels ** 2) * (size ** 2))
"""
Total variation loss
"""
def total_variation_loss(x):
a = tf.square(x[:, : img_height - 1, : img_width - 1, :] - x[:, 1:, : img_width - 1, :])
b = tf.square(x[:, : img_height - 1, : img_width - 1, :] - x[:, : img_height - 1, 1:, :])
return tf.reduce_sum(tf.pow(a + b, 1.25))
"""
Defining the final loss that you'll minimize
"""
# List of layers to use for the style loss
style_layer_names = [
"block1_conv1",
"block2_conv1",
"block3_conv1",
"block4_conv1",
"block5_conv1",
]
# The layer to use for the content loss
content_layer_name = "block5_conv2"
# Contribution weight of the total variation loss
total_variation_weight = 1e-6
# Contribution weight of the style loss
style_weight = 1e-6
# Contribution weight of the content loss
content_weight = 2.5e-8
def compute_loss(combination_image, base_image, style_reference_image):
input_tensor = tf.concat([base_image, style_reference_image, combination_image], axis=0)
features = feature_extractor(input_tensor)
# Initialize the loss to 0
loss = tf.zeros(shape=())
# Add the content loss
layer_features = features[content_layer_name]
base_image_features = layer_features[0, :, :, :]
combination_features = layer_features[2, :, :, :]
loss = loss + content_weight * content_loss(base_image_features, combination_features)
for layer_name in style_layer_names:
# Add the style loss
layer_features = features[layer_name]
style_reference_features = layer_features[1, :, :, :]
combination_features = layer_features[2, :, :, :]
style_loss_value = style_loss(
style_reference_features, combination_features)
loss += (style_weight / len(style_layer_names)) * style_loss_value
loss += total_variation_weight * total_variation_loss(combination_image)
return loss
"""
Setting up the gradient-descent process
"""
import tensorflow as tf
@tf.function # Make the training step fast by compiling as tf.function
def compute_loss_and_grads(combination_image, base_image, style_reference_image):
"""
"""
with tf.GradientTape() as tape:
loss = compute_loss(combination_image, base_image, style_reference_image)
grads = tape.gradient(loss, combination_image)
return loss, grads
optimizer = keras.optimizers.SGD(
# We'll start with a learning rate of 100 and decrease it by 4%
# every 100 steps
keras.optimizers.schedules.ExponentialDecay(
initial_learning_rate=100.0, decay_steps=100, decay_rate=0.96
)
)
base_image = preprocess_image(base_image_path)
style_reference_image = preprocess_image(style_reference_image_path)
# Use a Variable to store the combination image since we'll be updating
# it during training
combination_image = tf.Variable(preprocess_image(base_image_path))
iterations = 4000
for i in range(1, iterations + 1):
loss, grads = compute_loss_and_grads(combination_image, base_image, style_reference_image)
# Update the combination image in a direction that reduces the style
# transfer loss.
optimizer.apply_gradients([(grads, combination_image)])
if i % 100 == 0:
# print(f"Iteration {i}: loss={loss:.2f}")
img = deprocess_image(combination_image.numpy())
fname = f"combination_image_at_iteration_{i}.png"
# Sae the combination image at regular intervals
keras.utils.save_img(fname, img)
The resulting image can be seen below:
Generating Images with Variational Autoencoders
Two main techniques in image generation: variational autoencoders (VAEs) and generative adverserial networks (GANs).
Sampling from Latent Space of Images
The key idea of image generation is to develop a low-dimensional latent space of representations (which, like everything else in deep learning, is a vector space), where any point can be mapped to a "valid" image: an image that looks like the real thing. The module capable of realizing this mapping, taking as input a latent point and outputting an image (a grid of pixels), is called a generator (in the case of GANs) or a decoder (in the case of VAEs). Once such a latent space has been learned, you can sample points from it, and, by mapping them back to image space, generate imges that have never been seen before.
VAEs are great for learning latent spaces that are well-structured, were specific directions encode a meaningful axis of variation in the data. GANs generate images that can potentially be highly realistic, but the latent space they come from may not have as much structure and continuity.
Concept Vectors for Image Editing
Given a latent space of representations, or an embedding space, certain directions in the space may encode interesting axes of variation in the original data.
Variational Autoencoders
Variational autoencoders are a kind of generative model that's especially appropraite for the task of image editing via concept vectors. They're a modern take on autoencoders (a type of network that aims to encode an input to a low-dimensional latent space and then decode it back) that mixes ideas from deep learning with Bayesian inference.
A classical image autoencoder takes an image, maps it to a latent vector space via an encoder module, then decodes it back to an output with the same dimesnions as the original image, via a decoder module. It's then trained by using as target data the same images as the inputr images, meaning the autoencoder learns to reconstruct the original inputs. By imposing various constraints on the code (the output of the encoder), you can get teh autoencoder to learn more- or less-interesting latent representations of the data.
A VAE, instead of compressing its input iamge into a fixed code in the latent space, turns the image into the parametrs of a statistical distribution: a mean and a variance. The VAE then uses the mean and variance parameters to randomly sample one element of the distibution, and decodes that element back to the original input. The stochasticity of this process improves robustness and forces the latent space to encode meaningful representations everywhere: every point sampled in the latent space is decoded to a valid output.
How a VAE works:
- An encoder module turns the input sample, input_img, into two parameters in a latent space of representations, z_mean and z_log_variance
- You randomly sample a point z from the latent normal distribution that's assumed to generate the input image, via z = z_mean + exp(z_log_variance) * epsilon, where epsilon is a random tensor of small values.
- A decoder module maps this point in the latent space back to the original input image.
The parameters of a VAE are trained via two loss functions: a reconstruction loss that forces the decoded samples to match the initial inputs, and a regularization loss that helps learn well-rounded latent distributions and reduces overfitting to the training data.
Implementing a VAE with Keras
Implementing a VAR that can generate MNIST digits in three parts:
- An encoder network that truns a real image into a mean and a variance in the latent space.
- A sampling layer that takes such a mean and variance, and uses them to sample a random point from the latent space.
- A decoder network that turns points form the latent space back into images.
"""
VAE encoder network
"""
from tensorflow import keras
from tensorflow.keras import layers
latent_dim = 2 # Dimensionality of the latent space: a 2D plane
encoder_inputs = keras.Input(shape=(28,28,1))
x = layers.Conv2D(
32, 3, activation="relu", strides=2, padding="same")(encoder_inputs)
x = layers.Conv2D(64, 3, activation="relu", strides=2, padding="same")(x)
x = layers.Flatten()(x)
x = layers.Dense(16, activation="relu")(x)
"""
The input image ends up being encoded into these two parametrs
"""
z_mean = layers.Dense(latent_dim, name="z_mean")(x)
z_log_var = layers.Dense(latent_dim, name="z_log_var")(x)
encoder = keras.Model(encoder_inputs, [z_mean, z_log_var], name="encoder")
model.summary()
"""
Latent space sampling layer
"""
import tensorflow as tf
class Sampler(layers.Layer):
def call(self, z_mean, z_log_var):
batch_size = tf.shape(z_mean)[0]
z_size = tf.shape(z_mean)[1]
# Draw a batch of random normal vectors
epsilon = tf.random.normal(shape=(batch_size, z_size))
# Apply the VAE sampling formula
return z_mean + tf.exp(0.5 * z_log_var) * epsilon
"""
VAE decoder network, mapping latent space points to images
"""
# Input where we'll feed z
latent_inputs = keras.Input(shape=(latent_dim,))
# Produce the same number of coefficients that we had at the level
# of the Flatten layer in the encoder
x = layers.Dense(7 * 7 * 64, activation="relu")(latent_inputs)
# Revert the Flatten layer of the encoder
x = layers.Reshape((7, 7, 64))(x)
# Revert the Conv2D layers of the encoder
x = layers.Conv2DTranspose(64, 3, activation="relu", strides=2, padding="same")(x)
x = layers.Conv2DTranspose(32, 3, activation="relu", strides=2, padding="same")(x)
# The ooutput ends up with shape (28,28,1)
decoder_outputs = layers.Conv2D(1, 3, activation="sigmoid", padding="same")(x)
decoder = keras.Model(latent_inputs, decoder_outputs, name="decoder")
model.summary()
"""
VAE model with custom train_step()
VAE is an example of self-supervised learning, because it
uses inputs as targets. Whenever you depart from classic supervised
learning, it's common to subclass the Model class and implement
a custom train_step() to sepcify the new training logic
"""
class VAE(keras.Model):
def __init__(self, encoder, decoder, **kwargs):
super().__init__(**kwargs)
self.encoder = encoder
self.decoder = decoder
self.sampler = Sampler()
"""
Use these metrics to keep track of the loss averages
over each epoch
"""
self.total_loss_tracker = keras.metrics.Mean(name="total_loss")
self.reconstruction_loss_tracker = keras.metrics.Mean(
name="reconstruction_loss")
self.kl_loss_tracker = keras.metrics.Mean(name="kl_loss")
@property
def metrics(self):
"""
We list these metrics in the metrics property to enable the model
to reset them after each epoch (or in between fit and evaluate calls)
"""
return [
self.total_loss_tracker,
self.reconstruction_loss_tracker,
self.kl_loss_tracker
]
def train_step(self, data):
with tf.GradientTape() as tape:
z_mean, z_log_var = self.encoder(data)
z = self.sampler(z_mean, z_log_var)
reconstruction = decoder(z)
# We sum the reconstruction loss over the spatial dimensions
# and take its mean over the batch dimension
reconstruction_loss = tf.reduce_mean(
tf.reduce_sum(
keras.losses.binary_crossentropy(data, reconstruction),
axis=(1, 2)
)
)
# Adding the regularization-term (Kullback-Leibler divergence)
kl_loss = -0.5 * (1 + z_log_var - tf.square(z_mean) - tf.exp(z_log_var))
total_loss = reconstruction_loss + tf.reduce_mean(kl_loss)
grads = tape.gradient(total_loss, self.trainable_weights)
self.optimizer.apply_gradients(zip(grads, self.trainable_weights))
self.total_loss_tracker.update_state(total_loss)
self.reconstruction_loss_tracker.update_state(reconstruction_loss)
self.kl_loss_tracker.update_state(kl_loss)
return {
"total_loss": self.total_loss_tracker.result(),
"reconstruction_loss": self.reconstruction_loss_tracker.result(),
"kl_loss": self.kl_loss_tracker.result(),
}
"""
Training the VAE
"""
import numpy as np
(x_train, _), (x_test, _) = keras.datasets.mnist.load_data()
# We train on all MNIST digits, so we concatenate the training
# and test samples
mnist_digits = np.concatenate([x_train, x_test], axis=0)
mnist_digits = np.expand_dims(mnist_digits, -1).astype("float32") / 255
vae = VAE(encoder, decoder)
# Note that we don;t pass a loss argument in compile(), since the loss is
# already part of the train_step()
vae.compile(optimizer=keras.optimizers.Adam(), run_eagerly=True)
# Note that we don't pass targets in fit(), since train_step() doesn't
# expect any
vae.fit(mnist_digits, epochs=30, batch_size=128)
Introduction to Generative Adverserial Networks
Generative Adverserial Networks (GANs) are an alternative to VAEs for learning latent spaces of images. They enable the generation of fairly realistic synthetic images by forcing the generated images to be statistically almost indistinguishable from real ones. A GAN is made of two parts:
- Generator Network: Takes as input a random vector (a rnadom point in the latent space), and decodes it into a synthetic image
- Discriminator Netowrk (or adversary): Takes as input an image (real or synthetic) and predicts whether the image came from the training set or was created by the generator network
The generator network is trained to be able to fool the discrimnator network, and thus it evolves toward generating increasingly realistic images as training goes on: artifical images that look indistinguishable from real ones, to the extent that it's impossible for the discriminator network to tell the two apart. Meanwhile, the discriminatoris constantly adapting to the gradually improving capabilities of the generator, setting a high bar of realism for the generated images. Once training is over, the generator is capable of turning any point in its input space into a believable image.
GANs are notoriously difficult to train - getting a GAN to work requires lots of careful tuning of the model architecture and training parameters.
Best Practices for the Real World
Getting the most out of your models
Hyperparameter tuning with KerasTuner.With a typical search space and dataset, you'll often find yourself letting the hyperparameter search run overnight or even over several days. Look into AutoML and AutoKeras. Another powerful technique for obtaining the best possible results on a task is model ensembling. Ensembling consists of pooling together the predictions of a set of different models to produce better predictions.
Scaling-up model training
Faster training directly improves the quality of your deep learning solutions. Mixed precision training can speed up the training of almost any model by up to 3X. Mixed precision is about leveraging 16-bit computations in places where precision isn't an issue, and to work with 32-bit values in other places to maintain numerical stability. Modern GPUs and TPUs feature specialized jardware that can run 16-bit operations much faster and use less memory than equivalent 32-bit operations.
When training on GPU, you can turn on mixed precision like this:
from tensorflow import keras
keras.mixed_precision.set_global_policy("mixed_float16")
Note that some operations may be numerically unstable in float16 (in particular, softmax and crossentropy). If you need to opt out of mixed precision for a specific layer, just pass the argument dtype="float32" to the constructor of this layer.
Multi-GPU Training
There are two ways to distribute computation across multiple devices: data parallelism and model parallelism. With data parallelism, a single model is replicated on multiple devices or multiple machines. Each of the model replicas processes different batches of data, and then they merge their results. With model parallelism, different parts of a single model run on different devices, processing a single batch of data together at the same time. This works best with models that have a naturally parallel architecture, such as models that feature moultiple branches. In practice, model parallelism is only used for models that are too large to fit on any single device.
tRaining on TPUs is generally faster than training with GPUs. When training with TPUs, there is an extra step that you need to take before you can start building a model: you need to connect to the TPU cluster:
import tensorflow as tf
tpu = tf.distribute.cluster_resolver.TPUClusterResolver.connect()
print("Device:", tpu.master())