Generative Deep Learning - Generative Modeling and Deep Learning
Jupyter notebook for chapters 1 and 2 of Generative Deep Learning by David Foster. These first two chapters go over some generative modeling terminology and give a quick overview of deep learning - Dense Layers, CNNs, etc.
Chapter 1: Generative Modeling
Whast is Generative Modeling?
A generative model describes how a dataset is generated, in terms of a probabilistic model. By sampling from this model, we are able to generate new data.
Generative Modeling Process: First, we require a dataset consisting of many examples of the entity we are trying to generate. This is known as the training data, and one such data point is called an observation.
Each observation consists of many features. It is our goal to build a model that can generate new sets of features that look as if they have been created using the same rules as the original data. A generative model must also be probabilistic rather than deterministic. The model must include a stochastic (random) element that influences the individual samples generated by the model.
Generative Versus Discriminative Modeling
Most machine learning problems are discriminative modeling. The image below shows the discriminative modeling process.
When performing discriminative modeling, each obsevation in the training data has a label. Discriminative modeling is synonymous with supervised learning, or learning a function that maps an input to an output using a labeled dataset. Generative modeling is usually performed with an unlabeld dataset.
Advances in Machine Learning
Generative modeling is harder to evaluate than discriminative modeling. Discriminative modeling has also been historically more readily applicable to business problems than genrative modeling.
The Rise of Generative Modeling
We should not be content with only being able to categorize data but should also seek a more complete understanding of how the data was generated in the first place. It is highly likely that generative modeling will be central to driving future developments in other fields of mahcine learning, such as reinforcement learning (the study of teaching agents to optimize a goal in an environemnt through trial and error). Current neuroscientific theory suggests that our perception of reality is a generative model that is trained from birth to produce simulations of our surroundings that accurately match the future.
The Generative Modeling Framework
- We have a dataset of observations X .
- We assume that the observations have been generated according to some unknown distribution, pdata .
- A generative model pmodel tries to mimic pdata . If we achieve this goal, we can sample from pmodel to generate observations that appear to have been drawn from pdata .
- We are impressed by pmodel if:
- Rule 1: It can generate examples that appear to have neen drawn form pdata .
- Rule 2: It can generate examples that are suitably different from the other observations in X . In othe words, the model shouldn't simply reproduce things it has already seen.
Probabilistic Generative Models
Terms:
- The sample space is the complete set of all values an observation x can take.
- A probability density function (or simply density function), p(x) is a function that maps a point x in the sample space to a number between 0 and 1. The sum of the density function over all points in the sample space must equal 1, so that is a well-defined probability distribution.
While there is only one true density function pdata that is assumed to have generated the observable dataset, there are infinitely many density functions pmodel that we can use to estimate pdata .
- A parametric model, pθ(x) is a family of density functions that can be describedc using a finite number of parameters, θ .
- The likelihhod L(θ ∣ x) of a parameter set θ is a function that measures the probability of θ , given some observed point x . It is defined as L(θ ∣ x)=pθ(x) . That is, the likelihood of θ given some observed point x is defined to be the value of the density function parameterized by θ , at the point x . We are simply defining the set of parameters θ ti be equal to the probability of seeing the data under the model parameterized by θ .
The focus of parametric modeling should be to find the optimal value θ^ of the parameter set that maximizes the likelihhod of observing the dataset X . This technique is called maximum likelihood estimation.
- Maximum likelihood estimation is a technique that allows us to estimate θ^ - the set of parameters θ of a density function, pθ(x) , that are most likely to explain some observed data X .
In probability theory, the multinomial distribution is a generalization of the binomial distribution. For example, it models the probability of counts for each side of a k-sided dice rolled n times. For n independent trials each of which leads to a success for exactly one of k categories, with each category having a given fixed success probability, the multinomial distribution gives the probability of any particular combination of numbers of successes for the various categories.
Naive Bayes
The Naive Bayes parametric model makes use of a simple assumption (Naive Bayes assumption) that drastically reduces the number of parameters we need to estimate. It makes the naive assumption that each feature xj is independent of every other feature xk . For all features xp xk :
To applu the Naive Bayes assumption, we first use the chain rule of probability to write the density function as a production of conditional probabilities:
Apply the Naive Bayes assumption to simply the last line and arrive at:
The Challenged of Generative Modeling
The Naive Bayes assumption does not hold for problems where features are not independent of one another or where there an incomprehensibly vast number of possible observations in the sample space.
Genrative Modeling Challenges
- How does the model cope with the high degree of conditional dependence between features?
- How does the model find one of the tiny proportion of satisdying possible generated observations among a high-dimensional sample space?
Deep learning is the key to solving both of these challenges. The fact that deep learning cna form its own features in a lower-dimensional space means that it is a form of representation learning.
Representation Learning
The core idea behind representation learning is that instead of trying to model the high-dimensional sample space directly, we should instead describe each observation in the training set using some low-dimensional latent space and map it to a point in the original domain. In other words, each point in the latent space is the representation of some high-dimensional image.
Deep learning gives us the ability to learn the often highly complex mapping function f in a variety of ways.
The poer of represenattion learning is that it actually learns which features are most important for it to describe the given observations and how to generate those features from the raw data. Mathematically speaking, it tries to find the highly nonlinear manifold on which the data lies and then establish the dimensions required to fully describe this space.
In summary, representation learning establishes the most relevant high-level features that describe how groups of pixels are displayed so that it is likely that any point in the latent space is the representation of a well-formed image. By tweaking the values of features in the latent space, we cna produce representations that, when mapped back to the original image domain, have a much better chance of looking real than if we'd tried to work directly with the individual raw pixels.
# Setting up the environment for this textbook
!git clone https://github.com/davidADSP/GDL_code.git
# Make sure that you have the most up to date version of the codebase
!git pull
!pip install virtualenv virtualenvwrapper
# The location where the virual environments will be stored
!export WORKON_HOME=$HOME/.virtualenvs
# The default version of python to use when virual environemnt
# is created
!export VIRTUALENVWRAPPER_PYTHON=/usr/local/bin/python3
# Reloads the virtualenvwrapper script
!source /usr/local/bin/virtualenvwrapper.sh
# Install the packages that we'll be using in this book
!pip install -r requirements.txt
Deep Learning
Deep learning is a class of machine learning algorithm that uses multiple stacked layers of processing units to learn high-level representations from unstructured data.
Many types of machine learning algorithms require structured, tabular data as input, arranged into columns of features that describe each observation. Unstructured data refers to any data that is not naturally aranged into columns of features, such as images, audio, and text.
Deep Neural Networks
The majority of deep learning systems are artificial neural networks (ANNs, or just neural networks for short) with multiple stacked hidden layers. For this reason, deep learning has now almost become synonomous with deep neural networks.
A deep neural network consists of a series of stacked layers. Each layer contains units, that are connected to the previous lauer's units through a set of weights. As we shall see, there are many different types of layer, but one of the most common is the dense layer that connects all units in the layer directly to every unit in the previous layer. By stacking layers, the units in each subsequent layer can represent increasingly sophisticated aspects of the original input.
The magic of deep neural networks lies in finding the set of weights for each layer that results in the most accurate predictions. The process of finding these weights is what we mean by training the network.
Your First Deep Neural Network
"""
Using the CIFAR-10 dataset
A collection of 60,,000 32x32 pixel color images that come
bundled by Keras out of the box. Each image is classified
into exactly one of 10 classes.
"""
import numpy as np
from keras.utils import to_categorical
from keras.datasets import cifar10
"""
Load the CIFAR-10 dataset
"""
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
print("x_train Shape:",x_train.shape)
print("y_train Shape:",y_train.shape)
NUM_CLASSES=10
# Scale the images
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0
# Change the integer labeling of the images to one-hot-encoded vectors
y_train = to_categorical(y_train, NUM_CLASSES)
y_test = to_categorical(y_test, NUM_CLASSES)
print("y_train Shape:",y_train.shape)
"""
Building the Model
"""
from keras.models import Sequential
from keras.layers import Flatten, Dense
"""
Building the model using the Functional API -
the functional API is recommended because it is easier
to build complex architectures with them,
"""
from keras.layers import Input, Flatten, Dense
from keras.models import Model
# Entry point into the network
# We tell the network the shape of each data element to expect as a tuple
# We don't specify the batch size
input_layer = Input(shape=(32,32, 3))
# Flatten this input to a vector using a Flatten layer
x = Flatten()(input_layer)
# Dense layer is the most fundamental layer type in any
# neural network. It contains a given number of units that
# are densely connected to every unit in the previous layer.
# The output from a given unit is the weighted sum of the input it receives
# from the previous lauer, which is then passed through a nonlinear activation
# function before being sent to the next layer.
x = Dense(units=200, activation = 'relu')(x)
x = Dense(units=150, activation = 'relu')(x)
output_layer = Dense(units=10, activation = 'softmax')(x)
model = Model(input_layer, output_layer)
The ReLU (Rectified Linear Unit) activation function is defined to be zero if the input is negative and is otherwise equal to the input. The LeakyReLU activation functionis very similar to ReLU, with one key difference: whereas the ReLU activation function returns zero for input values less than zero, the Leaky ReLU function returns a small negative number proportional to the input. LeakyReLU fixes vanishing gradients.
The sigmoid activation is udeful if you wish the output from the layer to be scaled between 0 and 1 - for example, for binary classification problems with one output unit or multilabel classification problems, where each observation can belong to more than one class.
The softmax activation function is useful if you want the total sum of the output from the layer to equal 1, for example, for multiclass classification problems where each observation only belongs to exactly one class:
In Keras, activation functions can also be defined in separate layers:
x = Dense(units=200)(x)
x = Activation('relu')(x)
# is equivalent to:
x = Dense(units=200, activation = 'relu')(x)
It is required that the shape of the Input layer matches the shape of x_train and the shape of the Dense output layer matches the shape of y_train.
model.summary()
Compiling the Model
Keras provides many built-in loss functions to choose from, or you can create your own. Three of the most commonly used are mean squared error, categorical cross-entropy, and binary cross-entropy. If your model is designed to solve a regression problem, then you can use mean squared error loss.
If you are working on a classification problem where each observation only belongs to one class, then categorical cross-entropy is the correct loss function:
If you are working on a binary classification problem with one output unit, or a multilabel problem where each observation can belong to multiple classes simultaneously, you should use binary cross-entropy:
The optimizer is the algorithm that will be used to update the wieghts in the neural network based on the gradient of the loss function. One of the most commonly used and stable optimizers is Adam. In most cases, you shouldn't need to tweak the default parameters of the Adam optimizer, except for the learning rate. Another common optimizer that you may come across is RMSProp.
"""
Compiling the model
"""
from keras.optimizers import Adam
opt = Adam(learning_rate=0.0005)
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
"""
Training the Model
batch_size determines how many observations will be passed
to the network at each training step.
epchs determines how many times the network will be shown the
full training data.
If shuffle=True, the batches will be drawn randomly without replacement from the training data at eahc training step.
"""
model.fit(x_train,y_train,batch_size=32,epochs=10,shuffle=True)
At each training step, one batch of images is passed through the network and errors are backpropagated to update the weights. The batch_size determines how many images are in each training step batch. The larger the batch size, the more stable the gradient calculation, but the slower each training step. A batch size between 32 and 256 is generally used. It is also now recommended practice to increase the batch size as training progresses.
This continues until all observations in the dataset have been seen once. This completes teh first epoch. The process repeats until the number of epochs have elapsed.
Evaluating the Model
Use the evaluate method provided by Keras. The output for this method is the list of metrics we are monitoring, in this case categorical cross-entropy and accuracy.
model.evaluate(x_test, y_test)
CLASSES = np.array(['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck'])
# preds is an array of shape [10_000, 10] - a vector of 10 class probabilities
# for each observation
preds = model.predict(x_test)
# Convert array of probabilities back into single prediction using
# argmax function. preds_single.shape = (10_000, 1)
preds_single = CLASSES[np.argmax(preds, axis = -1)]
actual_single = CLASSES[np.argmax(y_test, axis = -1)]
"""
View Images Alongside their labels and predictions
"""
import matplotlib.pyplot as plt
n_to_show = 10
indices = np.random.choice(range(len(x_test)), n_to_show)
fig = plt.figure(figsize=(15, 3))
fig.subplots_adjust(hspace=0.4, wspace=0.4)
for i, idx in enumerate(indices):
img = x_test[idx]
ax = fig.add_subplot(1, n_to_show, i+1)
ax.axis('off')
ax.text(0.5, -0.35, 'pred = ' + str(preds_single[idx]), fontsize=10, ha='center', transform=ax.transAxes)
ax.text(0.5, -0.7, 'act = ' + str(actual_single[idx]), fontsize=10, ha='center', transform=ax.transAxes)
ax.imshow(img)
plt.show()
Improving the Model
One of the reasons the network is not performing as well as it might is because there isn't anything in the network that takes into account the spatial structure of the input images. To do this, we need to use a convolutional layer.
Convolutional Layers
The image below shows a 3 x 3 x 1 portion of a grayscale image being convoluted with a 3 x 3 x 1 filter (or kernel).
The convolution is performed by multiplying trhe filter pixelwise with the porition of the image, and summming the result. The ouput is more positive when the portion of the image closely mathes the filter and more negative when the portion of the image is the inverse of the filter.
If we move the filter across the entire image, from left to right and top to bottom, recording the convolutional output as we go, we obtain a new array that picks out a particular feature of the input, depending on the values in the filter. This is exactly what a convolutional layer is designed to do, but with multiple filters rather than just one. If we are working with color images, then each filter would have three channels rather than one (i.e., each having shape 3 x 3 x 3) to match the three channels (red, green, blue) of the image.
Strides
The strides parameter is the step size used by the layer to move the filters across the input. Increasing the stride therefore reduces the size of the output tensor. When strides = 2, the height and width of the output tensor will be hald the size of the input tensor.
Padding
The passing = "same" input parameter pads the input data with zeros so that the output size from the layer is exactly the same as the input size when strides=1.
"""
Convolutional Model
"""
from tensorflow.keras.layers import Conv2D
input_layer = Input(shape=(32,32,3))
conv_layer_1 = Conv2D(filters = 10, kernel_size = (4,4), strides = 2, padding = 'same')(input_layer)
conv_layer_2 = Conv2D(filters = 20, kernel_size = (3,3), strides = 2, padding = 'same')(conv_layer_1)
flatten_layer = Flatten()(conv_layer_2)
output_layer = Dense(units=10, activation = 'softmax')(flatten_layer)
model = Model(input_layer, output_layer)
model.summary()
The output of a Conv2D layer is another four-dimensional tensor, now of shape (batch_size, height, width, filters), so we can stack Comv2D layers on top of each other to grow the depth of our network. It is really important to upderstand how the shape of the tensor changes as data flows through from one convolutional layer to the next.
The depth ofthe filters in a layer is always the same as the number of channels in the preceding layer. The number of paramaters in a filter layer is equal to (width of layer x height of layer x channels in input + 1 bias term ) x number of filters. In general, the shape of the output from a convolutional layer with padding="same" is:
Batch Normalization
One common problem when training a deep neural network is ensuring that the weights of the network remain within a reasonable range of values. If they [the weights] start to become too large, this is a sign that your netowrk is suffering from what is known as the exploding gradients problem. As errors are propagated backward through the network, the calculation of the gradient in the earlier layers can sometimes grow exponentially large, causing wild fluctuations in weight values. Exploding gradients doesn't necessarily happen early in training.
One of the reasons for scaling input data into a neural network is to ensure a stable start to training over the first iteration. Covariate shift is when the weights move farther and farther away from their random initial values.
Batch Normalization is a solution that drastically reduces the exploding gradients problem. A batch normalization layer calculates the mean and standard deviation of each of its input channels across the batch and noramlizes by subtracting the mean and dividing by the standadr deviation. There are then two learned (trainable) parameters for each channel, the scale (gamma) and shift (beta). The output is simply the normalized input, scaled by gamma and shifted by beta.
We can place batch normalization layers after dense or convolutional layers to normalize the output from those layers. During training, a batch normalization layer calculates the moving average of the mean and standadr deciation of each channel and stores this value as part of the layer to use at test time.
The calcuated mean and standard deviation are called nontrainable parameters because they are derived form data passing through th layer rather than trained through backpropagation.
In Keras, the BatchNormalization layer implements the batch normalization functionality:
BatchNormalization(momentum = 0.9)
The momentum parameter is the weight given to the previous value when calculating the mobing average and moving standard deviation.
Dropout Layers
Any successful machine learning algorithm must ensure that it generalizes to unseen data, rather than simply remembering the training dataset. If an algorithm performs well on the training dataset, but not the test dataset, we say that it is suffering from overfitting. To counteract this problem, we use regularization techniques, which ensure that the model is penalized if it starts to overfit.
There are many ways to regularize a machine learning algorith,, but for deep learning, one of the most common is by using dropout layers. During training, each dropout layer chooses a random set of units from the preceding layor and sets their output to zero. Incredibly, this simple addition drastically reduces overfitting, by ensuring that the network doesn't become overdependent on certain units or groups of units that, in effect, just remember observations from the training set.
The Dropout layer in Keras implements this functionality, with the rate parameter pecifying the proportion of units to drop from the preceding layer: Dropout(rate=0.25). Dropout layers are most commonly used after Dense layers since these are most prone to overfitting due to their higher number of weights.
Batch normalization also has been shown to reduce overfitting, and therefore many modern deep learning architectures don't use dropout at all, and rely solely on batch normalization for regularization.
Putting it all Together
"""
Putting it all Together
- The order in which to use BatchNormalization and Activation layers is a matter of preference.
"""
from tensorflow.keras import layers, activations
from tensorflow.keras.layers import BatchNormalization, Dropout
input_layer = Input((32,32,3))
x = Conv2D(filters = 32, kernel_size = 3, strides = 1, padding = 'same')(input_layer)
x = BatchNormalization()(x)
x = layers.Activation(activations.leaky_relu)(x)
x = Conv2D(filters = 32, kernel_size = 3, strides = 2, padding = 'same')(x)
x = BatchNormalization()(x)
x = layers.Activation(activations.leaky_relu)(x)
x = Conv2D(filters = 64, kernel_size = 3, strides = 1, padding = 'same')(x)
x = BatchNormalization()(x)
x = layers.Activation(activations.leaky_relu)(x)
x = Conv2D(filters = 64, kernel_size = 3, strides = 2, padding = 'same')(x)
x = BatchNormalization()(x)
x = layers.Activation(activations.leaky_relu)(x)
x = Flatten()(x)
x = Dense(128)(x)
x = BatchNormalization()(x)
x = layers.Activation(activations.leaky_relu)(x)
x = Dropout(rate = 0.5)(x)
x = Dense(NUM_CLASSES)(x)
output_layer = layers.Activation('softmax')(x)
model = Model(input_layer, output_layer)
model.summary()
opt = Adam(learning_rate=0.0005)
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
model.fit(x_train,y_train,batch_size=32,epochs=10,shuffle=True)
model.evaluate(x_test,y_test,batch_size=1000)