Generative Deep Learning - Generative Modeling and Deep Learning

Jupyter notebook for chapters 1 and 2 of Generative Deep Learning by David Foster. These first two chapters go over some generative modeling terminology and give a quick overview of deep learning - Dense Layers, CNNs, etc.

2 433

Chapter 1: Generative Modeling

Whast is Generative Modeling?

A generative model describes how a dataset is generated, in terms of a probabilistic model. By sampling from this model, we are able to generate new data.

Generative Modeling Process: First, we require a dataset consisting of many examples of the entity we are trying to generate. This is known as the training data, and one such data point is called an observation.

The Generative Modeling Process

Each observation consists of many features. It is our goal to build a model that can generate new sets of features that look as if they have been created using the same rules as the original data. A generative model must also be probabilistic rather than deterministic. The model must include a stochastic (random) element that influences the individual samples generated by the model.

Generative Versus Discriminative Modeling

Most machine learning problems are discriminative modeling. The image below shows the discriminative modeling process.

Discriminative Modeling

When performing discriminative modeling, each obsevation in the training data has a label. Discriminative modeling is synonymous with supervised learning, or learning a function that maps an input to an output using a labeled dataset. Generative modeling is usually performed with an unlabeld dataset.


Discriminative Modeling estimates p(yx)- the probability of a label y given observation x.\small{\textit{Discriminative Modeling}\text{ estimates }p(y|\textbf{x})\text{- the probability of a label }y\text{ given observation }\textbf{x}.}Discriminative Modeling estimates p(yx)- the probability of a label y given observation x. Generative Modeling estimates p(x)- the probability of observing observation x.if the dataset is labeled, we can also build a generative model that estimates the distribution p(xy).\small{\textit{Generative Modeling}\text{ estimates }p(\textbf{x})\text{- the probability of observing observation }\textbf{x}.}\\[0.25em] \small{\text{if the dataset is labeled, we can also build a generative model that estimates the distribution }p(\textbf{x}|y).}Generative Modeling estimates p(x)- the probability of observing observation x.if the dataset is labeled, we can also build a generative model that estimates the distribution p(xy).

Advances in Machine Learning

Generative modeling is harder to evaluate than discriminative modeling. Discriminative modeling has also been historically more readily applicable to business problems than genrative modeling.

The Rise of Generative Modeling

Improvement of Face Generation

We should not be content with only being able to categorize data but should also seek a more complete understanding of how the data was generated in the first place. It is highly likely that generative modeling will be central to driving future developments in other fields of mahcine learning, such as reinforcement learning (the study of teaching agents to optimize a goal in an environemnt through trial and error). Current neuroscientific theory suggests that our perception of reality is a generative model that is trained from birth to produce simulations of our surroundings that accurately match the future.

The Generative Modeling Framework

  • We have a dataset of observations X\textbf{X}X .
  • We assume that the observations have been generated according to some unknown distribution, pdatap_{data}pdata .
  • A generative model pmodelp_{model}pmodel tries to mimic pdatap_{data}pdata . If we achieve this goal, we can sample from pmodelp_{model}pmodel to generate observations that appear to have been drawn from pdatap_{data}pdata .
  • We are impressed by pmodelp_{model}pmodel if:
    • Rule 1: It can generate examples that appear to have neen drawn form pdatap_{data}pdata .
    • Rule 2: It can generate examples that are suitably different from the other observations in X\textbf{X}X . In othe words, the model shouldn't simply reproduce things it has already seen.

Probabilistic Generative Models

Terms:

  • The sample space is the complete set of all values an observation x\textbf{x}x can take.
  • A probability density function (or simply density function), p(x)p(\textbf{x})p(x) is a function that maps a point x\textbf{x}x in the sample space to a number between 0 and 1. The sum of the density function over all points in the sample space must equal 1, so that is a well-defined probability distribution.

While there is only one true density function pdatap_{data}pdata that is assumed to have generated the observable dataset, there are infinitely many density functions pmodelp_{model}pmodel that we can use to estimate pdatap_{data}pdata .

  • A parametric model, pθ(x)p_{\theta}(\textbf{x})pθ(x) is a family of density functions that can be describedc using a finite number of parameters, θ\thetaθ .
  • The likelihhod L(θ  x)\mathscr{L}(\theta\ |\ \textbf{x})L(θ  x) of a parameter set θ\thetaθ is a function that measures the probability of θ\thetaθ , given some observed point x\textbf{x}x . It is defined as L(θ  x)=pθ(x)\mathscr{L}(\theta\ |\ \textbf{x}) = p_{\theta}(\textbf{x})L(θ  x)=pθ(x) . That is, the likelihood of θ\thetaθ given some observed point x\textbf{x}x is defined to be the value of the density function parameterized by θ\thetaθ , at the point x\textbf{x}x . We are simply defining the set of parameters θ\thetaθ ti be equal to the probability of seeing the data under the model parameterized by θ\thetaθ .

The focus of parametric modeling should be to find the optimal value θ^\hat{\theta}θ^ of the parameter set that maximizes the likelihhod of observing the dataset X\textbf{X}X . This technique is called maximum likelihood estimation.

  • Maximum likelihood estimation is a technique that allows us to estimate θ^\hat{\theta}θ^ - the set of parameters θ\thetaθ of a density function, pθ(x)p_{\theta}(\textbf{x})pθ(x) , that are most likely to explain some observed data X\textbf{X}X .
θ^= L(θ  X)\hat{\theta} = \underset{\theta}{\text{argmax}} \space \mathscr{L} (\theta \ | \ \textbf{X})θ^=θargmax L(θ  X)

In probability theory, the multinomial distribution is a generalization of the binomial distribution. For example, it models the probability of counts for each side of a k-sided dice rolled n times. For n independent trials each of which leads to a success for exactly one of k categories, with each category having a given fixed success probability, the multinomial distribution gives the probability of any particular combination of numbers of successes for the various categories.

Naive Bayes

The Naive Bayes parametric model makes use of a simple assumption (Naive Bayes assumption) that drastically reduces the number of parameters we need to estimate. It makes the naive assumption that each feature xjx_jxj is independent of every other feature xkx_kxk . For all features xp xkx_p \space x_kxp xk :

p(xj xk)=p(xj)p\left(x_j \space x_k \right) = p(x_j)p(xj xk)=p(xj)

To applu the Naive Bayes assumption, we first use the chain rule of probability to write the density function as a production of conditional probabilities:

p(x)=p(x1,,xK)=p(x2,,xKx1) p(x1)=p(x3,,xKx1,x2) p(x2x1)p(x1)=k=1Kp(xKx1,,xk1)\begin{align} p(\textbf{x}) = p(x_1 , \ldots , x_K) \\[0.25em] = p(x_2, \ldots , x_K |x_1)\space p(x_1) \\[0.25em] = p(x_3 , \ldots , x_K | x_1 , x_2)\space p(x_2 | x_1)p(x_1)\\[0.25em] =\prod_{k=1}^K p(x_K | x_1 , \ldots , x_{k-1}) \end{align}p(x)=p(x1,,xK)=p(x2,,xKx1) p(x1)=p(x3,,xKx1,x2) p(x2x1)p(x1)=k=1Kp(xKx1,,xk1)

Apply the Naive Bayes assumption to simply the last line and arrive at:

p(x)=k=1Kp(xk)p(\textbf{x})=\prod_{k=1}^Kp(x_k)p(x)=k=1Kp(xk)

The Challenged of Generative Modeling

The Naive Bayes assumption does not hold for problems where features are not independent of one another or where there an incomprehensibly vast number of possible observations in the sample space.

Genrative Modeling Challenges

  • How does the model cope with the high degree of conditional dependence between features?
  • How does the model find one of the tiny proportion of satisdying possible generated observations among a high-dimensional sample space?

Deep learning is the key to solving both of these challenges. The fact that deep learning cna form its own features in a lower-dimensional space means that it is a form of representation learning.

Representation Learning

The core idea behind representation learning is that instead of trying to model the high-dimensional sample space directly, we should instead describe each observation in the training set using some low-dimensional latent space and map it to a point in the original domain. In other words, each point in the latent space is the representation of some high-dimensional image.

Deep learning gives us the ability to learn the often highly complex mapping function fff in a variety of ways.

Latent Space Mapping

The poer of represenattion learning is that it actually learns which features are most important for it to describe the given observations and how to generate those features from the raw data. Mathematically speaking, it tries to find the highly nonlinear manifold on which the data lies and then establish the dimensions required to fully describe this space.

Idea of Manifolds

In summary, representation learning establishes the most relevant high-level features that describe how groups of pixels are displayed so that it is likely that any point in the latent space is the representation of a well-formed image. By tweaking the values of features in the latent space, we cna produce representations that, when mapped back to the original image domain, have a much better chance of looking real than if we'd tried to work directly with the individual raw pixels.

# Setting up the environment for this textbook
!git clone https://github.com/davidADSP/GDL_code.git
# Make sure that you have the most up to date version of the codebase
!git pull
!pip install virtualenv virtualenvwrapper
# The location where the virual environments will be stored
!export WORKON_HOME=$HOME/.virtualenvs
# The default version of python to use when virual environemnt
# is created
!export VIRTUALENVWRAPPER_PYTHON=/usr/local/bin/python3
# Reloads the virtualenvwrapper script
!source /usr/local/bin/virtualenvwrapper.sh
# Install the packages that we'll be using in this book
!pip install -r requirements.txt
out[2]

Cloning into 'GDL_code'...
remote: Enumerating objects: 394, done.
remote: Counting objects: 100% (3/3), done.
remote: Compressing objects: 100% (3/3), done.
remote: Total 394 (delta 0), reused 1 (delta 0), pack-reused 391 (from 1)
Receiving objects: 100% (394/394), 22.13 MiB | 16.10 MiB/s, done.
Resolving deltas: 100% (237/237), done.
fatal: not a git repository (or any of the parent directories): .git
Collecting virtualenv
Downloading virtualenv-20.26.4-py3-none-any.whl.metadata (4.5 kB)
Collecting virtualenvwrapper
Downloading virtualenvwrapper-6.1.0-py3-none-any.whl.metadata (5.1 kB)
Collecting distlib<1,>=0.3.7 (from virtualenv)
Downloading distlib-0.3.8-py2.py3-none-any.whl.metadata (5.1 kB)
Requirement already satisfied: filelock<4,>=3.12.2 in /usr/local/lib/python3.10/dist-packages (from virtualenv) (3.16.0)
Requirement already satisfied: platformdirs<5,>=3.9.1 in /usr/local/lib/python3.10/dist-packages (from virtualenv) (4.3.2)
Collecting virtualenv-clone (from virtualenvwrapper)
Downloading virtualenv_clone-0.5.7-py3-none-any.whl.metadata (2.7 kB)
Collecting stevedore (from virtualenvwrapper)
Downloading stevedore-5.3.0-py3-none-any.whl.metadata (2.3 kB)
Collecting pbr>=2.0.0 (from stevedore->virtualenvwrapper)
Downloading pbr-6.1.0-py2.py3-none-any.whl.metadata (3.4 kB)
Downloading virtualenv-20.26.4-py3-none-any.whl (6.0 MB)
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.0/6.0 MB 78.9 MB/s eta 0:00:00
[?25hDownloading virtualenvwrapper-6.1.0-py3-none-any.whl (22 kB)
Downloading distlib-0.3.8-py2.py3-none-any.whl (468 kB)
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 468.9/468.9 kB 35.5 MB/s eta 0:00:00
[?25hDownloading stevedore-5.3.0-py3-none-any.whl (49 kB)
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 49.7/49.7 kB 4.1 MB/s eta 0:00:00
[?25hDownloading virtualenv_clone-0.5.7-py3-none-any.whl (6.6 kB)
Downloading pbr-6.1.0-py2.py3-none-any.whl (108 kB)
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 108.5/108.5 kB 9.9 MB/s eta 0:00:00
[?25hInstalling collected packages: distlib, virtualenv-clone, virtualenv, pbr, stevedore, virtualenvwrapper
Successfully installed distlib-0.3.8 pbr-6.1.0 stevedore-5.3.0 virtualenv-20.26.4 virtualenv-clone-0.5.7 virtualenvwrapper-6.1.0
virtualenvwrapper.user_scripts creating /root/.virtualenvs/premkproject
virtualenvwrapper.user_scripts creating /root/.virtualenvs/postmkproject
virtualenvwrapper.user_scripts creating /root/.virtualenvs/initialize
virtualenvwrapper.user_scripts creating /root/.virtualenvs/premkvirtualenv
virtualenvwrapper.user_scripts creating /root/.virtualenvs/postmkvirtualenv
virtualenvwrapper.user_scripts creating /root/.virtualenvs/prermvirtualenv
virtualenvwrapper.user_scripts creating /root/.virtualenvs/postrmvirtualenv
virtualenvwrapper.user_scripts creating /root/.virtualenvs/predeactivate
virtualenvwrapper.user_scripts creating /root/.virtualenvs/postdeactivate
virtualenvwrapper.user_scripts creating /root/.virtualenvs/preactivate
virtualenvwrapper.user_scripts creating /root/.virtualenvs/postactivate
virtualenvwrapper.user_scripts creating /root/.virtualenvs/get_env_details
ERROR: Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'


Deep Learning

Deep learning is a class of machine learning algorithm that uses multiple stacked layers of processing units to learn high-level representations from unstructured data.

Many types of machine learning algorithms require structured, tabular data as input, arranged into columns of features that describe each observation. Unstructured data refers to any data that is not naturally aranged into columns of features, such as images, audio, and text.

Deep Neural Networks

The majority of deep learning systems are artificial neural networks (ANNs, or just neural networks for short) with multiple stacked hidden layers. For this reason, deep learning has now almost become synonomous with deep neural networks.

A deep neural network consists of a series of stacked layers. Each layer contains units, that are connected to the previous lauer's units through a set of weights. As we shall see, there are many different types of layer, but one of the most common is the dense layer that connects all units in the layer directly to every unit in the previous layer. By stacking layers, the units in each subsequent layer can represent increasingly sophisticated aspects of the original input.

The magic of deep neural networks lies in finding the set of weights for each layer that results in the most accurate predictions. The process of finding these weights is what we mean by training the network.

Your First Deep Neural Network

"""
Using the CIFAR-10 dataset
A collection of 60,,000 32x32 pixel color images that come
bundled by Keras out of the box. Each image is classified
into exactly one of 10 classes.

"""
import numpy as np
from keras.utils import to_categorical
from keras.datasets import cifar10
"""
Load the CIFAR-10 dataset
"""
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
print("x_train Shape:",x_train.shape)
print("y_train Shape:",y_train.shape)
NUM_CLASSES=10

# Scale the images
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0

# Change the integer labeling of the images to one-hot-encoded vectors
y_train = to_categorical(y_train, NUM_CLASSES)
y_test = to_categorical(y_test, NUM_CLASSES)

out[4]

Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
170498071/170498071 ━━━━━━━━━━━━━━━━━━━━ 13s 0us/step
x_train Shape: (50000, 32, 32, 3)
y_train Shape: (50000, 1)

print("y_train Shape:",y_train.shape)
out[5]

y_train Shape: (50000, 10)

"""
Building the Model
"""
from keras.models import Sequential
from keras.layers import Flatten, Dense

"""
Building the model using the Functional API -
the functional API is recommended because it is easier
to build complex architectures with them,
"""
from keras.layers import Input, Flatten, Dense
from keras.models import Model
# Entry point into the network
# We tell the network the shape of each data element to expect as a tuple
# We don't specify the batch size
input_layer = Input(shape=(32,32, 3))
# Flatten this input to a vector using a Flatten layer
x = Flatten()(input_layer)
# Dense layer is the most fundamental layer type in any
# neural network. It contains a given number of units that
# are densely connected to every unit in the previous layer.
# The output from a given unit is the weighted sum of the input it receives
# from the previous lauer, which is then passed through a nonlinear activation
# function before being sent to the next layer.
x = Dense(units=200, activation = 'relu')(x)
x = Dense(units=150, activation = 'relu')(x)
output_layer = Dense(units=10, activation = 'softmax')(x)
model = Model(input_layer, output_layer)
out[6]

The ReLU (Rectified Linear Unit) activation function is defined to be zero if the input is negative and is otherwise equal to the input. The LeakyReLU activation functionis very similar to ReLU, with one key difference: whereas the ReLU activation function returns zero for input values less than zero, the Leaky ReLU function returns a small negative number proportional to the input. LeakyReLU fixes vanishing gradients.

The sigmoid activation is udeful if you wish the output from the layer to be scaled between 0 and 1 - for example, for binary classification problems with one output unit or multilabel classification problems, where each observation can belong to more than one class.

The softmax activation function is useful if you want the total sum of the output from the layer to equal 1, for example, for multiclass classification problems where each observation only belongs to exactly one class:

yi=exij=1JexjJ=the total number of units in the layery_i = \cfrac{e^{x_i}}{\sum_{j=1}^Je^{x_j}}\\[0.25em] J=\text{the total number of units in the layer}yi=j=1JexjexiJ=the total number of units in the layer

In Keras, activation functions can also be defined in separate layers:

x = Dense(units=200)(x)
x = Activation('relu')(x)
# is equivalent to:
x = Dense(units=200, activation = 'relu')(x)

It is required that the shape of the Input layer matches the shape of x_train and the shape of the Dense output layer matches the shape of y_train.

model.summary()
out[8]

Model: "functional"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓

┃ Layer (type)  ┃ Output Shape  ┃  Param # ┃

┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩

│ input_layer (InputLayer) │ (None, 32, 32, 3) │ 0 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ flatten (Flatten) │ (None, 3072) │ 0 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ dense (Dense) │ (None, 200) │ 614,600 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ dense_1 (Dense) │ (None, 150) │ 30,150 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ dense_2 (Dense) │ (None, 10) │ 1,510 │

└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘

 Total params: 646,260 (2.47 MB)

 Trainable params: 646,260 (2.47 MB)

 Non-trainable params: 0 (0.00 B)

Compiling the Model

Keras provides many built-in loss functions to choose from, or you can create your own. Three of the most commonly used are mean squared error, categorical cross-entropy, and binary cross-entropy. If your model is designed to solve a regression problem, then you can use mean squared error loss.

MSE=1ni=1n(yipi)2\text{MSE}=\frac{1}{n}\sum_{i=1}^n(y_i - p_i)^2MSE=n1i=1n(yipi)2

If you are working on a classification problem where each observation only belongs to one class, then categorical cross-entropy is the correct loss function:

i=1n yi log(pi)-\sum_{i=1}^n \space y_i \space \log (p_i)i=1n yi log(pi)

If you are working on a binary classification problem with one output unit, or a multilabel problem where each observation can belong to multiple classes simultaneously, you should use binary cross-entropy:

1ni=1n(yi log(pi)+(1yi)log(1pi))-\frac{1}{n} \sum_{i=1}^n (y_i \space \log (p_i) + (1 - y_i)\log (1 - p_i))n1i=1n(yi log(pi)+(1yi)log(1pi))

The optimizer is the algorithm that will be used to update the wieghts in the neural network based on the gradient of the loss function. One of the most commonly used and stable optimizers is Adam. In most cases, you shouldn't need to tweak the default parameters of the Adam optimizer, except for the learning rate. Another common optimizer that you may come across is RMSProp.

"""
Compiling the model
"""
from keras.optimizers import Adam
opt = Adam(learning_rate=0.0005)
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
out[10]
"""
Training the Model

batch_size determines how many observations will be passed
to the network at each training step.
epchs determines how many times the network will be shown the
full training data.
If shuffle=True, the batches will be drawn randomly without replacement from the training data at eahc training step.
"""
model.fit(x_train,y_train,batch_size=32,epochs=10,shuffle=True)
out[11]

Epoch 1/10
1563/1563 ━━━━━━━━━━━━━━━━━━━━ 12s 4ms/step - accuracy: 0.2940 - loss: 1.9548
Epoch 2/10
1563/1563 ━━━━━━━━━━━━━━━━━━━━ 2s 1ms/step - accuracy: 0.3936 - loss: 1.6942
Epoch 3/10
1563/1563 ━━━━━━━━━━━━━━━━━━━━ 2s 1ms/step - accuracy: 0.4321 - loss: 1.5947
Epoch 4/10
1563/1563 ━━━━━━━━━━━━━━━━━━━━ 2s 1ms/step - accuracy: 0.4504 - loss: 1.5386
Epoch 5/10
1563/1563 ━━━━━━━━━━━━━━━━━━━━ 2s 1ms/step - accuracy: 0.4664 - loss: 1.5031
Epoch 6/10
1563/1563 ━━━━━━━━━━━━━━━━━━━━ 2s 1ms/step - accuracy: 0.4802 - loss: 1.4623
Epoch 7/10
1563/1563 ━━━━━━━━━━━━━━━━━━━━ 2s 1ms/step - accuracy: 0.4905 - loss: 1.4263
Epoch 8/10
1563/1563 ━━━━━━━━━━━━━━━━━━━━ 2s 1ms/step - accuracy: 0.4989 - loss: 1.4074
Epoch 9/10
1563/1563 ━━━━━━━━━━━━━━━━━━━━ 2s 1ms/step - accuracy: 0.5104 - loss: 1.3845
Epoch 10/10
1563/1563 ━━━━━━━━━━━━━━━━━━━━ 2s 1ms/step - accuracy: 0.5158 - loss: 1.3638

<keras.src.callbacks.history.History at 0x7aab4d4e11e0>

At each training step, one batch of images is passed through the network and errors are backpropagated to update the weights. The batch_size determines how many images are in each training step batch. The larger the batch size, the more stable the gradient calculation, but the slower each training step. A batch size between 32 and 256 is generally used. It is also now recommended practice to increase the batch size as training progresses.

This continues until all observations in the dataset have been seen once. This completes teh first epoch. The process repeats until the number of epochs have elapsed.

Evaluating the Model

Use the evaluate method provided by Keras. The output for this method is the list of metrics we are monitoring, in this case categorical cross-entropy and accuracy.

model.evaluate(x_test, y_test)
out[13]

313/313 ━━━━━━━━━━━━━━━━━━━━ 2s 4ms/step - accuracy: 0.4949 - loss: 1.4124

[1.4161943197250366, 0.4952999949455261]

CLASSES = np.array(['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck'])
# preds is an array of shape [10_000, 10] - a vector of 10 class probabilities
# for each observation
preds = model.predict(x_test)
# Convert array of probabilities back into single prediction using
# argmax function. preds_single.shape = (10_000, 1)
preds_single = CLASSES[np.argmax(preds, axis = -1)]

actual_single = CLASSES[np.argmax(y_test, axis = -1)]
out[14]

313/313 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step

"""
View Images Alongside their labels and predictions
"""
import matplotlib.pyplot as plt
n_to_show = 10
indices = np.random.choice(range(len(x_test)), n_to_show)
fig = plt.figure(figsize=(15, 3))
fig.subplots_adjust(hspace=0.4, wspace=0.4)
for i, idx in enumerate(indices):
  img = x_test[idx]
  ax = fig.add_subplot(1, n_to_show, i+1)
  ax.axis('off')
  ax.text(0.5, -0.35, 'pred = ' + str(preds_single[idx]), fontsize=10, ha='center', transform=ax.transAxes)
  ax.text(0.5, -0.7, 'act = ' + str(actual_single[idx]), fontsize=10, ha='center', transform=ax.transAxes)
  ax.imshow(img)
plt.show()
out[15]
Jupyter Notebook Image

<Figure size 1500x300 with 10 Axes>

Improving the Model

One of the reasons the network is not performing as well as it might is because there isn't anything in the network that takes into account the spatial structure of the input images. To do this, we need to use a convolutional layer.

Convolutional Layers

The image below shows a 3 x 3 x 1 portion of a grayscale image being convoluted with a 3 x 3 x 1 filter (or kernel).

The Convolution Operator

The convolution is performed by multiplying trhe filter pixelwise with the porition of the image, and summming the result. The ouput is more positive when the portion of the image closely mathes the filter and more negative when the portion of the image is the inverse of the filter.

If we move the filter across the entire image, from left to right and top to bottom, recording the convolutional output as we go, we obtain a new array that picks out a particular feature of the input, depending on the values in the filter. This is exactly what a convolutional layer is designed to do, but with multiple filters rather than just one. If we are working with color images, then each filter would have three channels rather than one (i.e., each having shape 3 x 3 x 3) to match the three channels (red, green, blue) of the image.

Covolutional Filter applied to Grayscale Image

Strides

The strides parameter is the step size used by the layer to move the filters across the input. Increasing the stride therefore reduces the size of the output tensor. When strides = 2, the height and width of the output tensor will be hald the size of the input tensor.

Padding

The passing = "same" input parameter pads the input data with zeros so that the output size from the layer is exactly the same as the input size when strides=1.

Same Padding

"""
Convolutional Model
"""
from tensorflow.keras.layers import Conv2D
input_layer = Input(shape=(32,32,3))
conv_layer_1 = Conv2D(filters = 10, kernel_size = (4,4), strides = 2, padding = 'same')(input_layer)
conv_layer_2 = Conv2D(filters = 20, kernel_size = (3,3), strides = 2, padding = 'same')(conv_layer_1)
flatten_layer = Flatten()(conv_layer_2)
output_layer = Dense(units=10, activation = 'softmax')(flatten_layer)
model = Model(input_layer, output_layer)
model.summary()
out[17]

Model: "functional_1"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓

┃ Layer (type)  ┃ Output Shape  ┃  Param # ┃

┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩

│ input_layer_1 (InputLayer) │ (None, 32, 32, 3) │ 0 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ conv2d (Conv2D) │ (None, 16, 16, 10) │ 490 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ conv2d_1 (Conv2D) │ (None, 8, 8, 20) │ 1,820 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ flatten_1 (Flatten) │ (None, 1280) │ 0 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ dense_3 (Dense) │ (None, 10) │ 12,810 │

└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘

 Total params: 15,120 (59.06 KB)

 Trainable params: 15,120 (59.06 KB)

 Non-trainable params: 0 (0.00 B)

The output of a Conv2D layer is another four-dimensional tensor, now of shape (batch_size, height, width, filters), so we can stack Comv2D layers on top of each other to grow the depth of our network. It is really important to upderstand how the shape of the tensor changes as data flows through from one convolutional layer to the next.

Diagram of Convolutional Neural Network

The depth ofthe filters in a layer is always the same as the number of channels in the preceding layer. The number of paramaters in a filter layer is equal to (width of layer x height of layer x channels in input + 1 bias term ) x number of filters. In general, the shape of the output from a convolutional layer with padding="same" is:

(None,input heightstride,input widthstride,filters)\left( \text{None},\cfrac{\text{input height}}{\text{stride}}, \cfrac{\text{input width}}{\text{stride}}, \text{filters} \right)(None,strideinput height,strideinput width,filters)

Batch Normalization

One common problem when training a deep neural network is ensuring that the weights of the network remain within a reasonable range of values. If they [the weights] start to become too large, this is a sign that your netowrk is suffering from what is known as the exploding gradients problem. As errors are propagated backward through the network, the calculation of the gradient in the earlier layers can sometimes grow exponentially large, causing wild fluctuations in weight values. Exploding gradients doesn't necessarily happen early in training.

One of the reasons for scaling input data into a neural network is to ensure a stable start to training over the first iteration. Covariate shift is when the weights move farther and farther away from their random initial values.

Batch Normalization is a solution that drastically reduces the exploding gradients problem. A batch normalization layer calculates the mean and standard deviation of each of its input channels across the batch and noramlizes by subtracting the mean and dividing by the standadr deviation. There are then two learned (trainable) parameters for each channel, the scale (gamma) and shift (beta). The output is simply the normalized input, scaled by gamma and shifted by beta.

Batch Normalization Process

We can place batch normalization layers after dense or convolutional layers to normalize the output from those layers. During training, a batch normalization layer calculates the moving average of the mean and standadr deciation of each channel and stores this value as part of the layer to use at test time.

The calcuated mean and standard deviation are called nontrainable parameters because they are derived form data passing through th layer rather than trained through backpropagation.

In Keras, the BatchNormalization layer implements the batch normalization functionality:

BatchNormalization(momentum = 0.9)

The momentum parameter is the weight given to the previous value when calculating the mobing average and moving standard deviation.

Dropout Layers

Any successful machine learning algorithm must ensure that it generalizes to unseen data, rather than simply remembering the training dataset. If an algorithm performs well on the training dataset, but not the test dataset, we say that it is suffering from overfitting. To counteract this problem, we use regularization techniques, which ensure that the model is penalized if it starts to overfit.

There are many ways to regularize a machine learning algorith,, but for deep learning, one of the most common is by using dropout layers. During training, each dropout layer chooses a random set of units from the preceding layor and sets their output to zero. Incredibly, this simple addition drastically reduces overfitting, by ensuring that the network doesn't become overdependent on certain units or groups of units that, in effect, just remember observations from the training set.

The Dropout layer in Keras implements this functionality, with the rate parameter pecifying the proportion of units to drop from the preceding layer: Dropout(rate=0.25). Dropout layers are most commonly used after Dense layers since these are most prone to overfitting due to their higher number of weights.

Batch normalization also has been shown to reduce overfitting, and therefore many modern deep learning architectures don't use dropout at all, and rely solely on batch normalization for regularization.

Putting it all Together

"""
Putting it all Together

- The order in which to use BatchNormalization and Activation layers is a matter of preference.
"""
from tensorflow.keras import layers, activations
from tensorflow.keras.layers import BatchNormalization, Dropout

input_layer = Input((32,32,3))
x = Conv2D(filters = 32, kernel_size = 3, strides = 1, padding = 'same')(input_layer)
x = BatchNormalization()(x)
x = layers.Activation(activations.leaky_relu)(x)
x = Conv2D(filters = 32, kernel_size = 3, strides = 2, padding = 'same')(x)
x = BatchNormalization()(x)
x = layers.Activation(activations.leaky_relu)(x)
x = Conv2D(filters = 64, kernel_size = 3, strides = 1, padding = 'same')(x)
x = BatchNormalization()(x)
x = layers.Activation(activations.leaky_relu)(x)
x = Conv2D(filters = 64, kernel_size = 3, strides = 2, padding = 'same')(x)
x = BatchNormalization()(x)
x = layers.Activation(activations.leaky_relu)(x)
x = Flatten()(x)
x = Dense(128)(x)
x = BatchNormalization()(x)
x = layers.Activation(activations.leaky_relu)(x)
x = Dropout(rate = 0.5)(x)
x = Dense(NUM_CLASSES)(x)
output_layer = layers.Activation('softmax')(x)
model = Model(input_layer, output_layer)
model.summary()
out[19]

Model: "functional_2"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓

┃ Layer (type)  ┃ Output Shape  ┃  Param # ┃

┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩

│ input_layer_2 (InputLayer) │ (None, 32, 32, 3) │ 0 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ conv2d_2 (Conv2D) │ (None, 32, 32, 32) │ 896 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ batch_normalization │ (None, 32, 32, 32) │ 128 │

│ (BatchNormalization) │ │ │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ activation (Activation) │ (None, 32, 32, 32) │ 0 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ conv2d_3 (Conv2D) │ (None, 16, 16, 32) │ 9,248 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ batch_normalization_1 │ (None, 16, 16, 32) │ 128 │

│ (BatchNormalization) │ │ │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ activation_1 (Activation) │ (None, 16, 16, 32) │ 0 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ conv2d_4 (Conv2D) │ (None, 16, 16, 64) │ 18,496 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ batch_normalization_2 │ (None, 16, 16, 64) │ 256 │

│ (BatchNormalization) │ │ │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ activation_2 (Activation) │ (None, 16, 16, 64) │ 0 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ conv2d_5 (Conv2D) │ (None, 8, 8, 64) │ 36,928 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ batch_normalization_3 │ (None, 8, 8, 64) │ 256 │

│ (BatchNormalization) │ │ │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ activation_3 (Activation) │ (None, 8, 8, 64) │ 0 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ flatten_2 (Flatten) │ (None, 4096) │ 0 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ dense_4 (Dense) │ (None, 128) │ 524,416 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ batch_normalization_4 │ (None, 128) │ 512 │

│ (BatchNormalization) │ │ │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ activation_4 (Activation) │ (None, 128) │ 0 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ dropout (Dropout) │ (None, 128) │ 0 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ dense_5 (Dense) │ (None, 10) │ 1,290 │

├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤

│ activation_5 (Activation) │ (None, 10) │ 0 │

└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘

 Total params: 592,554 (2.26 MB)

 Trainable params: 591,914 (2.26 MB)

 Non-trainable params: 640 (2.50 KB)

opt = Adam(learning_rate=0.0005)
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
model.fit(x_train,y_train,batch_size=32,epochs=10,shuffle=True)
model.evaluate(x_test,y_test,batch_size=1000)
out[20]

Epoch 1/10
1563/1563 ━━━━━━━━━━━━━━━━━━━━ 16s 5ms/step - accuracy: 0.3864 - loss: 1.7922
Epoch 2/10
1563/1563 ━━━━━━━━━━━━━━━━━━━━ 4s 2ms/step - accuracy: 0.5847 - loss: 1.1693
Epoch 3/10
1563/1563 ━━━━━━━━━━━━━━━━━━━━ 4s 2ms/step - accuracy: 0.6570 - loss: 0.9779
Epoch 4/10
1563/1563 ━━━━━━━━━━━━━━━━━━━━ 4s 2ms/step - accuracy: 0.6993 - loss: 0.8689
Epoch 5/10
1563/1563 ━━━━━━━━━━━━━━━━━━━━ 4s 2ms/step - accuracy: 0.7197 - loss: 0.8062
Epoch 6/10
1563/1563 ━━━━━━━━━━━━━━━━━━━━ 4s 2ms/step - accuracy: 0.7441 - loss: 0.7376
Epoch 7/10
1563/1563 ━━━━━━━━━━━━━━━━━━━━ 4s 2ms/step - accuracy: 0.7610 - loss: 0.6872
Epoch 8/10
1563/1563 ━━━━━━━━━━━━━━━━━━━━ 4s 2ms/step - accuracy: 0.7784 - loss: 0.6347
Epoch 9/10
1563/1563 ━━━━━━━━━━━━━━━━━━━━ 4s 2ms/step - accuracy: 0.7873 - loss: 0.5996
Epoch 10/10
1563/1563 ━━━━━━━━━━━━━━━━━━━━ 4s 2ms/step - accuracy: 0.8044 - loss: 0.5591
10/10 ━━━━━━━━━━━━━━━━━━━━ 1s 6ms/step - accuracy: 0.7169 - loss: 0.8352

[0.8304747343063354, 0.7206000089645386]

Comments

You have to be logged in to add a comment

User Comments

Insert Math Markup

ESC
About Inserting Math Content
Display Style:

Embed News Content

ESC
About Embedding News Content

Embed Youtube Video

ESC
Embedding Youtube Videos

Embed TikTok Video

ESC
Embedding TikTok Videos

Embed X Post

ESC
Embedding X Posts

Embed Instagram Post

ESC
Embedding Instagram Posts

Insert Details Element

ESC

Example Output:

Summary Title
You will be able to insert content here after confirming the title of the <details> element.

Insert Table

ESC
Customization
Align:
Preview:

Insert Horizontal Rule

#000000

Preview:


View Content At Different Sizes

ESC

Edit Style of Block Nodes

ESC

Edit the background color, default text color, margin, padding, and border of block nodes. Editable block nodes include paragraphs, headers, and lists.

#ffffff
#000000

Edit Selected Cells

Change the background color, vertical align, and borders of the cells in the current selection.

#ffffff
Vertical Align:
Border
#000000
Border Style:

Edit Table

ESC
Customization:
Align:

Upload Lexical State

ESC

Upload a .lexical file. If the file type matches the type of the current editor, then a preview will be shown below the file input.

Upload 3D Object

ESC

Upload Jupyter Notebook

ESC

Upload a Jupyter notebook and embed the resulting HTML in the text editor.

Insert Custom HTML

ESC

Edit Image Background Color

ESC
#ffffff

Insert Columns Layout

ESC
Column Type:

Select Code Language

ESC
Select Coding Language

Insert Chart

ESC

Use the search box below

Upload Previous Version of Article State

ESC