Deep Learning with Python - Chapters 1 and 2

The first two chapters go over what is deep learning / the mathematical building blocks of neural networks.

What is Deep Learning

AI, ML, and DL

AI can be described as the effort to automate intellectual tasks performed by humans. Symbolic AI is the approach of achieving human-level artificial intelligence by having programmers handcraft a sufficiently large set of explicit rules for manipulating knowledge stored in explicit database. This was the dominant approach from 1950s to 1980s. Machine Learning looks at the input data and corresponding answers and figures out what the rules should be. A machine learning system is trained rather than explicitly programmed. It is presented with many examples relevant to a task, and it finds statistical structure in these examples that eventually allows the system to come up with riles for automating the task. Unlike statistics, machine learning tends to deal with large, complex datasets (such as a dataset of millions of images, each consisting of tends of thousands of pixels) for which classical statistical analysis such as Bayesian analysis would be impractical. The central problem of machine learning and deep leaning is to *meaningfully transform data: in other words, to learn useful representations of the input data at hand - representations that get us closer to the expected output. Learning, in the context of machine learning, describes an automatic search process for data transformations that produce useful representations of some data guided by some feedback signal - representations that are amenable to simpler rules by solving the task at hand.

Deep Learning is a specific subfield in machine learning: a new take on learning representations from data that puts an emphasis on learning successive layers of increasingly meaningful representations. The "deep" in deep learning refers to the idea of successive layers of representations. How many layers contribute to a model of the data is called the depth of the model. Other appropriate names for the filed could have been layered representations learning or hierarchal representations learning. In deep learning, these layered representations are learned via models called neural networksm structured in literal layers stacked on top of each other. Although neural networks refer to neurobiology, deep learning models are not models of the brain and there is no evidence that the brain implements anything like the learning mechanisms used in modern deep learning models.

Deep Learning Network for Digit Recognition

Deep Neural Network for Digit Classification

[The Deep Neural Network for Digit Classification above] transforms the digit into representations that are increasingly different from the original image and increasingly informative about the final result. You can think of a deep network is a multistage information distillation process, where information goes through successive filtered and comes out increasingly purified (that is, useful with regard to some task).

Data Representations Learned by Digit Classification

Deep learning is technically a multistage way to learn data representations. The specification of what a layer does to its input data is stored in the layer's weights, which in essence are a bunch of numbers. In technical terms, we'd say that the transformation implemented by a layer is parameterized by its weights. In this context, learning means finding a set of values for the weights of all layers in a network, such that the network will correctly map example inputs to their associated targets. To control the output of a neural network, you must be able to measure how far this output is from what you expected. This is a job of the loss function of the network, also sometimes called the objective function or cost function. The loss function takes the predictions of the network and the true target and computes a distance score, capturing jow well the network has done on this specific example. The fundamental trick in deep learning is to use this score as a feedback signal to adjust the value of the weights a little, in a direction that will lower the loss score for the current example. This adjustment is the job of the optimizer, which implements what's called th Backpropagation algorithm: the central algorithm in deep learning.

Deep Learning Diagram

A Brief History of Machine Learning

Probabilistic modeling is the application of the principles of statistics to data analysis. It is one of the earliest forms of machine learning, and it's still widely used to this day. One of the best known algorithms in this category is the Naive Bayes algorithm. Naive Bayes is a type of classifier based on applying Bayes' theorem while assuming that the features in the input data are all independent. A closely related model is logistic regression.

Kernel methods are a group of classification algorithms, best known of which is the Support Vector Machine (SVM). SVM is a classification algorithm that works by finding decision boundaries. SVMs find these boundaries in two steps:

  1. The data is mapped to a new high-dimensional representation where the decision boundary can be expressed as a hyperplane.
  2. A good decision boundary (separation hyperplane) is computed by trying to maximize the distance between the hyperplane and the closest data points from each class, a step called maximizing the margin. This allows the boundary to generalize well to new samples outside of the training set.

The gist of the kernel trick: to find a good decision hyperplane in the new representation space, you just need to compute the distance between pairs of pints in that space, which can be done efficiently using a kernel function. A kernel function is a computationally tractable operation that maps two points in your initial space to the distance between these points in your target representation space, completely bypassing the explicit computation of the new representations. Kernel functions are typically crafted by hand rather than learned from data. SVMs proved hard to scale to large datasets and didn't provide good results for perceptual problems like image classification.

Decision trees are flowchart-like structures that let you classify input data points or predict output values given inputs. They're easy to visualize and interpret. The Random Forests algorithm introduced a robust, practical take on decision-tree learning that involves building a large number of specialized trees and then ensembling their outputs. A gradient boosting machine is a machine learning technique based on ensembling weak prediction models, generally decision trees. It uses gradient boosting, a way to improve any machine learning model by iteratively training new models that specialize in addressing the weak points of the previous models. Applied to decision trees, the use of the gradient boosting technique results in models that strictly outperform random forests most of the time, while having similar problems. It [gradient boosting] may be one of the best, if not the best, algorithm for dealing with non-perceptual data today. Alongside deep learning, it's one of the most commonly used techniques in Kaggle competitions.

Since 2012, deep convolutional neural networks (covnets) have become the go-to algorithm for all computer vision tasks; more generally, they work on all perceptual tasks.

Deep learning completely automates what used to be the most crucial step in machine learning workflow: feature engineering. Feature engineering is manually engineering good layers of representations of data. Deep learning completely automates this step: with deep learning, you learn all features in one pass rather than having to engineer them yourself. This has greatly simplified machine learning workflows, often replacing sophisticated multistage pipelines with a single, simple, end-to-end deep learning model. What is transformative about deep learning is that it allows a model to learn all layers of representation jointly, at the same time, rather than in succession (greedily as it's called).

With joint feature learning, whenever the model adjusts one of its internal features, all other features that depend on it automatically adapt to the change, without requiring human intervention. Everything is supervised by a single feedback signal: every change in the model serves the end goal. Thus is much more powerful than greedily stacking shallow methods, because it allows for complex, abstract representations to be learned by breaking them down into long series of intermediate spaces (layers); each space is only a simple transformation away from the previous one.

Two essential characteristics of how machine learning learns from data:

  1. the incremental, layer-by-layer way in which increasingly complex representations are developed
  2. the fact that these intermediate incremental representations are learned jointly

In 2019, Kaggle ran a survey asking teams that ended up in the top five of any competition which primary software tool they used. It turns out that to teams either use deep learning methods (via Keras) or gradient boosted trees (using XGBoost).

Software Used in Kaggle Competitions

Results of Kaggle survey among machine learning and data science professionals worldwide:

Results of Kaggle Survey from ML and DS Professionals

From 2016 to 2020, the entire machine learning and data science industry has been dominated by these two approaches: deep learning and gradient boosted trees. Specifically, gradient boosted trees is used for problems where structured data is available, whereas deep learning is used for perceptual problems such as image classification.

GPU = graphical processing unit; fast, massively parallel chips. In 2016, at its annual I/O convention, Google revealed its Tensor Processing Unit (TPU) project: a new chip design developed from the ground up to run deep neural networks significantly faster and far more energy efficient than top-of-the-line GPUs. In the late 200s, in algorithms for deep learning, the key issue was that of gradient propagation through deep stacks of layers. The feedback signal used to train neural networks would fade away as the number of layers increased. Several algorithmic improvements improved gradient propagation:

  • Better activation functions for neural layers
  • Better weight initialization schemes, starting with layer-wise pre-training, which was then quickly abandoned
  • Better optimization schemes, such as RMSProp and Adam

From 2014-2016, more advanced ways to improve gradient propagation were discovered: batch normalization, residual connections, and depthwise separable convolutions. Three most important properties of deep learning:

  1. Simplicity - removes the need for feature engineering
  2. Scalability - Highly amenable to parallelization on GPUs or TPUs, so it can take full advantage of Moore's law. Deep learning models are trained by iterating over small batches of data, allowing them to be trained on datasets of arbitrary size
  3. Versatility and reusability - Unlike many prior machine learning approaches, deep learning models can be trained on additional data without restarting from scratch, making them viable for continuous online learning- an important property for very large production models. Training deep learning models are re-purposable and res-usable.

The Mathematical Building Blocks of Neural Networks

Looking at a concrete example of a neural network that uses the Python library Keras to learn to classify handwritten digits. SOlving MNISt is like the "Hello World" of deep learning. The core building blocks of neural networks is the layer. You can think of a layer as a filter from data: some data goes in, and comes out in a more useful form. Layers extract representations out of the data fed into them. Most of deep learning consists of a chaining together simple layers that will implement a form of progressive data distillation. A deep learning model is like a sieve for data processing, made of a succession of increasing refined data filters - the layers. A softmax classification layer returns an array of probability scores that sum to one. As a part of the compilation step, the model needs an optimizer (the mechanism through which the model will update itself based on the training data it sees, to improve performance), a loss function (how the model will be able to measure its performance on the training data, and thus how it will be able to steer itself in the right direction), and metrics to monitor during training and testing (only care about the accuracy)

from tensorflow.keras.datasets import mnist 

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

print("Train Images Shape:",train_images.shape)
print("Train Labels Shape",len(train_labels))
print("Train Labels",train_labels)

print("Test Images Shape:",test_images.shape)
print("Test Labels",len(test_labels))
print("Test Labels",test_labels)
out[3]

Train Images Shape: (60000, 28, 28)
Train Labels Shape 60000
Train Labels [5 0 4 ... 5 6 8]
Test Images Shape: (10000, 28, 28)
Test Labels 10000
Test Labels [7 2 1 ... 4 5 6]

from tensorflow import keras 
from tensorflow.keras import layers 

model = keras.Sequential([
    layers.Dense(512,activation="relu"),
    layers.Dense(10,activation="softmax")
])

model.compile(optimizer="rmsprop",loss="sparse_categorical_crossentropy",metrics=["accuracy"])

"""
Preprocessing the data by reshaping it into the shape that the model expects and scaling it so that all values are in the [0,1] interval 
"""
train_images = train_images.reshape((60_000,28*28))
train_images = train_images.astype("float32") / 255 
test_images = test_images.reshape((10_000,28*28))
test_images = test_images.astype('float32') / 255
"""
Fit the model
"""
model.fit(train_images,train_labels,epochs=5,batch_size=128)
out[4]

Epoch 1/5
469/469 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.8784 - loss: 0.4283
Epoch 2/5
469/469 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.9670 - loss: 0.1117
Epoch 3/5
469/469 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9799 - loss: 0.0693
Epoch 4/5
469/469 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9857 - loss: 0.0479
Epoch 5/5
469/469 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.9889 - loss: 0.0380

<keras.src.callbacks.history.History at 0x2b12b33b710>

"""
Using the model to make predictions
"""
import matplotlib.pyplot as plt
import numpy as np 
test_digits = test_images[0:10]
predictions = model.predict(test_digits)
fig, ax = plt.subplots(2,5,layout="constrained",figsize=(16,8))
for i in range(10):
    digit = test_images[i]
    if i < 5:
        row = 0
        col = i
    else:
        row = 1
        col = i - 5
    ax[row,col].imshow(digit.reshape(28,28),cmap="gray")
    # YELLOW AND GREEN TOGHETHER
    ax[row,col].axes.get_yaxis().set_visible(False)
    ax[row,col].axes.get_xaxis().set_visible(False)
    prediction_proba_max = np.max(predictions[i])
    prediction_value_max = np.argmax(predictions[i])
    ax[row,col].set_title("Prediction: {} \nwith Probability {:2.2f}%".format(prediction_value_max,prediction_proba_max*100))
fig.suptitle("MNIST DNN Classification Prediction / Probabilities Examples")
plt.show()
out[5]

1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 36ms/step

Jupyter Notebook Image

<Figure size 1600x800 with 10 Axes>

"""
How Good is the model at classifying never before seen digits? Checking by computing the average accuracy over the entire set:
"""
test_loss, test_acc = model.evaluate(test_images, test_labels)
print("test_acc: {:1.4f}".format(test_acc))
out[6]

313/313 ━━━━━━━━━━━━━━━━━━━━ 0s 840us/step - accuracy: 0.9772 - loss: 0.0716
test_acc: 0.9811

A gap between the training accuracy and test accuracy, when the training accuracy is higher tha the test accuracy, is an example of overfitting: the fact that machine learning models tend to perform worse on new data than on their training data. Data stored in multidimensional NumPy arrays = tensors. All current machine learning systems use tensors as their basic data stricture. A tensor is a container for data - usually numeric data. Tensors are generalizations of matrices with an arbitrary number of dimensions. The number of axes of a tensor is called its rank (ndarray.ndim == rank). A scalar is a rank-0 tensor. A vector is a rank-1 tensor. The number of entries in a vector is called its dimension. Dimensionality can denote either the number of entries along a specific axis or the number of axes in a tensor. You should talk about the rank of a tensor being the number of axes it has. A matrix is a rank-2, or 2D tensor. A matrix has two axes (often referred to as rows and columns). If matrixes are packed into a new array, you can obtain a rank-3 tensor (3D tensor). Key Attributes: Number of axes (rank), Shape - a tuple of integers that describes how many dimensions the tensor has along each axis, Dat type (usually called dtype in Python libraries) - the type of the data contained in the tensor. Selecting specific elements in a tensor is called tensor slicing.

In general, the first axis (axis 0) in all data tensors you'll come across in deep learning will be the samples axis (or samples dimension). When considering a batch tensor, the first axis is called the batch axis or batch dimension.

  • Vector Data - Rank-2 tensors of shape (samples, features), where each sample is a vector of numerical attributes. (DataFrame)
  • Timeseries data or sequence data - Rank-3 tensor of shape (samples, timesteps, features), where each sample is a sequence (of length timestamps) of feature vectors
  • Images - Rank-4 tensors of shape (samples, width, height, channels), where each sample is a S2 grid of pixels, and each pixel is represented by a vector of values ("channels")
    • By convention, image tensors are always rank-3, with a one-dimensional color channel for gray scale images
  • Video - Rank-5 tensors of shape (samples, frames, height, width, channels), where each sample is a sequence (of length frames) of images

All transformations learned by deep neural networks can be reduced to a handful of tensor operations (or tensor functions) applied to tensors of numeric data. For instance, it's possible to add tensors, multiply tensors, and so on.

keras.layers.Dense(512,activation="relu")

## Can be expressed as
output = relu(dot(input, W) + b)

The relu operation and addition are element-wise operations: operations that are applied independently to each entry in the tensors being considered. This means these operations are highly amenable to massively parallel implementations (vectorized implementations, a term that comes from the vector processor supercomputer architecture from the 1970-90 period).

In practice, when dealing with NumPy arrays, these operations [element-wide operations] are available as welloptimized built-in NumPy functions, which themselves delegate the heavy lifting to a Basic Linear Algebra Subprograms (BLAS) implementation. BLAS are low-level, highly parallel, efficient tensor-manipulation routines that are typically implemented in Fortran or C.

Drop product of two matrices: dot(X,Y) = (X.shape[0],Y.shape[1]). In general, elementary operations such as translation, rotation, scaling, skewing, and so on can be expressed as tensor operations:

  • Translation: tensor addition
  • Rotation - can be achieved via a dot product with a 2 x 2 matrix: R = [[cos(theta), -sin(theta)], [sin(theta), cos(theta)]]
  • Scaling - a vertical and horizontal scaling of an image can be achieved via the dot product with a 2 x 2 matrix: S = [[horizontal_factor, 0], [0, vertical_factor]] (This is a diagonal matrix)
  • Linear Transform - A dot product with an arbitrary matrix implements a linear transform. Note that scaling and rotation, listed previously, are defined by linear transforms
  • Affine Transform - Am affine transformation (see below) is the combination of a linear transform (achieved via a dot product with some matrix) and a translation (achieved via a vector addition). A Dense layer with an activation function is an affine layer.

Affine Transformation

  • Dense layer with relu activation: An important observation about affine transforms is that if you apply many of them repeatedly, you still end up with an affine transform. As a consequence, a multilayer neural network made entirely of Dense layers without activations would be equivalent to a single Dense layer. This "Deep" neural network would act like a linear model in disguise. This is why we need activation functions, like relu. Thanks to activation functions, a chain of Dense layers can be made to implement very complex, non-linear geometric transformations, resulting in very rich hypothesis spaces for your deep neural networks.

Affine Transform followed by relu activation

You can interpret a neural network as a ery complex geometric transformation in a high-dimensional space, implemented via a series of simple steps. Machine learning is about finding neat representations for complex, highly folded data manifilds in high-dimensional spaces (a manifold is a continuous surface, like a crumpled sheet of paper).

the Engine of Neural Networks: Gradient Based Optimization

Each neural layer form the first model example transforms the input data as follows: output = relu(dot(input,W) + b). W and b are attributes of the layer. They're called the wights or trainable parameters of the layer (the kernel and bias attributes, respectively). These weights contain the information learned by the model from exposure to training data. Initially, these wights are filled ith small random values (a step called wight initialization). What comes next is to gradually adjust these weights, based on a feedback signal. This gradual adjustment, also called training, is the learning that machine learning is all about. This happens within what's called a training loop, which works as follows:

  1. Draw a batch of training samples, x, and corresponding targets y_true
  2. Run the model on x (a step called the forward pass) to obtain predictions, y_pred.
  3. Compute the loss of the model on the batch, a measure of the mismatch between y_true and y_pred
  4. Update all the wights of the model in a way that slightly reduces the loss on this batch.

The difficult part is updating the model's weights. Given an individual weight coefficient in the model, how can you compute whether the coefficient should be increased or decreased, and by how much? This is done by gradient descent. Gradient Descent is the optimization technique that powers modern neural networks. You can use a mathematical operator called the gradient to describe how the loss varies as you move the model's coefficients (all at once in a single update, rather than one at a time) in a direction that decreases the loss. The derivative of a tensor operation (or tensor function) is called a gradient. Gradients are just the generalization of the concept of derivatives to functions that take tensors as inputs. The gradient of a tensor function represents the curvature of the multidimensional surface described by the function. It characterizes how thw output of the function varies when its input parameters vary.

y_pred = dot(W,x) # Use the model weights W, to make a prediction for x
loss_value = loss(y_pred, y_true) # We estimate how far off the prediction was 
# The preceding function can be interpreted as a function mapping values of W to loss values 
loss_value = f(W) # f describes the curve (or high-dimensional surface) formed by loss values when W varies 

The tensor grad(loss_value, W0) is the gradient of the function f(w) = loss_value in W0, also called "gradient of loss_value with respect to W around W0.

[F]or a function f(x), you can reduce the value of f(x) by moving x a little in the opposite direction from the derivative, with a function f(W) of a tensor, you can reduce loss_value = f(W) by moving W in the opposite direction from the gradient: for example, W1 = W0 - step * grad(f(W0), W0) (where step is a small scaling factor). That means going against the direction of steepest ascent of f, which intuitively should put you lower on the curve. Note that the scaling factor step is needed because grad(loss_value, W0) only approximates the curvature when you’re close to W0, so you don’t want to get too far from W0.

Stochastic Gradient Descent

  1. Draw a batch of training samples, x, and corresponding targets, y_true
  2. Run the model on x to obtain predictions y_pred (forward pass)
  3. Compute the loss of the model on the batch, a measure of the mismatch between y_pred and y_true
  4. Compute the gradient of the loss with regard to the model's parameters (this is called the backward pass)
  5. Move the parameters a little in the opposite direction from the gradient- for example, W -= learning_rate*gradient - thus reducing the loss on the batch a bit. The learning rate would be a scalar factor modulating the "speed" of the gradient descent process.

The process described above is called mini-batch stochastic gradient descent. The term stochastic refers to the fact that each batch of data is drawn at random (stochastic is a scientific synonym of random). Batch gradient descent = run every sample at once. true gradient descent = run one sample at a time.

Visualization of Gradient Descent along 2D loss Surface

There are variants of SGD that differ by taking into account previous weight updates when computing the next weight update. There is for instance, DGD with momentum, as well as Adagrad, RMSprop, and several others. Sch variants are known as optimization methods or optimizers. In particular, the concept of momentum, which is used in many of these variants, deserves attention: it addresses two main issues with SGD: convergence speed and local minima. Momentum helps you break out of local minima.

Chaining Derivatives: The Backpropagation algorithm

The Backpropagation algorithm - computing the gradient of complex expressions in practice. Backpropagation is a way to use the derivatives of simple operations (such as addition, relu, or tensor product) to easily compute the gradient of arbitrarily complex combinations of atomic operations. Crucially, a neural network consists of many tensor operations chained together, each of which has a simple, known derivative. Applying the chain rule to the computation of the gradient values of a neural network gives rise to an algorithm called backpropagation. A useful way to think about backpropagation is in terms of computation graphs.

Computational Graph Representation of Two Layer Model

A computation graph is the data structure at the heart of TensorFlow and the deep learning revolution in general., It;s a direct acyclic graph of operations - in this case, tensor operations. What the chain rule says about thw backward graph is that you can obtain the derivative a node with respect to another node by multiplying the derivatives for each edge along the path linking the two nodes. For instance, grad(loss_val, w) = grad(loss_val,x2) * grad(x2, x1) * grad(x1, w)

Path from loss_val to w in backwards graph

Backpropagation is simply the application of the chain rule to a computation graph. Backpropagation starts with the final loss value and works backward from the top layers to the bottom layers, computing the contribution that each parameter had in the loss value. that's where the name "backpropagation" comes from: we "backpropagate" the loss contributions of different nodes in a computation graph. Nowadays people implement neural networks in modern frameworks that are capable of automatic differentiation, such as TensorFlow. Automatic differentiation is implemented with the kind of computation graph you've just seen. Automatic differentiation makes it possible to retrieve the gradients of arbitrary compositions of differentiable tensor operations without doing any extra work besides writing down the forward pass.

The Gradient Tape in TensorFlow

The API through which you can leverage TensorFlow’s powerful automatic differentiation capabilities is the GradientTape. It’s a Python scope that will “record” the tensor operations that run inside it, in the form of a computation graph (sometimes called a “tape”). This graph can then be used to retrieve the gradient of any output with respect to any variable or set of variables (instances of the tf.Variable class). A tf.Variable is a specific kind of tensor meant to hold mutable state—for instance, the weights of a neural network are always tf.Variable instances.

Conclusion

Each iteration over the training data, whether that be a batch of batch_size (SGD), a single instance (Pure GD), or the whole training data set (Batch BD), is called an epoch.

import tensorflow as tf 
HYPHEN_LENGTH = 50
def make_section(str):
    print('\n' + str + '\n' + HYPHEN_LENGTH*'-')
print("GradientTape on Scalar Value\n"+'-'*HYPHEN_LENGTH)
x = tf.Variable(0.) # Initiate a scalar Variable with an initial value of 0.
with tf.GradientTape() as tape: # Open a GradientTape scope
  y = 2  * x + 3 # Inside the scope, apply some tensor operations on the variable 
grad_of_y_wrt_x = tape.gradient(y, x) # use the tape to retrieve the  gradient fo the output y with respect to the variable x
print("Gradient of y wrt x", grad_of_y_wrt_x)

print("\nWorks with Tensor Operations\n"+'-'*HYPHEN_LENGTH)
x = tf.Variable(tf.random.uniform((2,2))) # Initiate a Variable with shape (2,2) and an initial value of all zeros 
with tf.GradientTape() as tape:
  y = 2 * x + 3
grad_of_y_wrt_x = tape.gradient(y,x)
print("Gradient of y wrt x", grad_of_y_wrt_x) # grad_of_y_wrt_x is a tensor of shape (2,2) (like x) describing the curvatire of y = 2 * a  + 3 around x = [[0,0]. [0,0]]

print("\nWorks on a List of Variables\n"+'-'*HYPHEN_LENGTH)
W = tf.Variable(tf.random.uniform((2,2)))
b = tf.Variable(tf.zeros((2,)))
x = tf.random.uniform((2,2))
with tf.GradientTape() as tape:
  y = tf.matmul(x,W) + b # matmul is how you say "dot product" in TensorFlow 
grad_of_y_wrt_W_and_b = tape.gradient(y,[W,b]) # grad_of_y_wrt_W_and_b is a list of two tensors with the same shapes as W and b, respectively
print("Gradient of y wrt x", grad_of_y_wrt_x)
 
out[8]

\GradientTape on Scalar Value
--------------------------------------------------
Gradient of y wrt x tf.Tensor(2.0, shape=(), dtype=float32)

Works with Tensor Operations
--------------------------------------------------
Gradient of y wrt x tf.Tensor(
[[2. 2.]
[2. 2.]], shape=(2, 2), dtype=float32)

Works on a List of Variables
--------------------------------------------------
Gradient of y wrt x tf.Tensor(
[[2. 2.]
[2. 2.]], shape=(2, 2), dtype=float32)

## Reminlementing MNIST example in TensorFlow 

"""
A `Dense` layer implents the following input transformation, where W and b are model parameters, and `activation` is an element-wise function (usually `relu`, but it would be `softmax` for the last layer):

output = activation(dot(W,input) + b)
"""
class NaiveDense:
    """
    This class creates two variables, W and b, and exposes a __call__() method that applies the preceding transformation
    """
    def __init__(self, input_size, output_size, activation):
        self.activation = activation
        w_shape = (input_size, output_size) # Create a Matrix, W, of shape (input_size, output_size) initialized with random values 
        w_initial_value = tf.random.uniform(w_shape,minval=0,maxval=1e-1)
        self.W = tf.Variable(w_initial_value)

        b_shape = (output_size,) # Create a vector, b, of shape (output_size, ) initialized with zeros
        b_initial_value = tf.zeros(b_shape)
        self.b = tf.Variable(b_initial_value)

    def __call__(self, inputs): 
        """
        Apply the forward pass
        """
        return self.activation(tf.matmul(inputs, self.W) + self.b)
    @property
    def weights(self): # Convenience method for retrieving the layer's weights
        return [self.W,self.b]

class NaiveSequential:
    """
    Creating a NaiveSequential class to chain NaiveDense layers/ It wraps a lists of layers and exposes a __call__() method that simply calls the underlying layers on the inputs, in order. It also features a weights property to easily keep track of the layers' parameters
    """
    def __init__(self,layers):
        self.layers = layers 
    
    def __call__(self,inputs):
        x = inputs 
        for layer in self.layers:
            x = layer(x)
        return x
    
    @property
    def weights(self):
        weights = []
        for layer in self.layers:
            weights += layer.weights
        return weights


model = NaiveSequential([
    NaiveDense(input_size=28*28, output_size=512, activation=tf.nn.relu),
    NaiveDense(input_size=512, output_size=10, activation=tf.nn.softmax)
])

assert len(model.weights)  == 4

"""
Next, we need to iteratoe over the MNISt data in mini-batches. This is easy:
"""
import math 
class BatchGenerator:
    def __init__(self, images, labels, batch_size=128):
        assert len(images) == len(labels)
        self.index = 0
        self.images = images 
        self.labels = labels 
        self.batch_size = batch_size
        self.num_batches = math.ceil(len(images) / batch_size)
    def next(self):
        images = self.images[self.index : self.index + self.batch_size]
        labels = self.labels[self.index: self.index + self.batch_size]
        self.index += self.batch_size
        return images, labels 
    


def one_training_step(model, images_batch, labels_batch):
    """
    The most difficult part of the process is the "training step": updating thw eights of the model after running it on one batch of data. We need to:

    1. Compute the predictions of the model for the images in the batch
    2. Compute the loss value for these predictions, given the actual labels
    3. Compute the gradient of the loss with regard to the model's weights
    4. Move the weights by a small amount in the direction opposite to the gradient
    """
    with tf.GradientTape() as tape:
        """
        Run the "foward pass" (compute the model's predictions under a GradientTape scope)
        """
        predictions = model(images_batch)
        per_sample_losses = tf.keras.losses.sparse_categorical_crossentropy(labels_batch, predictions)
        average_loss = tf.reduce_mean(per_sample_losses)
    """
    Compute the gradient of the loss with regard to the weights. The output gradients is a list where each entropy corresponds to a weight from the model.weights list
    """
    gradients = tape.gradient(average_loss,model.weights) 
    update_weights(gradients,model.weights) # Update the weights using the gradients
    return average_loss

learning_rate = 1e-3
def update_weights(gradients,weights):
    """
    The purpose of the "weight update" step is to move the wights by "a bit" in a direction to reduce the loss of the batch. The magnitude of the move is determined by the "learning rate", typically a small quantity. The simplest way to implement this update_weights function is to subtract gradient * learning_rate from each weight
    """
    for g, w in zip(gradients,weights):
        w.assign_sub(g*learning_rate) # assign_sub is the equivalent of -= for TensorFlow variables

"""
The step above would almost never be implemented by hand. Instead, you would use the `Optimizer` instance from Keras:
"""
from tensorflow.keras import optimizers

optimizer = optimizers.SGD(learning_rate=1e-3)

def update_weights(gradients,weights):
    optimizer.apply_gradients(zip(gradients, weights))

"""
The full training loop
----------------------------------------------------

An eopoch of training simply consists of repeating the training step for each batch in the training data, and the full training loop is simply the retition of one epoch
"""
def fit(model, images, labels, epochs, batch_size=128):
    for epoch_counter in range(epochs):
        print(f"Epoch {epoch_counter}")
        batch_generator = BatchGenerator(images,labels)
        for batch_counter in range(batch_generator.num_batches):
            images_batch, labels_batch = batch_generator.next()
            loss = one_training_step(model, images_batch, labels_batch)
            if batch_counter % 100 == 0:
                print(f"loss at batch {batch_counter}: {loss:.2f}")
out[9]
"""
Testing Out Our Implementation
"""
from tensorflow.keras.datasets import mnist 

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

train_images = train_images.reshape((60000, 28 * 28))
train_images = train_images.astype("float32") / 255
test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype("float32") / 255

fit(model, train_images, train_labels, epochs=10, batch_size=128)
out[10]

Epoch 0
loss at batch 0: 4.81
loss at batch 100: 2.25
loss at batch 200: 2.26
loss at batch 300: 2.10
loss at batch 400: 2.22
Epoch 1
loss at batch 0: 1.91
loss at batch 100: 1.89
loss at batch 200: 1.88
loss at batch 300: 1.72
loss at batch 400: 1.83
Epoch 2
loss at batch 0: 1.59
loss at batch 100: 1.59
loss at batch 200: 1.55
loss at batch 300: 1.43
loss at batch 400: 1.52
Epoch 3
loss at batch 0: 1.33
loss at batch 100: 1.35
loss at batch 200: 1.28
loss at batch 300: 1.21
loss at batch 400: 1.29
Epoch 4
loss at batch 0: 1.13
loss at batch 100: 1.16
loss at batch 200: 1.08
loss at batch 300: 1.05
loss at batch 400: 1.12
Epoch 5
loss at batch 0: 0.99
loss at batch 100: 1.02
loss at batch 200: 0.93
loss at batch 300: 0.92
loss at batch 400: 1.01
Epoch 6
loss at batch 0: 0.88
loss at batch 100: 0.92
loss at batch 200: 0.83
loss at batch 300: 0.83
loss at batch 400: 0.92
Epoch 7
loss at batch 0: 0.79
loss at batch 100: 0.83
loss at batch 200: 0.75
loss at batch 300: 0.76
loss at batch 400: 0.85
Epoch 8
loss at batch 0: 0.73
loss at batch 100: 0.77
loss at batch 200: 0.68
loss at batch 300: 0.71
loss at batch 400: 0.80
Epoch 9
loss at batch 0: 0.68
loss at batch 100: 0.71
loss at batch 200: 0.63
loss at batch 300: 0.66
loss at batch 400: 0.75

"""
Evaluating the Model
"""
predictions = model(test_images)
predictions = predictions.numpy() # Calling .numpy() on a TensorFlow tensor converts it to a NumPy tensor
predicted_labels = np.argmax(predictions,axis=1)
matches = predicted_labels == test_labels
print(f"Accuracy: {matches.mean():.2f}")
out[11]

Accuracy: 0.82