Review of Machine Learning / Deep Learning Notes: Part 2

I want to review my machine learning / deep learning notes before beginning my new years resolution of completing one ml / dl project per day. I will be re-reading my jupyter-notebook notes from August-September 2024. These jupyter notebooks are my notes from reading a few textbooks on machine learning / deep learning. I have to split up the notes because they are too long.

Date Created:

Last Edited:

1 123

Deep Learning with Python

Introduction to Keras and TensorFlow

TensorFlow is a Python-based, open source machine learning platform, developed primarily by Google.

It can automatically compute the gradient of any differentiable expression, making it highly suitable for machine learning
It can run not only on CPUs, but also on GPUs and TPUs, highly parallelized hardware accelerators
TensorFlow programs can be exported to other runtimes, such as C++, JavaScript, or TensorFlow lite.

Keras is a deep learning API for Python, built on top of TensorFlow, that provides a convenient way to define and train any kind of deep learning model. It is highly recommended to run DL systems on GPU rather than CPU.

Training a neural network revolves around the following concepts:

Low-Level Tensor Manipulation - the infrastructure that underlies all modern machine learning. This translates to TensorFlow APIs:
Second, high level deep learning concepts. This translates to Keras APIs:
- Layers: combined to make a model
- Loss Function: defines the feedback signal used for learning
- Optimizer: determines how learning proceeds
- Metrics: evaluate model performance
- Training Loop: Performs mini-batch gradient stochastic gradient descent

The fundamental data structure in neural networks is the layer. A layer is a data processing modules that takes as input one or more tensors and outputs one or more tensors. Some layers are stateless, but more frequently layers have a state: the layer's weights, one or several tensors learned with a stochastic gradient descent, which contains the network's knowledge. Different types of layers are appropriate in different situations:

Densely Connected Layers are appropriate for vector data (Dense)
Sequence Data is typically processed by recurrent layers, such as LSTM layer or 1D convolutional layers
Image data, stored in rank-4 tensors, is usually processed by 2D convolutional layers

Building deep learning models in Keras is done by clipping together compatible layers to form useful data-transformation pipelines. Everything in Keras is either a Layer or something that closely interacts with a Layer. A Layer is an object that encapsulates some state (weights) and some computation (a forward pass). The notion of layer compatibility refers specifically to the fact that every layer will only accept tensors of a certain shape and will return output tensors of a certain shape. A deep learning model is a graph of layers. In Keras, that's the Model class.

The topology of a model defines a hypothesis space. By choosing a network topology, you constrain your space of possibilities (hypothesis space) to a specific series of tensor operations, mapping input data to output data. To learn from data, you have to make assumptions about it. The structure of your hypothesis space is extremely important, and it encodes the assumptions you make about the problem, the prior knowledge that the model starts with.

The Fundamentals of Machine Learning

The fundamental issue in machine learning is the tension between optimization and generalization. Optimization refers to the process of adjusting a model to get the best performance possible on the training data (the learning in machine learning) whereas generalization refers to how well the trained model performs on data it has never seen before.

The manifold hypothesis posits that all natural data lies on a low-dimensional manifold within the high-dimensional space where it is encoded. That's a strong statement about the structure of information, and as far as we known, it's accurate and the reason why deep learning works.

The manifold hypothesis implies that:

Machine learning models only have to fit relatively simple, low-dimensional, highly structured subspaces with their potential input space
Within one of these manifolds, it's always possible to interpolate between two inputs, that is, to morph one into another via a continuous path along which all points fall on the manifold.

The ability to interpolate between samples is the key to understanding generalization in deep learning. While deep learning achieves generalization via interpolation on a learned approximation of the data manifold, that is not all there is to generalization. Interpolation helps you make sense of things that are very close to what you've seen before: it enables local generalization. Humans are capable of extreme generalization, which is enabled by cognitive mechanisms other than interpolation: abstraction, symbolic model of the world, reasoning, logic, common sense, innate priors about the world - what we call reason, as opposed to intuition and pattern recognition.

Properties of DL models that make them well-suited for learning latent manifolds:

Deep Learning models implement a smooth, continuous mapping from their inputs to the outputs. It has to be smooth and continuous because it must be differentiable, by necessity. The smoothness helps approximate latent manifold, which follow the same properties.
Deep learning models tend to be structured in a way that mirrors the shape of the information in their training data. This is particularly the case for image-processing models and sequence-processing models. More generally, deep neural network structure their learned representations in a hierarchal and modular way, which echoes the way natural data is organized.

Data curation and feature engineering are essential to generalization. Because deep learning is curve fitting, for a model to perform well it needs to be trained on a dense sampling of its input space. The best way to improve a deep learning model is to train it on more data or better data. The process of fighting overfitting by only focusing on more prominent (regular) patterns is called regularization.

Evaluating a model always boils down to splitting the available data into three sets: training, validation, and test. Tuning the configuration of the model based on performance on the validation set can result in overfitting to the validation set, even though your model was never directly trained on it - central to this concept is the notion of information leaks - every time you tune a hyperparameter of you r model based on the model's performance on the validation set, some information about the validation data leaks into the model. Methods for splitting data into test, validation, and train:

Simple Holdout validation - sets apart some fraction of your data as the test set and the validation set.
K-Fold Validation - Split your data into K partitions of equal size, for each partition, train a model on the remaining K-1 partitions, and evaluate the model on partition i. The final score is the average of the K scores obtained.
Iterated K-Fold Validation with Shuffling - one for situations where little data is available and you need to evaluate model as precisely as possible.

In deep learning, we always use models that are vastly overparameterized: they have way more degrees of freedom than the minimum necessary to fit to the latent manifold of data. This overparameterization is not an issue, because you never fully fit a deep learning model. Finding the exact point during training where you've reached the most generalizable fit - the exact boundary between an underfit curve and an overfit curve - is one of the most effective things you can do to improve generalization. This can be done with early stopping.

A common form of regularization is to put constraints on the weights on the model - forcing them to take small values, which makes the distribution of weight values more regular. This is called weight regularization, and it's done by adding to the loss function of the mode a cost associated with having large weights:

L1 Regularization: The cost added is proportional to the absolute value of the weight coefficients.
L2 Regularization: The cost added is proportional to the square of the value of weight coefficients.

Dropout is one of the most effective and most commonly used regularization techniques for neural networks. Dropout, applied to a layer, consists of randomly dropping out (setting to zero) a number of output features of the layer during training. The dropout rate is the fraction of the features that are zeroed out - usually set between 0.2 and 0.5. The core idea is that introducing noise in the output values of a layer can break up happenstance patterns that aren't significant, which the model will start memorizing if no noise is present.

The Universal Workflow of Machine Learning

Define the task
Develop a model
Deploy the model

Concept drift - the properties of production data change over time, causing model accuracy to gradually decay. Sampling bias occurs when your data collection process interacts with what you are trying to predict, resulting in biased measurements. The hardest things in machine learning are framing problems and collecting, annotating, and cleaning data.

All inputs and targets in neural networks must typically be tensors of floating-point data. Whatever you need to process - sound, images, text - you must turn into tensors, a step called data vectorization. Data should have the following properties:

Small values: most values should be in 0-1 range
Homogeneous: all features should take values roughly in the same range

The following stricter normalization process is common and can help, although not always necessary:

Normalize each feature independently to have a mean of 0
Normalize each feature interpedently to have a standard deviation of 1.

Activation and Last Layer for DL Model

Deploying a smaller version of model:

Weight pruning: only keep the most significant weights
Weight quantization: decreasing the size of the weights by changing the data type of the weights to something that takes less bytes

Working with Keras: A Deep Dive

The Keras API is guided by the principle of progressive disclosure of complexity: make it easy to get started, yet make it possible to handle high-complexity use cases, only requiring incremental learning at each step. Three APIs for building models in Keras:

The Sequential model: the most approachable API. It's limited to a simple stack of layers.
The Functional API: focuses on graph-like model architectures. It represents a nice mid-point between usability and flexibility and as such, it's the most commonly used model-building API.
1. In general, the Functional API provides you with a good trade-off between ease of use and flexibility. It also gives you direct access to layer connectivity, which is very powerful for use cases such as model plotting and feature extraction.
Model subclassing, a low-level option where you write everything yourself from scratch. This is ideal if you want full control over every little thing.

Introduction to Deep Learning for Computer Vision

Computer vision is the earliest and biggest success story of deep learning. The fundamental difference between a densely connected layer and a convolutional layer in this: Dense layers learn global pattens in their input feature space, whereas convolutional layers learn local patterns - in the case of images, patterns found in small 2D windows of the inputs.

Covnet has some interesting properties:

the patterns they learn are translation-invariant. After learning a pattern in the lower right hand corner of a picture, a Covnet can recognize it anywhere. Covnet is efficient when processing images because the visual world is fundamentally translation invariant.
They have special hierarchies of power. First convent layer will learn local patterns such as edges, the second Covnet layer will learn larger patterns made up of features of the first layers, and so on. This allows covnets to efficiently learn increasingly complex and abstract visual concepts, because the visual world is fundamentally spatially hierarchical

Convolutions operate over rank-3 tensors called feature maps, with two spatial axes (width and height) as well as a depth axis (also called the channels axis). The convolutional operation extracts patches from its input feature map and applies the same transformation to all of these patches, producing an output feature map. The output feature map is still a rank-3 tensor and the different layers in its depth axis no longer stand for specific colors as in the RGB input; rather, they stand for filters. Filters encode specific aspects of the input data: a single filter could encode the concept presence of a face of the input. Each of the channels of a filter is a feature map of the filter of the input, indicating the response of that filter pattern at different locations in the input.

A convolution works by sliding windows of size WxH over the 3d feature map, stopping at every possible location, and extracting the 3d patch of surround features (window_height, window_depth, input_depth). Each such 3D patch is then transformed into a 1D vector of shape (output_depth,), which is done via a tensor product with a learned weight matrix: called the convolutional kernel - the same kernel is reused across every path. The vectors are then spatially reassembled into a 3D output of shape (height, width, output_depth). Padding consists of adding an appropriate number or rows and columns on each side of the input feature map so as to make it possible to fit center convolution windows around every input tile. The distance between two successive windows is a parameter of the convolution, called its stride, which defaults to 1.

How Convolution Works

The reason to use down sampling (pooling layers) is to reduce the number of feature-map coefficients to process, as well as to induce a spatial filter hierarchy by making successive convolution layers look at increasing large windows.

There are two ways to use a pretrained model: feature extraction and fine-tuning.

Feature extraction consists of using the representations learned by a previously trained model to extract features from new samples. Representations learn by the convolutional base are likely to be more generic and therefore more reusable. Fine-tuning consists of unfreezing a few of the top layers of a frozen model base used for feature extraction and jointly training both the newly added part of the model and these top layers.

Advanced Deep Learning for Computer Vision

Three essential computer vision tasks:

Image Classification: the goal is to assign one or more labels to an image. It can be single label or multi-label classification
Image Segmentation: The goal is to segment or partition an image into several different areas, with each each usually representing a category.
1. Semantic segmentation: each pixel is independently classified into a semantic category, like cat. If there are two cats in the image, the corresponding pixels are all mapped to the generic cat category.
2. Instance segmentation: seeks not only to classify image pixels by category, but also to parse out individual object instances.
3. A segmentation mask is the image-segmentation equivalent of a label: it's an image the same size as the current image, with a single color channel where each integer value corresponds to the class of the corresponding pixel in the input image.
Object Detection: The goal is to draw rectangles (called bounding boxes) around objects of interest in an image, and associate each rectangle with a class.

A model's architecture is the sum of the choices that went into crafting it: which layers to use, how to configure them, and in what arrangement to connect them. These choices define the hypothesis space of your model: the space of possible functions that gradient descent can search over, parameterized by the model's weights. Like feature engineering, a good hypothesis space encodes prior knowledge that you have about the problem at hand and its solution. A good model architecture is one that reduces the size of the search space or otherwise makes it easier to converge to a good point of the search space. Model architecture is about making the problem simpler than gradient descent to solve.

If you want to make a complex system simpler, there's a universal recipe you can apply: just structure your amorphous sip of complexity into modules , organize into a hierarchy and start reusing the same modules as appropriate.

In general, a deep stack of narrow layers performs better than a shallow stack of large layers. There's a limit on how deep you can stack layers due to the problem of vanishing gradients. A way to fix vanishing gradients is by adding a residual connection. The residual connection acts as an information shortcut around destructive or noisy blocks, enabling error gradient information from earlier layers to propagate noiselessly through a deep network.

The depthwise separable convolution layer is a drop-in replacement for Conv2D that will make the model smaller and leaner and cause it to perform a few percentage points better on its task. This layer performs spatial convolution on each channel of its input, independently, before mixing output channels via a pointwise convolution (a convolution). Depthwise separable convolution relies on the assumption that spatial locations in intermediate activations are highly correlated, but different channels are highly independent. This assumption is generally true for image representations learned by deep neural networks.

The representations learned by covnets are highly amenable to visualization, in part because they're representations of visual concepts. Techniques for visualizing and interpreting these representations:

Visualizing intermediate covnet outputs: useful for understanding successive covnet layers transform their input, and for getting a first idea of how meaning of individual covnet filters
Visualizing covnet filters: Useful for understanding precisely what visual pattern or concept each filter in a covnet is receptible to
Visualizing heatmaps of class activation in an image: Useful for understanding which parts of an image were identified as belonging to a given class, thus allowing you to localize objects in an image.

Things to note above covnets:

The first layers act as collections as edge detectors.
As you go deeper, the activations become increasingly abstract and less visually interpretable. Deeper presentations carry less information about the visual contents of the image, and increasingly more information about the class of the image
The sparsity of the activations increases with the depth of the layer
Each layer in a covnet learns a collection of filters such that their inputs can be expressed as a combination of filters - this is similar to how the Fourier transform decomposes signals into a bank of cosine functions. The filters get increasingly complex and refined as you go deeper into the model.

Periodicity over multiple timescales is important and very common property of timeseries data, When exploring data, make sure to look for these patterns.

A major characteristic of non-recurrent neural networks is that they have no memory. Each input shown to them is processed independently, with no state kept between inputs. With such networks, in order to process a sequence or a temporal series of data points, you have to show the entire sequence to the network at once: turn it into a single data point. A recurrent neural network (RNN) processes sequences by iterating through sequence elements and maintaining a state that contains information relative to what it has seen so far. In effect, an ENN is a type of Neural Network that has an internal loop. The state of the RNN is reset between processing two different, independent sequences (such as two samples in a batch), so you still consider one sequence to be a single data point: a single input to the network. The data point is no longer processed in a single step, but the network loops over the sequence elements.

In summary, a RNN is a for loop that reuses quantities computed during the previous iteration of the loop, nothing more. The LSTM algorithm was developed to address the vanishing gradients problem of simple RNNs. The LSTM adds a way to carry information across many timestamps. The LSTM saves information for later, preventing older signals from gradually vanishing during processing - this should remind you of residual connections.

Advanced use of Recurrent Neural Networks

Recurrent Dropout: a variant of dropout, used to fight overfitting in recurrent layers
Stacking Recurrent Layers: increases the representational power of the model (at the cost of higher computational loads)
Bidirectional Recurrent Layers: present the same information to a recurrent network in different ways, increasing accuracy and mitigating forgetting issues

Deep Learning for Text

NLP is pattern recognition applied to words, sentences, and paragraphs.

Vectorizing text is the process of transforming text into numeric tensors.

First, you standardize the text to make it easier to process, such as by converting it to lowercase or removing punctuation
You split the text into units (called tokens), such as characters, words, or groups of words. This is called tokenization.
You convert each such token into a numerical vector. This will usually involve indexing all tokens present in the data.

Once you tokenize text, you need to encode each token into a numerical representation. The way you'd go about this is to build an index of all terms found in the training data (the vocabulary) and assign a unique integer to each entry in the vocabulary. The way you'd go about this is to build an index of all terms found in the training data (the vocabulary) and assign a unique integer to each entry in the vocabulary. You can then convert that integer into a vector encoding that can be processed by a neural network by one hot hot encoding it. It's common to restrict the vocabulary to the top 20,000 or 30,000 most common words found in the training data.

Bag of words models discard word order and treat text as an unordered set of words. Because they take into account word order, both RNNs and Transformers are called sequence models. To implement a sequence model, you'd start by representing your input as sequences of integer indices (one integer standing for one word). Then, you'd map each integer to a vector to obtain vector sequences. Finally, you'd feed these sequences of vectors into a stack of layers that could cross-correlate features from adjacent vectors, such as 1D covnet, a RNN, or a Transformer.

The geometric relationship between two word vectors should reflect the semantic relationship between two words. Word embeddings are vector representations of words that map human language into a structured geometric space. Word embeddings are low-dimensional floating-point vectors (dense vectors). Word embeddings are structured representations - their structure is learned from data. Two ways to learn word embeddings:

Learn word embeddings jointly with the main task you care about (such as document classification or sentiment prediction)
Load into your model word embeddings that were precomputed using a different machine learning task than the one you are trying to solve. These are called pretrained word embeddings.

The Transformer Architecture

The gist of Attention is All You Need: a simple mechanism called neural attention could be used to build powerful sequence models that didn't feature any recurrent layers or convolutional layers. The purpose of self attention is to modulate the representation of a token by using the representations of related tokens in the sequence. This produces self-aware tokens.

A Transformer is a sequence-to-sequence model: it was designed to convert one sequence into another. For each element in the query, compute how much the element is related to every key and use these scores to weight a sum of values.

The multi-head moniker refers to the fact that the output space of the self-attention layer gets factored into a set of independent sub-spaces, learned separately: the initial query, key, and value are sent through three independent sets of dense projections, resulting in three separate vectors. Each vector is processed via neural attention, and the three outputs are concatenated back together into a single output sequence. Each such subspace is called a head.

The idea behind positional encoding is very simple: to give the model access to word-order information.

Bag-of-words is still a valid and relevant approach in many cases. When to use bag-of-words vs Transformers: it turns out that when approaching a new text-classification task, you should pay close attention to the ratio between the number of samples in your training data and the mean number of words per sample. If that ratio is small - less than 1,500 - then the bag-of-bigrams model will perform better, If that ratio is higher than 1,500, then you should go with a sequence model. (Remember this is just for text classification).

Generative Deep Learning

The universal way to generate sequence data in deep learning is to train a model (usually a Transformer or an RNN) to predict the next token or next few tokens in a sequence, using the previous tokens as input. When working with text data, tokens are typically words or characters, and any network that can model the probability of the next token given the previous ones is called a language model. A language model captures the latent space of language: its statistical structure.

Once you have a trained language model, you can sample from it (generate new sequences): you feed it an initial string of text (called conditioning data), ask it to generate the next character or next word, add the generated output back to the input data, and repeat the process many times.

When generating text, the way you choose the next token is critically important. A naive approach is greedy sampling - always choosing the most likely next character (doesn't work well). Stochastic sampling introduces randomness in the sampling process by sampling from the probability distribution of the next character. This strategy doesn't offer a way to control the amount of randomness in the sampling proves.

The softmax temperature parameter characterizes the entropy of the probability distribution used for sampling: t characterizes how surprising or predictable the choice of the next word will be.

Natural Language Processing with PyTorch

Introduction

Natural Language Processing refers to a set of techniques involving the application of statistical methods, with or without insights from linguistics, to understand text for the sake of solving real-world tasks. This 'understanding" of text is mainly derived by transforming texts to useable computational representations, which are discrete or continuous combinatorial structures such as vectors or tensors, graphs, or trees. Deep Learning enables one to efficiently learn representations from data using an abstraction called the computational graph or numerical optimization techniques.

Supervised Learning refers to cases where the ground truth for targets (what's being predicted) is available for the observations.
Observations are items about which we want to predict something.
A loss function is a function that compares how far off a prediction is from its target for observations in the training data.
The goal of supervised learning is to pick values of parameters that minimize the cost function for a given dataset. We know that gradient descent is a common technique to find roots of an equation.
Due to memory constraints, an approximation of gradient descent called stochastic gradient descent, where data points are picked at random and the gradient is computed for that subset, is used. When a subset of more than one data points is used, we call it minibatch SGD.
The process of iteratively updating the parameters is called backpropagation.

Observation and Target Encoding

The one hot representation starts with a zero vector, and sets as 1 the corresponding entry in the vector if the word is present in the sentence of document. The Term-Frequency (TF) of a phrase, sentence, or document is simply the sum of one-hot representations of its constituent words.

A computational graph is an abstraction that models mathematical expressions.

Natural Language Processing (NLP) aims to develop methods for solving problems involving language, such as information extraction, automatic speech recognition, machine translation, sentiment analysis, question answering, and summarization.

Computational linguistics employs computational methods to understand properties of the human language.

All NLP methods, be they classic or modern, begin with a text dataset, also called a corpus. The raw text is a sequence of characters (bytes), but most times it is useful to group those characters into contiguous units called tokens. The metadata could be any auxiliary piece of information associated with the text. like identifiers, labels, and timestamps. In machine learning parlance, the text along with its metadata is called an instance or data point. The collection of instances is known as a dataset.

The process of breaking text down into tokens is called tokenization. Types are unique tokens present in a corpus. The set of all types in a corpus is called its vocabulary or lexicon. Words can be distinguished as a content words and stop words. Stop words are articles and prepositions serve mostly a grammatical purpose. N-grams are fixed-length (n) consecutive token sequences occurring in the text. Lemmas are root forms of words. fly is the lemma or flew, flies, flown, flow, flowing, and so on. The reduction of tokens to their lemma to keep the dimensionality of the vector representation low is called lemmatization. Stemming is poor-man's lemmatization. It involves the use of handcrafted rules to strip endings of words and reduce them to a common form called stems.

The task of identifying the relationship between the phrasal units produced by shallow parsing is called parsing. Parse trees indicate how different grammatical units in a sentence are related hierarchally.

Activation Functions

Loss Functions

Mean Squared Error / Root Mean Squared Error: used in regression problems
Categorical Cross Entropy Loss: used in multiclass classification setting in which the outputs are interpreted as predictions of class membership probabilities
Binary Cross Entropy: used when distinguishing between two classes

The most common method [to use to know when to stop training] is to use a heuristic called early stopping. Early stopping works by keeping track of the performance on the validation dataset from epoch to epoch and noticing when the performance no longer improves. Then, if the performance continues to not improve, the training is terminated. The number of epochs to wait before terminating the training is referred to as the patience. In general, the point at which a model stops improving on some dataset is said to be when the model has converged.

Learning intermediate representations that have specific properties, like being linearly separable for a classification task, is one of the most profound capabilities of using neural networks and is quintessential to their modeling capabilities.

Generative Deep Learning

Generative Modeling

A generative model describes how a dataset is generated, in terms of a probabilistic model. By sampling from this model, we are able to generate new data. The goal is to build a model that can generate new sets of features that look as if they have been created using the same rules as the original data. Discriminative modeling estimates the probability of a label given an observation . Generative modeling estimates the probability of observing observation .

Generative Modeling Framework

We have a dataset of observations
We assume that the observations have been generated according to some unknown distribution
A generative model tries to mimic . If we achieve this goal, we can sample from to generate observations that appear to have been drawn from
We are impressed by if:
- It can generate examples that appear to have been drawn from
- It can generate examples that are suitably different from other observations in . In other words, the model shouldn't simply reproduce things it has already seen
The sample space is the complete set of all values an observation can take
A probability density function (or simply density function), is a function that maps a point in the sample space to a number between 0 and 1. The sum of the density function over all points in the sample space must equal 1, so that is a well-defined probability distribution.

While there is only one true density function that is assumed to have generated the observable dataset, there are infinitely many density functions that we can use to estimate .

A parametric model, is a family of density functions that can be described using a finite number of parameters,
The likelihood of a parameter set is a function that measures the probability of , given some observed point . It is defined as . That is, the likelihood of given some observed point is defined to be the value of the density function parametrizes by , at point . We are simply defining the set of parameters to be equal to the probability of seeing the data under the model parameterized by .

The focus of parametric modeling should be to find the optimal value of the parameters set that maximizes the likelihood of observing the dataset . This technique is called maximum likelihood estimation.

Maximum likelihood estimation is a technique that allows us to estimate - the set of parameters of a density function , that are most likely to explain some of the observed data .

In probability theory, the multinomial distribution is a generalization of the binomial distribution. For example, it models the probability of counts for each side of a k-sided dice rolled n times. For n independent trials each of which leads to a success for exactly one of k categories, with each category having a given fixed success probability, the multinomial distribution gives the probability of any particular combination of numbers of successes for the various categories.

The Naive Bayes parametric model makes use of a simple assumption (Naive Bayes assumption) that drastically reduces the number of parameters we need to estimate. It makes the naive assumption that each feature is independent of every other feature . The naive bayes assumption does not hold for problems where features are not independent of one another or where there are an incomprehensibly vast number of possible observations in the sample space.

Generative Modeling Challenges:

How does the model cope with the high degree of conditional dependence between features?
How does the model find one of the tiny proportion of satisfying possible generated observations among a high-dimensional sample space?

Deep learning is the key to solving both of these challenges. The fact that deep learning can form its own features in a lower-dimensional space means that it is a form of representation learning. The core idea behind representation learning is that instead of trying to model the high-dimensional sample space directly, we should instead describe each observation in the training set using some low-dimensional latent space and map it to a point in the original domain. in other words, each point in the latent space is the representation of some high-dimensional image.

Deep Learning

Deep learning is a class of machine learning algorithms that uses multiple stacked layers of processing units to learn high-level representations from unstructured data. The majority of deep learning systems are artificial neural networks. Foer this reason, deep learning has now almost become synonymous with deep neural networks.

Much of this content has been covered multiple times in my review notes, so I am skipping over it.

Variational Autoencoders

The variational autoencoder is now one of the most fundamental and well-known deep learning architectures for generative modeling. An autoencoder is a neural network made up of two parts:

An encoder that compresses high-dimensional input data into a lower-dimensional vector
A decoder network that decompresses a given representation vector back to the original domain.

The network is trained to find weights for the encoder and decoder that minimize the loss between the original input and the reconstruction after it has passed through the encoder and decoder. The representational vector is a compression of the original image into a lower-dimensional, latent space. The idea is that by choosing any point in the latent space, we should be able to generate novel images by passing this point through the decoder, since the decoder has learned how to convert points in the latent space into viable images.

The convolutional transpose layer uses the same principle as a standard convolutional layer (passing a filter across the image), but is different in that setting strides = 2 doubles the size of the input tensor in both height and width.

In an autoencoder, each image is mapped directly to one point in the latent space. In a variational autoencoder, each image is instead mapped to a multivariate normal distribution around a point in latent space.

Difference between Autoencoder and Variational Autoencoder

Variational autoencoders assume that there is no correlation between any of the dimensions in the latent space and therefore that the covariance matrix is diagonal. This means the encoder only needs to map each input to a mean vector and a variance vector and does not need to worry about covariance between dimensions.

Generative Adversarial Networks

A GAN is a battle between two adversaries, the generator and the discriminator. The generator tries to convert random noise into observations that look as if they have been samples from the original dataset and the discriminator tries to predict whether an observation comes from the original dataset or is one of the generator's forgeries.

GAN

The key to GANs lies in how we alternate the training of the two networks so that s the generator becomes more adept at fooling the discriminator, the discriminator must adapt in order to maintain its flexibility to correctly identify which observations are fake. This drives the generator to find new ways to fool the discriminator, and so the cycle continues.

The input to the generator is a vector, usually drawn from a multivariate normal distribution. The output of an image is the same size as an image in the original training data. The generator of a GAN fulfills exactly the same purpose as the decoder of a VAE: converting a vector in the latent space to an image. The goal of the discriminator is to predict if an image is real or fake. This is a supervised image classification problem. It is commonplace to use Convolutional layers in GANs, even though the original paper used dense layers.

We can train the discriminator by creating a training set where some of the images are randomly selected real observations from the training set and some are outputs from the generator. The response would be 1 for the true images and 0 for the generated images.

To train the generator, we must first connect it to the discriminator to create a Keras model that we can train. Specifically, we feed the output from the generator into the discriminator so that the output from this combined model is the probability that the generated image is real, according to the discriminator. We must freeze the weights of the discriminator while we are training the combined model, so that only the generator's weights are updated.

Training GAN

Natural Language Processing with Transformers

Transformer Models

NLP is a field of linguistics and machine learning focused on understanding everything related to human language. The aim of NLP tasks is to not only understand single words individually, but to be able to understand the context of those words.

The Transformer Architecture was introduced in June 2017. Broadly, the kinds of transformer models can be grouped into three categories:

GPT-like: auto-regressive Transformer models
BERT-like: auto-encoding Transformer models
BART/T5-like: sequence-to-sequence Transformer models

Transformer models above are trained as language models - they have been trained on large amounts of raw text in a self-supervised fashion. Self-supervised learning is a type of training in which the objective is automatically computed from the inputs of the model. This type of model develops a statistical understanding of the language it has been trained on, and it is then fine-tuned on a task in a supervised way in a process called transfer learning.

Encoder Models

Encoder models use only the encoder of a Transformer model. At each stage, the attention layers can access all the words in the initial sentence. These models are often characterized as having bi-directional attention, and are often auto-encoding models. The pretraining of these models revolves around somehow corrupting a given sentence, e.g. masking some words in it, and tasking the model with finding or reconstructing the initial sentence. These models are best suited for sentence classification, ner, and extractive question answering.

Decoder Models

Decoder models use only the decoder of a Transformer model. At each stage, for a given word, the attention layers can only access the words positioned before it in in the sentence. These models are often called auto-regressive models. The pretraining of decoder models usually revolves around predicting the next word in the sentence. These models are best suited for text generation.

Sequence-to-Sequence Models

Encoder-decoder/sequence-to-sequence models are both parts of the Transformer architecture. At each stage, the attention layers of the encoder can access all the words in the initial sentence, whereas the attention layers of the decoder can only access the words positioned before a given word in the input. The pretraining of these models can be done using the objectives of encoder or decoder models, but usually involves something a bit more complex. These model are best suited for tasks revolving around generating new sentences depending on a given input, such as summarization, translation, or generative question answering.

Transformer Anatomy

The original transformer is based on the encoder-decoder architecture that is widely used for tasks such as machine translation. The architecture consists of two components:

Encoder: Converts an input sequence of tokens into a sequence of embedding vectors, often called the hidden state or context
Decoder: Uses the encoder's hidden state to iteratively generate an output sequence of tokens, one token at a time

Transformer Architecture

Things that characterize the transformer architecture:

The input text is tokenized and converted to token embeddings using the techniques in last chapter. Since the attention mechanism is not aware of the relative positions of the tokens, we need a way to inject some information about token positions into the input to model the sequential nature of the text. The token embeddings are thus combined with positional embeddings that contain positional information for each token.
The encoder is composed of a stack of encoder layers or blocks which is analogous to stacking convolutional layers in computer vision. The same is true of the decoder, which has its own stack of decoder layers.
The encoder's output is fed to each decoder layer, and the decoder then generates a prediction for the most probable next token in the sequence. The output of this next step is then fed back into the decoder to generate the next token, and so on until a special end-of-sequence (EOS) token is received.

Most transformer models fall under three categories:

Encoder-only: These models convert an input sequence of text into a rich numerical representation that is well suited for tasks like text classification or named entity recognition. BERT and its variants belong to this class of architecture. The representation computed for a given task in this architecture depends both on the left (before the token) and the right (after the token) contexts. This is often called bidirectional attention.
Decoder-only: These models will autocomplete the sequence by iteratively predicting the most probable next word. the family of GPT belong to this class. The representation computed for a given token in this architecture depends on the left context. This is called casual or autoregressive attention.
Encoder-decoder: These are used for modeling complex mappings from one sequence of text to another; they're suitable for machine translation and summarization tasks.

In reality, the distinction between encoder-only and decoder-only tasks is a bit blurry.

The Encoder

The encoder consists of many encoder layers stacked next to each other. Each encoder layer receives a sequence of embeddings and feeds them through the following sublayers:

A multi-head attention layer
A fully connected forward layer that is applied to each input embedding

The output of the embeddings of each encoder layer have the same size as the inputs, and we'll soon see that the main role of the encoder stack is to update the input embeddings to produce representations that encode some contextual information in the sequence.

Self-Attention

Attention is a mechanism that allows neural networks to assign a different amount of weight or attention to each element in a sequence. The self attention part of self-attention refers to the fact that these weights are computed for all hidden states in the same set. The main idea behind self-attention is that instead of using a fixed embedding for each token, we use the whole sequence to compute a weighted average of each embedding. Embeddings that are generated using the words around them are called contextual embeddings and predate the invention of transformers.

Scaled Dot-Product Attention

There are several ways to implement a self-attention layer, but the most common one is scaled dot-product attention. The four main steps to implement this mechanism:

Project each token embedding into three vectors called query, key, and value.
Compute attention scores. Determine how much the query and key relate to each other using a similarity function. The similarity function for scaled dot-product attention is the dot product, computed efficiently using matrix multiplication of the embeddings. Queries and keys that are similar will have a large dot product, while those that don't share much in common will have little to no overlap. The outputs of this step are called attention score, and for a sequence with input tokens, there is a corresponding matrix of attention scores.
Compute attention weights. Attention scores are first multiplied by a scaling factor to normalize their variance and then normalized with a softmax to ensure all the column values sum to 1. The resulting matrix contains all the attention weights .
Update the token embeddings. Once the attention weights are computed, we multiply them by the value vector to obtain an updated representation of the embedding.

Multi-headed Attention

In practice, self attention applies three independent linear transformations to each embedding to generate the query, key, and value vectors. These transformations project the embeddings and each projection carries its own set of learnable parameters, which allows the self-attention layer to focus on different semantic aspects of the sequence. It turns out to be beneficial to have multiple sets of linear projects, each one representing a so-called attention head. The resulting multi-head attention layer can be seen in the carousel below.

The Transformer architecture makes use of layer normalization and skip connections. The former normalizes each input in the batch to have zero mean and unity variance. Skip connections pass a tensor to the next layer of the model without processing and add it to the processed tensor.

Positional Embeddings

Positional embeddings are based on a simple, yet effective idea: augment the token embeddings with a position-dependent pattern of values arranged in a vector. If the pattern is characteristic for each position, the attention heads and feed-forward layers in each stack can learn to incorporate positional information into their transformations.

The Decoder

The main difference between a decoder and encoder is that the decoder has two attention sublayers:

Masked multi-head self-attention layer: Ensures that the tokens we generate at each timestamp are only based on the past outputs and the current token being predicted.
Encoder-decoder attention layer: Performs multi-head attention over the output key and value vectors of the encoder stack, with the intermediate representations of the decoder acting as the queries. This way the encoder-decoder attention layer learns how to relate tokens from two different sequences, such as two different languages.

Text Generation

Converting a model's probabilistic output to text requires a decoding method, which introduces a few challenges that are unique to text generation:

The decoding is done iteratively and thus involves significantly more compute than simply passing inputs once through the forward pass of a model
The quality and diversity of the generated text depend on the choice of decoding method and associated hyperparameters

Beam Search Decoding

Instead of decoding the token with the highest probability at each step, beam search keeps track of the top-b most probable next tokens, where is referred to as the number of beams or partial hypotheses. The next set of beams are chosen by considering all possible next token extensions of the existing set and selecting the most likely extension.

Beam Search

Beam search with n-gram penalty is a good way to find a tradeoff between high-probability tokens while reducing repetitions, and it's commonly used in applications such as summarization or machine translation where factual correctness is important. Sampling is another method used.

Sampling

The simplest sampling method is to randomly sample from the probability distribution of the model's outputs over the full vocabulary at each timestamp. You can control the diversity of the output by adding a temperature parameter that rescales the logits before taking the softmax. The main lesson we can draw from temperature is that it allows us to control the quality of the samples, but there's always a trade-off between coherence (low-temperature) and diversity (high temperature) that one has to tune to the use case at hand.

Review of Machine Learning / Deep Learning Notes: Part 2

Deep Learning with Python

Introduction to Keras and TensorFlow

The Fundamentals of Machine Learning

The Universal Workflow of Machine Learning

Working with Keras: A Deep Dive

Introduction to Deep Learning for Computer Vision

Advanced Deep Learning for Computer Vision

Advanced use of Recurrent Neural Networks

Deep Learning for Text

The Transformer Architecture

Generative Deep Learning

Natural Language Processing with PyTorch

Introduction

Observation and Target Encoding

Loss Functions

Generative Deep Learning

Generative Modeling

Generative Modeling Framework

Generative Modeling Challenges:

Deep Learning

Variational Autoencoders

Generative Adversarial Networks

Natural Language Processing with Transformers

Transformer Models

Encoder Models

Decoder Models

Sequence-to-Sequence Models

Transformer Anatomy

The Encoder

Self-Attention

Scaled Dot-Product Attention

Multi-headed Attention

Positional Embeddings

The Decoder

Text Generation

Beam Search Decoding

Sampling

Comments

User Comments