Review of Machine Learning / Deep Learning Notes: Part 1

I want to review my machine learning / deep learning notes before beginning my new years resolution of completing one ml / dl project per day. I will be re-reading my jupyter-notebook notes from August-September 2024. These jupyter notebooks are my notes from reading a few textbooks on machine learning / deep learning. I have to split up the notes because they are too long.

Date Created:

Last Edited:

1 101

References

See Jupyter Notebooks Page, notebooks on Hands-On Machine Learning, Deep Learning with Python, Natural Language Processing with PyTorch, Generative Deep Learning, and Natural Language Processing with Transformers
- These notebooks range from August to September 2024

Libraries Used in Machine Learning / Deep Learning

Scikit-Learn
- Very easy to use, yet it implements many Machine Learning algorithms efficiently, so it makes for a great entry point to learn Machine Learning
TensorFlow
- More complex library for distributed numerical computation. It makes it possible to train and run very large neural networks efficiently by distributing the computations across potentially hundreds of multi-GPU servers. TensorFlow was created at Google and supports many of their large-scale Machine Learning applications. It was open sourced in November 2015.
Keras
- High level Deep Learning API that makes it simple to train and run neural networks. It can run on tip of either TensorFlow, Theano or Microsoft Cognitive Toolkit (formerly known as CNTK). TensorFlow comes with its own implementations of this API, called tf.keras, which provides support for some advanced TensorFlow features.
NumPy
- The fundamental package for scientific computing with Python
Pandas
- A fast, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language
Matplotlib
- Comprehensive library for creating static, animated, and interactive visualizations in Python.

Hands-On Machine Learning

Machine Learning Landscape

Machine Learning is the science of programming computers so they can learn from data. Use machine learning for:

problems for which existing solutions require a lot of hand-tuning or long lists of rules: one machine learning algorithm can often simplify and perform better
For complex problems for which there is not a good solution at all using a traditional approach
For fluctuating environments
For getting insights about complex problems and large amounts of data

Types of Machine Learning Systems

In Supervised Learning, the training data you feed to the algorithm includes the desired solutions, called labels
- Typical supervised learning tasks are classification and regression
In unsupervised learning, the training data is unlabeled. The system tries to learn without a teacher.
Visualization algorithms try to preserve as much structure as they can, so you can understand how the data is organized and identify unsuspected patterns.
Dimensionality Reduction's goal is to simplify the data without losing too much information.
Anomaly Detection and novelty detection are used to detect data points that are far outside that of the normal data points that were used in training.
Reinforcement Learning: the learning system, called an agent, can observe the environment, select and perform actions, and get rewards in return (or penalties in the form of negative rewards). It must then learn by itself what is the best strategy, called a policy to get the most reward over time.
In Batch Learning, the system is incapable of learning incrementally: it must be trained using all the available data.
In online learning, you train the system incrementally by feeding it data instances sequentially, either individually or by small groups called mini batches.

Some Definitions

Model Selection - choosing a function to represent the model
Model Parameters - can be tweaked, affects the model
Utility Function - measures how good the model is
Cost Function - measures how bas the model is
Training Model - feeding examples to the model to find parameters that best fit the data

Main Challenges of Machine Learning

Bad Data: nonrepresentative data or poor-quality data
Irrelevant features
Overfitting the training data
Underfitting the data

Hyperparameter Tuning and Model Selection

Holdout validation - holding out part of the training set to evaluate several candidate models and select the best one. The new heldout validation set is called the validation set. More specifically, you train multiple models with various hyperparameters on the reduced training set, and you select the model that performs the best on the validation set. After this holdout validation process, you train the best model on the full training set and this gives you the final model.
Cross Validation - each model is evaluated once per validation set, after it is trained on the rest of the data. By averaging out all the evaluations of a model, we get a much more accurate measure of its performance.

End-to-End Machine Learning Project

Frame the problem
1. Unsupervised vs Supervised; Regression or Classification; Batch Learning vs Online Learning
Select Performance Measure
1. Select loss function - typical performance measure for regression problems is Root Mean Square Error (RMSE)
Get the data, plot the data, investigate the data
1. Look for correlations, missing values
Generate a test set, train set, and validation set
Clean the data
1. Impute / drop missing values
2. Handle Text and Categorical Attributes
3. Scale Numerical Columns
Select and Train a Model
Fine Tune Hyperparameters
Evaluate on Test Set
Launch, monitor, and maintain system

Classification

A binary classifier can distinguish between two classes. Accuracy is generally not the preferred performance measure for classifiers, especially when you are dealing with skewed datasets (some classes more frequent than others). A much better way to evaluate the performance of a classifier is to look at the confusion matrix - count the number of times instances of class A are classified as class B and vice versa. You can use the cross_val_predict() function to look at the confusion matrix.

Precision: The Accuracy of the positive predictions

Recall: The ratio of positive instances that are correctly detected by the classifier

It is often convenient to combine the precision and recall into a single metric called the score. The score is the harmonic mean of precision and recall - this gives more weight to low values. As a result, the classifier will only get a high score if both recall and precision are high (f1_score()).

The receiver operating characteristic (ROC) curve is another common tool used with binary classifiers. It is very similar to precision/recall curve, but instead of plotting precision vs recall, the ROC curve plots the true positive rate (another name for recall) against the false positive rate.

Whereas binary classifiers distinguish between two classes, multiclass classifiers (also known as multinomial classifiers) can distinguish between more than two classes. Some algorithms (such as Random Forest Classifiers or naive Bayes classifiers( are capable of handling multiple classes directly. Others are strictly binary classifiers. There are various strategies that you can use to perform multiclass classification using multiple binary classifiers:

One-versus-all (OvA): Train different binary classifiers, one for each class (belongs to class vs doesn't belong to class). Then, you want to classify the instance based on which class has the highest decision score.
1. Preferred for most binary classification algorithms.
One-versus-one (OvO): Train a binary classifier for each class pair. If there are classes, you need to train classifiers.

One way to tune classification models is to analyze the types of errors that it makes. The confusion matrix is helpful for this. It is often convenient to look at an image representation of the confusion matrix, using Matplotlib's matshow() function.

A multilabel classification system outputs multiple binary tags.

Training Models

There are two ways to train linear regression models:

Using a direct closed-from equation that directly computes the model parameters that best fit the model to the training set.
Using an iterative approach, called Gradient Descent (GD), that gradually tweaks the model parameters to minimize the cost function over the training set, eventually converging to the same set of parameters as the first method.

A linear model makes a prediction simply by computing the weighted sum of the input features, plus a bias term (intercept term):

Training a model means setting its parameters so that the model best fits the training set. For this purpose, we need a measure of how well (or poorly) the model fits the training data. A common performance measure of a regression model is the root mean square error (RMSE). Therefore, to train a Linear Regression model, you need to find the value of that minimizes the RMSE.

There is a closed-form solution that minimizes the cost function above called the normal equation:

Because the computational complexity of inverting a matrix is typically about to , where is the number of features, the normal equation and SVD approach get very slow when the number of features grows large. On the positive side, both are linear with respect to the number of instances in the training set.

Gradient Descent

Better suited for cases where there are a large number of features, or too many training instances to fit into memory, Gradient Descent is a very generic optimization algorithm capable of finding optimal solutions to a wide range of problems. The general idea behind gradient descent is to tweak parameters iteratively in order to minimize the cost function. The Gradient Descent measures the local gradient of the error function with respect to the parameter vector , and it goes in the direction of the descending gradient. Once the gradient is 0, you have reached a minimum. Concretely, you start by filling with random values (called random initialization) and then improve it gradually, taking one baby step at a time, each step attempting to decrease the cost function, until the algorithm converges to a minimum.

An important parameter in Gradient Descent is the size of the steps, determined by the learning rate hyperparameter. If it is too small, convergence takes forever; if it is too high, the learning rate might diverge.

Gradient Descent might converge to a local minimum, which is not as preferable as a global minimum.

To implement gradient descent, you need to compute the gradient of the cost function with regards to each model parameter . In other words, you need to calculate how much the cost function will change if you change by a small amount. This is called the partial derivative. Batch Gradient Descent involves calculating the gradient vector, , over the full training set at every Gradient Descent step. As a result, BGD scales well with # of features, but not # of instances.

A simple solution to decrease the amount of time it takes for the model to converge is to set a tolerance, , which when the norm because smaller than it, the gradient descent has almost reached a minimum and training will stop.

Stochastic Gradient Descent picks a random instance in the training set at every step and computes the gradients based on that instance. This makes the algorithm much faster and able to train on very large data sets. The cost function will bounce around in this case. The randomness of SGD is good because it can escape from local optima, but bad because it means the algorithm can never settle at the minimum. One solution to this problem is to gradually decrease the learning rate - start large then get smaller in a process called simulated annealing. The function that determines the learning rate at each iteration is called the learning schedule.

Mini-batch Gradient Descent: At each step, instead of computing the gradients based on the full training set or just one instance, Mini-batch GD computes the gradients on small random sets of instances called mini-batches. The main advantage of Mini-batch GD over SGD is that you can get a performance boost from hardware optimization of matrix operations, especially when using GPUs.

Polynomial Regression

Even if the data is not linear, a linear model can still be used to fit the data. A simple way to do this is to add powers of each feature to new features, then train a linear model on this extended set of features. This technique is called Polynomial Regression. Learning curves are plots of the model's performance on the training set and the validation set as a function of the training set size.

The Bias / Variance Tradeoff

An important theoretical result of statistics and Machine Learning is the fact that a model's generalization error can be expressed as the sum of three very different errors:

Bias: A high-bias model is likely to underfit the data. This is part of the generalization error due to wrong assumptions/
Variance: This part is due to the model's excessive sensitivity to small variations in the training data. A model with many degrees of freedom (parameters) is likely to have high variance, and thus overfit the training data.
Irreducible Error: error due to the noisiness of the training data itself.

Regularizing Linear Models

A good way to reduce overfitting is to regularize the model (to constrain it): the fewer degrees of freedom it has, the harder it will be for it to overfit the data.

Ridge Regression: a regression term equal to is added to the cost function. This forces the learning algorithm to not only fit the data but also keep the model weights as small as possible.
Lasso Regression is another regularized version of Linear Regression. It's just like Ridge regression except it uses the norm of the weight vector instead of half the square of the norm
- Lasso regression tends to completely eliminate the weights of the least important features. Lasso regression automatically performs feature selection and outputs a sparse model.
Elastic Net is the middle ground between Ridge Regression and Lasso Regression. The regularization term is a simple mix of both Ridge and Lasso's regularization terms, and you can control the mix ratio .

Logistic Regression

Logistic Regression is a regression algorithm is commonly used to estimate the probability that an instance belongs to a particular class. The Logistic Regression model can be generalized to support multiple classes directly, without having to train and combine multiple binary classifiers. This is called Softmax Regression.

Support Vector Machines

A Support Vector Machine (SVM) is a very powerful and versatile Machine Learning model, capable of performing linear or nonlinear classification, regression, and even outlier detection. It is one of the most popular models in machine Learning, and anyone interested in Machine Learning should have it in their toolbox.

You can think of an SVM classifier as fitting the widest possible street between the classes. This is called large margin classification. More instances added off the street will not affect the decision boundary. Support vectors are instances that help define the decision boundary. SVMs are sensitive to feature scales. The two main problems with hard-margin classification are that it only works when the data is linearly separable and it is quite sensitive to outliers. Margin violations are instances that end up in the middle of the street or on the wrong side. Soft margin classification is finding a good balance between finding the largest street as possible and limiting the number of margin violations. Reducing the C hyperparameter of SVM models can reduce overfitting.

The kernel trick and Gaussian Radial Basis Functions (RBF) are two techniques used to tackle nonlinear problems. The GBF is a function that can be used as a similarity function that measures how much each instance resembles a particular landmark. For RBFs, you can use the gamma hyperparameter to change variance: to reduce overfitting, reduce gamma; to increase variance, increase gamma.

SVM is versatile: it supports linear and nonlinear classification and regression. The hard margin and soft margin problems are both convex quadratic optimization problems with linear constraints. Such problems are known as Quadratic Programming (QP) problems.

Decision Trees

Like SVMs, decision trees are versatile machine learning algorithms that can perform both classification and regression tasks, even multioutput tasks. They are very powerful algorithms, capable of fitting complex datasets. Decision trees are also the fundamental components of Random Forests, which are among the most powerful Machine Learning algorithms available today.

Decision Tree

Making Predictions

You start at the root node (depth=0), then move left or right depending on the petal length. You can continue this process until you reach a leaf node (one that has no children nodes)m which gives you a predicted class. A node's samples attribute counts how many training instances it applies to. A node's value attribute tells you how many instances of each class this node applies to. Finally, a node's gini attribute measures its impurity: a node is pure (gini=0) if all training instances it applies to belong to the same class. The gini impurity is an example of an impurity measure. Decision trees are fairly intuitive and their decisions are easy to interpret, such models are called white box models. In contrast, Random Forests or Neural Networks are called black box models. They make good predictions, and you can check the calculations they performed to make these predictions; nevertheless, it is usually hard to explain in simple terms why those predictions were made. A decision tree can also estimate the probability that an instance belongs to a particular class.

CART Training Algorithm

Scikit-Lean uses the Classification and Regression Tree (CART) algorithm to train/grow decision trees.

GINI Impurity of Entropy

You can select an entropy impurity measure different than Gini impurity by setting the criterion hyperparameter to entropy. In Shannon's information theory, entropy measures the average information content of a message: entropy is zero when all messages are identical. In Machine Learning, it is frequently used as an impurity measure.

Regularization Hyperparameters

To avoid overfitting the training data, you need to restrict the Decision Tree's freedom during training. This is called regularization. The regularization hyperparameters depend on the model, but in general, you can always limit the max-depth.

Ensemble Learning and Random Forests

The wisdom of the crowd - if you aggregate the answers of a random group of people you will often find that it is better than the expert's answer. Similarly, if you aggregate the predictions of a group of predictors (such as classifiers or regressors), you will often get better predictions than with the best individual predictor. A group of predictors is called an ensemble; thus, this technique is called Ensemble Learning, and an Ensemble Learning algorithm is called an Ensemble method. An ensemble of Decision Trees where you predict the class that gets the most votes from the predictions of the trees is called a Random Forest. Despite its simplicity, this is one of the most powerful Machine Learning algorithms available today.

A very simple way to create a better classifier is to aggregate the predictions of each classifier to predict the class that gets the most votes. This majority-vote classifier is called a hard voting classifier. Surprisingly, this voting classifier often achieves a higher accuracy than the best classifier in the ensemble.

Ensemble methods work best when the predictors are as independent from one another as possible. One way to get diverse classifiers is to train them using very different algorithms. This increases the chance that they will make very different types of errors, improving the ensemble's accuracy. If all the classifiers are able to predict probabilities (they have the predict_proba() method), then you can tell Scikit-Learn to predict the class with the highest class probability averaged over all the individual classifiers. This is called soft voting. It often achieves higher performance than hard voting because it gives more weight to highly confident votes.

Bagging and Pasting

One way to get a diverse set of classifiers is to use very different training algorithms; another approach is to use the same training algorithm for every predictor, but to train them on different random subsets of the training set. When sampling is performed with replacement, this method is called bagging (short for bootstrap aggregating). When sampling is performed without replacement, it is called pasting. The aggregation function is typically the statistical mode for classification, or average for regression. Aggregation reduced both bias and variance.

A great feature of random forests is that they show you the relative importance of each feature.

Boosting

Boosting refers to any Ensemble method that can combine several weak learners into a strong learner. The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor. Ada Boost and Gradient Boosting work by sequentially adding predictors to an ensemble, each one correcting its predecessor.

Dimensionality Reduction

Having too many features can make training extremely slow and can make it harder to find a good solution. This problem is often referred to as the curse of dimensionality. In real-world problems, it is often possible to reduce the number of features considerably, turning an intractable problem into a tractable one. Reducing dimensionality does lose some information, so it will make your system perform slightly worse. Dimensionality reduction also makes your pipelines more complex and difficult to maintain. Ins some cases, reducing the dimensionality of the training data may filter out some noise and unnecessary details and thus result in higher performance. Dimensionality reduction is also extremely useful for data visualization.

The Curse of Dimensionality

In higher dimensional space, it is very likely that a random point will be extreme among any dimension (close to the border), whereas in smaller dimensions, it is more likely. Higher dimensions are at risk of being very sparse - more training instances will be very far away from each other. The more dimensions training set has, the greater the risk of overfitting.

Main Approaches to Dimensionality Reduction

In most real-world problems, training instances are not spread out uniformly across all dimensions. Many features are almost constant, while others are highly correlated. As a result, all training instances lie within a lower-dimensional subspace of the higher-dimensional space.

Manifold Learning

A 2D manifold is a 2D shape that can be bent and twisted in a higher-dimensional space. More generally, a -dimensional manifold is part of an -dimensional space (where ) that locally resembles a -dimensional hyperplane. Many dimensionality reduction algorithms work by modeling the manifold on which the training instances lie; this is called Manifold Learning. It relies on the manifold assumption/hypothesis, which holds that most real-world higher dimensional datasets lie close to a much lower-dimensional manifold. If you reduce the dimensionality of your training set before training a model, it will usually speed up training, but it may not always lead to a better or simpler solution; it all depends on the dataset.

PCA

Principle Component Analysis (PCA) is by far the most popular dimensionality reduction algorithm. First, it identifies the hyperplane that lies closest to the data, and then it projects the data onto it. The simple idea behind PCA is to jeep the axes that contain the most variance. It is generally preferable to choose the number of dimensions that add up to a sufficiently large portion of the variance (e.g. 95%).

Kernel PCA makes it possible to perform complex nonlinear projects for dimensionality reduction.

LLE

Locally Linear Embedding is another very powerful nonlinear dimensionality reduction (NLDR) technique. It is a Manifold Learning technique that does not rely on projections like the previous algorithms.

Other Techniques

Multidimensional Scaling (MDS) reduces dimensionality while trying to preserve the distances between instances
Isomap creates a graph by connecting each instance to its nearest neighbors, then reduces dimensionality while trying to preserve the geodesic distances between the instances
t-Distributed Stochastic Neighbor Embedding (t-SNE) reduces dimensionality while trying to keep similar instances close and dissimilar instances apart. It is mostly used for visualization, in particular to visualize clusters of instances in high-dimensional space.
Linear-Discriminant Analysis (LDA) is actually a classification algorithm, but during training it learns the most discriminative axes between the classes, and these axes can then be used to define a hyperplane on which to project the data.

Unsupervised Learning Techniques

The majority of data is unlabeled: we have the input features , but we do not have the labels . Dimensionality reduction is an unsupervised learning tasks. Unsupervised learning tasks:

Clustering: the goal is to group similar instances together into clusters.
Anomaly Detection: the objective is to learn what normal data looks like and use this to detect abnormal instances, such as defective items on a production line or a new trend in a time series
Density Estimation: this is the task of estimating the probability density function (PDF) of the random process that generated the dataset.

Clustering

Clustering - the task of identifying similar instances and assigning them to clusters, groups of similar instances. like in classification, each instance gets assigned to a group. However, this is an unsupervised task. There is no universal definition of what a cluster is: it really depends on the context, and different algorithms will capture different kinds of clusters. For example, some algorithms look for instances centered around a particular point, called a centroid. Others look for continuous regions of densely packed instances: these clusters can take on any shape. Some algorithms are hierarchal, looking for clusters of clusters.

K-Means

The K-Means algorithm is a simple capable algorithm capable of clustering datasets vey quickly and efficiently, often in just a few iterations. You have to specify how many clusters to find. The K-Means algorithm does not behave well when the blobs have very different diameters since all it cares about when assigning an instance to a cluster is the distance to the centroid. Instead of assigning each instance to a single cluster, called hard clustering, it can be useful to give each instance a score per cluster: this is called soft clustering.

The K-Means Algorithm

Place the centroids randomly, Then label the instances, update the centroids, label the instances, update the centroids, and so on until the centroids stop moving. The algorithm is guaranteed to converge in a finite amount of steps.

DBSCAN

Allows the algorithm to identify clusters of arbitrary shapes. It defined continuous regions of high density:

For each instance, the algorithm counts how many instances are located with a small distance from it. This region is called the instance's -neighborhood.
If an instance has at least min_samples in its -neighborhood, then it is considered a core instance. Core instances are those that are located in dense regions.
Any instance that is not a core instance and does not have one in its neighborhood is considered an anomaly.

The algorithm works well if all the clusters are dense enough, and they are well separated by low-density regions. DBSCAN is a very simple yet powerful algorithm, capable of identifying any number of clusters, of any shape, it is robust to outliers, and it has just two hyper-parameters. However, if the density varies significantly across the clusters, it can be impossible for it to capture all the clusters properly.

Gaussian Mixtures

A Gaussian Mixture Model (GMM) is a probabilistic model that assumes that the instances were generated from a mixed of several Gaussian distributions whose parameters are known. All the instances generated from a single Gaussian distribution form a cluster that typically looks like an ellipsoid, each of which has a different shape, size, density, and orientation.

Anomaly Detection

Anomaly detection is the task of detecting instances that deviate strongly from the norm. These instances are of course called anomalies or outliers. Using a Gaussian Mixture model for anomaly detection is simple: any instance located in a low-density region can be considered an anomaly. You must define what density threshold to use.

Anomaly Detection using Gaussian Mixture

Introduction to Neural Networks with Keras

Looking at the brain's architecture to build an intelligent machine is the key idea that sparked Artificial Neural Networks (ANNs). The artificial neuron has one or more binary inputs and one binary output. The Perceptron is one of the simplest ANN architectures. It is bases on a slightly different artificial neuron called a threshold logical unit (TLU): the inputs and the outs are now numbers (instead of binary values) and each input connection is associated with a weight. The TLU computes a weighted sum of its inputs and then applies a step function to that sum and outputs the result.

Threshold Logical Unit

The most common step function in Perceptrons is the Heavyside step function. A single TLU with a threshold can be used for a simple linear binary classification. Training a TLU in this case means finding the right weights. A Perceptron is simply composed of a single layer of TLUs, with each TLU connected to all the inputs. When all the neurons in a layer are connected to every neuron in the previous layer, it is called a fully connected layer or dense layer. All the input neurons form the input layer.

Computing the Outputs of a Fully Connected Layer:

Perceptron Learning Rule:

The Perceptron is fed one training instance at a time, and for each instance it makes predictions. For every output neuron that produced a wrong prediction, it reinforces the connection weights from the input that would have contributed to the correct prediction. See equation above.

Perceptrons are incapable of learning complex patterns because the decision boundary of each output is linear. The Perceptron learning algorithm strongly resembles Stochastic Gradient Descent. Perceptron convergence theorem: the perceptron will always converge to a solution.

Multi-Layer Perceptron

A MLP is composed of one (passthrough) input layer, one or more layers of TLUs, called hidden layers, and one final layer of TLUs called the output layer. The layers close to the input layer are called the lower layers, and the ones close to the output are called the upper layers. Every layer except the output layer includes a bias neuron and is fully connected to the next layer. MLP is an example of a feedforward neural network because the signal flows only in one direction. When an ANN contains a deep stack of hidden layers, it is called a deep neural network (DNN). The field of Deep Learning studies DNNs and more generally models containing deep stacks of computations.

Backpropagation is a Gradient Descent using an efficient technique for computing gradients automatically: in just two passes through the network (one forward, one backward), the backpropagation algorithm is able to compute the gradient of the network's error with regards to every single model parameter. In other words, it can find out how each connection weight and each bias term should be tweaked in order to reduce the error. Once it has these gradients, it just performs a regular gradient descent step, and the whole process is repeated until the network converges to the solution; automatically computing gradients is called automatic differentiation or autodiff.

Backpropagation

Handles one batch at a time, and it goes through the full training set multiple times. Each pass is called an epoch
Each mini-batch is passed to the network's input layer, which just sends it to the first hidden layer. The algorithm then computes the output of all neurons in this layer. The result is passes on to the next layer, its output is computed and passed to the next layer, and so on until we get the output of the last layer, the output layer. This is called the forward pass: it is exactly like making predictions, except all intermediate results are preserved since they are needed for the backward pass.
The algorithm measures the network's output layer.
It computes how much each output connection contributed to the error. This is done analytically by applying the chain rule, which makes this step fast and precise.
The algorithm then measures how much of these error contributions came from each connection in the layer below, again using the chain rule, and so on until the algorithm reaches the input later. This reverse pass efficiently measures the error gradient across all the connection weights in the network by propagating the error gradient through the network.
The algorithm then performs a Gradient Descent step to tweak all the connection weights in the network, using the error gradients it just computed.

This algorithm is so important, it’s worth summarizing it again: for each training instance the backpropagation algorithm first makes a prediction (forward pass), measures the error, then goes through each layer in reverse to measure the error contribution from each connection (reverse pass), and finally slightly tweaks the connection weights to reduce the error (Gradient Descent step).

It's important to initialize all hidden layers' connection weights randomly, or else training will fail. When you randomly initialize the weights, you break the symmetry and allow backpropagation to train a diverse set of neurons. In order to work properly, NNs today replace the MLP step function with other step functions:

The logistic function:
- Nonzero derivative everywhere
The Hyperbolic tangent function
- Just like the logistic function: S-shaped, continuous, and differentiable, but its output ranges from -1 to 1 instead of 0-1 like the logistic function, which helps speed up convergence
The Rectified Linear Unit Function
- In practice works well and is fast to compute. Does not have a maximum value which helps reduce some issues during Gradient Descent

MLPs can be used for regression tasks. If you just need to predict a single value, then you just need a single output neuron. For multivariate regression, you need one output neuron per output dimension. In general, when building a MLP for regression, you do not want to use any activation function for the output neurons, so they are free to output any range of values. If you want to guarantee that the output will be positive, then you can use the ReLU activation function, or the softplus activation function in the output layer.

MLPs can be used for classification tasks. For a binary classification problem, you only need a single output neuron using the logistic activation function: the output number between 0 and 1, which you can estimate as the probability of the positive class. MLPs can also handle multilabel binary classification tasks. You dedicate one output neuron for each positive class you want to predict.

Keras is a high-level Deep Learning API that allows you to easily build, train, evaluate and execute all sorts of neural networks. [ ... ] Developed by Francois Chollet as part of a research project and released as an open source project in March 2015. It quickly gained popularity owing to its ease of use, flexibility, and beautiful design. To perform the heavy computations required by neural networks, keras-team relies on a computation backend. At the present, you can choose from three popular open source deep learning libraries: TensorFlow, Microsoft Cognitive Toolkit or Theano.

Number of Hidden Layers

Deep networks have a much higher parameter efficiency than shallow ones: they can model complex function using exponentially fewer neurons than shallow nets, allow them to reach much better performance with the same amount of training data.

Real world data is often structured in such a hierarchal way and DNNs automatically take advantage of this fact: lower hidden layers model low-level structures (e.g., line segments of various shapes and orientations), intermediate hidden layers combine these low-level structures to model intermediate-level structures (squares, circles) and highest hidden layers and the output layer combine these intermediate structures to model high-level structures (faces).

Number of Neurons per Hidden Layer

The number of neurons in input and output layers is constrained by the task. Just like for the number of layers, you can try increasing the number of neurons gradually until the networks starts overfitting.

Other Hyperparameters

The learning rate is arguably the most important hyperparameter. In general, the optimal learning rate is about half the maximum learning rate (the learning rate above which the algorithm diverges).
Choosing a better optimizer than Mini-batch GD is also optimal.
The batch size will also have a significant impact on the model's performance.
The choice of activation function: ReLU activation function will be a good default for hidden layers.

Training Deep Neural Networks

Deep Neural Networks: 10 layers or more, each combining hundreds of neurons, connected by hundreds of thousands of connections, problems:

Vanishing gradients problem / exploding gradients problem
1. During backpropagation, gradients can get smaller and smaller as the algorithm progresses down to the lower layers. As a result, gradient descent update leaves the lower layer connection weight virtually unchanged, and training never converges to a good solution. In some cases, the gradients can grow bigger and bigger, so many layers get insanely large weight updates and the algorithm diverges. The combination of sigmoid activation function and random initialization is suspected to cause this problem.
Might not have enough data for large network
Training may be extremely slow
A model with too many parameters risks overfitting the data

Glorot and He Initialization

The authors argue that we need the variance of the outputs of each layer to be equal to the variance of its inputs and we also need the gradients to have equal variance before and after flowing through a layer in the reverse direction. It is not actually possible to guarantee this, so the authors proposed Glorot Initialization.

Non-saturating Activation Functions

Leaky ReLU can also be used to solve the vanishing gradients problem. With leaky ReLU, neurons will leak something other than 0 when their output is less than 0. There is also the exponential linear unit activation function, which is like leaky ReLU.

What activation function to choose for Deep Neural Networks? SeLU > ELU > leaky ReLU > ReLU > tanh > logistic.

Batch Normalization

Batch Normalization (BN), designed to address vanishing/exploding gradients, consists of zero-centering and normalizing each input, then scaling and shifting the result using two new parametric vectors per-layer: one for scaling, the other for shifting. Four parameter vectors are learned in each batch-normalized layer: the output scale vector and the output offset vector are learned through regular backpropagation, and the final input mean vector and the final input standard deviation vector are estimated using an exponential moving average. BN acts to regulate, reducing the need for other regularization techniques.

Reusing Pretrained Layers

It is generally not a good idea to train a very large DNN from scratch: instead, you should always try to find an existing model that accomplishes a similar task to the one you are trying to tackle and then just reuse the lower layers of this network in a process called transfer learning. Replace the output layers, maybe add some upper layers, freeze the lower layers during initial training, etc...

Faster Optimizers

Trying to find a faster optimizer than a gradient descent optimizer.

Momentum Optimization
- Cares a great deal about what the previous gradients were
Netserov accelerated Gradient
- Measure the gradient cost function not at the local position but slightly ahead in the direction of the momentum
AdaGrad
- Achieves faster convergence by pointing a bit more directly to global optimum than gradient descent by scaling down the gradient vector along the steepest dimensions
RMSProp
- Fixes AdaGrad's problem of never converging to global optimum by accumulating only the gradients from the most recent iterations.
Learning Rate Scheduling
- Changing the learning rate during scheduling using various techniques

Avoid Overfitting

and regularization
Dropout
Monte-Carlo Dropout
Max-Norm Regularization

Default DNN Configuration

Deep Computer Vision using Convolutional Neural Networks

Convolutional Neural Networks (CNNs( emerged from the study of the brain's visual cortex and they have been used in image recognition since the 1980s. In the last few years, CNNs have achieved superhuman performance on complex visual tasks. CNNs are also good at voice recognition and natural language processing.

Convolutional Layer

The most important block of a CNN is the convolutional layer. Neurons in the first convolutional layer are not connected to every single pixel in the input image, but only to pixels in their receptive fields. In turn, each neuron in the second convolutional layer is connected only to neurons located within a small rectangle in the first layer. This architecture allows the network to concentrate on small low-level features in the first hidden layer, then assemble them into larger higher-level features in the next hidden layer, and so on.

stride
- spacing out receptive fields by some amount
padding
- adding zeros around the input so that a layer has the same height and width as the previous layer
filters (or convolutional kernels)
- A neuron's weights. A layer full of neurons using the same filter outputs a feature map, which highlights the areas in an image that activate the filter the most.

Connections between Layers and Padding

A convolutional layer has multiple filters in reality, and it outputs one feature map per filter. A neuron's receptive filed extends across all the previous layers' feature maps. A convolutional layer simultaneously applied multiple trainable filters to its inputs, making it capable of detecting multiple features anywhere in its inputs.

The goal of a pooling layer is to subsample (shrink) the input image in order to reduce the computational load, the memory usage, and the number of parameters. The pooling layer is just like the convolutional layer, except that a pooling layer neuron has no weights; all it does is aggregate the inputs using an aggregate function like max or mean. Max pooling layer - only the max value in the receptive field makes it to the next receptive field. Max pooling also adds some level of translational and rotation invariance.

Typical CNN Architecture

Processing Sequences using RNNs and CNNs

A recurrent neural network looks very much like a feed forward neural network, except it also has connections pointing backwards. At each time step , also called a frame, this recurrent neuron receives the inputs as well as the output from the previous time step . A recurrent layer of neurons: at each time step , every neuron receives both the input vector and the output vector from the previous time step.

Each recurrent neuron has two sets of weights: one for the inputs and the other for the outputs of the previous time step . These weight vectors are denotes and . Their corresponding matrices are denoted and . The output of a recurrent layer, with as the activation function and as the bias vector:

Computing a recurrent layer's output in one shot for an entire mini-batch by placing all the inputs at time into an input matrix :

Since the output of a recurrent neuron at time step t is a function of all the inputs from the previous time steps, you could say that it has a form of memory. A part of a neural network that preserves some state across time steps is called a memory cell. A cell's state at time , , where h stands for hidden, is a function of some inputs at that time step and its state at the previous time step . It is not always the case that the hidden state for a time step is equal to the output of that time step.

An RNN can simultaneously take a sequence of inputs and produce a sequence of outputs. This type of sequence-to-sequence network is useful to forecast time series. Alternatively, you could feed the network a sequence of inputs and ignore all outputs except for the last one (sequence to vector network). Conversely, you could feed the network the same input vector over and over again at each time step and let it output a sequence (vector-to-sequence network). Lastly, you could have a sequence-to-vector network, called an encoder, followed by a vector-to-sequence network, called a decoder/

Types of RNN Networks

To train an RNN, the trick is to unroll it through time and use regular backpropagation. This strategy is called backpropagation through time (BPTT). The output sequence is evaluates using a loss function (where is the target, is the prediction, and is the max time step). This function may ignore some inputs.

Time series Data: data with values at different time steps, usually at regular intervals. If there are multiple values per time step, this is called multivariate time series. Time series with one value per time step is called univariate time series. When a time series is correlated with a lagged version of itself, we say that the time series is autocorrelated. Differencing, subtracting a past value from a more recent value in the series, is a common technique used to remove trend and seasonality from a time series. The autoregressive moving average (ARMA) model computes forecasts using a simple weighted average of lagged variables and corrects these forecasts by adding a moving average.

The most basic RNN contains a single recurrent layer with just one recurrent neuron. Stacking multiple layers of cells gives you a deep RNN. Many of the tricks used in DNNs to alleviate unstable gradients can be used for RNNs. However, a nonsaturating activation function may lead the RNN to be even more unstable during training. Batch Normalization doesn't work well with RNNs. A form of normalization that works well with RNNs is layer normalization. This form of normalization is similar to batch normalization, but instead of normalizing across the batch dimension, layer normalization normalizes across the feature dimension.

Due to transformations that the data goes through when traversing an RNN, some information is lost at each time step. After a while, a RNNs state contains virtually no trace of the first inputs. LSTM cells were introduced to improve this. The Long Short-Term Memory Cell (LSTM) performs much better than simple RNN cell and converges faster.

LSTM Cell

The LSTM cell looks exactly like a regular cell, except that its state is split into two vectors: and (c stands for cell). You can think of as the short-term state and as the long-term state. The key idea of LSTM is that the network can learn what to store in the long-term state, what to throw away, and what to read from it. As the long term state traverses from left to right, you can see it goes through a forget gate, dropping some memories, and then it adds some new memories via the addition operation (which adds the memories that were selected by an input gate). The result is sent straight out, without any further transformation. At each time step, some memories are dropped and others are added. After the addition operation, the long-term state is copied and passed through the tanh function, and then the result is filtered by the output gate. This produces the short term state , which is equal to the cell's output from this time step.

The current input layer and the previous short-term state are fed to four different fully-connected layers:

The main layer is the one that outputs . It has the usual role of analyzing the current state inputs and the previous short term state. In a LSTM cell, this layer's output device does not go straight out; instead, the most important parts are stored in the long-term state (and the rest is dropped).
The three other layers are gate controllers. The gate controllers are fed to element-wise multiplication operations: if they output zeros they close the gate and if they output 1s, they open it:
1. The forget gate (controlled by , controls which parts of the long-term state should be erased.
2. The input gate (controlled by , controls which parts of should be added to the long-term state.
3. Finally, the output gate, controlled by , controls which parts of the long-term state should be read and output at the same time, both to and .

The gated recurrent unit (GRU) cell is a simplified version of the LSTM cell, and it seems to perform just as well. Main simplifications:

Both state vectors are merged into a single vector .
A single gate controller controls the input gate and the forget gate.
There is no output gate: the full state vector is output at every time step.

GRU Cell

Natural Language Processing with RNNs and Attention

This content will be a summary - it is covered more in-depth by a later section of notes.

Attention Mechanisms

The image below shows an encoder-decoder model with an added attention mechanism. Instead of just sending the encoder's final hidden state to the decoder, as well as the previous target word at each step, we now send all of the encoder's outputs to the decoder as well. Since the decoder cannot deal with all these encoder outputs at once, they need to be aggregated: at each timestamp, the decoder's memory cell computes a weighted sum of all the encoders outputs. This determines what words it will focus on at each step. The weight is the weight of the encoder output at the decoder time step. The rest of the decoder works just like earlier: at each time step the memory cell receives the inputs plus the hidden state from the previous time step, and finally it receives the target word from the previous time step. Th weights are generated by a small neural network called the attention layer, which is trained jointly with the rest of the encoder-decider model.

Attention Model

The attention layer provides a way to focus the attention of the model on part of the inputs. It acts as a differentiable memory retrieval mechanism.

Attention is All you Need: The Original Transformer Architecture

In 2017, the transformer was introduced, which improved the state of Neural Machine Translation without using any recurrent or convolutional layers, just attention mechanisms (and some other things). Because the model is not recurrent, it can be trained in fewer steps, it's easier to parallelize across multiple GPUs, and it can better capture long-range patterns than RNNs.

Original Transformer

The image above:

Notice that the encoder and decoder contain modules that are stacked times
The encoder's multi-head attention layer updates each word representation by paying attention to all other words in the same sentence. This is where word semantics become more well-defined.
The decoder's masked multi-head attention layer does the same thing, but when it processes a word, it doesn't pay attention to words located after it.
The decoder's upper multi-head attention layer is where the decoder pays attention to the words in the English sentence. This is called cross-attention.
The positional encodings are dense vectors that represent the position of each word in the sentence. The positional encoding is added to the word embedding of the word in each sentence. This is needed because all layers in the transformer architecture ignore word positions: without positional encoding, you could shuffle input sequences, and it would just shuffle the output sequences in the same way. The order of the words matters, and positional encodings are good at giving position information to the transformer.

Multi-head Attention

Scaled Dot-product attention:

Q is a matrix containing one row per query. Its shape is , where is the number of queries and is the number of dimensions of each query and each key.
K is a matrix containing one row per key. Its shape is , where is the number of keys and values
V is a matrix containing one row per value. Its shape is . where is the number of dimensions of each value.

Multi-head Attention Architecture

The multi-head attention layer is a bunch of dot-product attention layers, each preceded by a linear transformation of the values, keys, and queries (a time-distributed dense layer with no activation function). All the outputs are simply concatenated, and they go through a final linear transformation. The multi-head attention layer applied multiple different linear transformations of the values, keys, and queries: this allows the model to apply many different projections of the word representations into different subspaces, each focusing on a subset of the word's characteristics.

Autoencoders, GANs, and Diffusion Models

Autoencoders are artificial neural networks capable of learning dense representations of the input data, called latent representations or codings. Autoencoders can be useful for visualization, dimensionality reduction, unsupervised pretraining of deep neural networks, and as generative models. They work by learning to copy their inputs to their outputs.

Generative Adversarial networks (GANs) are neural networks capable of generating data. GANs are now widely used for super resolution (increasing resolution), colorization, powerful image editing, turning simple sketches into photorealistic images, predicting the next frames in a video, augmenting a dataset, and generating other types of data. They are composed of a generator that tries to generate data close to training data and a discriminator that tries to tell real data from fake data. The generator and discriminator compete against each other in training. Diffusion models are a recent addition to the generative learning parting. A denoising diffusion probabilistic model (DDPRM) is trained to remove a tiny bit of noise from an image.

An autoencoder looks at inputs, converts them to an efficient latent representation, and then spits out something that looks close to the inputs. An autoencoder is composed of two parts: an encoder (or recognition network) that converts the inputs to latent representations, followed by a decoder (or generative network) that converts internal representations to the outputs. An autoencoder typically looks like MLP except that the number of neurons in the output layer must be equal to the number of inputs.

Stacked Autoencoders: autoencoders will multiple hidden layers that are usually symmetrical with respect to the central hidden layer
Convolutional autoencoders: autoencoder for images.
Denoising Autoencoders: Training autoencoders by adding noise to inputs and learning to recover original image
Sparse Autoencoder

Variational Autoencoders (VAE) are different from all other autoencoders in that:

They are probabilistic autoencoders, meaning that their outputs are partly determined by chance, even after training
More importantly, they are generative autoencoders, meaning that they can generate new instances that look like they were sampled from the training set

Generative Adversarial Networks (GAN) is composed of two neural networks:

Generator: Takes a random distribution as input (typically Gaussian) and outputs some data.
Discriminator: Takes either a fake image from the generator or a real image from the training set as input, and must guess whether the input image is fake or real.

During training, the generator and the discriminator have opposite goals: the discriminator tries to tell fake images from real images, while the generator tires to produce images that look real enough to trick the discriminator.

First, train the discriminator. A batch of real images is samples from the training set and is completed with an equal number of fake images produced by the generator
Second, train the generator. Use it to produce another batch of fake images, and once again the discriminator is used to tell whether the images are fake or real. We want the generator to produce images that the discriminator will wrongly believe to be real.

Reinforcement Learning

Reinforcement learning (RL) is one of the most exciting fields of machine learning today, and also one of the oldest. In reinforcement learning, a software agent makes observations and takes actions within an environment, and in return it receives rewards from the environment. Its objective is to learn to act in a way that will maximize its expected rewards over time. The algorithm a software agent uses to determine its actions is called a policy. This policy could be a neural network taking observations as inputs and outputting an action to take.

A stochastic policy is a policy that involves some randomness. Policy parameters can be tweaked to change the policy. Policy search is looking for different values for the policy parameters. The policy space is the space that includes all different poly parameters. Using policy gradients, you can tweak the policy parameters by following the gradients toward higher rewards.

Deep Learning with Python

What is Deep Learning

AI can be described as the effort to automate intellectual tasks performed by humans. Symbolic AI is the approach of achieving human-level artificial intelligence by having programmers handcraft a sufficiently large set of explicit rules for manipulating knowledge stored in explicit database. Machine earning looks at the input data and corresponding answers and figures out what the rules should be. A machine learning system is trained rather than programmed. The central problem of machine learning and deep learning is to meaningfully transform data: to learn useful representations if the input data at hand.

Deep Learning is a specific subfield in machine learning: a new take on learning representations from data that puts an emphasis on learning successive layers of increasingly meaningful representations. The deep in deep learning refers to the idea of successive layers of representations. How many layers contribute to a model of the data is called the depth of the model. In deep learning, these layered representations are learned via models called neural networks structured in literal layers stacked on top of each other.

Deep learning is technically a multistage way to learn data representations. The specification of what a layer does to its input data is stored in a layer's weights, which in essence are a bunch of numbers. In technical terms, we'd say that the transformation implemented by a layer is parameterized by its weights. Learning means finding a set of values for the weights of all layers in a network, such that the network will correctly map example inputs to their associated targets. To control the output of a neural network, you must be able to measure how far this output is from what you expected. This is a job of the loss function of the network, also sometimes called the objective function or cost function. The loss function takes the predictions of the network and the true target and computes a distance score, capturing how well the network has done on this specific example. The fundamental trick in deep learning is to use this score as a feedback signal to adjust the value of the weights a little, in a direction that will lower the loss score of the current example. This adjustment is the job of the optimizer, which implements what's called the Backpropagation algorithm: the central algorithm in deep learning.

Probabilistic modeling is the application of the principles of statistics to data analysis. Kernel methods are a group of classification algorithms, best known of which is Support Vector Machine (SVM). Kernel Trick: to find a good decision hyperplane in the new representation space, you just need to compute the distance between the pairs of points in that space, which can be done efficiently using a kernel function. A kernel function is a computationally tractable operation that amps two points in your initial space to the distance between these points in your target representation space completely bypassing the explicit computation of new representations.

Deep learning completely automates what used to be the most crucial step in machine learning workflows: feature engineering. Feature engineering is manually engineering good layers of representations of data. Deep learning completely automates this step: with deep learning, you learn all features in one pass rather than having to engineer them yourself. This has greatly simplified ML workflows; what is transformative about deep learning is that it allows a model to learn all layers of representation jointly, at the same time, rather than in succession. With joint feature learning, whenever the model adjusts one of its internal features, all other features that depend on it automatically adapt to the change, without requiring human intervention.

Two essential characteristics of how machine learning learns from data:

the incremental, layer-by-layer way in which increasingly complex relationships are developed
the fact that these intermediate incremental representations are learned jointly

Gradient boosted trees and deep learning have dominated the ML and DS industry from 2017-2020.

The Mathematical Building Blocks of Neural Networks

Neural network layers extract representations out of the data fed into them, Most of deep learning consists of a chaining together simple layers that will implement a progressive form of data distillation. A softmax classification layer returns an array of probability scores that sum to one. As a part of the compilation step, the model needs an optimizer, the mechanism through which the model will update itself based on the training data it sees, a loss function, how the model will be able to measure its performance on the training data, and metric to monitor during training and testing.

When the training accuracy is higher than the test accuracy overfitting. In general. the first axis in all data tensors you'll come across in deep learning will be the samples axis. When considering a batch tensor, the first axis is called the batch axis or batch dimension.

Training Loop

Draw batch of samples, x, and corresponding targets y_true
Run the model on x (a step called the forward pass) to obtain predictions, y_pred
Compute the loss of the model on the batch, a measure of the mismatch between y_true and y_pred
Update all the weights of the model in a way that slightly reduces the loss on this batch

Updating the model's weights is done by gradient descent. Gradient Descent is the optimization technique that powers modern neural networks. You can use a mathematical operator called the gradient to describe how the loss varies as you move the model's coefficients (all at once in a single update, rather than one at a time) in a direction that decreases the loss. The derivative of a tensor operation (or function) is called a gradient. Gradients are just generalizations of the concept of derivates to functions that take tensors as inputs.

Stochastic Gradient Descent

Draw a batch of training samples, x, and corresponding targets y_true
Run the model on x to obtain predictions y_pred (forward pass)
Compute the loss of the model on the batch, a measure of the mismatch between y_pred and y_true
Compute the gradient of the loss with regard to the model's parameters (this is called the backward pass)
Move the parameters a little in the opposite direction of the gradient (W -= learning_rate*gradient) - thus reducing the loss on the batch a bit.

A neural network consists of many tensor operations chained together, each of which has a simple, known derivate. Applying the chain rule to the computation of the gradient values of a neural network gives rise to an algorithm called backpropagation. A useful way to think about backpropagation is in terms of computation graphs. Backpropagation is simply the application of the chain rule to a computation graph. Backpropagation starts with the final loss value and works backwards from top layer to bottom layers, computing the contribution that each parameter had in the loss value.