Mathematics For Machine Learning Notes

I am reading this because I hope that learning some of the math behind Machine Learning will help me gain a better intuition on how to store and process data in a way that utilizes machine learning. I hope to understand the different machine learning methods better as well.

Date Created:

Last Edited:

2 516

References

Mathematics for Machine Learning, by Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong
Book available for download at https://mml-book.com/

Preface

Machine learning is the latest in a long line of attempts to distill human knowledge and reasoning into a form that is suitable for constructing machines and engineering automated systems. As machine learning becomes more ubiquitous and its software packages become easier to use, it is natural and desirable that the low-level technical details are hidden from the practitioner. However, this brings with it the danger that the practitioner becomes unaware of the design decisions and hence, the limits of machine learning algorithms.

Why another Machine Learning Textbook?

We do not aim to write a classical machine learning book. Instead, our intention is to provide the mathematical background, applied to four central machine learning problems, to make it easier to read other machine learning textbooks.

Mathematical Foundations

Introduction and motivation

Machine learning is about designing algorithms that automatically extract valuable information from data. The emphasis here is on automatic, i.e., machine learning is concerned with general-purpose methodologies that can be applied to many datasets, while producing something that is meaningful. There are three concepts that are at the core of machine learning: data, a model, and learning.

Data
1. In this book, we assume that data has already been appropriate converted into a numerical representation suitable for reading into a computer program. Therefore, we think of data as vectors.
Model
1. A model is said to learn from data if its performance on a given task improves after the data is taken into account.
2. A model is typically used to describe a process for generating data, similar to the dataset at hand.
Learning
1. A way to automatically find patterns and structure in data by optimizing the parameters of the model.
2. Training the model means to use the data available to optimize some parameters of the model with respect to a utility function that evaluates how well the model predicts the training data.

Two things can be meant by algorithm in the machine learning sense:

A system that makes predictions based on input data - predictors
A system that adapts some internal parameters of the predictor so that it performs well on future unseen input data. We refer to this adaptation as training a system.

Three ways to think about vectors:

Vector as an array of numbers - computer science view
Vector as an arrow with a direction and magnitude - a physics view
Vector as an object that obeys addition and scaling - a mathematical view

Foundations and Pillars of Machine Learning

Part 1 of this book is about the foundations of machine learning, and part 2 is about the pillars of machine learning.

Math Review

I just reviewed linear algebra, so I am not sure how much I need this chapter. I will take note of things I want to remind myself of.

I am including all of the Math section (part 1) in Mathematics for Machine Learning in this section because I am tires of looking at equations and would rather watch videos and topics that I don't understand and link them here.

Vectors and Matrices

Vectors
Polynomials are vectors.
Audio Signals are vectors
Elements of are vectors.

Mind Map of Linear Algebra and ML

In general, for a real valued-system of linear equations, we obtain either no, exactly one, or infinitely many solutions.

Some Properties of Matrices:

Consider a square matrix . Let have the property that . is called the inverse of and denoted by .
Not every matrix has an inverse. If an inverse for does exist, we call regular/invertible/nonsingular, otherwise singular/noninvertible. When the matrix inverse exists, it is unique.
For the matrix with is called the transpose of . We write .
- Important properties of matrices and their transposes/inverses, with

Important Properties of Matrices and Their Transposes/Inerveses

A matrix is symmetric if .
Square Matrices contain the same number of rows and columns.
Properties of Scalar Multiplication:

Properties of Scalar Multiplication Matrices

Systems of linear equations can have particular solutions and general solutions.
There exists a constructive algorithmic way of transforming any system of linear equations into a form that makes it easy to solve the equations: gaussian elimination. The key to Gaussian elimination are elementary transformation of systems of linear equations, which transform the equation system into a simple form.
Row Echelon Form
- All rows that contain only zeros are at the bottom of the matrix; correspondingly, all rows that contain at least one nonzero element are on tup of rows that contain only zeros.
- Looking at nonzero rows only, the first nonzero number from the left is always strictly to the right of the pivot of the row above it.
The leading coefficient of a row (the first nonzero number from the left) is called the pivot and is always strictly to the right of the pivot of the row above it, The variables corresponding to the pivots in row-echelon form are called basic variables and the other variables are free variables.
Reduced Row Echelon Form
- It is in row-echelon form.
- Every pivot is a 1.
- The pivot is the only nonzero entry in its column.
Gaussian elimination is an algorithm that performs elementary transformations to bring a system of linear equations into reduced row-echelon form.
The solution space of is called the kernel or null space.

In practice, systems of linear equations are solved indirectly [(rather than by Gaussian elimination - due to the number of variables)], by either stationary or iterative methods, such as the Richardson method, the Jacobi Method, the Gaus-Seidel method, and the successive over-relaxation method, or Krylov subspace methods, such as conjugate gradients, generalized minimal residual, or biconjugate gradients.

Vector Spaces

Vector space - structured space in which vectors live.
Groups:
- Play an important role in computer science. They provide a fundamental framework for operations on sets, and are heavily used in cryptography, coding theory, and graphics.

Groups

I am not sure how much reviewing these mathematical definitions actually helps me. For the rest of the math section, if I come across a concept that I don't understand, I'll try to link a YouTube video on it.

Vector Spaces

Have the following properties (Closure properties):
- given and scalar , then
- given and , then

Vector Subspaces

A subspace is a smaller space within the vector space that is itself a vector space.
The span of any number of elements of vector space V is also a subspace of V. It is the smallest subspace of V that contains that set of elements.

Linear Independence

If none of a group of vectors can be written in terms of the others, then they are linearly independent.
To check for linear independence, we check for a valid set of nonzero scalars that makes the linear combination not equal to 0.
If the scalars have to be zero for the linear combination to be 0, then the group of vectors is linearly independent.
Finding the determinant is something we can only do with square matrices.
If the determinant is zero, then the vectors are linearly dependent.
If the determinant is nonzero, then the vectors are linearly independent.

Basis and Rank

A basis is a set of linearly independent vectors that can be used as building blocks to make any other vector in the vector space.
A group of vectors for a basis for V IFF they are linearly independent and they span the vector space V.
If a vector space has a basis with n elements, the vector space has dimension n
Vector Rank
- The rank of a matrix = the number of pivots = the number of linearly independent columns = the number of linearly independent rows =

Linear Mappings

Matrix multiplication is a linear map. You can look up the definition.

GIF Explaining Linear Map

Basis Change

You are basically writing vectors in terms of a new coordinate system.

Affine Spaces

Affine lines avoid the origin
Affine spaces are the ingredients for systems of linear equations.
A linear map keeps the origin, an affine map moves the origin. Affine maps are linear maps plus translations.
Affine spaces allow for different origins. Different perspectives are related by translation.

Affine Space

Affine Space Definition

Vector Norms

L2 Norm:
L1 Norm:

Inner Products

Inner Product:

Orthogonal Complement

An orthogonal complement is a vector/set of vectors such that the vector is orthogonal (dot product equal to 0) of every vector in .

Projections

= Projection onto vector L of some other vector x

Gram-Schmidt Orthogonalization

The Gram-Schmidt orthogonalization process takes a set of basis vectors for a vector space and forms a new orthogonal basis for the vector space

Orthogonalization Gram-Schmidt

Matrix Rotations

Matrix multiplication can be used to rotate a figure. To rotate a figure is to move it around a center point.

Matrix Rotation

Determinant and Trace

The trance of a square matrix is the sum of its diagonal. It is also the sum of its eigenvalues.
There determinant of a square matrix is the product of its eigenvalues.
Markov matrix - the sum of each column is 1 and each element (cell) has a value less than 1.

Eigenvalues and Eigenvectors

Eigenvectors - vectors that maintain the same span after linear transformation. Those vectors are called the eigenvectors and each eigenvector has an associated eigenvalue, which is factor by which the vector is stretched or squished during the transformation.

Cholesky Decomposition

If , then , where is the lower triangular form of A, which can be obtained from LU Decomposition.

Eigendecomposition and Diagonalization

Diagonalization

Singular Value Decomposition

The eigenvectors of a symmetric matrix are orthogonal to each other.

Some Intuition About Matrices

Matrix Approximation

You're approximating the SVD by truncating the matrices on the ight hand side of the equation above
You are removing some columns on the right
This assumes there are many more columns than rows

Taylor Series

Taylor series is about finding non-polynomials and approximating them with polynomials near some input

Taylor Series

Partial Differentiation and Gradients

Differentiation reminder

Partial derivatives - only differentiation with respect to one variable.

Partial Derivates Reminder

Gradient Formula / Example

Gradient of a Function

is how many people express gradient

Gradients of Vector-Valued Functions

Vector Valued Functions

The Jacobian

Gradients of Matrices

You just take the partial derivative of each row with respect to that row's variables

Backpropagation and Automatic Differentiation

Gradients in Machine Learning

Why Learn about Backpropagation

Chain Rule

Autodifferentiation

Machine Learning Backprop Dummary

Construction of a Probability Space

Construction of Probability Space

Discrete and Continuous Probabilities

Pretty self explanatory

Sum Rule, Product Rule, and Bayes' Theorem

Sum Rule

Product Rule

Bayes Theorem

Summary Statistics and Independence

This is self explanatory / you know must of this

Gaussian Distribution

Central Limit Theorem

Conjugacy and the Exponential Family

The Conjugacy Prior
- If you assume the prior's density distribution in Bayes' Theorem to be normal, then the result density distribution is normal as well.
If you can write the distribution as the product of two functions and the natural exponent of a sum, then that is in the exponential family.

Exponential Family

Change of Variables / Inverse Transform

The Jacobian

Inverse Transform Sampling

Optimization Using Gradient Descent

Trying to optimize functions by maximum or minimum

Four pillars of machine learning:
- Regression
- Dimensionality Reduction
- Density Estimation
- Classification

Data, Models, and Learning

Data is assumed to be tabular, where we think of each row of the table as representing a particular instance or example, and each column to be a particular feature.
Each input (row) is a D-dimensional vector of real numbers, which are called features. attributes, or covariates.
Each column represents a particular feature of interest about the example, and we index the features as .
A predictor is a function that, when given a particular input example (in our case, a vector of features), produces an output.
The goal of learning is to find a model and its corresponding parameters such that the resulting predictor will perform well on unseen data. There are conceptually three distinct algorithmic phases when discussing machine learning algorithms:

Prediction of inference
1. We use a trained predictor on previously unseen test data
Training or parameter estimation
1. We adjust our predictive model based on training data. Two main strategies for doing so:
2. 1. Finding the best predictor based on some measure of quality.
  2. Using Bayesian inference
3. For non-probabilistic models, we use the principle of empirical risk minimization, which directly provides an optimization problem for finding good parameters.
4. With a statistical mode, maximum likelihood is used to find a good set of parameters. We can additional model the uncertainty of parameters using a probabilistic model.
5. We simulate the behavior of our predictor on future unseen data using cross-validation.
6. Abduction is the process of inference to the best explanation.
Hyperparameter tuning or model selection

The distinction between parameters and hyperparameters is somewhat arbitrary, and is mostly driven by the distinction between what can be numerically optimized versus what needs to use search techniques.

A predictor is a function that, when given a particular input example, produces an output.

Empirical Risk Minimization

Questions to ask:
- What is the set of functions we allow the predictor to take?
- How do we measure how well the predictor performs on the training data?
- How do we construct predictors from only training data that performs well on unseen test data?
- What is the procedure for searching over the space of models?

What is The Set of Functions We Allow the Predictor to Take: Hypothesis Class of Functions

We are trying to find good parameters so that our prediction is not far off from the real output for all data points.

How do we measure how well the predictor performs on the training data?:

The loss function defines what it means to fit the data well. = loss function, where is the prediction of the output and is the observed output.
One assumption that is commonly made in machine learning is that the set of example is independent and identically distributed. The word independent means that two data points and do not statistically depend on each other, meaning that the empirical mean is a good estimate of the population mean.

How do we construct predictors from only training data that performs well on unseen test data?:

We are not interested in a predictor that only performs well on the training data. Instead, we seek a predictor that performs well on unseen data. More formally, we are interested in finding a predictor that minimizes the expected risk:

We simulate unseen data by holding out a proportion of the whole dataset. This hold out is referred to as the test set. In practice, we have only a finite set of data, and hence we split our data into a training and a test set. The training set is sed to fit the model, and the test set is used to evaluate generalization performance.
Empirical risk minimization can lead to overfitting.
We need to somehow bias the search for the minimizer of empirical risk by introducing a penalty term, which makes it harder for the optimizer to return an overly flexibly predictor. In machine learning, the penalty term is referred to as regularization.
The test data is sometimes referred to as the validation set. The validation set is a subset of the available training data that we keep aside.

Cross Validation

K-fold cross validation effectively partitions the data into K chunks, K-1 of which form the training set R, and the last chunk serves as the validation set V/ Cross-validation iterates through all combinations of assignments of chunks to R and V. This procedure is repeated for all K choices of the validation set, and the performance of the model from the K runs is averaged.
We partition our dataset into two sets , such that the do not overlap , where is the validation set, and train our model on . After training, we assess the performance of the predictor on the validation set .

K-fold Cross Validation

One can use nested cross-validation to search for good hyperparameters.
Cross-validation is an embarrassingly parallel problem - little effort is needed to separate the problem into a number of parallel tasks.

Parameter Estimation

The idea behind maximum likelihood estimation (MLE) is to define a function for the parameters that enables us to find a model that fits the data well. The estimation problem is focused on the likelihood function, or more precisely its negative logarithm.

I am going to try going through the MIT OpenCourseWare on Machine Learning. I am tires of slogging through this text.

Mathematics For Machine Learning Notes

Vectors and Matrices

Vector Spaces

Vector Norms

Inner Products

Data, Models, and Learning

Empirical Risk Minimization

What is The Set of Functions We Allow the Predictor to Take: Hypothesis Class of Functions

How do we measure how well the predictor performs on the training data?:

How do we construct predictors from only training data that performs well on unseen test data?:

Cross Validation

Parameter Estimation

User Comments