Hands On Machine Learning Chapter 4 - Training Models

I am going to re-read Hands-On Machine Learning with Scikit-learn Keras & TensorFlow because I don't feel that I got a good grasp of machine learning the first time I read it, and I skipped neural networks the first time I read the book. Since the first time reading this textbook.

Training Models

Having a good understanding of how things [machine learning models] work can help you quickly home in on the appropriate model, the right training algorithm to use, and a good set of hyperparameters for your task. Understanding what's under the hood will also help you debug issues and perform error analysis more efficiently. Lastly, most of the topics discussed in this chapter will be essential in understanding, building, and training neural networks.

We will start this chapter by looking at the Linear Regression model, one of the simplest models there is. There are two very different ways to train it:

  1. Using a direct "closed-form" equation that directly computes the model parameters that best fit the model to the training set (i.e., the model parameters that minimize the cost function over the training set).
  2. Using an iterative approach, called Gradient Descent (GD), that gradually tweaks the model parameters to minimize the cost function over the training set, eventually converging to the same set of parameters as the first method.

Next, we will look at Polynomial Regression, a more complex model that can fit non-linear datasets. Since this model has more parameters, it is more prone to overfitting, so we will look at how to prevent this. Finally, we will look at two more models that are commonly used for classification tasks: Logistic Regression and Softmax Regression.

Linear Regression

A linear model makes a prediction by simply computing a weighted sum of the input features, plus a constant called the bias term (also called the intercept term):

y^=θ0+θ1x1+θ2x2++θnxny^=the predicted valuen=the number of featuresxi=the ith feature valueθj=the jth model parameter (including the bias term θ0 and the feature weights θ1,θ2,,θn).\hat{y} = \theta _0 + \theta _1 x_1 + \theta _2 x_2 + \cdots + \theta _n x_n \\[0.25em] \hat{y}=\text{the predicted value}\\ n=\text{the number of features}\\ x_i=\text{the }i^{th}\text{ feature value}\\ \theta _j =\text{the }j^{th}\text{ model parameter (including the bias term }\theta _0 \text{ and the feature weights }\theta _1, \theta _2, \ldots , \theta _n \text{).}y^=θ0+θ1x1+θ2x2++θnxny^=the predicted valuen=the number of featuresxi=the ith feature valueθj=the jth model parameter (including the bias term θ0 and the feature weights θ1,θ2,,θn).

This equation can be written more concisely in the vectorized form:

y^=hθ(x)=θxθ is the model’s parameter vector, containing the bias term θ0 and the feature weights θ1 to θnx is the instance’s feature vector, containing x0 to xn, with x0 always equal to 1.θx is the dot product of the vectors θ and xhθ is the hypothesis function, using the model parameters θ\hat{y} = h_{\bm{\theta}}(\textbf{x})= \bm{\theta} \cdot \textbf{x} \\[0.25em] \bm{\theta}\text{ is the model's }\textit{parameter vector}\text{, containing the bias term }\theta _0\text{ and the feature weights }\theta _1\text{ to }\theta _n\\ \textbf{x}\text{ is the instance's }\textit{feature vector}\text{, containing }x_0\text{ to }x_n\text{, with }x_0\text{ always equal to 1.}\\ \bm{\theta}\cdot \textbf{x}\text{ is the dot product of the vectors }\bm{\theta}\text{ and } \textbf{x}\\ h_{\bm{\theta}}\text{ is the hypothesis function, using the model parameters }\bm{\theta}\\y^=hθ(x)=θxθ is the model’s parameter vector, containing the bias term θ0 and the feature weights θ1 to θnx is the instance’s feature vector, containing x0 to xn, with x0 always equal to 1.θx is the dot product of the vectors θ and xhθ is the hypothesis function, using the model parameters θ

Training a model means setting its parameters so that the model best fits the training set. For this purpose, we need a measure of how well (or poorly) the model fits the training data. A common performance measure of a regression model is the Root Mean Square Error (RMSE). Therefore, to train a Linear Regression model, you need to find the value of θ\bm{\theta}θ that minimizes the RMSE. (Minimizing the MSE is simpler and the same thing in practice).

MSE(X,hθ)=1mi=1m(θTxiyi)2\text{MSE}(\textbf{X},h_{\bm{\theta}}) = \cfrac{1}{m}\sum_{i=1}^m \left( \bm{\theta}^T\textbf{x}^i - y^i \right)^2MSE(X,hθ)=m1i=1m(θTxiyi)2

The Normal Equation

To find a value of θ\bm{\theta}θ that minimizes the cost function, there is a closed-form solution - in other words, a mathematical equation that gives the result directly. This is called the Normal Equation:

θ^=(XTX)1 XT yθ^ is the value of θ that minimizes the cost functiony is the vector of target values containing yi to ym\hat{\bm{\theta}} = \left( \textbf{X}^T \textbf{X} \right)^{-1} \space \textbf{X}^T \space \textbf{y}\\[0.25em] \hat{\bm{\theta}}\text{ is the value of }\bm{\theta}\text{ that minimizes the cost function}\\ \textbf{y}\text{ is the vector of target values containing }y^i \text{ to }y^mθ^=(XTX)1 XT yθ^ is the value of θ that minimizes the cost functiony is the vector of target values containing yi to ym

Below, there is code that shows computing the best parameter vector using numpy, the LinearRegression class of Scikit-Learn, and scipy. The scipy and Scikit-Learn implementations of the Normal Equation calculate the pseudoinverse of X\textbf{X}X (specifically, the Moore-Penrose inverse). The pseudo inverse itself is computed using a standard matrix factorization technique called Singular Value Decomposition (SVD) that can decompose the training set matrix X\textbf{X}X into the matrix multiplication of three matrices UΣVT\textbf{U} \bm{\Sigma} \textbf{V}^TUΣVT. This equation handles edge cases nicely: the Normal Equation may not work if the matrix XTX\textbf{X}^T\textbf{X}XTX is not invertible, but the pseudoinverse is always defined.

Computational Complexity

The computational complexity of inverting a matrix is typically about O(n2.4)O(n^{2.4})O(n2.4) to O(n3)O(n^3)O(n3) (where nnn is the number of features).In other words, if you double the number of features, you multiply the computation time by rough;y 5.3 to 8. The SVD approach used by Scikit-Learn's LinearRegression class is about O(n2)O(n^{2})O(n2), so if you double the number of features, you multiply the computation time by 4. Both the Normal Equation and SVD approach get very slow when the number of features grows large. On the positive sie, both are linear with regards to number of instances in the training set, so they can handle large training sets well provided they can fit in memory. Predictions are very fast with a linear model once you have trained it - the computational complexity is linear with regards to the number of instances you want to make predictions on.

import numpy as np
import matplotlib.pyplot as plt
# Generate some random-lookind data
X = 2 * np.random.rand(100,1)
y = 4 + 3 * X + np.random.randn(100,1)
plt.scatter(X,y,c='blue',label="Data",marker='.')
plt.xlabel("X_1")
plt.ylabel("y")
plt.axis((0,2,0,15))

# Computing the parameter vector using the normal equation
# inv() function computes the inverse of a matrix
# dot() method is used for matrix multiplication
X_b = np.c_[np.ones((100,1)),X] # add x0 = 1 to each instance
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
print("theta_best =",theta_best)
X_new = np.array([[0], [2]])
print("X_new =",X_new)
X_new_b = np.c_[np.ones((2,1)),X_new] # ass x0 = 1 to each instance
y_predict = X_new_b.dot(theta_best)
print("Prediction of X_new with Manual Computation of Closed-Form Solution =",y_predict) 
plt.plot(X_new,y_predict,"r-",label="Predictions")
plt.legend()
plt.show()

# Performing Linear Regression Using Scikit-Learn is Simple
print("Linear Regression with Scikit-Learn\n--------------------------------")
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X,y)
print("Intercept = ",lin_reg.intercept_,"\n","Coefficients =",lin_reg.coef_)
print("Prediction of X_new =",lin_reg.predict(X_new))

theta_best_svd, residuals, rank, s = np.linalg.lstsq(X_b, y, rcond=1e-6)
print("The LinearRegression class is based on the scipy.linalg.lstsq() function => theta_best_svd =",theta_best_svd)
print("Pseudoinverse of X (the Moore-Penrose inverse) =",np.linalg.pinv(X_b).dot(y))

out[2]

theta_best = [[3.73436804]
[3.2568451 ]]
X_new = [[0]
[2]]
Prediction of X_new with Manual Computation of Closed-Form Solution = [[ 3.73436804]
[10.24805824]]

Jupyter Notebook Image

<Figure size 640x480 with 1 Axes>

Linear Regression with Scikit-Learn
--------------------------------
Intercept = [3.73436804]
Coefficients = [[3.2568451]]
Prediction of X_new = [[ 3.73436804]
[10.24805824]]
The LinearRegression class is based on the scipy.linalg.lstsq() function => theta_best_svd = [[3.73436804]
[3.2568451 ]]
Pseudoinverse of X (the Moore-Penrose inverse) = [[3.73436804]
[3.2568451 ]]

Gradient Descent

Better suited for cases where there are a large number of features, or too many training instances to fit in memory. Gradient Descent is a very generic optimization algorithm capable of finding optimal solutions to a wide range of problems. The general idea of Gradient Descent is to tweak parameters iteratively in order to minimize a cost function. Gradient Descent measures the local gradient of the error function with regards to the parameter vector θ\bm{\theta}θ, and it goes in the direction of the descending gradient. Once the gradient is 0, you have reached a minimum. Concretely, you start by filling θ\bm{\theta}θ with random values (this is called random initialization) and then you improve it gradually, taking one baby step at a time, each step attempting to decrease the cost function (e.g., the MSE), until the algorithm converges to a minimum.

Gradient Descent

An important parameter in Gradient Descent in the size of the steps, determined by the learning rate hyperparameter. If the learning rate is too small, the algorithm will have to go through too many iterations to converge (too slow). If the learning rate is too high, then the algorithm might diverge, with larger and larger values, failing to find a good solution.

Learning Rate Too Small

Learning Rate Too Large

Finally, not all cost functions look like nice regular bowls. There may be holes, ridges, plateaus, and all sorts of irregular terrains, making convergence to the minimum very difficult. You might converge to a local minimum, which is not as good as a global minimum.

Gradient Descent Pitfalls

The MSE cost function for a Linear Regression model happens to be a convex function, which means that if yoy pick any two points on the curve, the line segment joining them never crosses the curve - this implies that there are no local minima, just one global minimum. It is also a continuous function with a slope that never changes abruptly. These two facts have a great consequence: Gradient Descent is guaranteed to approach arbitrarily close to the global minimum. In fact, the cost function has the shape of a bowl, but it can be an elongated bowl if the features have different scales. The image below shows why you should ensure that all features have a similar scale when using Gradient Descent - it will be faster. The diagram also illustrates the fact that training a model means searching for a combination of model parameters that minimizes a cost function (over the training set). It is a search in the model's parameter space: the more parameters a model has, the more dimensions this space has, and the harder teh search is: searching for a needle in a 300-dimensional haystack is much trickier than in three dimensions. Fortunately, since the cost function is convex in the case of Linear Regression, the needle is simply at the bottom of the bowl.

Gradient Descent With and Without Feature Scaling

Batch Gradient Descent

To implement Gradient Descent, you need to compute the gradient of the cost function with regards to each model parameter θj\theta _jθj. In other words, you need to calculate how much the cost function will change if you change θj\theta _jθj just a little bit. This is called the partial derivative.

Partial Derivative of the Cost Function:

θjMSE(θ)=2mi=1m(θTxiyi)xji\cfrac{\partial }{\partial \theta _j}\text{MSE}(\bm{\theta}) = \cfrac{2}{m} \sum_{i=1}^{m}\left( \bm{\theta}^T \textbf{x}^i - y^i \right)x_j^iθjMSE(θ)=m2i=1m(θTxiyi)xji

Instead of computing the partial derivatives individually, you can use the equation below. The gradient vector, denoted θMSE(θ)\nabla _{\bm{\theta}} \text{MSE}(\bm{\theta})θMSE(θ), contains all of the partial derivatives of the cost function (one for each model parameter). This formula involves calculations over the full training set X\textbf{X}X at every Gradient Descent step. This is why it is called Batch Gradient Descent. As a result, it is terribly slow on very large training sets, but it scales well with the number of features.

Gradient Vector Cost Function

Once you have the gradient vector which points uphill, just go in the opposite direction to go downhill. This means subtracting θMSE(θ)\nabla _{\bm{\theta}} \text{MSE}(\bm{\theta})θMSE(θ) from θ\bm{\theta}θ. This is where the learning rate η\etaη comes into play: multiply the gradent vector by ν\nuν to determine the size of the downhill step:

θnext step=θηθMSE(θ)\bm{\theta}^{\text{next step}} = \bm{\theta} - \eta \nabla _{\bm{\theta}} \text{MSE}( \bm{\theta} )θnext step=θηθMSE(θ)

Below you can see a quick implementation of this algorithm:

eta = 0.1 # learning rate
n_iterations = 1000
m = 100

theta = np.random.randn(2,1) # random initialization

for iteration in range(n_iterations):
    gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y)
    theta = theta - eta * gradients 
theta
out[4]

array([[3.73436804],

[3.2568451 ]])

Gradient Descent With Different Learning Rates

The image above shows different learning rates. On the left, the learning rate is too small and it will take a while to find a solution. On the right, the learning rate is too high and the algorithm diverges. In the middle, the learning rate converges to a solution in a few iterations. To find a good learning rate, you can use grid search. However, you may want to limit the number of iterations so that grid search can eliminate models that take too long to converge. Setting the Number of Iterations: A simple solution is to set a very large number of iterations but to interrupt the algorithm when the gradient vector becomes tiny - that is, when the norm becomes smaller than a number ϵ\epsilonϵ, called the tolerance - becuase this happens when Gradient Descent has almost reached the minimum.

Stochastic Gradient Descent

The main problem with Batch Gradient Descent is the fact that it uses thw hole training set to compute the gradients at every step, which makes it very slow when the training set is large. At the opposite extreme, Stochastic Gradient Descent just picks a random instance in training set at every step and computes the gradients based only on that instance. This makes the algorithm much faster and able to train on very large data sets. Dur to its random (stochastic) nature, this algorithm is much less regular than Batch Gradient Descent. Instead of gently decreasing until it reaches the minimum, the cost function will bounce up or down, decreasing only on average. Once the algorithm stops, the final parameter values are good, but not optimal.

Stochastic Gradient Descent

When the cost function is very irregular, the algorithm can jump oit of local minima, so Stochastic Gradient Descent has a better chance of finding the global minimum than Batch Gradient Descent. Randomness is good to escape from local optima, but bad because it means that the algorithm can never settle at the minimum. One solution is to gradually decrease the learning rate - start large then get smaller in a process called simulated annealing. The function that determines the learning rate at each iteration is called the learning schedule.

If the learning rate is reduced too quickly, you may get stuck in a local minimum, or even end up frozen halfway to the minimum. If the learning rate is reduced too slowly, you may jump around the minimum for a long time and end up with a suboptimal solution if you halt training too early. [...] By convention, we iterate by rounds of mmm iterations; each round is called an epoch.

Stochastic Gradient Descent First 20 Steps

Since the instances are picked randomly, some instances may be picked several times per epoch while others may not be picked at all. Of you want to be sure that the algorithm gets through every instance at each epoch, another approach is to shuffle the training set at each epoch. This generally converges more slowly.

To perform Linear Regression using SGD with Scikit-Learn, you can use the SGDRegressor class, which defaults to optimizing the squared error cost function.

Mini-batch Gradient Descent

Mini-batch Gradient Descent: at each step, instead of computing the gradients based on the full training set or based on just one instance, Mini-batch GD computes the gradients on small random sets of instances called mini-batches. The main advantage of Mini-batch GD over Stochastic GD is that you can get a performance boost from hardware optimization of matrix operations, especially when using GPUs. The algorithm's progress in parameter space is less erratic than SGD (see image below), especially with fairly large mini-batches. As a result, Mini-batch GD will end up walking around a bit closer to the minimum that SGD. But, on the other hand, it may be harder to escape from local minima,

Gradient Descent in Parameter Space

Summary of Linear Regression

Note that mmm is the number of training instances, and nnn is the number of features.

Comparison of Linear Regression Algorithms

# Implements Stochastic Gradient Descent using a simple learning schedule

n_epochs = 50
t0, t1 = 5, 50
def learning_schedule(t):
    return t0 / (t + t1)
theta = np.random.randn(2,1) # random initialization
for epoch in range(n_epochs):
    for i in range(m):
        random_index = np.random.randint(m)
        xi = X_b[random_index:random_index+1]
        yi = y[random_index:random_index+1]
        gradients = 2 * xi.T.dot(xi.dot(theta) - yi)
        eta = learning_schedule(epoch*m+i)
        theta = theta - eta * gradients
print(theta)

# Runs a maximum of 1000 epochs
# until loss drops less than 1e-3 during one epoch
# starts with a learning rate of 0.1
from sklearn.linear_model import SGDRegressor
sgd_reg = SGDRegressor(max_iter=1000,tol=1e-3,penalty=None,eta0=0.1)
sgd_reg.fit(X,y.ravel())
print("Intercept =",sgd_reg.intercept_,"Coefficients =",sgd_reg.coef_)
out[6]

[[3.73253048]
[3.23729649]]

Polynomial Regression

Even if the data is not linear (see graph below), a linear model can still be used to fit the data. A simple way to do this is to add powers of each feature to new features, then train a linear model on this extended set of features. This technique is called Polynomial Regression. You can use Scikit-Learn's PolynomialFeatures class to transform the training data. You can then perform linear regression on the extended training data.

m = 100
X = 6 * np.random.rand(m,1) - 3
y = 0.5 * X **2 + X + 2 + np.random.randn(m,1)
plt.scatter(X,y,c='blue',label="Data",marker='.')
plt.xlabel("X_1")
plt.ylabel("y")
plt.axis((-3,3,0,10))
out[8]

(-3.0, 3.0, 0.0, 10.0)

Jupyter Notebook Image

<Figure size 640x480 with 1 Axes>

# Tranforming Data - Adding the Sequare of each Feature in the Training Set as New Features
from sklearn.preprocessing import PolynomialFeatures
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)
print("X[0] =",X[0])
print("X_poly[0] =",X_poly[0])
# X_poly now contains the original feature of X plus the square of this feature. You can fit a Linear Regresion model to this extended training data
lin_reg = LinearRegression()
lin_reg.fit(X_poly, y)
print("Intercept =",lin_reg.intercept_,"Coefficients =",lin_reg.coef_)

x_plot = np.linspace(-3,3,num=200)
y_plot = [lin_reg.coef_[0][1]*(i**2) + lin_reg.coef_[0][0]*i + lin_reg.intercept_ for i in x_plot]
plt.scatter(X,y,c='blue',label="Data",marker='.')
plt.xlabel("X_1")
plt.ylabel("y")
plt.axis((-3,3,0,10))
plt.plot(x_plot,y_plot,c="red")
plt.show()
out[9]

X[0] = [1.05418078]
X_poly[0] = [1.05418078 1.11129711]
Intercept = [2.16142843] Coefficients = [[1.02479608 0.47362887]]

Jupyter Notebook Image

<Figure size 640x480 with 1 Axes>

Not bad: The model estimates y^+0.473x12+1,024x+2.1614\hat{y}+0.473x_1^2+1,024x+2.1614y^+0.473x12+1,024x+2.1614 when the original function was y=0.5x2+x+2.0+ Gaussian Noisey=0.5x^2 + x + 2.0 + \text{ Gaussian Noise}y=0.5x2+x+2.0+ Gaussian Noise. Note that when there are multiple features, Polynomial Regression is capable of finding relationships between features (which is something plain Linear Regression can not do). This is made possible by the fact that PolynomialFeatures also adds all combinations of features up to the given degree. For example, if there were two features aaa and bbb, PolynomialFeatures with degree=3 would not only add the features a2a^2a2, a3a^3a3, b2b^2b2, and b3b^3b3, but also the combinations ababab, a^2b$, and ab2ab^2ab2. You should beware of the combinatorial explosion of number of features:

Combinatorial Explosion of Features

Learning Curves

If you perform high degree Polynomial Regression, you will likely fit the training data much better than with plain Linear Regression. (See Image Below) Of course, this high Polynomial Regression is severely overfitting the training data, while the linear model is underfitting it. How can you decide how complex you model should be? You could use cross-validation to get an estimate of the model's generalization performance. Another way is to look at learning curves: there are plots of the model's performance on the training set and the validation set as a function of the training set size (or the training iteration). To generate the plots, simply train the model several times on different sized subsets of the training data.

High Degree Polynomial Regression

The learning curves below are typical of an underfitting model:

First, let’s look at the performance on the training data: when there are just one or two instances in the training set, the model can fit them perfectly, which is why the curve starts at zero. But as new instances are added to the training set, it becomes impossible for the model to fit the training data perfectly, both because the data is noisy and because it is not linear at all. So the error on the training data goes up until it reaches a plateau, at which point adding new instances to the training set doesn’t make the average error much better or worse. Now let’s look at the performance of the model on the validation data. When the model is trained on very few training instances, it is incapable of generalizing properly, which is why the validation error is initially quite big. Then as the model is shown more training examples, it learns and thus the validation error slowly goes down. However, once again a straight line cannot do a good job modeling the data, so the error ends up at a plateau, very close to the other curve.

If your model is underfitting the training data, adding more training examples will not help. You need to use a more complex model or come up with better features. The second learning curve below:

  1. The second leaning curve below shows that the error on the training data is much lower than the Linear Regression model.
  2. There is a gap between the curves. This means that the model performs significantly better on the training data than on the validation data - the hallmark of an overfitting model.

One way to improve an overfitting model is to feed it more training data until the validation error reaches the training error.

The Bias / Variance Tradeoff

An important theoretical result of statistics and Machine Learning is the fact that a model's generalization error can be expressed as the sum of three very different errors:

  1. Bias
  • This part of the generalization error is due to wrong assumptions, such as assuming the data is linear when it is actually quadratic, A high-bias model is most likely to underfit the training data.
  1. Variance
  • This part is due to the model's excessive sensitivity to small variations in the training data. A model with many degrees of freedom (such as a high-degree polynomial model) is likely to have high variance, and thus to overfit the training data.
  1. Irreducible Error
  • This part is due to the noisiness of the data itself. The only way to reduce this part of the error is to clean up the data.

Increasing a models' complexity will typically increase its variance and reduce its bias. Conversely, reducing a model's complexity increases its bias and reduces its variance. This is why it is called a tradeoff.

# Plot the Learning Curves of a Model diven some training data
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
def plot_learning_curves(model,X,y):
    X_train, X_val, y_train, y_val = train_test_split(X,y,test_size=0.2)
    train_errors, val_errors = [], []
    for m in range(1, len(X_train)):
        model.fit(X_train[:m],y_train[:m])
        y_train_predict = model.predict(X_train[:m])
        y_val_predict = model.predict(X_val)
        train_errors.append(mean_squared_error(y_train[:m],y_train_predict))
        val_errors.append(mean_squared_error(y_val,y_val_predict))
    plt.plot(np.sqrt(train_errors),"r-+",linewidth=2,label="train")
    plt.plot(np.sqrt(val_errors),"b-",linewidth=3,label="val")
lin_reg = LinearRegression()
plot_learning_curves(lin_reg,X,y)
plt.xlabel("Training Set Size")
plt.ylabel("RMSE")
plt.legend()
plt.axis((0,80,0,3))
plt.show()

from sklearn.pipeline import Pipeline 
polynomial_regression = Pipeline([
    ("poly_features",PolynomialFeatures(degree=10,include_bias=False)),
    ("lin_reg",LinearRegression())
])
plot_learning_curves(polynomial_regression,X,y)
plt.xlabel("Training Set Size")
plt.ylabel("RMSE")
plt.axis((0,80,0,3))
plt.legend()
plt.show()
out[11]
Jupyter Notebook Image

<Figure size 640x480 with 1 Axes>

Jupyter Notebook Image

<Figure size 640x480 with 1 Axes>

Regularizing Linear Models

A good way to reduce overfitting is to regularize the model (to constrain it): the fewer degrees of freedom it has, the harder it will be for it to overfit the data. For a linear model regularization is typically achieved by constraining the weights of the model. We will now look at Ridge Regression, Lasso Regression, and Elastic Net, which implement three different ways to constrain the weights.

Ridge Regression

Ridge Regression also called Tikhonov regularization is a regularized version of linear Regression: aregression term equal to αi=1nθi2\alpha \sum_{i=1}^n \theta _i^2αi=1nθi2 is added to the cost function. This forces the learning algorithm to not only fit the data but also keep the model weights as small as possible. Once the model is trained, you want to evaluate the model's performance using the unregularized performance measure. The hyperparameter α\alphaα controls how much you ant to regularize the model. If it equals 0, then all weights are very close to 0. If α\alphaα is very largem then all weights end up very close to zero and the result is a flat line going through the data's mean. Note that the bias term θ0\theta _0θ0 is not regularized and that the regularization term is equal to 1/2 times the 2\ell _22 norm of the weight vector squared (where the weight vector is the feature weights θ1\theta _1θ1 to $\theta _n$). It is important to scale the data before performing Ridge regression.

Ridge Regression Cost Function

J(θ)=MSE(θ)+α12i=1nθi2J(\bm{\theta}) = \text{MSE}(\bm{\theta}) + \alpha \frac{1}{2} \sum_{i=1}^n \theta _i^2J(θ)=MSE(θ)+α21i=1nθi2

The image below shows Ridge Regression with different α\alphaα values. Note how increasing α\alphaα leads to a flatter (less extreme, more reasonable) predictions; this reduces the models' variance but increases its bias.

Ridge Regression Closed-Form Solution

θ^=(XTX+αA)1 XT y\hat{\bm{\theta}} = \left( \textbf{X}^T \textbf{X} + \alpha \textbf{A} \right)^{-1} \space \textbf{X}^T \space \textbf{y}θ^=(XTX+αA)1 XT y

Ridge Regression

Lasso Regression

Least Absolute Shrinkage and Selection Operator Regression (also called Lasso Regression) is another regularized version of Linear Regression: just like Ridge regression, it adds a regularization term to the cost function, but it uses the 1\ell _11 norm of the weight vector instead of half the square of the 2\ell _22 norm. The image below shows Lasso Regression in action.

Lasso Regression Cost Function

J(θ)=MSE(θ)+αi=1nθiJ(\bm{\theta}) = \text{MSE}(\bm{\theta})+\alpha \sum_{i=1}^n \lvert \theta _i \rvertJ(θ)=MSE(θ)+αi=1nθi

Lasso Regression

An important feature of Lasso Regression is that it tends to completely eliminate the weights of the least important features (i.e. , set them to zero). Lasso Regression automatically performs feature selection and outputs a sparse model (i.e., with few nonzero feature weights).

[Description of the Image Below]: on the top-left plot, the background contours (ellipses) represent an unregularized MSE cost func‐ tion ($\alpha$ = 0), and the white circles show the Batch Gradient Descent path with that cost function. The foreground contours (diamonds) represent the 1\ell _11 penalty, and the triangles show the BGD path for this penalty only ($\alpha \rightarrow \infty$). Notice how the path first reaches θ1\theta _1θ1 = 0, then rolls down a gutter until it reaches θ2\theta _2θ2 = 0. On the top-right plot, the contours represent the same cost function plus an 1\ell _11 penalty with α\alphaα = 0.5. The global minimum is on the θ2\theta _2θ2 = 0 axis. BGD first reaches θ2\theta _2θ2 = 0, then rolls down the gutter until it reaches the global minimum. The two bottom plots show the same thing but uses an 2\ell _22 penalty instead. The regularized minimum is closer to θ\thetaθ = 0 than the unregularized minimum, but the weights do not get fully eliminated.

You need to gradually reduce the learning rate in order to actually converge to the global minimum with Lasso Regression.

Lasso Ridge Regularization

Lasso Cost Function Consideration

Elastic Net

Elastic Net is a middle ground between Ridge Regression and Lasso Regression. The regularization term is a simple mix of both Ridge and Lasso's regularization terms, and you can control the mix ration rrr. When r=0r=0r=0, Elastic Net is equivalent to Ridge regression, and when r=1r=1r=1, it is equivalent to Lasso Regression.

Elastic Net Cost Function

J(θ)=MSE(θ)+rαi=1nθi+1r2αi=1nθi2J(\bm{\theta})=\text{MSE}(\bm{\theta})+r\alpha \sum_{i=1}^n\lvert \theta _i \rvert + \cfrac{1-r}{2} \alpha \sum_{i=1}^n \theta _i^2J(θ)=MSE(θ)+rαi=1nθi+21rαi=1nθi2

It is almost always preferable to have at least a little bit of regularization, so generally you should avoid plain Linear Regression. Ridge is good by default, but if you suspect some features to be useless, it is best to use Elastic Net. In general, Elastic Net is preferable over lasso since Lasso might behave erratically when the number of features is greater than the number of training instances or several features are strongly correlated.

Early Stopping

A very different way to regularize learning algorithms such as Gradient Descent is to stop training as soon as validation error reaches a minimum. This is called early stopping.With early stopping, you just stop training as soon as the validation error reaches a minimum. It is such a simple and efficient regularization technique that Geoffrey Hinton called it a "beautiful free lunch".

Early Stopping Regularization

## Perform Ridge Regression with Scikit Learn Using a Closed Form Solution
from sklearn.linear_model import Ridge, SGDRegressor
ridge_reg = Ridge(alpha=1, solver="cholesky")
ridge_reg.fit(X,y)
print("Ridge Regression Closed Form Prediction =",ridge_reg.predict([[1.5]]))
## And using Stochastic Gradient Descent
sgd_reg = SGDRegressor(penalty="l2") # l2 penalty = Ridge Regression
sgd_reg.fit(X,y.ravel())
print("Ridge Regression SGD Prediction =",sgd_reg.predict([[1.5]]))

## Scikit-LKearn Example using the Lasso Class
from sklearn.linear_model import Lasso 
lasso_reg = Lasso(alpha=0.1)
lasso_reg.fit(X,y)
print("Lasso Reg Prediction =",lasso_reg.predict([[1.5]]))

## Elastic Net Example
from sklearn.linear_model import ElasticNet
elastic_net = ElasticNet(alpha=0.1,l1_ratio=0.5)
elastic_net.fit(X,y)
print("ElasticNet Prediction =",elastic_net.predict([[1.5]]))

## Early Stopping Implementation
from sklearn.preprocessing import StandardScaler
# prepare the data 
from copy import deepcopy
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# extra code – creates the same quadratic dataset as earlier and splits it
np.random.seed(42)
m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 0.5 * X ** 2 + X + 2 + np.random.randn(m, 1)
X_train, y_train = X[: m // 2], y[: m // 2, 0]
X_valid, y_valid = X[m // 2 :], y[m // 2 :, 0]

preprocessing = make_pipeline(PolynomialFeatures(degree=90, include_bias=False),
                              StandardScaler())
X_train_prep = preprocessing.fit_transform(X_train)
X_valid_prep = preprocessing.transform(X_valid)
sgd_reg = SGDRegressor(penalty=None, eta0=0.002, random_state=42)
n_epochs = 500
best_valid_rmse = float('inf')
train_errors, val_errors = [], []  # extra code – it's for the figure below

for epoch in range(n_epochs):
    sgd_reg.partial_fit(X_train_prep, y_train)
    y_valid_predict = sgd_reg.predict(X_valid_prep)
    val_error = mean_squared_error(y_valid, y_valid_predict, squared=False)
    if val_error < best_valid_rmse:
        best_valid_rmse = val_error
        best_model = deepcopy(sgd_reg)

    # extra code – we evaluate the train error and save it for the figure
    y_train_predict = sgd_reg.predict(X_train_prep)
    train_error = mean_squared_error(y_train, y_train_predict, squared=False)
    val_errors.append(val_error)
    train_errors.append(train_error)

# extra code – this section generates and saves Figure 4–20
best_epoch = np.argmin(val_errors)
plt.figure(figsize=(6, 4))
plt.annotate('Best model',
             xy=(best_epoch, best_valid_rmse),
             xytext=(best_epoch, best_valid_rmse + 0.5),
             ha="center",
             arrowprops=dict(facecolor='black', shrink=0.05))
plt.plot([0, n_epochs], [best_valid_rmse, best_valid_rmse], "k:", linewidth=2)
plt.plot(val_errors, "b-", linewidth=3, label="Validation set")
plt.plot(best_epoch, best_valid_rmse, "bo")
plt.plot(train_errors, "r--", linewidth=2, label="Training set")
plt.legend(loc="upper right")
plt.xlabel("Epoch")
plt.ylabel("RMSE")
plt.axis([0, n_epochs, 0, 3.5])
plt.grid()
plt.show()
out[13]

Ridge Regression Closed Form Prediction = [[4.91398185]]
Ridge Regression SGD Prediction = [4.88075379]
Lasso Reg Prediction = [4.85924079]
ElasticNet Prediction = [4.85795756]

Jupyter Notebook Image

<Figure size 600x400 with 1 Axes>

Logistic Regression

Some regression algorithms can be used for classification as wll. Logistic Regression (also called Logit Regression) is commonly used to estimate the probability that an instance belongs to a particular class. If the estimated probability is greater than 50%, then the model predicts that the instance belongs to that class or else it predicts that it does not. This makes it a binary classifier.

Estimating Probabilities

Just like a Linear Regession model, a Logistic Regression model computes a weighted sum of the input features (plus a bias term), but instead of outputting the result directly like the Linear Regression model does, it outputs the logistic of this result

Logistic Regression Model Estimated Probability (Vectorized Form)

p^=hθ(x)=σ(xTθ)\hat{p} = h_{\bm{\theta}}(\textbf{x})=\sigma \left( \textbf{x}^T \bm{\theta} \right)p^=hθ(x)=σ(xTθ)

The logistic - noted σ()\sigma (\cdot)σ() - is a sigmoid function that outputs a number between 0 and 1.

Logistic Function

σ(t)=11+et\sigma (t) = \cfrac{1}{1+e^{-t}}σ(t)=1+et1

Logistic Function

Once the logistic model has estimated the probability, it makes the prediction y^\hat{y}y^ easily:

Logistic Model Prediction

Training and Cost Function

The objective of training is set the parameter vector θ\bm{\theta}θ so that the model estimates high probabilities for positive instances and low probabilities for negative instances. This idea is captured by the cost function below:

Cost Function of a Single Training Instance

The cost function over the whole function is simply the average cost over all training instances. It can be written in a single expression called the log loss:

Log Loss

The bad news is that there is no known closed-form equation to compute the value of θ\bf{\theta}θ that minimizes the cost function. But the good news is that this cost function is convex, so Gradient Descent is guaranteed to find the global minimum. The partial derivatives of the cost function with regards to the jthj^{th}jth model parameter θj\theta _jθj is given by:

Cost Function Partial

Once you have the gradient vector containing all the partial derivatives, you can se it in the Batch Gradient Descent algorithm.

Decision Boundaries

Using the iris dataset to illustrate Logistic regression. This is a famous dataset that contains the sepal and petal length and width of 150 iris flowers of three different species: Iris Setosa, Iris-Versicolor, and Iris-Virginica.

Flowers of Iris

There s a Decision Boundary at around 1.6cm where both probabilities are equal to 50%: if the petal width is higher than 1.6cm, the classifier will predict that the flower is an Iris-Virginica, or else it will predict that it is not (even if not very confident). The dashed line in the image below represents the model's decision boundary. Note that this is a linear boundary. Each parallel line represents the points where them model outputs a specific probability from 15% to 90%.

Estimated Probability Decision Boundary

Linear Decision Boundary

Softmax Regression

The Logistic Regression model can be generalized to support multiple classes directly, without having to train and combine multiple binary classifiers. This is called Softmax Regression, or Multinomial Logistic Regression. The idea is quite simple: when given an instance x\textbf{x}x , the Softmax Regression model first computes the score sk(x)s_k(\textbf{x})sk(x) for each class kkk, then estimates the probability of each class by applying the softmax function (also called the normalized exponential) to the scores.

Softmax Score for Class known

sk(x)=xTθks_k(\textbf{x})=\textbf{x}^T \bm{\theta}^ksk(x)=xTθk

Note that each class has its own dedicated parameter vector θk\bm{\theta}^kθk. All these vectors are typically stored as rows in a parameter matrix Θ\ThetaΘ. Once you compute the score for every class for the instance x\textbf{x}x, you can estimate the probability p^k\hat{p}_kp^k that the instance belongs to class kkk by running the scores through the softmax function: it computes the exponential of every score, then normalizes them. The scores are generally called logits or log-odds.

Softmax Function

Minimizing the cross entropy should lead to a model that estimates a high probability for the target class. Cross entropy is frequently used to measure how well a set of estimated class probabilities match the target class.

Cross Entropy

Cross Entropy Gradient

Scikit-Learn's LogisticRegression uses one-versus all by default when you train it on two or more classes, but you can set the multiclass hyperparameter to multinomial to switch it to Softmax Regression instead. You must also specify a solver that supports Softmax Regression, such as the "lbfgs solver. It applies 2\ell _22 regularization by default.

[The image below] shows the resulting decision boundaries, represented by the background colors. Notice that the decision boundaries between any two classes are linear. The figure also shows the probabilities for the Iris-Versicolor class, represented by the curved lines (e.g., the line labeled with 0.450 represents the 45% probability boundary). Notice that the model can predict a class that has an estimated probability below 50%. For example, at the point where all decision boundaries meet, all classes have an equal estimated probability of 33%

Softmax Regression Decision Boundary

Cross Entropy

Cross entropy originated from information theory. Cross entropy measures the average number of bits you actually send per option. Here is a video on cross entropy

from sklearn import datasets 
iris = datasets.load_iris()
print(list(iris.keys()))
X = iris["data"][:,3:] # petal width 
y = (iris["target"]==2).astype(np.int8) # 1 if Iris-Virginica, else 0
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(X,y)

## Softmax Regression

X = iris["data"][:,(2,3)] # petal length, petal width 
y = iris["target"]
softmax_reg = LogisticRegression(multi_class="multinomial",solver="lbfgs", C=10)
softmax_reg.fit(X, y)
softmax_reg.predict([[5, 2]])
softmax_reg.predict_proba([[5, 2]])
out[15]

['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module']

array([[6.38014896e-07, 5.74929995e-02, 9.42506362e-01]])

out[16]