Training Models Exercises Answers
This chapter goes into various graident descent algoirthms and various linear regression algorithms. It discusses polynomial regression and regularization as well.
Question 1
What Linear Regression training algorithm can you use if you have a training set with millions of features?
Gradient descent performs well with a large number of features because it iteratively updates the weights of each feature based on the direction of the steepest descent in the loss function, allowing it to efficiently navigate through a high-dimensional space and find the optimal solution even when dealing with many variables, making it highly scalable for complex datasets with numerous features.
Question 2
Suppose the features in your training set have very different scales. What algorithms might suffer from this, and how? What can you do about it?
Gradient Descent algorithms will suffer from this (see image below). Ridge Regression will be negatively affected by unscaled data. It is important to scale the data (eg using a StandardScaler before performing Ridge Regression), as it is sensitive to the scale of the input features. This is true of most regularized models. What can you do about features having different scales? You can scale your features (MinMaxScaler(), StandardScaler(), Normalizer()).
Question 3
Can Gradient Descent get stuck in a local minimum when training a Logistic Regression model?
Gradient Descent cannot get stuck in a local minimum when training a Logistic Regression model - the cost function is convex, so Gradient Descent (or any other optimization algorithm) is guaranteed to find the global minimum (if the learning rate is not too large and you wait long enough).
Question 4
Do all Gradient Descent algorithms lead to the same model provided you let them run long enough?
No, it is possible that Gradient Descent algorithms get stuck in a local minimum, and some gradient descent algorithms, such as Stochastic Gradient Descent, have a better chance of breaking out of that local minimum and finding the global minimum. Stochastic gradient descent has a greater chance of breaking out of a local minimum because it picks a random instance at every step and computes the gradients based only on that single instance, whereas Batch Gradient Descent uses the whole training set to compute the gradients at every step. If there are no local minima, as in linear regression, then all gradient descent algorithms will lead to the same model.
Question 5
Suppose you use Batch Gradient Descent and you plot the validation error at every epoch. If you notice that the validation error consistently goes up, what is likely going on? How can you fix this?
The learning rate is likely too high. This causes the parameters in the next step to be worse than the parameters of the current step. You can fix this by decreasing the learning rate.
The learning rate being too high:
Question 6
Is it a good idea to stop Mini-batch Gradient Descent immediately when the validation error goes up?
With Stochastic and Mini-Batch Gradient descent, the curves of the error are not so smooth as in the image below (which plots the error on the validation and training set of a Batch Gradient Descent model at every epoch), so it is hard to know whether you have reached a minium. One solution to this is to stop only after the validation error has been above the the minimum for some ime (when you are confident that the model will not do any better), then roll back the model parameters to the point where the validation error was at a minimum.
Question 7
Which Gradient Descent algorithm (among those we discussed) will reach the vicinity of the optimal solution the fastest? Which will actually converge? How can you make the others converge as well?
Stochastic Gradient Descent will reach the vicinity of the global minimum the fastest. Batch Gradient descent will actually converge. You can make the other solutions converge more intelligently by gradually reducing the learning rate. The steps start out large (which helps the algorithm make progress and escape local minima), then get smaller and smaller, allowing the algorithm to settle at the global minimum. This process is akin to simulated annealing. The function that determines the learning rate at each iteration is called the learning schedule.
Question 8
Suppose you are using Polynomial Regression. You plot the learning curves and you notice that there is a large gap between the training error and the validation error. What is happening? What are three ways to solve this?
You are overfitting the model most likely if the training error is small and the validation error is large. Ways to solve this:
- Choose a lower degree polynomial regression. This will hinder the ability of the model to overfit the data.
- Regularize the model by constraining the weights of the model (or using Ridge regularization).
- You could remove some features that you don't think matter (or use Lasso regularization).
The image below shows a Polynomial regression model that has Ridge regularization with varying values for α (the regularization hyperparameter).
The image below shows a Polynomial regression model that has Lasso regularization with varying values for α (the regularization hyperparameter).
Question 9
Suppose you are using Ridge Regression and you notice that the training error and the validation error are almost equal and fairly high. Would you say that the model suffers from high bias or high variance? Should you increase the regularization hyperparameter α or reduce it?
High bias - if both the validation and training error are high then it is likely that the model is underfitting the data. You should decrease the regularization parameter, and if that doesn't work, you should choose a more complex model or get better features.
Question 10
Why would you want to use:
- Ridge Regression instead of plain Linear Regression (i.e., without any regularization)?
- You should use Ridge Regression instead of Linear Regression if your model is overfitting the training data - generalizing poorly to new instances. The regularization of Ridge Regression means that it is harder for the model to over fit the data (it constrains the parameter weights).
- Lasso instead of Ridge Regression?
- In the case where you have a lot of features and you suspect only some of the features to be actually useful.
- Elastic Net instead of Lasso?
- Elastic net is preferred over Lasso since Lasso may behave erratically when the number of features is greater than the number of training instances or when there is multicollinearity - when several features are strongly correlated.
Question 11
Suppose you want to classify pictures as outdoor/indoor and daytime/nighttime. Should you implement two Logistic Regression classifiers or one Softmax Regression classifier?
You should implement two Logistic Regression classifier. The softmax Regression classifier predicts only one class at a time so "it should be used only with mutually exclusive classes such as different types of plans. You cannot use it to recognize multiple people [or multiple features of an image] in one picture"
Question 12
Implement Batch Gradient Descent with early stopping for Softmax Regression (without using Scikit-Learn).
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
class BatchGradientDescent(BaseEstimator,TransformerMixin):
"""
I am not sure if the dimensions work out here (row vs column vector)
"""
def __init__(self,train_val_split=0.3,learning_rate=0.001,n_iter=1000):
self.train_val_split = train_val_split
self.learning_rate = learning_rate
self.n_iter = n_iter
self.theta = 0
self.last_error = np.inf
def batch_grad_desc_early_stopping(self,X_train,X_val,y_train,y_val):
while True:
new_theta = self.theta
for iter in range(self.n_iter):
gradients = (2/X_train.shape[1]) * X_train.T.dot(X_train.T.dot(new_theta) - y_train)
new_theta = new_theta - self.eta * gradients
y_pred = X_val.mul(new_theta)
val_error = np.sqrt(np.sum((1/np.max(y_pred.shape))*(y_pred - y_val)**2))
if val_error < self.last_error:
self.last_error = val_error
self.theta = new_theta
else:
return
def fit(self,X,y):
# Random Initialization of Theta
self.theta = np.randn(X.shape[1],1)
idx_split = int(X.shape[0]*0.3)
y_train, y_val = y[idx_split:], y[:idx_split]
X_train, X_val = X[idx_split:], X[:idx_split]
self.batch_grad_desc_early_stopping(X_train,X_val,y_train,y_val)
return self
def predict(X):
return []