Hands-On Machine Learning with Scikit Learn and Tesnor Flow Notes Chapter 1-4

Here are some notes from trying to learn machine learning by going through "Hands-On Machine Learning with Scikit Learn and Tesnor Flow". I took notes with markdown and used jupyter notebooks for chapters 1-3, but for the rest of the book, I am probably just going to use jupyter notebooks and create something on this site where I can store some jupyter notebooks.

DOWNLOAD MARKDOWN

2 519

Hands-On Machine Learning with Scikit Learn and Tesnor Flow Notes

Note: sometimes the $\theta$ symbols in equations should be bold eventhough they are not.

Chapter 1 - What is Machine Learning

Notes

Machine learning is the science and art of programming computers so that they can learn from data
Why Use Machine Learning?
- For projects that change over time
- For projects where traditional approaches have no known algorithm or that are too complex for traditional approaches
- Machine learning can help humans learn
Types of Machine Learning Systems Categorized By:
- Whether or not they are trained with human supervision
- Whether or not they can learn incrementally on the fly (online vs batch learning)
- Whether they work by simply comparing new data points to know data points or instead detect patterns in the training data and build a predictive model, much like scientists do
Supervised vs Unsupervised Learning
- Supervised Learning
  - The training data that you feed the algorithm includes the desired solutions - called labels
    - A typical supervised learning task is classification
    - Another typical task is to predict a target numeric value, such as the price of a car, given a set of features called predictors.
    - Logistic Regression is commonly used for classification, as it can output a value that corresponds to the probability of belonging to a given class
  - Important Learning Algorithms:
    - k-Nearest Neighbors
    - Linear Regression
    - Logistic Regression
    - Support Vector Machines (SVMs)
    - Decision Trees and Random Forests
    - Neural Networks
- Unsupervised Learning
  - In unsupervised learning, the training data is unlabeled
  - The system tries to learn without a teacher
  - Clustering
    - K-Means
    - DBSCAN
    - Hierarchal Cluster Analysis (HCA)
  - Anomaly Detection and Novelty Detection
    - One-class SVM
    - Isolation Forest
  - Visualization and Dimensionality Reduction: good example of unsupervised learning algorithms: you feed them a lot of complex and unlabeled data, and they output a 2D or 3D representation of your data that can be easily plotted
    - Principal component analysis (PCA)
    - Kernel PCA
    - Locally-Linear Embedding (LLE)
    - t-distributed Stochastic Neighbor Embedding (t-SNE)
  - Association Rule Learning
    - Apriori
    - Eclat
- Semi-supervised Learning
  - Algorithms that can deal with partially labeled training data, usually a lot of unlabeled training data, and a little bit of labeled data. This is called unsupervised learning.
- Reinforcement Learning
  - Reinforcement learning is a different beast. The learning system, called an agent in this context, can observe an environment, select and perform actions, and get rewards in return (or penalties in the form of negative rewards). It must learn by itself over time. A policy defined what action the agent should choose when it is in a given situation
Bach vs Online Learning
- Batch learning
  - The system is un-capable of learning incrementally: it must be trained using all the available data. Then the system is used in production without taking in more data. This is called unsupervised learning.
- Online Learning
  - You train the system incrementally by feeding it data instances sequentially or individually or by small groups called mini batches. Each learning step is fast and cheap, so the whole system can learn about new data on the fly, as it arrives.
Instance Based vs Model Based Learning
- Instance Based Learning: the system learns example by heart, then generalizes new cases by comparing them to learned examples by a similarity measure
- Model Based Learning : Another way to generalize from a set of examples is to bui;ld a model of these example and then used the model to make predictions. This is called model based learning.
What A Typical Machine Learning Project Looks Like:
- You studied the data
- You selected the model
- You trained it on the training data (i.e., the learning algorithm searched for the model parameter values that minimize a cost function)
- Finally, you applied the model to make predictions on new cases (this is called inference), hoping this model will generalize well
Main Challenges of Machine Learning
- Bad Data
  - Insufficient Quantity of Training Data
  - Nonrepresentative Training Data
  - Poor Quality Data
  - Irrelevant Features Included in the Model
- Bad Model
  - Over fitting the training data
  - Under fitting the Training Data
Testing and Validation
- Hyperparameter Tuning and Model Selection

Exercises

How would you define Machine Learning?

Machine learning is the science and art of programming computers so that they can learn from data

Can you name four types of problems where it shines.

For problems for which existing solutions require a lot of hand tuning or long lists of rules: one Machine Learning algorithm can often simply code and perform better
Complex problems for which there is no good solution at all using the traditional approach: the best machine learning techniques can find a solution
Fluctuating environments: a machine learning system can adapt to new data
Getting insights about complex problems and large amounts of data

What is a labeled training set?

A set of data used to train a ML model that includes the desired solution

What are the two most common supervised tasks?

Classification and regression

Can you name four common unsupervised tasks?

Clustering
Anomaly Detection and Novelty Detection
Visualization and Dimensionality Reduction
Association Rule Learning

What type of machine learning algorithm would you use to allow a robot to walk in unknown territories?

A reinforcement learning algorithm

What type of algorithm you use to segment your customers into multiple groups?

A clustering algorithm

Would you frame the problem of spam detection as a supervised learning problem or an unsupervised learning problem?

Supervised learning because you use data that the users have already labeled as span

What is an online learning system?

A learning system where you train the system incrementally by feeding it data instances sequentially or by small groups called minibatches.

What is out of core learning?

Technique in machine learning that allows for the processing of data that cannot fit into a computer's main memory

What type of learning algorithm relies on a similarity measure to make predictions?

Instance based learning

What is the difference between a model parameter and a learning algorithm's hyperparameter?

A hyperparameter is a parameter of a learning algorithm, not of the model

What do model based learning algorithms search for? What is the most common strategy they use to succeed? How do they make predictions?

Model-based learning algorithms search for optimal values of the model parameters such that the model generalizes well for new instances. We measure success by training the system to minimize a cost function which measures how bad the model is at making predictions.

Can you name four of the main challenges in Machine Learning?

Bad Data (Insufficient Quantity, nonrepresentative data, poor quality data) and bad algorithms/models (irrelevant features, over fitting, under fitting)

If your model performs great on the training data but generalizes poorly to new instances, what is happening? Can you name three possible solutions?

If the model performs poorly to new instances, then it has overfit on the training data. To solve this, we can do any of the following three: get more data, implement a simpler model, or eliminate outliers or noise from the existing data set.

What is a test set and why would you want to use it?

A data set used to provide an unbiased evaluation of a final model fit on the training data set.

What is the purpose of a validation set?
-The validation set uses a subset of the training data to provide an unbiased evaluation of a model. The validation data set contrasts with training and test sets in that it is an intermediate phase used for choosing the best model and optimizing it. It is in this phase that hyperparameter tuning occurs.
What can go wrong if you tune hyper parameters using the test set?

The hyperparameter may be tuned so that model performs well on that particular set, but it is unlikely that this will extend to new data

What is repeated cross validation and why would you prefer it to using a single validation set?

It involves splitting the data into multiple folds, with each fold used as a validation set once, and the remaining folds used as a training set. This process is repeated multiple times, with different random splits each time, and the results are averaged to give a more robust estimate of the model's performance.

Chapter 2 - End to End Machine Learning Project

Notes

In this chapter, you will go through an example project end to end as an ML Data Science with these steps:
1. Look at the big picture
2. Get the data
3. Discover and visualize the data to gain insights
4. Prepare the data for machine learning algorithms
5. Select the model and train it
6. Fine-tune your model
7. Present the solution
8. Launch, monitor, and maintain your system
Build a model of housing prices in california using the California Census Data
Your model should learn from the data and be able to predict the median housing price in any district, given all other metrics.
A sequence of data processing components is called a data pipeline. Pipelines are very common in machine learning systems, since there is a lot of data to manipulate and many data transforms to apply
Multiple Regression: the system will use multiple factors to make a prediction
Frame Problem:
This problem is a supervised learning task due to the labeled data. It is a regression task since you are asked to predict a value. It is a univariate regression problem since we are only trying to predict a single value for each district. There is no continuous flow of data coming into the system, and there is no need to adjust to changing data rapidly, and the data is small enough to fit in memory, so plain batch learning should do just fine.
Select Performance Measure:
A typical performance measure for regression problems is the Root Mean Square Error (RMSE), It gives an idea of how much error the system typically makes when in its predictions, with a higher weight for larger errors.

\text{RMSE}(\textbf{X},h) = \sqrt{\frac{1}{m}\sum^{m}_{i=1}\left( h(\textbf {x}^{(i)})-y^{(i)}\right)^{2}}

If there are many outliers in your data, you might consider using the Mean Absolute Error as the regression function:

\text{MAE}(\textbf{X},h) = \frac{1}{m} \sum^{m}_{i=1}\left|h\left(\textbf{x}^{(i)}\right)-y^{(i)}\right|

When you load data from a CSV file with pandas and the type shows object, you know it must be a text attribute
Working with preprocessed attributes is common in Machine Learning (helps to graph data before plotting it)
Creating a test set is theoretically quite simple: just pick some instances randomly, typically 20% of the dataset (or less if your dataset is large), and set them aside
Purely randomized sampling of data is sometimes not enough - you want the data to be representative of the whole population
- Since the US is 51.3% female and 48.7 male, a well conducted survey in the US would try to maintain this ratio in the sample (51.3% female and 48.7% male)
- This is called stratified sampling: the population is divided into homogeneous subgroups called stata, and the right number of instances is sampled from each stratum to guarantee that the test set is representative of the entire population.
Use pandas.cut when you want to transform a continuous variable into a categorical variable, which might help when trying to get a sample representative of the population
Split the Test Set and Train Set ensuring that you get Sets Representative of the Population
Discover and Visualize the Data to Gain Insights
Look for correlations between data
The correlation coefficient only measures linear correlations. It might completely miss out on nonlinear relationships
The correlation coefficient tells you nothing about the slope
Prepare the Data for the Machine Learning Algorithm
Instead of transforming data manually, you should write functions to transform the data
Data Cleaning Options:

Get rid of the rows with the missing values python housing.dropna(subset=["total_bedrooms"]) # option 1
Get rid of the column with the missing attributes python housing.drop("total_bedrooms", axis=1) # option 2
Set the values to some value (zero, the mean, the media, etc.) python median = housing["total_bedrooms"].median(); housing["total_bedrooms"].fillna(median, inplace=True) # option 3
Handling Text and Categorical Attributes

You can use OridinalEncoder and OneHotEncoder
One issue with the ordinal encoder is that it will assume that two nearby values are more similar than two distant values
OneHotEncoder creates one binary attribute per category. This is called one-hot encoding because only one attribute will be equal to one(hot) while the others will be zero(cold). The new attributes are sometimes called dummy attributes

If a categorical attribute has a large number of possible categories (e.g., country code, profession, species, etc.), then one-hot encod‐ ing will result in a large number of input features. This may slow down training and degrade performance. If this happens, you may want to replace the categorical input with useful numerical features related to the categories: for example, you could replace the ocean_proximity feature with the distance to the ocean (similarly, a country code could be replaced with the country’s population and GDP per capita). Alternatively, you could replace each category with a learnable low dimensional vector called an embedding. Each category’s representation would be learned during training: this is an example of representation learning (see Chapter 13 and ??? for more details).

The more you automate the data preparation steps, the more combinations you can automatically try out, making it much more likely that you will find a great combination (and saving you a lot of time)

Feature Scaling

With few exceptions, Machine Learning algorithms don't perform well when the input numerical attributes have very different scales. There are two ways to get all attributes to have the same scale:

min-max scaling
Min max scaling is just ensuring that each value ends up ranging from 0 to 1. We do this by subtracting the min from each value, and dividing by the range.

x_{i-new} = \frac{{x_i}-x_{min}}{x_{max}-x_{min}}

standardization
Subtract the mean of the series from each value and divide by the standard deviation of the series so that the resulting distribution has a unit variance
Standardization does not bound values to a specific range, which might be a problem for some algorithms (e.g., neural networks often expect a value ranging from 0 to 1)

As with all the transformations, it is important to fit the scalers to the training data only, not to the full dataset (including the test set). Only then can you use them to transform the training set and the test set (and new data).

Select and Train A Model

You should save every model you experiment with, so you can come back easily to any model you want. Make sure you save both the hyperparameters and the trained parameters, as well as the cross-validation scores and perhaps the actual predictions as well. This will allow you to easily compare scores across model types, and compare the types of errors they make. You can easily save Scikit-Learn models by using Python’s pickle module, or using sklearn.externals.joblib, which is more efficient at serializing large NumPy arrays:

The point of this step is to try out some different models and get a shortlist of which models work best (3-5). As mentioned above, it is best to dave your models so that you don't have to re-run them

Fine Tune Your Model

Tune hyperparameters using GridSearchCV and RandomizedSearchCV
Deploy Your Model
maintain, update data, retrain, and the rest

Chapter 3 - Classification

We will be using MNIST dataset for this chapter learning about classification.
Scikit learn has some datasets saved: sklearn.datasets(); this includes MNIST
Each image is a 28x28 array - so each image has 784 features each of which is a number from 0 (white) to 1 (black)
Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable) It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient (calculated from the entire data set) by an estimate thereof (calculated from a randomly selected subset of the data). Especially in high-dimensional optimization problems this reduces the very high computational burden, achieving faster iterations in exchange for a lower convergence rate.[1] The basic idea behind stochastic approximation can be traced back to the Robbins–Monro algorithm of the 1950s. Today, stochastic gradient descent has become an important optimization method in machine learning.[2]
Accuracy is generally not the preferred peformance measure for classifiers especially when you are dealing with skewed datasets, when some classes are mush more frequent than others
A better evaluator of a classifier is to look at the confusion matrix, look at the number of times instances of class A are classified as class B
The accuracy of positive predictions is called the precision of the classifier:

\text{precision} = \frac{TP}{TP + FP} \\ \text{where TP is the number of true positives, and FP is the number of false positives.}

Precision is typically used along with another metric named recall, also called the true positive rate(TPR):this is the ratio of positive instances that are correctly detected by the classifier

\text{recall} = \frac{TP}{TP+FN}

where FN is the number of false negatives.

It is often useful to combine the precision and recall into a single score called the F_1 score, which is the harmonic mean of precision and recall. Where the regular mean treats all values equally, the harmonic means gives more wight to lower values.

F_1 = \frac{2}{\frac{1}{\text{precision}}+\frac{1}{\text{recall}}} = 2 \times \frac{\text{precision} \times \text{recall}}{\text{precision} + \text{recall}} = \frac{TP}{TP + \frac{FN+FP}{2}}

The F 1 score favors classifiers that have similar precision and recall. This is not always what you want: in some contexts you mostly care about precision, and in other con‐ texts you really care about recall. For example, if you trained a classifier to detect vid‐ eos that are safe for kids, you would probably prefer a classifier that rejects many good videos (low recall) but keeps only safe ones (high precision), rather than a classifier that has a much higher recall but lets a few really bad videos show up in your product (in such cases, you may even want to add a human pipeline to check the classifier’s video selection).

How does the SGDClassifier make its classification decisions?
- For each instance, it computes a score based on a decision function and if that score is greater than a threshold, it assigns he new instance to the positive class, or else it assigns it to the negative class.
Raising the threshold decreases recall
The receiver operating characteristic (ROC) curve is another common tool used with binary classifiers. It is very similar to the precision recall curve, but instead of plotting precision vs recall, the ROC curve plots the true positive rate (another name for recall) against the false positive rate. The False Positive Rate (FPR) is equal to one minus the true negative rate,
which is the ratio of negative instances that are correctly classified as negative. The True Negative Rate is also called specificity. Here the ROC curve plots sensitivity (recall) versus 1 - specificity.

One way to compare classifiers is to measure the area under the curve (AUC). A perfect classifier will have a ROC AUC equal to 1, whereas a purely random classifier will have a ROC AUC equal to 0.5. Scikit-Learn provides a function to compute the ROC AUC:

As a rule of thumb, you should prefer the PR curve whenever the positive class is rare or when you care more about the false positive that the false negatives, and the ROC curve otherwise.

Training Binary Classifiers:

Choose the appropriate metric for a task
Evaluate the classifiers using cross validation
Select the precision/recall tradeoff that fits your needs
Compare various models using ROC curves and ROC AUC scores

Whereas binary classifiers distinguish between two classes, multiclass classifiers can distinguish between more than two classes
One strategy for multiclass classifiers is to create ten binary classifiers (one for each digit), then when you want to classify an image, you get the decision score for each classifier for that image and you select the class whose classifier outputs the highest score. This is called the one versus all (OvA) strategy (also called the one versus rest)
Another strategy is to train binary classifier for every pair of digits: one to distinguish 0s and 1s, another to distinguish 0s and 2s, another for 1s and 2s, and so on. This is called a one versus one strategy. If there are N classes, you need to train N x (N-1) / 2 classifiers. The main advantage of OvO is that each classifier only needs to be trained on the part of th training set for two classes that it must distinguish.

Chapter 4 - Training Models

We have implemented models without knowing how things actually work so far in this book
Knowing how things actually work can help you choose appropriate models, choose the right training algorithm, and choose a good set of hyperparameters, and perform error analysis more efficiently.
Linear models can be trained using a direct "closed-form" equation that computes the model parameters that best fit the model to th training set (model parameters that minimize the cost function over the training set)
Using an iterative optimization approach, called Gradient Descent (GD), that gradually tweaks the model parameters to minimize the cost function over the training set, eventually converging to the same set of parameters as the first method.
polynomial regression is a more complex model that can fit nonlinear training sets. Since the model has more parameters than Linear Regression, it is more prone to overfitting the training dat, so we will look at how to detect this.
Finally, we will look at two more models that are commonly used for classification tasks: Logistic Regression and Softmax Regression.

Linear Models

Linear Regression Model Prediction

A linear model makes a prediction by simply computing a weighted sum of the input features plus a constant called the bias term (also called the intercept term)
Linear Regression Model Prediction Vectorized Form

\hat{y} = h_{\theta}\left(\textbf{x}\right) = \textbf{\theta} \cdot \textbf{x}

where theta is the models parameter vector, containing the bias term $\theta_{0}$ and the feature weights $\theta_{1} \text{to} \theta_{n}$ .
x us the instances feature vector, containing x_{0} to x_{n}, with x_{0} always equal to 1.
\textbf{theta} \cdot \textbf{x} is the dot product of the vectors \textbf{$\theta$} and \textbf{x}, which is equal to theta_{i} and x_{i} for i=0 to i=n.
h_{$\theta$} is the hypothesis function, using model parameters $\theta$
Training a model means setting its parameters so that the model best fits the training set. For this purpose, we need a measure of how well the model fits the training data.
The most common performance measure for regression is the Root Mean Square Error -> find the values of the vector theta to minimize the RMSE
To find the value of theta that minimizes the cost function, there is a closed form solution - a mathematical equation that gives the result direction, which is called the Normal Equation:

Equation 4-4: Normal Equation:\\ \hat{\textbf{\theta}} = \left(\textbf{X}^{T}\textbf{X}\right)\ \textbf{X}^{T}\ \textbf{y}

where \hat{theta} is the value fo \textbf{theta} that minimizes the cost function and \textbf{y} is the vector of target values from y^{1} to y^{m}

The calculation of the values for the parameter vector is O^2.4 to O^3, so calculation of the parameter vector can get very slow when the number of features gets very high (> 100,000)

Gradient Descent

Gradient descent is better suited for cases where there are a large number of features, or too many training instances to fit in memory
Gradient Descent is a very generic optimization algorithm capable of finding optimal solutions to a wide range of problems. The general idea of Gradient Descent is to tweak parameters iteratively in order to minimize cost function
Gradient Descent measures the local gradient of the error function with regards to the parameter vector theta
and it goes in the direction of the decreasing gradient. Once the gradient is zero, you have reached a minimum.
Concretely, you start by filling θ with random values (this is called random initialization), and then you improve it gradually, taking one baby step at a time, each step attempting to decrease the cost function (e.g., the MSE), until the algorithm converges to a minimum (see Figure 4-3).
An important parameter in Gradient Descent is the size of the steps, determined by the learning rate hyperparameter. If the learning rate is too small, then the algorithm will have to go through many iterations to converge, which will take a long time. If the learning rate is too high, you might jump across the valley and end up on the other side, possibly even higher up than you were before. This might make the algorithm diverge, with larger and larger values, failing to find a good solution.
The cost function also may have holes, ridges, plateaus, local minima, and irregular terrain, which makes finding the global minimum difficult.
The Mean Squared Error cost function for a Linear Regression model happens to be a convex function, which means that if you pick any two points on the curve, the line segment joining them never crosses the curve.

Training a model means searching for a combination of model parameters that minimizes a cost function (over the training set). It is a search of the model's parameter space

To implement Gradient Descent, you need to compute the gradient of the cost function with regards to each model parameter. In other words, you need to calculate how much the cost function will change if the you change one model parameter just a little bit - this is called partial derivative.
Gradient descent scales well with the number fo features but is very slow on large training sets
The gradient descent formula involves calculations over the full training set \textbf{X} as every Gradient Descent step
Stochastic Gradient Descent just picks a random instance in the training set at every step and calculates the gradient based only on that single instance. This makes the algorithm faster since it has very little data to manipulate at every iteration.
When the cost function is very irregular, this can actually help the algorithm jump out of local minima, so Stochastic Gradient Descent has a better chance of finding global minimum than Batch Gradient Descent does.
Stochastic Gradient Descent's randomness is good to escape from local optima, but bad because it means that the algorithm can never settle at the minimum.
- One solution to this dilemma is to gradually decrease the learning rate. The steps start out large (which helps make progress and escape local minima), then gets smaller and smaller, allowing the algorithm to settle at the global minimum. This process is akin to simulated annealing/.
- The function that determines the learning rate at each iteration is called the learning schedule
- Each round of iteration in stochastic gradient descent is called an epoch
Mini-batch Gradient Descent computes gradients on small random sets of instances called mini-batches. The main advantage of Mini-batch GD over Stochastic GD is that you can get a performance boost from hardware optimization of matrix operations, especially when using GPUs.

Polynomial Regression

If your model is more complex than just a straight line, you can add powers of each feature as new features, then train a linear model on this extended set of features. This technique is called Polynomial Regression.
Note that when there are multiple features, Polynomial regression is capable of finding relationships between features.
This is made possible by the fact that PolynomialFeatures also adds all combinations of features up to the given degree. For example, if there were two features a and b, PolynomialFeatures with degree=3 would not only add the features a2, a3, b2, and b3, but also the combinations ab, a2b, and ab2.
If a model performs well on the training data but generalizes poorly according to cross validation metrics, then the model is overfitting

Learning Curves

Another way to look at whether your model is overfitting / underfitting the data is to look at the learning curves: plots of the models performance on the training set and the validation set as a function of the training set size.
Look at how the learning curves for overfitting and underfitting models differ
An important theoretical result of statistics and Machine Learning is the fact that a model's generalization error can be expressed as the sum of three very different errors:

Bias: This part of the generalization error is due to wrong assumptions, such as assuming that the data is linear when it is actually quite quadratic. A high bias model is most likely to underfit the training data.
Variance: This part is due to the model's excessive sensitivity to small variations in the training data. A model with many degrees of freedom (such as a high degree polynomial model) is likely to have a high variance, and this to overfit the training data.
Irreducible Error: This part is due to the noisiness of the data itself. The only way to reduce this part of the error is to clean up the data (e.g., fix the data sources, such as broken sensors, or detect and remove the outliers)
Increasing a model’s complexity will typically increase its variance and reduce its bias. Conversely, reducing a model’s complexity increases its bias and reduces its variance. This is why it is called a tradeoff.

Regularized Linear Models

A good way to reduce overfitting is to regularize the model: the fewer degrees of freedom it has, the harder it will be for it overfit the data.
For polynomial model, this typically involves reducing the number of polynomial degrees
For linear models, regularization is typically achieved by constraining the weights of the model
Ridge Regression is a regularized version of Linear Regression: a regularized term equal to

\alpha \sum_{i=1}^{n} \theta_{i}^{2}

is added to the cost function. This forces the learning algorithm to not only fit the data but also keep the model wights as small as possible.

The hyperparameter alpha controls how much you want to regularize the model. If $\alpha =0$ , then Ridge Regression is just like Linear Regression. If $\alpha$ is very large, then all weights end up very close to 0 and the result is a flat line going through the data's mean.
It is important to scale the data before performing Ridge Regression - this is true of must regularized models
Least Absolute Shrinkage and Selection Operator Regression (also called Lasso Regression) is another regularized version of Linear Regression. Just like ridge regression, it adds a regularization term to the cost function, but it uses $l_{1}$ norm of the weight vector instead of the half square of the $l_{2}$ norm.
An important feature of Lasso Regression is that it tends to completely eliminate the weights of the least important features - set them to zero. In other words, Lasso regression performs feature selection and outputs a sparse model - a model with few nonzero feature weights.
Elastic Net is a middle ground between Ridge Regression and Lasso Regression. The regularization term is a simple miz of both Ridge and Lasso's regularization terms, and you can control the mix ration $r$ .

So when should you use plain Linear Regression (i.e., without any regularization), Ridge, Lasso, or Elastic Net? It is almost always preferable to have at least a little bit of regularization, so generally you should avoid plain Linear Regression. Ridge is a good default, but if you suspect that only a few features are actually useful, you should pre‐ fer Lasso or Elastic Net since they tend to reduce the useless features’ weights down to zero as we have discussed. In general, Elastic Net is preferred over Lasso since Lasso may behave erratically when the number of features is greater than the number of training instances or when several features are strongly correlated.

A different way to regularize iterative learning algorithms such as Gradient Descent is to stop training as soon as the validation error reaches a minimum. This is called early stopping.

Logistic Regression

Some regression algorithms can be used for classification as well. Logistic Regression is commonly used to estimate the probably that an instance belongs to a particular class.
Just like linear regression, a Logistic Regression model computes the weighted sum of input features (plus a bias term), but instead od outputting the result directly like the Linear Regression model does, it outputs the logistic of the result

Logistic\ Regression\ model\ estimated\ probability\ (vectorized\ form)\\ \hat{p} = h_{\theta}(\textbf{x})=\sigma\left(\textbf{x}^{T}\theta \right)

the logistic - noted $\sigma(\cdot)$ is a sigmoid function that outputs a number between 0 and 1.

\sigma(t) = \frac{1}{1+\text{exp}(-t)}

There is no closed-form equation to compute the value of theta that minimizes the cost function, but the cost function is convex, so Gradient Descent is guaranteed to find the global minimum.
Logistic Regression can be generalized to support multiple classes directly, without having to train and combine multiple binary classifiers. This is called Softmax Regression or Multinomial Logistic Regression.

Excercises

What Linear Regression training algorithm can you use if you have a training set with millions of features?

You would want to use gradient descent due to the fact that gradient descent scales better wrt to the number of features vs Linear Regression.

Suppose the features in your training set have very different scales. What algorithms might suffer from this, and how? What can you do about it?

If the features in your training set have very different scales, the Gradient Descent algorithms will take a long time to converge. To solve this, you should scale the data before training the model. note that the Normal Equation will work just fine without scaling.

Can Gradient Descent get stuck in a local minimum when training a Logistic Regression model?

Gradient Descent cannot get stuck in a local minimum when training a Logistic Regression model because the cost function is convex
(If you draw a straight line between any two points on the curve, the line never crosses the curve.)

Do all Gradient Descent algorithms lead to the same model provided you let them run long enough?

If the optimization problem is convex (such as Linear Regression or Logistic Regression), and assuming the learning rate is not too high, then all Gradient Descent algorithms will approach the global optimum and end up producing fairly similar models. However, unless you gradually reduce the learning rate, Stochastic GD and Mini-batch GD will never truly converge; instead, they will keep jumping back and forth around the global optimum. This means that even if you let them run for a very long time, these Gradient Descent algorithms will produce slightly different models.

Suppose you use Batch Gradient Descent and you plot the validation error at every epoch. If you notice validation error consistently goes up, what is likely going on> How can you fix this?

If the validation error consistently goes up after every epoch, then one possibility is that the learning rate is too high and the algorithm is diverging. If the training error also goes up, then this is clearly the problem and you should reduce the learning rate. However, if the training error is not going up, then your model is overfitting the training set and you should stop training.

Is it a good idea to stop Mini-batch gradient descent immediately when the validation error goes up?

Due to their random nature, neither Stochastic Gradient Descent nor Mini-batch Gradient Descent is guaranteed to make progress at every single training iteration. So if you immediately stop training when the validation error goes up, you may stop much too early, before the optimum is reached. A better option is to save the model at regular intervals, and when it has not improved for a long time (meaning it will probably never beat the record), you can revert to the best saved model.

Which Gradient Descent algorithm, among those discussed, will reach the vicinity of the optimal solution fastest? Which will actually converge? How can you make others converge as well?

Stochastic Gradient Descent has the fastest training iteration since it considers only one training instance at a time, so it is generally the first to reach the vicinity of the global optimum (or Mini-batch GD with a very small mini-batch size). However, only Batch Gradient Descent will actually converge, given enough training time. As mentioned, Stochastic GD and Mini-batch GD will bounce around the optimum, unless you gradually reduce the learning rate.

Suppose you are using Polynomial Regression. You plot the learning curves and you notice that there is a large gap between the training error and the validation error. What is happening? What are three ways to solve this?

If the validation error is much higher then the training error, this is likely because you model is overfitting the training set. One way to try to fix this is to reduce the polynomial degree: a model with fewer degrees of freedom is less likely to overfit. Another thing you can try is to regularize the model – for example, by adding a l_2 penalty (Ridge) or an l_1 penalty (Lasso) to the cost function. This will also reduce the degrees of freedom of the model. Lastly, you can try to increase the size of the training set.

Suppose you are using Polynomial Regression and you notice that the training error and the validation error are almost equal and fairly high. Would you say that the model suffers from high bias or high variance? Should you increase the regularization hyperparameter $\alpha$ or reduce it?

If both the training error and the validation error are almost equal and fairly high, the model is likely underfitting the training set, which means it has a high bias. You should try reducing the regularization hyperparameter \alpha.

Why would you want to use:

Ridge Regression instead of plain Linear Regression (i.e., without any regularization)?
- A model with some regularization typically performs better than a model without any regularization, so you should generally prefer Ridge Regression over plain Linear Regression (Moreover, the Normal Equation requires computing the inverse of a matrix, but that matrix is not always invertible. In contrast, the matrix for Ridge regression is always invertible.)
Lasso instead of Ridge Regression?
- Lasso Regression uses an penalty, which tends to push the weights down to exactly zero. This leads to sparse models, where all weights are zero except for the most important weights. This is a way to perform feature selection automatically, which is good if you suspect that only a few features actually matter. When you are not sure, you should prefer Ridge Regression.
Elastic Net instead of Lasso?
- Elastic Net is generally preferred over Lasso since Lasso may behave erratically in some cases (when several features are strongly correlated or when there are more features than training instances). However, it does add an extra hyperparameter to tune. If you just want Lasso without the erratic behavior, you can just use Elastic New with an “l1_ratio” close to 1.

Suppose you want to classify pictures as outdoor/indoor and daytime/nighttime. Should you implement two Logistic Regression classifiers or one Softmax Regression classifier?

If you want to classify pictures as outdoor/indoor and daytime/nighttime, since these are not exclusive classes (i.e., all four combinations are possible) you should train two Logistic Regression classifiers.

User Comments

There are currently no comments for this article.