Hands-On Machine Learning with Scikit Learn and Tesnor Flow Notes Chapter 1-4

Here are some notes from trying to learn machine learning by going through "Hands-On Machine Learning with Scikit Learn and Tesnor Flow". I took notes with markdown and used jupyter notebooks for chapters 1-3, but for the rest of the book, I am probably just going to use jupyter notebooks and create something on this site where I can store some jupyter notebooks.

Hands-On Machine Learning with Scikit Learn and Tesnor Flow Notes

Note: sometimes the θ\thetaθ symbols in equations should be bold eventhough they are not.

Chapter 1 - What is Machine Learning

Notes

Exercises

  1. How would you define Machine Learning?
  1. Can you name four types of problems where it shines.
  1. What is a labeled training set?
  1. What are the two most common supervised tasks?
  1. Can you name four common unsupervised tasks?
  1. What type of machine learning algorithm would you use to allow a robot to walk in unknown territories?
  1. What type of algorithm you use to segment your customers into multiple groups?
  1. Would you frame the problem of spam detection as a supervised learning problem or an unsupervised learning problem?
  1. What is an online learning system?
  1. What is out of core learning?
  1. What type of learning algorithm relies on a similarity measure to make predictions?
  1. What is the difference between a model parameter and a learning algorithm's hyperparameter?
  1. What do model based learning algorithms search for? What is the most common strategy they use to succeed? How do they make predictions?
  1. Can you name four of the main challenges in Machine Learning?
  1. If your model performs great on the training data but generalizes poorly to new instances, what is happening? Can you name three possible solutions?
  1. What is a test set and why would you want to use it?
  1. What is the purpose of a validation set?
    -The validation set uses a subset of the training data to provide an unbiased evaluation of a model. The validation data set contrasts with training and test sets in that it is an intermediate phase used for choosing the best model and optimizing it. It is in this phase that hyperparameter tuning occurs.
  2. What can go wrong if you tune hyper parameters using the test set?
  1. What is repeated cross validation and why would you prefer it to using a single validation set?

Chapter 2 - End to End Machine Learning Project

Notes

RMSE(X,h)=1mi=1m(h(x(i))y(i))2\text{RMSE}(\textbf{X},h) = \sqrt{\frac{1}{m}\sum^{m}_{i=1}\left( h(\textbf {x}^{(i)})-y^{(i)}\right)^{2}}RMSE(X,h)=m1i=1m(h(x(i))y(i))2
MAE(X,h)=1mi=1mh(x(i))y(i)\text{MAE}(\textbf{X},h) = \frac{1}{m} \sum^{m}_{i=1}\left|h\left(\textbf{x}^{(i)}\right)-y^{(i)}\right|MAE(X,h)=m1i=1mh(x(i))y(i)
  1. Get rid of the rows with the missing values python housing.dropna(subset=["total_bedrooms"]) # option 1
  2. Get rid of the column with the missing attributes python housing.drop("total_bedrooms", axis=1) # option 2
  3. Set the values to some value (zero, the mean, the media, etc.) python median = housing["total_bedrooms"].median(); housing["total_bedrooms"].fillna(median, inplace=True) # option 3
    Handling Text and Categorical Attributes

If a categorical attribute has a large number of possible categories (e.g., country code, profession, species, etc.), then one-hot encod‐ ing will result in a large number of input features. This may slow down training and degrade performance. If this happens, you may want to replace the categorical input with useful numerical features related to the categories: for example, you could replace the ocean_proximity feature with the distance to the ocean (similarly, a country code could be replaced with the country’s population and GDP per capita). Alternatively, you could replace each category with a learnable low dimensional vector called an embedding. Each category’s representation would be learned during training: this is an example of representation learning (see Chapter 13 and ??? for more details).

Feature Scaling

With few exceptions, Machine Learning algorithms don't perform well when the input numerical attributes have very different scales. There are two ways to get all attributes to have the same scale:

  1. min-max scaling
  2. Min max scaling is just ensuring that each value ends up ranging from 0 to 1. We do this by subtracting the min from each value, and dividing by the range.
xinew=xixminxmaxxminx_{i-new} = \frac{{x_i}-x_{min}}{x_{max}-x_{min}}xinew=xmaxxminxixmin
  1. standardization
  2. Subtract the mean of the series from each value and divide by the standard deviation of the series so that the resulting distribution has a unit variance
  3. Standardization does not bound values to a specific range, which might be a problem for some algorithms (e.g., neural networks often expect a value ranging from 0 to 1)

As with all the transformations, it is important to fit the scalers to the training data only, not to the full dataset (including the test set). Only then can you use them to transform the training set and the test set (and new data).

Select and Train A Model

You should save every model you experiment with, so you can come back easily to any model you want. Make sure you save both the hyperparameters and the trained parameters, as well as the cross-validation scores and perhaps the actual predictions as well. This will allow you to easily compare scores across model types, and compare the types of errors they make. You can easily save Scikit-Learn models by using Python’s pickle module, or using sklearn.externals.joblib, which is more efficient at serializing large NumPy arrays:

Fine Tune Your Model

Chapter 3 - Classification

precision=TPTP+FPwhere TP is the number of true positives, and FP is the number of false positives.\text{precision} = \frac{TP}{TP + FP} \\ \text{where TP is the number of true positives, and FP is the number of false positives.}precision=TP+FPTPwhere TP is the number of true positives, and FP is the number of false positives.
recall=TPTP+FN\text{recall} = \frac{TP}{TP+FN}recall=TP+FNTP

where FN is the number of false negatives.

F1=21precision+1recall=2×precision×recallprecision+recall=TPTP+FN+FP2F_1 = \frac{2}{\frac{1}{\text{precision}}+\frac{1}{\text{recall}}} = 2 \times \frac{\text{precision} \times \text{recall}}{\text{precision} + \text{recall}} = \frac{TP}{TP + \frac{FN+FP}{2}} F1=precision1+recall12=2×precision+recallprecision×recall=TP+2FN+FPTP

The F 1 score favors classifiers that have similar precision and recall. This is not always what you want: in some contexts you mostly care about precision, and in other con‐ texts you really care about recall. For example, if you trained a classifier to detect vid‐ eos that are safe for kids, you would probably prefer a classifier that rejects many good videos (low recall) but keeps only safe ones (high precision), rather than a classifier that has a much higher recall but lets a few really bad videos show up in your product (in such cases, you may even want to add a human pipeline to check the classifier’s video selection).

One way to compare classifiers is to measure the area under the curve (AUC). A perfect classifier will have a ROC AUC equal to 1, whereas a purely random classifier will have a ROC AUC equal to 0.5. Scikit-Learn provides a function to compute the ROC AUC:

Training Binary Classifiers:

  1. Choose the appropriate metric for a task
  2. Evaluate the classifiers using cross validation
  3. Select the precision/recall tradeoff that fits your needs
  4. Compare various models using ROC curves and ROC AUC scores

Chapter 4 - Training Models

Linear Models

Linear Regression Model Prediction

y^=hθ(x)=\thetax\hat{y} = h_{\theta}\left(\textbf{x}\right) = \textbf{\theta} \cdot \textbf{x}y^=hθ(x)=\thetax
Equation44:NormalEquation:\theta^=(XTX) XT yEquation 4-4: Normal Equation:\\ \hat{\textbf{\theta}} = \left(\textbf{X}^{T}\textbf{X}\right)\ \textbf{X}^{T}\ \textbf{y}Equation44:NormalEquation:\theta^=(XTX) XT y

where \hat{theta} is the value fo \textbf{theta} that minimizes the cost function and \textbf{y} is the vector of target values from y^{1} to y^{m}

Gradient Descent

Training a model means searching for a combination of model parameters that minimizes a cost function (over the training set). It is a search of the model's parameter space

Polynomial Regression

Learning Curves

  1. Bias: This part of the generalization error is due to wrong assumptions, such as assuming that the data is linear when it is actually quite quadratic. A high bias model is most likely to underfit the training data.
  2. Variance: This part is due to the model's excessive sensitivity to small variations in the training data. A model with many degrees of freedom (such as a high degree polynomial model) is likely to have a high variance, and this to overfit the training data.
  3. Irreducible Error: This part is due to the noisiness of the data itself. The only way to reduce this part of the error is to clean up the data (e.g., fix the data sources, such as broken sensors, or detect and remove the outliers)

    Increasing a model’s complexity will typically increase its variance and reduce its bias. Conversely, reducing a model’s complexity increases its bias and reduces its variance. This is why it is called a tradeoff.

Regularized Linear Models

αi=1nθi2\alpha \sum_{i=1}^{n} \theta_{i}^{2} αi=1nθi2

is added to the cost function. This forces the learning algorithm to not only fit the data but also keep the model wights as small as possible.

So when should you use plain Linear Regression (i.e., without any regularization), Ridge, Lasso, or Elastic Net? It is almost always preferable to have at least a little bit of regularization, so generally you should avoid plain Linear Regression. Ridge is a good default, but if you suspect that only a few features are actually useful, you should pre‐ fer Lasso or Elastic Net since they tend to reduce the useless features’ weights down to zero as we have discussed. In general, Elastic Net is preferred over Lasso since Lasso may behave erratically when the number of features is greater than the number of training instances or when several features are strongly correlated.

Logistic Regression

Logistic Regression model estimated probability (vectorized form)p^=hθ(x)=σ(xTθ)Logistic\ Regression\ model\ estimated\ probability\ (vectorized\ form)\\ \hat{p} = h_{\theta}(\textbf{x})=\sigma\left(\textbf{x}^{T}\theta \right)Logistic Regression model estimated probability (vectorized form)p^=hθ(x)=σ(xTθ)

the logistic - noted σ()\sigma(\cdot)σ() is a sigmoid function that outputs a number between 0 and 1.

σ(t)=11+exp(t)\sigma(t) = \frac{1}{1+\text{exp}(-t)}σ(t)=1+exp(t)1

Excercises

  1. What Linear Regression training algorithm can you use if you have a training set with millions of features?
  1. Suppose the features in your training set have very different scales. What algorithms might suffer from this, and how? What can you do about it?
  1. Can Gradient Descent get stuck in a local minimum when training a Logistic Regression model?
  1. Do all Gradient Descent algorithms lead to the same model provided you let them run long enough?
  1. Suppose you use Batch Gradient Descent and you plot the validation error at every epoch. If you notice validation error consistently goes up, what is likely going on> How can you fix this?
  1. Is it a good idea to stop Mini-batch gradient descent immediately when the validation error goes up?
  1. Which Gradient Descent algorithm, among those discussed, will reach the vicinity of the optimal solution fastest? Which will actually converge? How can you make others converge as well?
  1. Suppose you are using Polynomial Regression. You plot the learning curves and you notice that there is a large gap between the training error and the validation error. What is happening? What are three ways to solve this?
  1. Suppose you are using Polynomial Regression and you notice that the training error and the validation error are almost equal and fairly high. Would you say that the model suffers from high bias or high variance? Should you increase the regularization hyperparameter α\alphaα or reduce it?
  1. Why would you want to use:
  1. Suppose you want to classify pictures as outdoor/indoor and daytime/nighttime. Should you implement two Logistic Regression classifiers or one Softmax Regression classifier?