Hands On Machine Learning Chapter 7 - Ensemble Learning and Random Forests

I am going to re-read Hands-On Machine Learning with Scikit-learn Keras & TensorFlow because I don't feel that I got a good grasp of machine learning the first time I read it, and I skipped neural networks the first time I read the book. Since the first time reading this textbook.

Ensemble Learning and Random Forests

The wisdom of the crowd - if you aggregate the answers of a random group of people you will often find that it is better than the expert's answer. Similarly, if you aggregate the predictions of a group of predictors (such as classifiers or regressors), you will often get better predictions than with the best individual predictor. A group of predictors is called an ensemble; thus, this technique is called Ensemble Learning, and an Ensemble Learning algorithm is called an Ensemble method. An ensemble of Decision Trees where you make the predict the class that gets the most votes from the predictions od the trees is called a Random Forest. Despite its simplicity, this is one of the most powerful Machine Lraning algorithms available today. The winning solutions in Machine Leaning competitions often involve Ensemble methods.

Voting classifiers

Suppose you have trained a few classifiers, each achieving about an 80% accuracy.

Training Diverse Classification

A very simple way to create a better classifier is to aggregate the predictions of each classifier to predict the class that gets the most votes. This majority-vote classifier us called a hard voting classifier.

Hard Voting Classifier Prediction

Somewhat surprisingly, this voting classifier often achieves a higher accuracy than the best classifier in the ensemble. In fact, even if each classifier is a weak learner (meaning that the classifier only does slightly better than random guessing), the ensemble can still be a strong learner (achieving high accuracy), provided there are a sufficient number of weak learners and they are sufficiently diverse. The law of large numbers - flipping a coin that has a 51% chance of being heads, the more you flip it, the closer and closer it gets to the probability of heads (51%).

The Law of Large Numbers

Ensemble methods work best when the predictors are as independent form one another as possible. One way to get diverse classifiers is to train them using very different algorithms. This increases the chance that they will make very different types of errors, improving the ensemble's accuracy. The code below creates and trains a voting classifier in Scikit-Learn, composed of three diverse classifiers. As seen in the output, the voting classifier significantly outperforms all the individual classifiers.

If all the classifiers are able to predict probabilities (they have the predict_proba() method), then you can tell Scikit-Learn to predict the class with the highest class probability averaged over all the individual classifiers. This is called soft voting. It often achieves higher performance than hard voting because it gives more weight to highly confident votes. All you need to do is replace voting="hard" with voting-"soft" and ensure that all classifers can estimate class probabilities. (This is not the case with SVC by default, you need to set the probability hyperparameter to true).

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()
voting_clf = VotingClassifier(estimators=[('lr',log_clf),('rf',rnd_clf),('svc',svm_clf)],voting="hard")
voting_clf.fit(X_train,y_train)
from sklearn.metrics import accuracy_score
print("Hard Voting\n-----------------------------------------------")
for clf in (log_clf,rnd_clf,svm_clf,voting_clf):
    clf.fit(X_train,y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__,accuracy_score(y_test,y_pred))

svm_clf_soft = SVC(probability=True)
log_clf_soft = LogisticRegression()
rnd_clf_soft = RandomForestClassifier()
voting_clf_soft = VotingClassifier(estimators=[('lr',log_clf_soft),('rf',rnd_clf_soft),('svc',svm_clf_soft)],voting="soft")
print("\nSoft Voting\n-----------------------------------------------")
for clf in (log_clf_soft,rnd_clf_soft,svm_clf_soft,voting_clf_soft):
    clf.fit(X_train,y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__,accuracy_score(y_test,y_pred))
out[2]

Hard Voting
-----------------------------------------------
LogisticRegression 0.864
RandomForestClassifier 0.888
SVC 0.896
VotingClassifier 0.904

Soft Voting
-----------------------------------------------
LogisticRegression 0.864
RandomForestClassifier 0.888
SVC 0.896
VotingClassifier 0.92

Bagging and Pasting

One way to get a diverse set of classifiers is ro use very different training algorithms; another approach is to use the same training algorithm for every predictor, but to train them on different random subsets of the training set. When sampling is performed with replacement, this method is called bagging (short for bootstrap aggregating). When sampling is performed without replacement, it is called pasting. Both bagging and pasting allow training instances to be sampled several times across multiple predictors, but only bagging allows training instances to be sampled several times for the same predictor.

Pasting Bagging Training Set Sampling and Training

Once all predictors are trained, the ensemble can make a prediction for a new instance by simply aggregating the predictions of all predictors. The aggregation function is typically the statistical mode (the most frequent prediction, just kike a hard voting classifier) for classification, or the average for regression. Each individual predictor has a higher bias than if it were trained on the original training set, but aggregation reduces both bias and variance. Generally, the net result is that the ensemble has a similar bias but a lower variance than a single predictor trained on the original training set. Bagging and pasting are such popular methods because they can be run simultaneously on different cores: they scale well.

Bagging and Pasting in Scikit-Learn

Scikit-Learn offers a simple API for both bagging and pasting with the BaggingClassifier class (or BaggingRegressor class for regression). If you want to use pasting, set the bootstrap karg to False. The n_jobs karg tells Scikit-Learn the number of CPU cores to use (-1 means all available cores). The BaggingClassifier automatically performs soft voting instead of hard voting if the base classifier can estimate class probabilities.

The image below compares the boundary of a Decision Tree with the decision boundary of a bagging ensemble of 500 trees, both trained on the moons dataset. The Ensemble's predictions can generalize much better than the single Decision Tree's predictions: the ensemble has a comparable but a smaller variance (it roughly makes the same number of errors on the training set, but the decision boundary is less irregular).

Single Decision Tree Versus Bagging Ensemble 500 Trees

Bootstrapping introduces a bit more diversity in the subsets that each predictor is trained on, so bagging ends up with a slightly higher bias than pasting, but this also means that predictors end up being less correlated so the ensemble’s variance is reduced. Overall, bagging often results in better models, which explains why it is generally preferred. However, if you have spare time and CPU power you can use crossvalidation to evaluate both bagging and pasting and select the one that works best.

Out-of-Bag Evaluation

With bagging, some instances may be samples several times for any given predictor, while others may not be sampled at all. By default, a BaggingClassifier samples mmm training instances with replacement (bootstrap=True), where mmm is the size of the training set. This means that only about 63% of the training instance are sample on average for each predictor. The remaining 37% of the training instances that are not samples are called out-of-bag (oob) instances. Note that they are not the same 37% for all predictors. Since a predictor never sees the oob instances during training, it can be evaluated on these instances, without the need for a separate valuation set. You can evaluate the ensemble itself by averaging out the oob evaluation of each predictor. In Scikit-Learn, you can set oob_score=True when creating a BaggingClassifier to request an automatic oob evaluation after training. The oob decision function for each training instance is also available through the oob_decision_function_ variable.

Random Patches and Random Subspaces

The BaggingClassifier class also supports sampling the features as well. This is controlled by two hyperparameters: max_features and bootstrap_features. They work the same way as max_sample and bootstrap, but for feature sampling. Thus, each predictor will be trained on a random subset of input features. This is particularly useful when you are dealing with high-dimensional inputs (such as images). Sampling both training instances and features is called the Random Patches method. Keeping all training instances (bootstrap=False and max_samles=1.0) but sampling features (bootstrap_features=true and/or max_features smaller than 1.0) is called the Random Subspaces method.

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
bag_clf = BaggingClassifier(DecisionTreeClassifier(random_state=42),n_estimators=500,max_samples=100,bootstrap=True,n_jobs=1,oob_score=True)
bag_clf.fit(X_train,y_train)
y_pred = bag_clf.predict(X_test)
print("Out of Bag Score (The Classifier is likely to achieve about this accuracy on the test set) =",bag_clf.oob_score_)

from sklearn.metrics import accuracy_score
y_pred = bag_clf.predict(X_test)
print("Accuracy Score =",accuracy_score(y_test,y_pred))
out[4]

Out of Bag Score (The Classifier is likely to achieve about this accuracy on the test set) = 0.9253333333333333
Accuracy Score = 0.92

Random Forests

A Random Forest is an ensemble of Decision Trees, generally trained via the bagging method (or sometimes pasting), typically with max_samples set to the size of the training set. You can use the RandomForestClassifier which is more convenient and optimized for Decision Trees (similarly, there is a RandomForestRegressor for regression tasks). The code below trains a Random Forest Classifier with 500 Trees - each Limited to 16 Nodes, Using all CPU Cores. With a few exceptions, a RandomForestClassifier has all the hyperparameters of a DecisionTreeClassifier (to control how trees are grown), plus all the hyperparameters of a BaggingClassifier to control the ensemble itself.

The Random Forest algorithm introduces extra randomness when growing trees; instead of searching for the very best feature when splitting a node, it searches for the best feature among a random subset of features. This results in a greater tree diversity, which (once again) trades a higher bias for a lower variance, generally yielding an overall better model. The BaggingClassifier below is roughly equivalent to a RandomForestClassifier.

Extra-Trees

When you are growing a tree in a Random Forest, at each node only a random subset of the features is considered for splitting. It is possible to make trees even more random by also using random thresholds for each feature rather than searching for the best possible thresholds. A forest of such extremely random trees is simply called an Extremely Randomized Tree ensemble (Extra-Trees for short). Once again, this trades more bias for lower variance. This trade more bias for lower variance and makes Extra-Trees much faster to run. You can create an Extra-Trees classifier using Scikit-Learn's ExtraTreesClassifier class. Its hard to tell which will perform better (ExtraTreesClassifier or RandomForestClassifier) - the only way to know is to try them both.

Feature Importance

Another great quality of Random Forests is that they make it easy to measure the relative importance of each feature. Scikit-Learn measures a feature's importance by looking at how much the tree nodes that use that feature reduce impurity on average (across all trees in a forest). More precisely, it is a weighted average, where each node's weight is equal to the number of training samples that are associated with it. Scikit-Learn computes this score automatically for each feature after training, then it scales the results so that the sum of all importances is equal to 1. You can access the result using the feature_importances_ variable. See the code below for an example. See the image below - training a Random Forest classifier on the MNIST dataset, and plotting each pixel's importance:

MNIST Pixel Importance

# Train Random Forest Classifier with 500 Trees, Each Limited to 16 Nodes, Using all CPU Cores
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train,y_train)
y_p_pred_rf = rnd_clf.predict(X_test)

# Bagging Classifier Equivalent to Random Forest Classifier
bag_clf = BaggingClassifier( DecisionTreeClassifier(splitter="random", max_leaf_nodes=16),n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1)

from sklearn.datasets import load_iris
iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rnd_clf.fit(iris["data"], iris["target"])
for name, score in zip(iris["feature_names"], rnd_clf.feature_importances_):
    print(name,score)
out[6]

sepal length (cm) 0.09118288240939618
sepal width (cm) 0.02281136315230843
petal length (cm) 0.4421182667960137
petal width (cm) 0.4438874876422817

Boosting

Boosting (originally called hypothesis testing) refers to any Ensemble Method that can combine several weak learners into a strong learner. The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor.

AdaBoost

One way for a new predictor to correct its predecessor is to pay a bit more attention to the training instances that the predecessor underfitted. This results in new predictors focusing more and more on the hard cases. This is the technique used by Ada Boost. For example, (shown in the picture below), to build an AdaBoost classifier, a first base classifier is trained and used to make predictions on the training set. The relative wight of misclassified training instances is then increased. A second classifier is trained and so on...

AdaBoost Sequential Training with Instance Wight Updates

The image below shows the decision boundaries of five consecutive predictors on the moon dataset. The first classifier gets many instances wrong, so their weights get boosted and so on. The picture on the right has its learning rate halved (misclassified weights are boosted half as much at every iteration). AdaBoost is like gradient descent, except instead of tweaking a single predictor's parameters, it adds predictors to the ensemble. Once all predictors are trained, the ensemble makes predictors like bagging or pasting except that predictors have different weighted depending on their overall accuracy on the weighted training set. The one important drawback to sequential learning techniques is that they cannot be parallelized (or only partially). Ad a result, it doesn't scale as well as bagging or pasting. The AdaBoost algorithm stops when the desired number of predictors is reached, or when the perfect predictor is found. Scikit-Learn uses a multiclass version of AdaBoost called SAMME, which stands for Stagewise Additive Modeling using a Multiclass Exponential Loss Function.

Decision Boundaries of Consecutive Predictors

See the code below for an example of an AdaBoost classifier based on 200 Decision Stumps using Scikit-Learn's AdaBoostClassifier class (there is also an AdaBoostRegressor class). A Decision Stump is a decision Tree with max_depth=1 - a tree composed of a single decision node plus two leaf nodes. This is the default base estimator for the AdaBoostClassifier class. If your AdaBoost ensemble is overfitting the training set, you can try reducing the number of estimators or more strongly regularizing the base estimator.

from sklearn.ensemble import AdaBoostClassifier
ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1),n_estimators=200,algorithm="SAMME.R",learning_rate=0.5)
ada_clf.fit(X_train, y_train)

Gradient Boosting

Another very popular boosting algorithm is Gradient Boosting. Just like AdaBoost, Gradient Boosting works by sequentially adding predictors to an ensemble, each one correcting its predecessor. Instead of tweaking the instance weights at every iteration like AdaBoost does, this method tries to fit the new predictor to the residual errors made by the previous predictor. See the code below for a regression example using Decision Trees as the base predictors. This tasks is called Gradient Tree Boosting or Gradient Boosted Regression Trees (GBRT).

The image below represents predictions of the three trees in left column, and the prediction of the ensemble in the right column. On the right, you can see that the ensemble's predictions are equal to the sum of the predictions of the first two trees. A simpler way to train GBRT is to use Scikit-Learn's GradientBoostingRegressor class. It has hyperparameters to control the growth of the class. The learning_rate hyperparameter scales the contribution of each tree. If you set it to a low value, such as 0.1, you will need more trees in the ensemble to fit the training set, but the predictions will usually generalize better. This is a regularization technique called shrinkage.

Gradient Boosting

The image below shows two GBRT ensembles trained with a low learning rate: the one on the left does not have enough trees to fit the training set, while the one on the right has too many trees and overfits the training set.

GBRT ensembles with not enough predictors (le) and too many (right)

In order to find the optimal number of trees, you can use early stopping. A simple way to implement this is to use the staged_predict() method: it returns an iterator over the predictions made by the ensemble at each stage of training. The code below (# Optimal Trees) trains a GBRT ensemble with 120 trees, then measures the validation error at each stage of training to find the optimal number of trees, and finally trains another GBRT ensemble using the optimal number of trees. The validation errors are represented on the left of the image below, and the model's best predictions are represented on the right.

Tuning the Number of Trees Using Early Stopping

It is also possible to implement early stopping by actually stopping training early (instead of training a large number of trees then looking back to find the optimal number). You can do set by setting warm_start=True, which makes Scikit-Learn keep existing trees when the fit() method is called, allowing incremental training. The code below (# Early Stopping) stops the raining when the validation error does not improve for five iterations in a row. The subsample hyperparameter can be used to specify the fraction of training instances to use for each tree. This can speed up training considerably and is called Stochastic Gradient Boosting.

XGBoost

An optimized implementation of Gradient Boosting is available in the popular python library XGBoost, which stands for Extreme Gradient Boosting. XGBoost is often an important component of the winning entries in ML competitions. See some example code below.

# Gradient Boosted Regression Trees Manual
from sklearn.tree import DecisionTreeRegressor
tree_reg1 = DecisionTreeRegressor(max_depth=2)
tree_reg1.fit(X,y)
y2 = y - tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth=2)
tree_reg2.fit(X,y)
y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=2)
tree_reg3.fit(X,y)
# This Ensemble containing three trees can make predictions on a new instance by adding up prediction sof all the trees
# y_pred = sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))

# Gradient Boosted Regression Trees with GradientBoostingRegressor
from sklearn.ensemble import GradientBoostingRegressor
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0)
gbrt.fit(X,y)

# Optimal Trees
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X_train, X_val, y_train, y_val = train_test_split(X,y)
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=120)
gbrt.fit(X_train, y_train)
errors = [mean_squared_error(y_val, y_pred) for y_pred in gbrt.staged_predict(X_val)]
bst_n_estimators = np.argmin(errors)
gbrt_best = GradientBoostingRegressor(max_depth=2, n_estimators=bst_n_estimators )
gbrt_best.fit(X_train,y_train)

# Early Stopping
gbrt = GradientBoostingRegressor(max_depth=2, warm_start=True)
min_val_error = float("inf")
error_going_up = 0
for n_estimators in range(1, 120):
 gbrt.n_estimators = n_estimators
 gbrt.fit(X_train, y_train)
 y_pred = gbrt.predict(X_val)
 val_error = mean_squared_error(y_val, y_pred)
 if val_error < min_val_error:
    min_val_error = val_error
    error_going_up = 0
 else:
    error_going_up += 1
    if error_going_up == 5:
        break # early stopping
    
import xgboost
xgb_reg = xgboost.XGBRegressor()
xgb_reg.fit(X_train, y_train)
y_pred = xgb_reg.predict(X_val)
xgb_reg.fit(X_train, y_train,eval_set=[(X_val, y_val)])
y_pred = xgb_reg.predict(X_val)
out[8]

[0] validation_0-rmse:0.40028
[1] validation_0-rmse:0.35054
[2] validation_0-rmse:0.32707
[3] validation_0-rmse:0.31359
[4] validation_0-rmse:0.31298
[5] validation_0-rmse:0.31117
[6] validation_0-rmse:0.31294
[7] validation_0-rmse:0.31429
[8] validation_0-rmse:0.31454
[9] validation_0-rmse:0.31578
[10] validation_0-rmse:0.31673
[11] validation_0-rmse:0.31883
[12] validation_0-rmse:0.31939
[13] validation_0-rmse:0.31936
[14] validation_0-rmse:0.31931
[15] validation_0-rmse:0.31941
[16] validation_0-rmse:0.31979
[17] validation_0-rmse:0.32028
[18] validation_0-rmse:0.32042
[19] validation_0-rmse:0.32297
[20] validation_0-rmse:0.32366
[21] validation_0-rmse:0.32464
[22] validation_0-rmse:0.32556
[23] validation_0-rmse:0.32586
[24] validation_0-rmse:0.32858
[25] validation_0-rmse:0.32901
[26] validation_0-rmse:0.32973
[27] validation_0-rmse:0.33042
[28] validation_0-rmse:0.33287
[29] validation_0-rmse:0.33289
[30] validation_0-rmse:0.33331
[31] validation_0-rmse:0.33447
[32] validation_0-rmse:0.33492
[33] validation_0-rmse:0.33474
[34] validation_0-rmse:0.33489
[35] validation_0-rmse:0.33460
[36] validation_0-rmse:0.33486
[37] validation_0-rmse:0.33517
[38] validation_0-rmse:0.33533
[39] validation_0-rmse:0.33575
[40] validation_0-rmse:0.33569
[41] validation_0-rmse:0.33661
[42] validation_0-rmse:0.33653
[43] validation_0-rmse:0.33676
[44] validation_0-rmse:0.33713
[45] validation_0-rmse:0.33728
[46] validation_0-rmse:0.33757
[47] validation_0-rmse:0.33780
[48] validation_0-rmse:0.33835
[49] validation_0-rmse:0.33855
[50] validation_0-rmse:0.33874
[51] validation_0-rmse:0.33891
[52] validation_0-rmse:0.33905
[53] validation_0-rmse:0.33906
[54] validation_0-rmse:0.33916
[55] validation_0-rmse:0.33934
[56] validation_0-rmse:0.33975
[57] validation_0-rmse:0.33980
[58] validation_0-rmse:0.33989
[59] validation_0-rmse:0.33992
[60] validation_0-rmse:0.33993
[61] validation_0-rmse:0.34025
[62] validation_0-rmse:0.34032
[63] validation_0-rmse:0.34081
[64] validation_0-rmse:0.34070
[65] validation_0-rmse:0.34063
[66] validation_0-rmse:0.34066
[67] validation_0-rmse:0.34067
[68] validation_0-rmse:0.34074
[69] validation_0-rmse:0.34102
[70] validation_0-rmse:0.34097
[71] validation_0-rmse:0.34097
[72] validation_0-rmse:0.34098
[73] validation_0-rmse:0.34105
[74] validation_0-rmse:0.34109
[75] validation_0-rmse:0.34119
[76] validation_0-rmse:0.34119
[77] validation_0-rmse:0.34137
[78] validation_0-rmse:0.34142
[79] validation_0-rmse:0.34143
[80] validation_0-rmse:0.34143
[81] validation_0-rmse:0.34148
[82] validation_0-rmse:0.34146
[83] validation_0-rmse:0.34145
[84] validation_0-rmse:0.34148
[85] validation_0-rmse:0.34150
[86] validation_0-rmse:0.34147
[87] validation_0-rmse:0.34145
[88] validation_0-rmse:0.34160
[89] validation_0-rmse:0.34175
[90] validation_0-rmse:0.34181
[91] validation_0-rmse:0.34177
[92] validation_0-rmse:0.34173
[93] validation_0-rmse:0.34178
[94] validation_0-rmse:0.34189
[95] validation_0-rmse:0.34197
[96] validation_0-rmse:0.34205
[97] validation_0-rmse:0.34201
[98] validation_0-rmse:0.34209
[99] validation_0-rmse:0.34209

Stacking

The least ensemble method to discuss is stacking (short for stacked generalization). It is based on a simple idea: instead of using trivial functions (such as hard voting) to aggregate the oredictions of all predictors in an ensemble, why don't we train a model to perform this aggregation? the image below shows such an ensemble performing a regression task on a new instance. Each of the bottom three predictors predicts a different value (3.1, 2.7, and 2.9), and then the final predictor (called a blender, or a meta learner) takes these predictions as inputs and makes the final prediction (3.0)

Aggregating Predictions Using a Blending Predictor

To train the blender, a common approach is to use a hold-out set. First the training set is split in two subsets. The first subset is used to train the predictions in the first layer.

Training the First Layer

Next, the first layer predictors are used to make predictions on the second (held-out) set. Now, for each instance in the hold-out set, there are three predicted values. We can create a new training set using these predicted values as input features and keeping the target values. The blender is trained on this new training set, so it learns to predict the target value given the first layer's predictions.

Training the Blender

Predictions in Multilayer Stacking Ensemble

Unfortunately, Scikit-Learn does not support blending directly.