Hands-On Machine Learning with Scikit Learn and Tensor Flow Chapter 6 - Decision Trees

I am going to re-read Hands-On Machine Learning with Scikit-learn Keras & TensorFlow because I don't feel that I got a good grasp of machine learning the first time I read it, and I skipped neural networks the first time I read the book. Since the first time reading this textbook.

Decision Trees

Like SVMs, Decision Trees are versatile Machine Leaning algorithms that can perform both classification and regression tasks, even multioutput tasks.They are very powerful algorithms, capable of fitting complex datasets. Decision trees are also the fundaments components of Random Forests, which are among the most powerful Machine Learning algorithms available today.

Training and Visualizing a Decision Tree

Training a DecisionTreeClassifier on the iris dataset in the code below. You can visualize the trained decision tee using the export_graphviz() function. Then you can convert the outputted .dot file into a variety of formats such as png using the dot command line tool from the graphviz package. The decision tree looks like the image below:

Iris Decision Tree

Making Predictions

You start at the root node (depth = 0), then move left or right depending on the petal length. You can continue this process until you reach a leaf node (one that has no children nodes), which gives you a predicted class. A node's samples attribute counts how many training instances it applies to. For example, 100 training instances have a petal length greater than 2.45 cm, among which 54 have a petal width smaller than 1.75. A node's value attribute tells you how many instances of each class this node applies to: for example, the bottom-right node above applies to 0 Iris-Setosa, 1 Iris-Versicolor, and 45 Iris-Virginica. Finally, a node's gini attribute measures its impurity: a node is "pure" (gini = 0) if all training instances it applies to belong to the same class. The gini impurity is an example if an impurity measure.

Gi=1i=1npi,k2pi,k= is the ratio of class k instances among the training instances in the ith node.G_i = 1 - \sum_{i=1}^n p_{i,k}^2 \\[0.25em] p_{i,k}=\text{ is the ratio of class }k\text{ instances among the training instances in the }i^{th}\text{ node.}Gi=1i=1npi,k2pi,k= is the ratio of class k instances among the training instances in the ith node.

Scikit-Learn uses the CART algorithm, which produces only binary trees: nonleaf nodes have only two children, However, other algorithms such as ID3 can produce Decision Trees with nodes that have two children.

[The image below] shows this Decision Tree’s decision boundaries. The thick vertical line represents the decision boundary of the root node (depth 0): petal length = 2.45 cm. Since the left area is pure (only Iris-Setosa), it cannot be split any further. However, the right area is impure, so the depth-1 right node splits it at petal width = 1.75 cm (represented by the dashed line). Since max_depth was set to 2, the Decision Tree stops right there. However, if you set max_depth to 3, then the two depth-2 nodes would each add another decision boundary (represented by the dotted lines)

Decision Tree Boundary

Model Interpretation: White Box Versus Black Box

As you can see, Decision Trees are fairly intuitive and their decisions are easy to interpret, Such models are often called white box models. In contrast, Ransom Forests or neural networks are generally considered black box models. They make great predictions, and you can easily check the calculations they performed to make these predictions; nevertheless, it is usually hard to explain in simple terms why the predictions were made.

Estimating Class Probabilities

A Decision Tree can also estimate the probability that an instance belongs to a particular class kkk: first it traverse the tree to find the lead node for this instance, and then it returns the ratio of training instances of class kkk in this node. For example, if you found a flower whose petals are 5cm long and 1.5cm wide, the corresponding lead node is the depth-2 left node so the Decision Tree would output 0% for Iris-Setosa (0/54), 90.7% for Iris-Versicolor (49/54), and 9.3% for Iris-Virginca (5/54).

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
iris = load_iris()
X = iris.data[:,2:] # peta length and width 
y = iris.target
tree_clf = DecisionTreeClassifier(max_depth=2)
tree_clf.fit(X,y)

# You can visualize the Decision Tree by using the `export_graphviz()` method to output a graphic definition file
from sklearn.tree import export_graphviz
export_graphviz(
    tree_clf,
    out_file="iris_tree.dot",
    feature_names=iris.feature_names[2:],
    class_names=iris.target_names,
    rounded=True,
    filled=True
)
print("Predicting Probabilities =",tree_clf.predict_proba([[5,1.5]]))
out[2]

Predicting Probabilities = [[0. 0.90740741 0.09259259]]

CART Training Algorithm

Scikit-Learn uses the Classification and Regression Tree (CART) algorithm to train Decision trees (also called "growing" trees). The algorithm first splits the training set in two subsets using a single feature kkk and a threshold tkt_ktk. How does it choose kkk and tkt_ktk? It searches for the pair (k,tk)(k,t_k)(k,tk) that produces the purest subsets (weighted by their size). This cost function that the algorithm tries to minimize is:

J(k,tk)=mleftmGleft+mrightmGrightwhere {Gleft/right measures the impurity of the left/right subset.mleft/right is the number of instances in the left/right subset.J(k,t_k) = \cfrac{m_{\text{left}}}{m}G_{\text{left}}+ \cfrac{m_{\text{right}}}{m}G_{\text{right}}\\[0.25em] \text{where }\begin{cases} G_{\text{left/right}}&\text{ measures the impurity of the left/right subset.}\\ m_{\text{left/right}}&\text{ is the number of instances in the left/right subset.} \end{cases}J(k,tk)=mmleftGleft+mmrightGrightwhere {Gleft/rightmleft/right measures the impurity of the left/right subset. is the number of instances in the left/right subset.

Once it has successfully split the training set in two, it splits the subsets using the same logic, and so on, recursively, until it reaches the maximum depth (defined by the max_depth hyperparameter) or if it cannot find a split that will reduce impurity. A few other hyperparameters control additional stopping conditions (min_samples_split,min_samples_leaf,min_weight_fraction_leaf, and max_leaf_nodes).

The CART algorithm is a greedy algorithm: it greedily searches for an optimum split at the top level, then repeats the process at each level. It does not check whether or not teh split will lead to the lowest possible impurity several levels down. A greedy algorithm often produces a reasonable good solution, but it is not guaranteed to be the optimal solution. Finding the optimal tree is known to be an NPCompleteNP-CompleteNPComplete problem: it requires O(exp(m))O(\text{exp}(m))O(exp(m)) time, making the problem intractable even for fairly small training sets.

Computational Complexity

Decision Trees are generally approximately balanced, so traversing a Decision Tree required going through roughly O(log2(m))O(log_2(m))O(log2(m)) nodes. This is the overall prediction complexity since each node just has an attribute evaluation. The training algorithm compares all features (or less if max_features is set) on all samples at each node. This results in a training complexity of O(n×mlog(m))O(n \times m log(m))O(n×mlog(m)). For small training sets, Scikit-Learn can speed up training by presorting the data (presort=True), but this slows down training on larger training sets.

Gini Impurity of Entropy

You can select an entropy impurity measure different than Gini impurity by setting the criterion hyperparameter to "entropy". In Shannon's information theory, entropy measures the average information content of a message: entropy is zero when all messages are identical. In Machine Learning, it is frequently used as an impurity measure: a set's entropy is zero when it contains instances of only one class:

Entropy

Hi= pi,klog2(pi,k)H_i = - \space \underset{p_{i,k}\neq 0}{\sum_{k=1}^n}p_{i,k}\text{log}_2 \left( p_{i,k} \right)Hi= pi,k=0k=1npi,klog2(pi,k)

Most of the time, it does not make a big difference whether you use Gini impurity or entropy. Gini impurity is slightly faster to compute, and entropy tends to produce slight more balanced trees.

Regularization hyperparameters

Decision trees make very few assumptions about the training data. If left unconstrained, the tree structure will adapt itself to the training data, fitting it very closely, and most likely overfitting it. Such a model is often called a nonparametric model, not because it does not have any parameters, but because the number of parameters is not determined prior to training, so the model structure is free to stick closely to the data. In contrast, a parametric model such as a linear model has a predetermined number of parameters, so its degree of freedom is limited, reducing the risk of overfitting.

To avoid overfitting the training data, you need to restrict the Decision Tree's freedom during training. This is called regularization. The regularization hyperparameters depend on the model, but in general, you can always limit the max-depth. Reducing max_depth will regularize the model.

Other regularization hyperparameters for DecisionTreeClassifier:

  • min_samples_split: the minimum number of samples a node must have before it can be split.
  • min_samples_leaf: the minimum number of samples a leaf node must have
  • min_wight_fraction_leaf: same as min_samples_leaf but expressed as a fraction of the total number of weighted instances
  • min_leaf_nodes: maximum number of leaf nodes
  • max_features: maximum number of features that are evaluated for splitting at each node

Other algorithms work by training the Decision Tree without regularization, and then pruning (deleting) unnecessary nodes. A node whose children are all leaf nodes is considered unnecessary if the purity improvement it provides is not statistically significant.

The image below shows Decision Trees trained on the moons dataset. The left image has no regularization hyperparameters, while the right has min_samples_leaf=4.

Regularization Using Min Sample Leaf

Regression

Decision Trees are also capable of performing regression tasks. See the code below using DecisionTreeRegressor which produces the tree seen below. The tree looks similar to the classification tree, except instead od predicting a class, it predicts a value. The prediction for each leaf node is simply the average target value of each of the training instances associated with that leaf node.

Decision Tree For Regression

The graphs below show predictions using DecisionTreeRegressor with different regularization hyperparameters (max_depth):

Predictions of Two Regression Models

The CART algorithm works mostly the same way as earlier, except that instead of trying to split the training set in a way that minimized impurity, it now tries to split the training set in a way that minimizes the MSE.

CART Cost Function for Regression

J(l,tk)=mleftmMSEleft+mrightmMSEright where {MSEnode=inode(y^y(i))2y^node=1mnodeinodey(i)J(l,t_k)= \cfrac{m_{\text{left}}}{m}\text{MSE}_{\text{left}}+ \cfrac{m_{\text{right}}}{m}\text{MSE}_{\text{right}} \text{ where }\begin{cases} \text{MSE}_{\text{node}} = \sum_{i \isin \text{node}}(\hat{y} - y^{(i)})^2 \\ \hat{y}_{\text{node}}=\cfrac{1}{m_{\text{node}}}\sum_{i \isin \text{node}}y^{(i)} \end{cases}J(l,tk)=mmleftMSEleft+mmrightMSEright where MSEnode=inode(y^y(i))2y^node=mnode1inodey(i)

Just like for classification tasks, Decision trees are prone to overfitting when dealing with regression tasks. See the image below.

Regularizing Decision Tree Regressor

from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor(max_depth=2)
tree_reg.fit(X,y)
out[6]

DecisionTreeRegressor(max_depth=2)

Instability

Decision Trees have a lot going for them: they are simple to understand and intercept, easy to use, versatile, and powerful. They do have some limitations:

  1. They love orthogonal decision boundaries (all splits are perpendicular to an axis), which makes them sensitive to training set rotation. (See the image below: the decision tree fits the data better in the picture on the left than on the right (where the data has been rotated by 45 degrees). The model on the right is unlikely to generalize.) One way to limit this problem is to use PCA, which often results in a better orientation of the training data.

Sensitivity to Training Set Rotation

  1. They are sensitive to small variations in training data. Random Forests can limit this instability by averaging predictions over many trees, as we will see in the next chapter.

Sensitivity to Training Set Details