Hands On Machine Learning Chapter 3 - Classification
I am going to re-read Hands-On Machine Learning with Scikit-learn Keras & TensorFlow because I don't feel that I got a good grasp of machine learning the first time I read it, and I skipped neural networks the first time I read the book.
Classification
We now turn our attention to classification systems.
MNIST
In this chapter, we will be using the MNIST dataset, which is a set of 70,000 small images of digits handwritten by high school students and employees of the US Census Bureau. Each image is laveled by the figit it reprents.
Scikit-Learn provides many helper functions to download popular datasets. MNIST is one of them. Datasets loaded by Scikit-Learn generally have a similar dictionary structure including:
- A DESC key describing the dataset
- A data key containung an array with one row per instance and one column per feature
- A target key containing an array with the labels
There are 70,000 images, and each image has 784 features. This is because each image is 28x28 pixels, and each feature simply represents one pixel's intensity, from 0 (white) to 255 (black).
Training a Binary Classifier
A binary classifier can distinguish between two classes. A good place to start is with a Stochastic Gradient Desent classifier, using Scikit-Learn's SGDClassifier class. The SGDClassifier relies on randomness during training. If you want reproducible results, you should set the random_state parameter.
Performance Measures
Evaluating a classifier is significantly trickier than evaluating a regressor, so we will spend a large part of this chapter on this topic.
If you look at the code below, the classifier that we trained with SGDClassifier had an average accuracy of approximately 96% while the classifier that we trained to trained to always return False had an accuracy of 90%. (This is for the binary classifier testing whether a digit represents a '5' or not. Also, we were testing the performance of the classification using 3-fold cross-validation with scoring="accuracy"). This demonstrates why accuracy is generally not the preferred performance measure for classifiers, especially when you are dealing with skewed datasets (when some classes are much more frequent than others.)
A much better way to evaluatre the performance of a classifier is to look at the confusion matrix. The general idea is to count the number of times instances of class A are classified as class B. [...] To predict the confusion matrix, you can use the cross_val_predict() function.
The cross_val_predict() function works like cross_val_score() in that it performs K-fold cross validation, but instead of returning teh evaluation scores, it returns the predictions made on each test fold. This means that you get a clean prediction for each instance in the training set ("clean" means that the prediction is made by a model that never saw the data during training).
Each row in a confusion matrix represents an actual class, while each column represents a predicted class. A perfect classifier would have only true positives and true negatives, so its cvonfusion matrix would have nonzero values only on its main diagonal. The confusion matrix gives you a lot of information, but sometimes you may prefer a more concise metric. An interesting one to look at is the accuracy of the positive predictions; this is called the precision of the classifier:
Precision is trypically used along with another metric named recall, also called sensitivity or true positive rate (TPR): this is the ratio of positive instances that are correctly detected by the classifier:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784',version=1,parser='auto')
mnist.keys()
X, y = mnist["data"], mnist["target"]
print("X Shape = ",X.shape,"; y Shape = ",y.shape,sep="")
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
some_digit = X.loc[0].values
some_digit_image = some_digit.reshape(28,28)
plt.imshow(some_digit_image, cmap=mpl.cm.binary, interpolation="nearest")
plt.axis("off")
plt.show()
print("The Image Looks like a 5, and the corresponding label is",y.loc[0])
y = y.astype(np.uint8)
# Creating test set and train set. MNIST is already set into training set (the first 60,000 images) and a test set (the last 10,000 images)
X_train, X_test, y_train, y_test = X.loc[:60000], X.loc[60000:], y.loc[:60000], y.loc[60000:]
print("Training a Binary Classifier\n-----------------------------")
y_train_5 = (y_train ==5)
y_test_5 = (y_test==5)
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train,y_train_5)
sgd_clf.predict([some_digit])
from sklearn.model_selection import cross_val_score
from sklearn.base import BaseEstimator
class Never5Classifier(BaseEstimator):
def fit(self,X,y=None):
pass
def predict(self,X):
return np.zeros((len(X),1),dtype=bool)
never_5_clf = Never5Classifier()
print("Stochastic Gradient Descent Classifier -> Average Accuracy according to 3-Fold Cross Validation:",np.average(cross_val_score(sgd_clf,X_train,y_train_5,cv=3,scoring="accuracy")))
print("Never 5 Classifier -> Average Accuracy according to 3-Fold Cross Validation:",np.average(cross_val_score(never_5_clf,X_train,y_train_5,cv=3,scoring="accuracy")))
from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_train_5, y_train_pred))
print("The first row in the matrix above considers non-5 images (the negative class). The second row considers images of 5s (positive class).")
y_train_perfect_predictions = y_train_5 # pretend we reached perfection
print(confusion_matrix(y_train_5,y_train_perfect_predictions))
Scikit-Learn provides several functions to compute classifier matrix, including precision and recall (with precision_score and recall_score). It is often convenient to combine the precision and recall into a single metric called the F1 score, in particular if you need a simple way to compare two classifiers. The F1 score is the harmonic mean of precision and recall. The Harmonic mean gives much more weight to low values.As a result, the classifier will only get a high F1 score if both recall and precision are high. To compute the F1 score, simply call the f1_score() function. The F1 score favors classifiers that have similar precision and recall. This is not always what you want: in some contexts you mostly care about precision and in other contexts what you really care about is recall. Increasing precision reduces recall, and vice versa. This is called the precision/recall tradeoff.
The SGDClassifier computes a score based on the decision function. If the score is greater than a threshold, it assigns the instance to the positive class, else it assigns it to a negative class. Scikit-Learn gives you access to the decsions scores that it uses to make predictions with the decision_function() method, which returns a score for each instance. You can then make predictions based on those scores using any threshold you want.
from sklearn.metrics import precision_score, recall_score, f1_score
print("Precision Score = ",precision_score(y_train_5,y_train_pred),", when the classifier predicts a 5, it is only right 80.9% of the time.",sep="")
print("Recall Score = ",recall_score(y_train_5,y_train_pred),", the classifier only detects 79.6% of the 5s.",sep="")
print("F_1 Score =",f1_score(y_train_5,y_train_pred))
y_scores = sgd_clf.decision_function([some_digit])
print("Descision Function Scores =",y_scores)
threshold = 0
y_some_digit_pred = (y_scores > threshold)
print("Prediction with Threshold=0 => ",y_some_digit_pred)
y_scores = cross_val_predict(sgd_clf,X_train,y_train_5,cv=3,method="decision_function")
from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train_5,y_scores)
def plot_precision_recall_vs_threshold(precisions,recalls,thresholds,threshold):
plt.plot(thresholds,precisions[:-1],"b--",label="Precision")
plt.plot(thresholds,recalls[:-1],"g--",label="Recall")
plt.grid(visible=True,which="major",axis="both")
plt.xlabel("Threshold")
plt.axis((-50000,50000,0,1))
idx = (thresholds >= threshold).argmax()
plt.plot(thresholds[idx], precisions[idx], "ro")
plt.plot(thresholds[idx], recalls[idx], "ro")
plt.plot([thresholds[idx],thresholds[idx]],[0,max(precisions[idx],recalls[idx])],'r--')
plt.plot([-50000,thresholds[idx]],[precisions[idx],precisions[idx]],'r--')
plt.plot([-50000,thresholds[idx]],[recalls[idx],recalls[idx]],'r--',label="Threshold")
plt.legend()
plt.show()
plot_precision_recall_vs_threshold(precisions,recalls,thresholds,10000)
def plot_precision_vs_recall(precisions,recalls):
plt.plot(recalls[:-1],precisions[:-1],'b-')
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.axis((0,1,0,1))
plt.show()
plot_precision_vs_recall(precisions,recalls)
The reason [that the precision curve is bumpier than the recall curve] is that precision may sometimes go down when you raise the threshold. On the other hand, recall can only go down when the threshold is increased.
The ROC Curve
The receiver operating characteristic (ROC) curve is another common tool used with binary classifiers. It is very similar to the precision/recall curve, but instead of plotting precision vs recall, the ROC curve plots the true positive rate (another name for recall) against the false positive rate. The FPR is the ratio of negative instances that are incorrectly classified as positive. It is equal to one minus the true negative rate, which is the ratio of negative instances that are correctly classified as negative. The TNR is also called specificity. Hence the ROC curve plots sensitivity (recall) versus 1-specificity. You can get the TPR and FPR for various threshold values using the roc_curve() function.
Once again there is a tradeoff: the higher the recall (TPR), the more false positives (FPR) the classifier produces. The dotted line represents the ROC curve of a purely random classifier: A good classifier stays as far away from that line as possible.
One way to compare classifiers is to measure the area under the curve (AUC). A perfect classifier will have a ROC AUC equal to 1, whereas a purely random classifier will have a ROC AUC equal to 1. As a rule of thumb, you should prefer the Precision-Recall curve whenever the positive class is rare or when you care more about the false positives than the false negatives, and the ROC curve otherwise. The RandomForestClassifier class does not have a decision_function() method. Instead, it has a predict_proba() method. Scikit-Learn classifiers generally have one or the other. The predict_proba() method returns an array containing a row per instance and a column per class, each containing the probability that the given instance belongs to the given class.
As you can see below, the RandomForestClassifier's ROC curve looks much better than the SGDClassifiers.
# Aiming for a 90% precision, get the lowest threshold that gets you that precision:
# thresholds_90_precision = thresholds[np.argmax(precisions >= 0.90)]
# print("Lowest Threshold with 90% Precision =",thresholds_90_precision)
from sklearn.metrics import roc_curve, roc_auc_score
fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)
def plot_roc_curve(fpr,tpr,label=None):
plt.axis((0,1,0,1))
plt.plot(fpr,tpr,linewidth=2,label=label)
plt.plot([0,1],[0,1],'k--') # dashed diagonal
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate (Recall)")
plot_roc_curve(fpr,tpr)
plt.show()
print("ROC AUC Score =",roc_auc_score(y_train_5,y_scores))
from sklearn.ensemble import RandomForestClassifier
forest_clf = RandomForestClassifier(random_state=42)
y_probas_forest = cross_val_predict(forest_clf,X_train,y_train_5,cv=3,method="predict_proba")
y_scores_forest = y_probas_forest[:,1] # Score = probability of positive class
fpr_forest, tpr_forest, thresholds_forest = roc_curve(y_train_5,y_scores_forest)
plt.plot(fpr,tpr,"b:",label="SGD")
plot_roc_curve(fpr_forest,tpr_forest,"Random Forest")
plt.legend(loc="lower right")
plt.show()
Multiclass Classification
Whereas binary classifiers distinguish between two classes, multiclass classifiers (also called multinomial classifiers) can distinguish between more than two classes. Some algorithms (such as Random Forest Classifiers or naive Bayes classifiers) are capable of handling multiple classes directly. Others (such as Support vector Machine classifiers or Linear classifiers) are strictly binary classifiers. There are various strategies that you can use to perform multiclass classification using multiple binary classifiers:
- One-versus-all (OvA) or One-versus-the-rest: One way to create a system that can classify the digit images into 10 classes is to train 10 binary classifiers, one for each digit.l Then, you want to classify an image, you get the decision score from each classifier and ou select the class whose classifier outputs the highest score.
- One-versus-one (OvO): Train a binary classifier for every pair of digits: one to distinguish 0s and 1s, another to distinguish 1s and 2s, and so on. If there are N classes, you need to train N×(N−1)/2 classifiers.
Some algorithms, such as Support Vector Machine classifiers, scale poorly with the size of the training set, so for these algorithms OvO is preferred since it is faster to train many classifiers on small training sets than training very few classifiers on large training sets. For most binary classification algorithms. however, OvA is preferred.
Scikit-Learn detects when you try to use a binary classification algorithm for a multiclass classification task, and it automatically runs OvA (except for SVM classifiers which it uses OvO).
If you want to force Scikit-Learn to use one-versus-one or ove-versus-all, you can use the OneVsOneClassifier or OneVsRestClassifier classes. Scikit-Learn does not have to run OvA or OvO when using RandomForestClassifiers because Random Forest Classifiers can directly classify instances into multiple classes.
sgd_clf.fit(X_train,y_train)
sgd_clf.predict([some_digit])
some_digit_scores = sgd_clf.decision_function([some_digit])
print("The SGD classifier actually trained 10 binary classifiers, as evidenced by 10 returned scores:")
print(some_digit_scores)
print("The highest score is the one corresponding to class 5:")
print(np.argmax(some_digit_scores))
print("SGDClassifier classes =",sgd_clf.classes_)
print(sgd_clf.classes_[5])
from sklearn.multiclass import OneVsOneClassifier
ovo_clf = OneVsOneClassifier(SGDClassifier(random_state=42))
ovo_clf.fit(X_train,y_train)
ovo_clf.predict([some_digit])
print("Number of OneVsOne Classifier Estimators",len(ovo_clf.estimators_))
forest_clf.fit(X_train,y_train)
print("RandomForestClassifier Prediction for some_digit =",forest_clf.predict([some_digit]))
print("Random Forest predict_proba for some_digit =",forest_clf.predict_proba([some_digit]))
cross_val_score(sgd_clf,X_train,y_train,cv=3,scoring="accuracy")
# By scaling the classifier
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))
cross_val_score(sgd_clf,X_train_scaled,y_train,cv=3,scoring="accuracy")
Error Analysis
One way to tune classification models is to analyze the types of errors that it makes. The confusion matrix is helpful for this. It is often convenient to look at an image representation of the confusion matrix, using Matplotlib's matshow() function. The confusion matrix below looks fairly good, since most images are on the main diagonal, which means they were classified correctly. The 5s look slightly darker than the other digits, which could mean that there are fewer images of 5s in the dataset or that the classifier does not perform as well on 5s as on other digits. Comparing the rates of errors instead of the total number of error is also a good idea. Keep in mind that rows represent teh actual classes, while columns represent the predicted classes. Analyzing the confusion matrix often gives you insights on ways to improve your classifier. Analyzing individual errors can also be a good way to gain insights on what your classifier is doing and why it is failing, but it is more difficult and time-consuming.
The ConfusionMatrixDisplay from sklearn.metrics is worth looking into for plotting confusion matrices.
y_train_pred = cross_val_predict(sgd_clf,X_train_scaled,y_train,cv=3)
conf_mx = confusion_matrix(y_train,y_train_pred)
print(conf_mx)
plt.matshow(conf_mx,cmap=plt.cm.gray)
plt.show()
row_sums = conf_mx.sum(axis=1, keepdims=True)
norm_conf_mx = conf_mx / row_sums
np.fill_diagonal(norm_conf_mx,0)
plt.matshow(norm_conf_mx,cmap=plt.cm.gray)
plt.show()
Multilabel Classification
In some cases you may want your classifier to output multiple classes for each instance. Such a classification system that outputs multiple binary tags is called a multilabel classification system. There are many ways to evaluate a multilabel classifier, and selecting the right metric really depends on your project.
Multioutput Classification
The last type of classification task we are going to discuss is called the multioutput-multiclass classification (or simply (multioutput classification*)). It is simply a generalization of multilabel classification where each label can be multiclass (it can have more than two possible values).
## Multilabel Classification
from sklearn.neighbors import KNeighborsClassifier
y_train_large = (y_train >= 7)
y_train_odd = (y_train % 2 == 1)
y_multilabel = np.c_[y_train_large, y_train_odd]
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train,y_multilabel)
print("Multilabel Classification\n---------------------------------\nHere we are building a multilabel classifier that\n1. Classifies an image as representing a large number (greater than or equal to 7).]\n2. Classifies an image as being odd.")
print("[Prediction that the Number is Greater than or Equal to 7, Prediction that the Number is Odd]",knn_clf.predict([some_digit]))
y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_multilabel, cv=3)
print("F_1 Score for Multilabel Classifier (Computes F_1 score for each individual label, and computes the average score) =",f1_score(y_multilabel, y_train_knn_pred, average="macro"))
## Multioutput Classification
print("Multioutput Classification\n---------------------------------\nHere we are building a system that removes noise from images.")
some_digit_test = X_train.loc[0].values
some_digit_image_test = some_digit_test.reshape(28,28)
plt.imshow(some_digit_image_test, cmap=mpl.cm.binary, interpolation="nearest")
noise = np.random.randint(0, 100, (len(X_train), 784))
X_train_mod = X_train + noise
some_digit_test_2 = X_train_mod.loc[0].values
some_digit_image_test_2 = some_digit_test_2.reshape(28,28)
plt.imshow(some_digit_image_test_2, cmap=mpl.cm.binary, interpolation="nearest")
noise = np.random.randint(0, 100, (len(X_test), 784))
X_test_mod = X_test + noise
y_train_mod = X_train
y_test_mod = X_test
knn_clf.fit(X_train_mod, y_train_mod)
clean_digit = knn_clf.predict([X_test_mod.iloc[0]])
clean_digit_test = clean_digit.reshape(28,28)
plt.imshow(clean_digit_test, cmap=mpl.cm.binary, interpolation="nearest")
print("I don't know what went wrong here.")