Hands On Machine Learning Chapter 5 - Support Vector Machines

I am going to re-read Hands-On Machine Learning with Scikit-learn Keras & TensorFlow because I don't feel that I got a good grasp of machine learning the first time I read it, and I skipped neural networks the first time I read the book. Since the first time reading this textbook.

DOWNLOAD NOTEBOOK

2 555

Support Vector Machines

A Support Vector Machine (SVM) is a very powerful and versatile Machine Learning model, capable of performing linear or nonlinear classification, regression, and even outlier detection. It is one of the most popular models in machine Learning, and anyone interested in Machine Learning should have it in their toolbar. SVMs are particularly well suited for classification of complex but small- or medium-sized datasets.

Linear SVM classification

The fundamental idea behind SVMs is best explained with some pictures. The image below shows the Iris dataset. The two classes can be separated easily with a straight line (they are linearly Separable). The left plot shows the decision boundaries of three possible linear classifiers (one works horribly, the other two are two close to the data that it is unlikely they continue to perform well on new data). The solid line on the right represents the decision boundary of an SVM classifier; this line not only separates the two classes but stays as far away form the closest training instances as possible. You can think of an SVM classifier as fitting the widest possible street (represented by the parallel dashed lines) between the classes. This is called large margin classification. More instances added "off the street" will not affect the decision boundary. The instances circled in red are called the support vectors.

SVMs are sensitive to feature scales, as you can see below. After feature scaling (e.g., using Scikit-Learn's StandardScaler), the decision boundary looks much better on the right.

Soft Margin Classification

If we strictly impose that all instances be off teh street and on the right side, this is called hard margin classification. Two main issues with hard margin classification:

It only works when the data is linearly separable.
It is quite sensitive to outliers.

The image below shows Hard Margin's sensitivity to outliers: on the left it is impossible to find a hard margin, and on the right the decision boundary is not too different than the ones found with linear classifiers like those above. To aviod the issues above it is preferable to find a more flexible model. The objective is to find a good balance between keeping the street as large as possible and limit the number of margin violations (instances that end ip in the middle of the street or on the wrong side)., This is called soft margin classification.

In Scikit-Learn's SVM classes, you can control this balance using the C hyperparameter: a smaller C value leads to a wider street but more margin violations.

[In This Image Below:] On the left, using a low C value the margin is quite large, but many instances end up on the street. On the right, using a high C value the classifier makes fewer margin violations but ends up with a smaller margin. However, it seems likely that the first classifier will generalize better: in fact even on this train‐ ing set it makes fewer prediction errors, since most of the margin violations are actually on the correct side of the decision boundary.

Large Margin (Left) versus Fewer Margin Violations

If your SVM model is overfitting, you cna try regularizing it by reducing C. Unlike Logistic Regression classifiers, SVM classifiers do not output probabilities for each class. You can train a linear SVM model with LinearSVC(loss="hinge",C=[float]), SVC(kernel="linear",C=[float]) (this method is slower and not recommended), or SGDClassifier(loss="hinge", alpha=1/(m*C)), which applies regular Stochastic Gradient Descent to train a linear SVM classifier - this does not converge as fast, but it is useful to handle huge datasets that do not fit in memory.

## Load the iris Dataset, Scale the Features, Train a Linear SVM Model

import numpy as np
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

iris = datasets.load_iris()
X = iris["data"][:,(2,3)] # peta length, petal width 
y = (iris["target"] == 2).astype(np.float64) # Iris-Virginica
svm_clf = Pipeline([
    ('scaler',StandardScaler()),
    ('linear_svc',LinearSVC(C=1,loss="hinge"))
])
svm_clf.fit(X,y)
svm_clf.predict([[5.5, 1.7]])

out[2]

array([1.])

Nonlinear SVM Classification

Many datasets are not even close to being linearly separable. One approach is to add more features, such as polynomial features; in some cases, this can result in a linearly separable dataset. The image below shows the effect of adding a feature $s_{2} = (x_{1})^{2}$ , resulting in a 2D dataset that is perfectly linearly separable. The code below shows adding more features. The make_moons() dataset is a toy dataset for binary classification in which the data points are shaped as two interleaving half circles.

Adding Features Makes Dataset Linearly Separable

Linear SVM Classifier Polynomial Features

Polynomial Kernel

Adding polynomial features is simple to implement and can work great with all sorts of Machine Learning algorithms (not just SVMs), but at a low polynomial degree it cannot deal with very complex datasets, and with a high polynomial degree it creates a huge number of features, making the model too slow.

When using SVMs, you can apply an almost miraculous technique called the kernel trick. It makes it possible to get the same result as if you added many polynomial features, even with very high degree polynomials, without having to add them - so there is no combinatorial explosion of features. This trick is implemented by teh SVC class and can be seen in the code below. The image below shows: on the left, a SVM classifier using a 3rd degree polynomial kernel; on the right, a SVM classifier using a 10th degree polynomial. If your model is overfitting, you may want to reduce the polynomial degree. If your model is underfitting, you can try to increasing it. The hyperparameter coef0 controls how much the model is influenced by high-degree polynomials versus low-degree polynomials. A common approach to find the right hyperparameter values is to use grid search. It is often faster to first do a very coarse grid search, then a finer grid search around the best values found.

Adding Similarity Features

Another technique to tack nonlinear problems is to add features computed using a similarity function that measures how much each instance resembles a particular landmark. The Gaussian Radial Basis Function (RBF) could be used as the similarity function.

Gaussian RBF

\phi (\textbf{x},\ell) = \text{exp}\left(-\gamma \lVert \textbf{x} = \ell \rVert ^2 \right)

The Gaussian RBF is a bell-shaped function varying from 0 (very far from the landmark) to 1 (at the landmark). The plot on the right below shows the transformed dataset (dropping the original features and adding the similarity feature). As you can see, the data is now linearly separable. How to Select Landmarks?: The simplest approach is to create a landmark at the location of each and every instance in the dataset. This creates many dimensions and thus increases the chances that the transformed training set will be linearly separable. The downside is that a training set with $m$ instances and $n$ features gets transformed into a training set with $m$ instances and $m$ features - which if your training set is large, you could end up with an equally large number of features.

Gaussian RBF Keynel

The similarity features method can be computationally expensive to compute all the additional features, especially on large training sets. However, once again, the kernel trick makes it possible to obtain a similar result as if you had added many similarity features without actually having to add them. See the code below (Kernel Trick - Adding Similarity Features). The model produced by the code is shown on the bottom left. The image below shows different results with different hyperparameters. Increasing gamma makes the bell shaped curve narrower, and as a result, each instance's range of influence is smaller: the decision boundary end up being more irregular, wiggling around individual instances. A small gamma value makes the bell-shaped curve wider, so instances have a larger range of influence, and the decision boundary ends up smoother. gamma acts like a regularization parameter: if your model is overfitting, reduce it, and if underfitting, increase it.

Other kernels exist but are used much more rarely. String kernels are sometimes used when classifying text documents or DNA sequences. As a rule of thumb, you should always try the linear kernel first, especially if the training set is very large or if it has plenty of features. If the training set is not too large, you should try Gaussian RBF kernel as well: it works well in most cases. Then, if you have time and computing power, you can experiment with other kernels using cross validation and grid search.

Computational Complexity

The LinearSVC class is based on the liblinear library, which implements an optimized algorithm for linear SVMs. It scales almost linearly with number of instances and features: $O(m\times n)$ . The SVC class if based on the libsvm library, which implements an algorithm that supports the kernel trick. The training time and complexity is usually between $O(m^2 \times n)$ and $O(m^3 \times n)$ . This means that it gets slow when the number of training instances is large (hundreds of thousands). This algorihm is perfect for complex but large or medium train set. It scales well with the number of features, expecially with sparse features (when each instance has few nonzero features).

Scikit-Learn Classification Classes

## Adding More Deatures
from sklearn.datasets import make_moons
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

polynomial_clf = Pipeline([
    ("poly_features",PolynomialFeatures(degree=3)),
    ("scaler",StandardScaler()),
    ("svm_clf",LinearSVC(C=10,loss="hinge"))
])
polynomial_clf.fit(X,y)

## Kernel Trick - Adding Polynomial Features
from sklearn.svm import SVC
poly_kernel_svm_clf = Pipeline([
    ("scaler",StandardScaler()),
    ("svm_clf",SVC(kernel="poly",degree=3,coef0=1,C=5))
])
poly_kernel_svm_clf.fit(X,y)

## Kernel Trick - Adding Similarity Features
rbf_kernel_svm_clf = Pipeline([
    ("scaler",StandardScaler()),
    ("svm_sclf",SVC(kernel="rbf",gamma=5,C=0.001))
])
rbf_kernel_svm_clf.fit(X,y)

out[4]

Pipeline(steps=[('scaler', StandardScaler()),

('svm_sclf', SVC(C=0.001, gamma=5))])

SVM Regression

SVM is versatile: it supports linear and nonlinear classification and regression. The trick with SVM regression is to try to fit as many instances on the street while limiting the margin violations (instances off the street.) The width of street is controlled by the hyperparameter $\eta$ . The image below shows Linear SVM regression models with a different hyperparameter $\eta$ /.

Adding more instances within the margin does not affect the model's predictions => the model is $\eta$ -insensitive. You can use Scikit-Leasrn's LinearSVR class to perform linear SVM Regression. The code below generated the model on the left of the image above.

from sklearn.svm import LinearSVR 
svm_reg = LinearSVR(epsilon=1.5)
svm_reg.fit(X,y)

To tackle nonlinear regression tasks, you can use a kernalized SVM model. The image below shows SVM Regression on a random quadratic training set, using a 2nd-degree polynomial kernel. There is little regularization on the left plot, and much more regularization on the right plot (high C value). The code below produces the model on the left of the image below using Scikit-Learn's SVR class (which supports kernel trick). The SVR class is the regression equivalent of the SVC class, and the LinearSVR class is the regression euqivalent of the LinearSVC class.

from sklearn.svm import SVR 
svm_poly_reg = SVR(kernel="poly",degree=2,C=100,epsilon=0.1)
svm_poly_reg.fit(X,y)

Under the Hood

In this chapter, the bias term is called $b$ and the feature weights vector is called $\textbf{w}$ . No bias feature will be added to input feature vectors.

Decision Function and predictions

The linear SVM classifier model predicts the class of a new instance $\textbf{x}$ by simply computing the descion function $\textbf{w}^T\textbf{x}+b$ : if the result is positive, the predicted class $\hat{y}$ is the positive class, else it is the negative class.

\hat{y}= \begin{cases} 0 &\text{if }\textbf{w}^T\textbf{x}+b < 0 \\1&\text{if } \textbf{w}^T\textbf{x}+b \geq 0 \end{cases}

The image below shows a two-dimensional plane since the dataset (iris dataset with petal width and petal length) has two features. The decision boundary is the set of points where the decision function is equal to 0: it is the intersection of two planes, which is a straight line. The dashed lines represent the points where the decision function is equal to 1 or -1: they are parallel and at equal distance to the decision boundary, forming a margin around it. Training a linear SVM classifier finding the value of $\textbf{w}$ and $b$ that make this margin as wide as possible while avoiding margin violations (hard margin) or limiting them (soft margin).

Training Objective

The decision function is equal to the norm of the weight vector $\lVert \textbf{w} \rVert$ . Dividing the weight vector gives a larger margin (see image below). So, we want to minimize $\lVert \textbf{w} \rVert$ to get a large margin. If we want to avoid any margin violation, we need the decsion function to be greater than 1 for all positive training instances, and less than -1 for all negative training instances. If we define $t^{i} = - 1$ for negative training instances and $t^{i} = 1$ for positive training instances, you can express this constraint as $t^i (\textbf{w}^T \textbf{x}^i + b) \geq 1$ for all training instances. We can then express the the hard margin linear SVM classifier objective as the constrained optimization problem:

To get the soft margin objective, we need to introduce a slack variable $\zeta ^i \geq 0$ for each instance: $\zeta ^i$ measures how much the $i^{th}$ instance is allowed to violate the margin. We now have two conflicting objectives: making the slack variable as small as possible to reduce the margin violation, and making $\frac{1}{2}\textbf{w}^T\textbf{w}$ as small as possible to increase the margin. This is where the C hyperparameter comes in: it allows the tradeoff between these two objectives.

Soft Margin Linear SVM Classifier Objective

Quadratic Programming

The hard margin and soft margin problems are both convex quadratic optimization problems with linear constraints. Such problems are known as Quadratic Program‐ ming (QP) problems. Many off-the-shelf solvers are available to solve QP problems using a variety of techniques that are outside the scope of this book. The general problem formulation is given [below]:

The Dual Problem

Given a constrained optimization problem, known as the primal problem, it is possible to express a different but closely related problem, called the dual problem. The solution to the dual problem typically gives a lower bound to the solution of the primal problem, but under some conditions it can even have the same solutions as the primal problem. Luckily, the SVM problem happens to meet those conditions, so you can choose to solve the primal problem or the dual problem: both will have the same solution.

Kernalized SVM

The kernel trick, in essence, is the fact that you don;t need to transform the training instances at all, just replace the dot product by its square in the Dual Form of the Linear SVM objective. The result will be strictly the same as if you wnt through the trouble of actually transforming the training set, then fitting a linear SVM algorithm, but this trick makes the whole process more computationally efficient. The function $K(\textbf{a},\textbf{b})=(\textbf{a}^T \textbf{b})^2$ is called a 2nd degree polynomial kernel. In Machine Learning, a kernel is a function capable of computing the dot product $\phi(\textbf{a})^T \phi(\textbf{b})$ based only on the original vectors $\textbf{a}$ and $\textbf{b}$ , without having to compute the transformation $\phi$ .

Mercer's Theorem

According to Mercer's theorem, if a function $K(\textbf{a},\textbf{b})$ respects a few mathematical conditions called Mercer's conditions. (K must be continuous, symmetric in its arguments so $K(\textbf{a},\textbf{b})=K(\textbf{b},\textbf{a})$ ), then there exists a function $\phi$ that maps $\textbf{a}$ to $\textbf{b}$ into another space such that $K(\textbf{a},\textbf{b})=\phi(\textbf{a})^T \phi(\textbf{b})$ . So you can use $K$ as a kernel since you known $\phi$ exists, even if you don;t know what $\phi$ is.

Online SVMs

Hinge Loss

The function $m a x (0, 1 - t)$ is called the hinge loss function.

User Comments

There are currently no comments for this article.