Support Vector Machines Exercises Answers
This chapter goes into Support Vector Classification and Regression.
Question 1
What is the fundamental idea behind Support Vector Machines?
The fundamental idea behind support vector machines is trying to fit the widest possible street between classes of objects. Trying to find the hyper-dimensional plane that maximizes the space between classes.
Question 2
What is a Support Vector?
A support vector is an instance that lies on the edge of the street referenced above. It is an instance that lies as close to the other class as possible while still being classified as being in the appropriate class.
Question 3
Why is it important to scale the inputs when using SVM?
SVMs try to fit the largest possible “street” between the classes (see the first answer), so if the training set is not scaled, the SVM will tend to neglect small features
Question 4
Can an SVM classifier output a confidence score when it classifies an instance? What about a probability?
While an SVM classifier can output a confidence score in the form of the distance between a data point and the decision boundary, it cannot directly output a probability; to get probability estimates, you need to use a technique like Platt scaling or set the probability=True parameter when creating an SVM model in most machine learning libraries, which will then calibrate the scores to produce probabilities
Question 5
Should you use the primal or the dual form of the SVM problem to train a model on a training set with millions of instances and hundreds of features?
The dual hyperparameter should be set to False unless there are more features than training instances, so in this case, the primal form of the SVM problem should be used.
Question 6
Say you trained an SVM classifier with an RBF kernel. It seems to underfit the training set: should you increase or decrease γ(gamma) ? What about C?
You should increase C and γ if your SVM with an RBF kernel is underfitting the training set. The image below shows an SVM classifier with an RBF kernel using various values of γ and C.
Question 7
How should you set the QP parameters ($ \bm{H} , \bm{f} , \bm{A} ,and \bm{b} $ ) to solve the soft margin linear SVM classifier problem using an off-the-shelf QP solver?
Note that p stands for the number of parameters (features + 1 bias) and c the number of constraints.
Question 8
Train a LinearSVC on a linearly separable dataset. Then train an SVC and a SGDClassifier on the same dataset. See if you can get them to produce roughly the same model.
from sklearn.datasets import load_iris
dataset = load_iris()
data, target, target_names, descr = dataset["data"], dataset["target"], dataset["target_names"], dataset["DESCR"]
print(descr)
from sklearn.model_selection import train_test_split
import numpy as np
np.random.seed(42)
random_state = np.random.randint(100)
X_train, X_test, y_train,y_test = train_test_split(data,target,test_size=0.2,random_state=random_state)
def get_flower_name(num):
return target_names[num]
import matplotlib.pyplot as plt
fig, ax = plt.subplots(2,3,layout="constrained",figsize=(16,8))
sepal_length = X_train[:,0]
sepal_width = X_train[:,1]
petal_length = X_train[:,2]
petal_width = X_train[:,3]
print("""According to the Description of the Iris dataset: "One class is linearly separable from the other 2; the latter are NOT linearly separable from each other."
According to the graph below, the sertosa flower is linearly separable from the other two, so a sertosa vs not-sertosa classifier should be trained.""")
print(target_names)
plots = [
{ "ax": [0,0], "x_label": "Sepal Length (cm)", "y_label": "Sepal Width (cm)", "x": sepal_length, "y": sepal_width },
{ "ax": [0,1], "x_label": "Sepal Length (cm)", "y_label": "Petal Length (cm)", "x": petal_length, "y": sepal_width },
{ "ax": [0,2], "x_label": "Sepal Length (cm)", "y_label": "Petal Width (cm)", "x": sepal_length, "y": petal_width },
{ "ax": [1,0], "x_label": "Sepal Width (cm)", "y_label": "Petal Length (cm)", "x": sepal_width, "y": petal_length },
{ "ax": [1,1], "x_label": "Sepal Width (cm)", "y_label": "Petal Width (cm)", "x": sepal_width, "y": petal_width },
{ "ax": [1,2], "x_label": "Petal Length (cm)", "y_label": "Petal Width (cm)", "x": petal_length, "y": petal_width },
]
import pandas as pd
y_ser = pd.Series(y_train,dtype=int)
cm = [{"c": "b", "m": 's'}, {"c": "y", "m": "v"},{"c": "r", "m": "d"}]
for d in plots:
Ax = ax[d["ax"][0],d["ax"][1]]
for i in range(3):
c = cm[i]["c"]
m = cm[i]["m"]
y_indices = y_ser[y_ser==i].index.to_numpy()
X_values = d["x"][y_indices]
y_values = d["y"][y_indices]
Ax.scatter(X_values,y_values,c=c,marker=m,label=get_flower_name(i))
Ax.set_xlabel(d["x_label"])
Ax.set_ylabel(d["y_label"])
Ax.set_title(d["y_label"] + " vs. " + d["x_label"])
Ax.legend()
plt.show()
y_train_setosa = np.where(y_train==0,1,0)
y_test_setosa = np.where(y_test==0,1,0)
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV
"""
Linear SVC
------------------------------------------
Linear Support Vector Classification. Similar to SVC with parameter kernel="linear", but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and shhould scale better to large numbers of samples.
The main differenences between LinearSVC and SVC lie in the loss function used by default, and in the handling of intercept regularization between the two implementations.
"""
lin_svm_params = [
{"fit__C": [0.1,0.5,1,10]},
]
lin_svm_pipe = Pipeline(
steps=[
("inpute",KNNImputer(n_neighbors=5)), # All features will need to be imputed so
("scale",StandardScaler()), # Important that
("fit",LinearSVC(penalty="l2",dual=False,max_iter=10000))
]
)
lin_svm = HalvingGridSearchCV(lin_svm_pipe,param_grid=lin_svm_params,cv=5,verbose=3,refit=True)
"""
Linear SVC
------------------------------------------
Linear classifiers (SVM, logistic regression, etc.) with SGD training.
The estimator implements regularized linear models with stochastic gradient descent learning: the gradient loss is estimated at each sample at a time and the model is updated along the way with a dcreasing string schedule (aka learning rate). For best results with the default learning rate, the data should be scaled.
"""
"""
l1_ratio =0 leads to L2 penalty (same as linear SVM)
tol=1e-4 is the same as LinearSVM - increasing tol here to try to get the SGD model closed to the Linear SVM
"""
sgd_params = [
{"fit__l1_ratio": [0,0.15], "fit__tol": [1e-4, 1e-5]},
]
sgd_pipe = Pipeline(
steps=[
("inpute",KNNImputer(n_neighbors=5)),
("scale",StandardScaler()),
("fit",SGDClassifier(loss="hinge",penalty="l2",tol=1e-4,n_jobs=-1)) # hinge="loss" gives a Linear SVM, l2 penalty is the default for linear SVM models,
]
)
sgd = HalvingGridSearchCV(sgd_pipe,param_grid=sgd_params,cv=5,verbose=3,refit=True)
lin_svm.fit(X_train,y_train_setosa)
sgd.fit(X_train,y_train_setosa)
lin_svm_df = pd.DataFrame(lin_svm.cv_results_)
print("Linear SVM (constant in decision function, weights assigned to features)",(lin_svm.best_estimator_.named_steps["fit"].intercept_,lin_svm.best_estimator_.named_steps["fit"].coef_))
sgd_df = pd.DataFrame(lin_svm.cv_results_)
print("Stochastic Gradient Descent (constant in decision function, weights assigned to features)",(sgd.best_estimator_.named_steps["fit"].intercept_,sgd.best_estimator_.named_steps["fit"].coef_))
lin_svm_pred = lin_svm.predict(X_test)
sgd_pred = sgd.predict(X_test)
from sklearn.metrics import accuracy_score
print("Linear SVC Accuracy:",accuracy_score(lin_svm_pred,y_test_setosa))
print("SGD Accuracy:",accuracy_score(sgd_pred,y_test_setosa))
Question 9
Train an SVM classifier on the MNIST dataset. Since SVM classifiers are binary classifiers, you will need to use one-versus-all to classify all 10 digits. You may want to tune the hyperparameters using small validation sets to speed up the process. What accuracy can you reach?
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1)
data, target, description = mnist["data"], mnist["target"], mnist["DESCR"]
print(description)
data = data.to_numpy()
target = target.to_numpy()
X_train, X_test, y_train, y_test = data[:60000], data[60000:], target[:60000], target[60000:]
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV
"""
I am going to try training linear and rbf kernel models because the recommendation from the book is to start with linear and then go to rbf if you have time. I also know that the rbf kernel might take longer to train, but a linear model might not be the best for this dataset
"""
svc_param_grid = [
{"fit__kernel": ["linear"], "fit__C": [0.5,1,10,100]},
{"fit__kernel": ["rbf"], "fit__C": [0.5,1,10] }
]
svc_pipe = Pipeline(
steps=[
("impute", KNNImputer(n_neighbors=5)),
("scale", StandardScaler()), # Scaling important for SVC
("fit", SVC(cache_size=1000,decision_function_shape='ovr')), # incrase cache size (try to speed up), decision_function_shape = "ovr" - multi label classifications "Whether to return a one-vs-rest (‘ovr’) decision function of shape (n_samples, n_classes)"
]
)
svc = HalvingGridSearchCV(svc_pipe,param_grid=svc_param_grid,cv=5,verbose=3,refit=True)
svc.fit(X_train,y_train)
print(svc.best_params_)
Question 10
Train an SVM regressor on the California housing dataset.
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
np.random.seed(42)
random_state = np.random.randint(100)
data = pd.read_csv(os.path.join(os.getcwd(),'..','housing.csv'))
target = data["median_house_value"]
data = data.drop(columns="median_house_value")
X_train, X_test, y_train, y_test = train_test_split(data,target,test_size=0.2,random_state=random_state)
print("X_train Shape:",X_train.shape,"\nX_test Shape:",X_test.shape,"\ny_train Shape:",y_train.shape,"\ny_test Shape:",y_test.shape)
X_train.head()
X_train.describe()
X_train.info()
import matplotlib.pyplot as plt
n, bins, patches = plt.hist(X_train["housing_median_age"],bins=7)
plt.gca().set_title("Test Bins for Housing Median Age")
plt.show()
print(bins)
from sklearn.preprocessing import StandardScaler
handling_multi_collin = X_train.iloc[:,:-1].copy()
corr = handling_multi_collin.corr()
print("""
total_rooms, total_bedrooms, population, and households demonstrate multicollinearity,
""")
corr.style.background_gradient(cmap='coolwarm')
print("""
Attempt at removing multicollinearity while keeping information
""")
handling_multi_collin["people_per_bedroom"] = handling_multi_collin["population"]/handling_multi_collin["total_bedrooms"]
handling_multi_collin["bedrooms_per_rooms"] = handling_multi_collin["total_bedrooms"]/handling_multi_collin["total_rooms"]
handling_multi_collin = handling_multi_collin.drop(columns=["total_bedrooms","total_rooms","population"])
corr2 = handling_multi_collin.corr()
corr2.style.background_gradient(cmap='coolwarm')
fig, ax = plt.subplots(1,2,layout="constrained")
ax[0].hist(X_train["households"])
ax[0].set_title("Households Histogram")
ax[1].hist(X_train["population"])
ax[1].set_title("Population Histogram")
plt.show()
from sklearn.neighbors import KNeighborsTransformer, KernelDensity
knnTrans = KNeighborsTransformer(mode="distance",n_neighbors=6 )
out = knnTrans.fit_transform(X_train.loc[:,["longitude","latitude"]])
print("KNeighborsTransformer Shape:",out.shape,". This is out for being too sparse")
kernDens = KernelDensity(kernel="gaussian")
kernDens.fit(X_train.loc[:,["longitude","latitude"]])
out = kernDens.score_samples(X_train.loc[:,["longitude","latitude"]])
print(out,out.shape,X_train.loc[:,["latitude"]].shape)
fig, ax = plt.subplots(1,1,layout="constrained")
s = ax.scatter(X_train.loc[:,["longitude"]],X_train.loc[:,["latitude"]],cmap=plt.cm.hsv,c=out)
ax.set_title("Kernel Density Estimation for Latitude / Longitude")
ax.set_xlabel("Longitude")
ax.set_ylabel("Latitude")
fig.colorbar(s,cmap=plt.cm.hsv)
plt.show()
print("""
I've tried other ways of doing longitude / latitude - I think I'll give this a shot.
""")
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.base import BaseEstimator, TransformerMixin
class BinThenOhe(BaseEstimator,TransformerMixin):
"""
- Transformer created to first put values into bins and then One Hot Encodes it
- fit and transform should receive an array of shape (n_features,1)
"""
def __init__(self,bins_num=7,bins=None):
self.ohe = OneHotEncoder(sparse_output=False,handle_unknown="ignore")
self.bins_num = bins_num
if bins != None:
self.bins = bins
def fit(self,X,y=None):
X = pd.DataFrame(X,columns=["Col"])
if not getattr(self,"bins",None):
n, bins, patches = plt.hist(X,bins=self.bins_num)
self.bins = bins
plt.gca().remove()
for i in range(len(self.bins)):
if i==0:
query_str = 'Col <= {}'.format(self.bins[i])
indices = X.query(query_str).index
X.loc[indices,0] = i+1
else:
query_str = 'Col <= {} and Col > {}'.format(self.bins[i],self.bins[i-1])
indices = X.query(query_str).index
X.loc[indices,0] = i+1
X = X.loc[:,0].astype(np.int64)
self.ohe.fit(X.to_numpy().reshape(-1,1))
return self
def transform(self,X,y=None):
X = self.ohe.transform(X)
return X
class TransformHouseValues(BaseEstimator,TransformerMixin):
"""
- I am not sure whether to just scale the housing age or put it into bins and ohe it, so I will try both
- This estimator expects one column array - the housing_median_age from the og DataFrame
- If bin_it = True, then the houses will be put into bins and one hot encoded - else, the bins will be scaled with StandardScaler
- You can manipulate bins as well - the number of bins to use (default = 7)
"""
def __init__(self,bin_it=True,bins=7):
self.bin_it=bin_it
self.bins=bins
if type(self.bins) != type(4):
raise TypeError("Bins should be an integer.")
def reshapeX(self,X):
if isinstance(X,pd.Series):
X = X.to_numpy().reshape(-1,1)
elif isinstance(X,pd.DataFrame):
X = X.iloc[:,0].to_numpy().reshape(-1,1)
elif len(X.shape)==1:
# If X is a row vector
X = X.reshape(-1,1)
return X
def fit(self,X,y=None):
X = self.reshapeX(X)
if self.bin_it:
self.transformer = BinThenOhe(bins_num=self.bins)
self.transformer.fit(X,y)
return self
else:
self.transformer = StandardScaler()
self.transformer.fit(X)
return self
def transform(self,X,y=None):
X = self.reshapeX(X)
X = self.transformer.transform(X)
return X
class HandleRoomsBedroomsPopulation(BaseEstimator,TransformerMixin):
"""
- Trying to remove multicollinearity between these attributes
- Columns in this order:
- total_rooms total_bedrooms population
- Compute/Return:
- people_per_bedroom
- bedrooms_per_rooms
- population? - karg for this
- Should be recieving a NumPy array since this is after the imputer
"""
def __init__(self,keep_pop=False):
self.keep_pop = keep_pop
def fit(self,X,y=None):
return self
def transform(self,X,y=None):
if self.keep_pop:
ret = np.zeros((X.shape[0],3),dtype=np.float64)
ret[:,2] = X[:,2]
else:
ret = np.zeros((X.shape[0],2),dtype=np.float64)
ret[:,0] = np.divide(X[:,2],X[:,1],out=np.zeros_like(X[:,2],dtype=np.float64),where=X[:,1]!=0)
ret[:,1] = np.divide(X[:,1],X[:,0],out=np.zeros_like(X[:,1],dtype=np.float64),where=X[:,0]!=0)
return ret
class LngLatHandler(BaseEstimator,TransformerMixin):
"""
Handling Longitude and Latitude to get some data from them
- Ysing Kernel Density estimation - assuming that the houses had an equal chance of being pulled anywhere from california, this should give you a good estimate of population density
- https://scikit-learn.org/stable/modules/density.html#kernel-density-estimation
- https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KernelDensity.html#sklearn.neighbors.KernelDensity
"""
def __init__(self,kernel="gaussian",bandwidth=1,algorithm="auto"):
self.kernel = kernel
self.bandwidth = bandwidth
self.algorithm = algorithm
self.kern_func = KernelDensity(kernel=self.kernel,bandwidth=self.bandwidth,algorithm=self.algorithm)
def fit(self,X,y=None):
self.kern_func.fit(X)
return self
def transform(self,X,y=None):
out = self.kern_func.score_samples(X)
out = out.reshape(-1,1)
return out
rooms_bed_pop_pipeline = Pipeline(
steps=[
('impute',KNNImputer(n_neighbors=5,weights="uniform")),
("rem_collin",HandleRoomsBedroomsPopulation()),
("scale",StandardScaler())
]
)
ocean_proximity_pipeline = Pipeline(
steps=[
('impute',SimpleImputer(strategy="most_frequent")),
('ohe',OneHotEncoder(sparse_output=False,handle_unknown="ignore"))
]
)
housing_median_age_pipeline = Pipeline(
steps=[
('impute',SimpleImputer(strategy="median")),
("bohe_or_std",TransformHouseValues())
]
)
median_income_households_pipeline = Pipeline(
steps=[
('impute',SimpleImputer(strategy="median")),
("scale",StandardScaler())
]
)
lat_lng_pipeline = Pipeline(
steps=[
('impute',SimpleImputer(strategy="median")),
('kernel',LngLatHandler()),
("scale",StandardScaler())
]
)
col_transformer = ColumnTransformer(
transformers=[
("ocp",ocean_proximity_pipeline,["ocean_proximity"]), # 0-4
("mi_h",median_income_households_pipeline,["median_income","households"]), # 5-6
('rbpop',rooms_bed_pop_pipeline,["total_rooms","total_bedrooms","population"]), # 7-8
("hma",housing_median_age_pipeline,['housing_median_age']), #
("lnglat",lat_lng_pipeline,["longitude","latitude"]) # -1
]
)
from sklearn.svm import SVR, NuSVR
from sklearn.model_selection import GridSearchCV
param_grid = [
{"fit__kernel": ["linear"], "fit__C": [0.5,1,10,100], "transform__hma__bohe_or_std__bin_it": [True, False]},
{"fit__kernel": ["poly"], "fit__degree": [2,3], "fit__coef0": [0,1], "fit__C": [0.1,1,10], "transform__hma__bohe_or_std__bin_it": [True, False] },
{"fit__kernel": ["rbf"], "fit__C": [0.1,1,10], "transform__hma__bohe_or_std__bin_it": [True, False] },
{"fit__kernel": ["sigmoid"], "fit__coef0": [0,1], "fit__C": [0.5,1,10,100], "transform__hma__bohe_or_std__bin_it": [True, False] }
]
svr_pipe = Pipeline(
steps=[
('transform',col_transformer),
('fit',SVR(verbose=True))
]
)
svr = GridSearchCV(svr_pipe,param_grid=param_grid,cv=5,verbose=3,refit=True)
svr.fit(X_train,y_train)
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, explained_variance_score, mean_pinball_loss, d2_pinball_score, d2_absolute_error_score
pred = svr.predict(X_test)
print("Mean Squared Error Score:",mean_squared_error(y_test,pred))
print("Mean Absolute Error Score:",mean_absolute_error(y_test,pred))
print("R2 Score Score:",r2_score(y_test,pred))
print("Explained Variance Score:",explained_variance_score(y_test,pred))
print("Mean Pinball Loss Score:",mean_pinball_loss(y_test,pred))
print("D2 Pinball Score:",d2_pinball_score(y_test,pred))
print("D2 Absolute Error Score:",d2_absolute_error_score(y_test,pred))
# explicitly require this experimental feature
from sklearn.experimental import enable_halving_search_cv # noqa
# now you can import normally from model_selection
from sklearn.model_selection import HalvingGridSearchCV
param_grid = [
{"fit__kernel": ["linear"], "fit__C": [100,1000], "transform__hma__bohe_or_std__bin_it": [True, False]},
{"fit__kernel": ["poly"], "fit__degree": [3,4], "fit__coef0": [1], "fit__C": [10,100,1000], "transform__hma__bohe_or_std__bin_it": [True, False] },
{"fit__kernel": ["rbf"], "fit__C": [100,1000], "transform__hma__bohe_or_std__bin_it": [True, False] },
{"fit__kernel": ["sigmoid"], "fit__C": [100,1000], "transform__hma__bohe_or_std__bin_it": [True, False] }
]
svr_pipe = Pipeline(
steps=[
('transform',col_transformer),
('fit',SVR(verbose=True))
]
)
svr = HalvingGridSearchCV(svr_pipe,param_grid=param_grid,cv=5,verbose=3,refit=True)
svr.fit(X_train,y_train)
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, explained_variance_score, mean_pinball_loss, d2_pinball_score, d2_absolute_error_score
pred = svr.predict(X_test)
print("Mean Squared Error Score:",mean_squared_error(y_test,pred))
print("Mean Absolute Error Score:",mean_absolute_error(y_test,pred))
print("R2 Score Score:",r2_score(y_test,pred))
print("Explained Variance Score:",explained_variance_score(y_test,pred))
print("Mean Pinball Loss Score:",mean_pinball_loss(y_test,pred))
print("D2 Pinball Score:",d2_pinball_score(y_test,pred))
print("D2 Absolute Error Score:",d2_absolute_error_score(y_test,pred))
fig, ax = plt.subplots(1,1,layout="constrained")
ax.plot(y_test,y_test,c='k')
ax.scatter(y_test,pred,c='r',marker="+")
plt.show()
from sklearn.ensemble import IsolationForest
out = col_transformer.fit_transform(X_train)
clf = IsolationForest(random_state=random_state).fit(out)
predictions = clf.predict(out)
print("I still am not really sure where I am going wrong here")