Rest of Notes for Hands-On Machine Learning with Scikit Learn and Tesnor Flow
I decided to just skim the rest of the textbook and just try to become awarer of concepts and tools that may be useful. I think I will be able to learn more if I try to start training models using Kaggle competitions / other datasets and learn more that way.
Chapter 5 - Support Vector Machines
A support vector machine is a very powerful and versatile Machine Learning model, capable of performing linear or nonlinear classification, regression, and even outlier detection. It is one of the most popular models in Machine Learning, and anyone interested in Machine Learning should have it in their toolbox. SVMs ate particularly well suited for classification of complex but small - or medium-sized datasets.
- You are basically trying to find a way to split datasets along some line / surface
- What is the fundamental idea behind Support Vector Machines?
- FInd the widest possible "street" between two classes. Have the largest possible margin between the decision boundary that separates the two classes and the training instances. When performing soft margin classification, the SVM searches for compromise between perfectly separating the two classes and having the widest possible street. Another key idea is to use kerels when training nonlinear datasets.
- What is a support vector?
- A support vector is any instance that is located on the "street", including its border. The decision boundary is entirely determined by the support vectors. Any instances that is not a support vector (off the street), has no influence whatsoever. Computing the predictions only involves the support vectors, not the whole training set.
- Why is it important to scale the inputs when using SVMs?
- SVMs try to fit the largest possible "street" between the classes, so if the training set is not scales, the SVM will tend to neglect small features.
- Can an SVM classifier output a confidence score when it classifies an instance? What about a probability?
- An SVM classifier can output the distance between the test instance and the decision boundary, and you can use this as a confidence score. However, the score cannot be directly converted into and estimation of the class probability. If you set probability=True when creating an SVM in Scikit-Learn then after the training it will calibrate the probabilities using Logistic Regression on the SVM's scores.
- Should you use the primal or dual form of the SVM problem to train a model on a training set with millions of instances and hundreds of features?
- This question applies to linear SVMs since kernalized can only use the dual form. The computational complexity of the primal form of the SVM problem is proportional to the number of training instances m, while the computational complexity of the dual form is proportional to a number between m2 and m3. So if there are millions of instances, you should definitely use the primal form, because the fual form will be much too slow. The dual problem is faster to solve than the primal when the number of training instances is smaller than the number of features.
- Say you trained an SVM classifier with an RBF kernel. It seems to underfit the training set: should you increase or decrease gamma? What about C?
- To decrease it, you need to increase the gamma or C hyper-parameter.
- Train a LinearSVC on a linearly separable dataset. Then train an SVC and a SGDClassifier on the same dataset. See if you can get them to produce roughly the same model.
- Train an SVM classifier on the MNIST dataset. Since SVM classifiers are binary classifiers, you will need to use one-versus-all to classify all 10 digits. You may want to tune the hyperparameters using small validation sets to speed up the process. What accuracy can you reach?
- Answers to question 8 and 9:
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris, load_digits, fetch_california_housing
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC, LinearSVC
from sklearn.linear_model import SGDClassifier
from mlxtend.plotting import plot_decision_regions
import matplotlib.pyplot as plt
from typing import Tuple
import os
np.random.seed(42) # make random number reproducible
rand_num = np.random.randint(0,100)
os.system('cls')
print("-"*20 + "QUESTION 1" + "-" * 20)
iris = load_iris(as_frame=True)
iris: pd.DataFrame = iris.frame
target = iris['target']
data = iris.drop(columns=['target'])
# Pass in random_state for reproducible output across multiple function calls
X_train, X_test, y_train, y_test = train_test_split(data,target,random_state=rand_num,test_size=0.2)
C = 5
alpha = 1 / (C * len(X_train))
# You could perform gridsearchcv to improve the models by tuning the hyperparameters
svc_clf = Pipeline([
('scaler',StandardScaler()),
('svc', SVC(kernel="linear",C=C,random_state=rand_num))
])
lin_svc_clf = Pipeline([
('scaler',StandardScaler()),
('lin_svc',LinearSVC(loss="hinge",C=C,random_state=rand_num)),
])
sgd_clf = Pipeline([
('scaler',StandardScaler()),
('sgd',SGDClassifier(loss="hinge",learning_rate="constant",eta0=0.001,alpha=alpha,max_iter=1000,tol=1e-3,random_state=rand_num))
])
svc_clf.fit(X_train,y_train)
lin_svc_clf.fit(X_train,y_train)
sgd_clf.fit(X_train,y_train)
print("SVC Classifier Score: ",svc_clf.score(X_test,y_test))
print("Linear SVC Classifier Score: ",lin_svc_clf.score(X_test,y_test))
print("SGD Classifier Score: ",sgd_clf.score(X_test,y_test))
print("-"*20 + "QUESTION 2" + "-" * 20)
digits_data = load_digits(as_frame=True)
digits_frame: pd.DataFrame = digits_data.frame
digits_series = digits_frame['target']
digits_frame = digits_frame.drop(columns=["target"])
X_train, X_test, y_train, y_test = train_test_split(digits_frame,digits_series,random_state=rand_num,test_size=0.2)
base_model = Pipeline([
('std',StandardScaler()),
('svc',SVC(kernel="linear",C=C,random_state=rand_num))
])
param_grid = {
"svc__kernel": ["linear","poly","rbf"],
"svc__C": [5,10,20],
"svc__gamma": ['scale','auto']
}
digit_clf = GridSearchCV(base_model,param_grid=param_grid,cv=5)
digit_clf.fit(X_train,y_train)
print(digit_clf.best_params_)
print(digit_clf.score(X_test,y_test))
print("-"*20 + "QUESTION 3" + "-" * 20)
housing = fetch_california_housing()
X = housing['data']
y = housing['target']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=rand_num)
Chapter 6 - Decision Trees
Like SVMs, Decision Trees are versatile Machine Learning algorithms that can per‐
form both classification and regression tasks, and even multioutput tasks. They are very powerful algorithms, capable of fitting complex datasets.
One of the many qualities of Decision Trees is that they require very little data preparation. In particular, they don’t require feature scaling or centering at all.
As you can see Decision Trees are fairly intuitive and their decisions are easy to interpret. Such models are often called white box models. In contrast, as we will see, Random Forests or neural networks are generally considered black box models. They make great predictions, and you can easily check the calculations that they performed to make these predictions; nevertheless, it is usually hard to explain in simple terms why the predictions were made.
- A decision tree can estimate the probability that any instance belongs to a particular class k
- Decision trees are prone to overfitting die to it being a nonparametric model - uou need to regularize the model by restricting its max_depth
- Decision Tree Pitfalls:
- They love orthogonal decision boundaries
- The are sensitive to small variations in training data
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
import os
os.system('cls')
iris = load_iris()
X = iris.data[:,2:] # petal length and width
y = iris.target
# Decision Trr Classifier
tree_clf = DecisionTreeClassifier(max_depth=2)
tree_clf.fit(X,y)
print(tree_clf.predict_proba([[5, 1.5]]))
# Decision Tree Regressor
tree_reg = DecisionTreeRegressor(max_depth=2)
tree_reg.fit(X,y)
Chapter 7 - Ensemble Learning and Random Forests
- Ensemble Learning is like aggregating the predictions of various predictors
- You can train different classifiers/predictors on the same training data and take the classification prediction that gets the most votes or you could train the same predictor on different random subsets of the training data
- Random Forest is an ensemble of Decision Trees, generally trained using the bagging method (training different predictors/regressors on subsets of the training data then aggregating predictions someway (by frequency or averaging))
- Random forests are very handy to perform feature selection, especially to get a quick understanding of what features actually matter, in particular if you need to perform feature selection
- Boosting refers to any Ensemble method that can combine several weak learners into a strong learner
- Check out the xgboost library
- Stacking is when we train a model to do the aggregation of the outputs of the sub-models rather than taking the average
Chapter 8 - Dimensionality Reduction
Many machine learning problems involve thousands or even millions of features for each training instance. Not only does this make training extremely slow, it can also make it much harder to find a good solution, as we will see. This problem is often referred to as the curse of Dimensionality.
- Dimensionality reduction good for data visualization as well.
- Main approaches to Dimensionality Reduction:
- Projection: projecting points onto a different plane
- manifold
- Principle Component Analysis is the most popular dimensionality reduction algorithm. It identifies the hyperplane that lies closest to the data, then it projects the data onto it
- You can remove features that account for little variance in the data
- Dimensionality reduction is often a preparation step for a supervised learning algorithm
Chapter 9 - Unsupervised Learning and Random Forests
Although most of the applications of Machine Learning today are based on supervised learning (and a result, this is where most of the investments go), the vast majority of available data is actually unlabeled: we have the input features X, but do not have the labels y.
Unsupervised Learning Tasks and Algorithms:
- Clustering: the goal is to group similar instances together into clusters. This is a great tool for data analysis, customer segmentation, recommender systems, search engines, image segmentation, semi-supervised learning, dimensionality reduction, and more
- Anomaly Detection: the objective is to learn what "normal" data looks like, and use this to detect abnormal instances, such as defective items on a production line or a new trend in a time series
- Density Estimation: task of estimating the probability density function of the random process that generated the dataset. This is commonly used for anomaly detection: instances located in very low-density regions are likely to be anomalies. It also can be useful for data analysis and visualization
DBSCAN clustering:
A Gaussian mixture model (GMM) is a probabilistic model that assumes that the instances were generated from a mixture of several Gaussian distributions whose parameters are unknown.
Chapter 10 - Introduction to Artificial Networks With Keras
Keras is a high-level Deep Learning API that allows you to easily build, train, evaluate and execute all sorts of neural networks. Its documentation (or specification) is available at Keras.io