Scikit-Learn Docs

Wanted to learn more about scikit-learn before continuing to learn about machine learning. These notes aren't comprehensive, and I should return to the docs to take more detailed notes on available classes/functions eventually.

Reviewing Scikit-Learn User Guide

Go over everything in the user guide

Supervised Learning

Linear Models

The following are a set of methods intended for regression in which the target value is expected to be a linear combination of features. In mathematical notation, if y^\hat{y}y^ is the predicted value:

y^(w,x)=w0+w1x1++wpxp\hat{y}(w,x) = w_0 + w_1 x_1 + \ldots + w_p x_py^(w,x)=w0+w1x1++wpxp

Across linear models, the vector w=(w1,,wp)w = ( w_1 , \ldots , w_p )w=(w1,,wp) is the coef_ and w0w_0w0 is the intercept_.

Ordinary Least Squares

LinearRegression fits a linear model with coefficients w=(w1,,wp)w = ( w_1 , \ldots , w_p )w=(w1,,wp) to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation. It solves a problem of the form:

=Xwy22\underset{w}{\text{min}} = \lvert \bm{X}w -y \rVert ^{2} _{2}wmin=Xwy22

Linear Regression will take in its fit method arrays X and y and will store the coefficients www of the linear model in its coef_ member. The situation of multicollinearity can arise when features are correlated with one another.

Non-negative Least Squares

It is possible to constrain all the coefficients to be non-negative (which may be useful in representing physical properties), set the positive keyword argument of Linear Regression to be True.

Least Squares Complexity

The complexity of the least squares method is O(nsamplesnfeatures2)O ( n_{\text{samples}} n _{\text{features}}^2 )O(nsamplesnfeatures2) . It relies on the singular value decomposition of X.

Ridge Regression and Classification

Ridge regression addresses some of the problems of Ordinary Least Squares by imposing a penalty on the size of coefficients. The ridge coefficients penalized residual sum of squares:

=Xwy22+αw22\underset{w}{\text{min}} = \lvert \bm{X}w -y \rVert ^{2} _{2} + \alpha \lVert w \rVert ^{2} _{2}wmin=Xwy22+αw22

The complexity parameter α0\alpha \geq 0α0 controls the amount of shrinkage, the greater the amount of shrinkage and thus the coefficients become more robust to collinearity. You can set the solver parameter with Ridge, or it chooses automatically.

Ridge Regression Solver

Classification

The Ridge has a classifier variant: RidgeClassification. The classifier first converts binary targets to {-1,1} and treats the problem as a regression task. The predicted class corresponds to the sign of the regressor’s prediction. For multiclass classification, the problem is treated as multi-output regression, and the predicted class corresponds to the output with the highest value.

Ridge Complexity

The complexity is the same order as Ordinary Least Squares

Setting the Regularization Parameter: Leave-one-out Ridge Regression

RidgeCV and RidgeClassifierCV implement ridge regression/classification with built-in cross-validation of the alpha parameter. They work in the same was as GridSearchCV except that it defaults to efficient Leave-One-Out cross validation. Specifying the value of the cv attribute will rigger the use of cross-validation with GridSearchCV, for example cv=10 for 10-fold cross-validation, rather than Leave-One-Out Cross-Validation.

Lasso

The Lasso is a linear model that estimates sparse coefficients. It is useful in some contexts due to its tendency to prefer solutions with fewer non-zero coefficients, effectively reducing the number of features upon which the given solution is dependent. mathematically, it consists of a linear model with an added regularization term. The objective function to minimize is:

=12nsamplesXwy22+αw12\underset{w}{\text{min}} = \cfrac{1}{2n_{\text{samples}}} \lvert \bm{X}w -y \rVert ^{2} _{2} + \alpha \lVert w \rVert ^{2} _{1}wmin=2nsamples1Xwy22+αw12

Setting Regularization Parameter

The alpha parameter controls teh degree of sparsity of the estimated coefficients.

Using Cross-Validation

Scikit-Learn exposes objects that set the Lasso alpha parameter by cross-validation: LassoCV and LassoLarsCV. FOr high dimensional datasets with many collinear features, LassoCV is most often preferable. However, LassoLarsCV has the advantage of exploring more relevant values of the alpha parameter, and if the number of samples is small compared to the number of features, it is often faster than LassoCV.

Information-Criteria Based Model Selection

Alternatively, the estimator LassoLarsIC proposes to use the Akaike information criteria (AIC) and the Bayes Information Criteria (BIC).

Comparison with Regularization Parameter of SVM

The equivalence between alpha and the regularization parameter of SVM, C is given by alpha = 1 / C or alpha = 1 / ( n_samples * C ) depending on the estimator.

Multi-Task Lasso

The MultiTaskLasso is a linear model that estimates sparse coefficients for multiple regression problems jointly: y is a sD array of shape (n_samples, n_tasks). The constraint is that the selected features are the same for all the regression problems, also called tasks.

Elastic-Net

Elastic-Net is a linear regression model trained with both 1\ell _11 and 2\ell _22 -norm regularization of the coefficients. This combination allows for learning a sparse model where few of the weight are non-zero like Lasso, while still maintaining the regularization properties of Ridge. We control the convex combination of 1\ell _11 and 2\ell _22 using the l1_ratio parameter. A practical advantage of trading-off between Lasso and Ridge is that it allows Elastic-Net to inherit some of Ridge's stability under rotation. The objective function to minimize:

=12nsamplesXwy22+αρw1+α(1ρ)2w22\underset{w}{\text{min}} = \cfrac{1}{2n_{\text{samples}}} \lvert \bm{X}w -y \rVert ^{2} _{2} + \alpha \rho \lVert w \rVert _{1} + \cfrac{\alpha (1 - \rho )}{2} \lVert w \rVert _2 ^2wmin=2nsamples1Xwy22+αρw1+2α(1ρ)w22

The ElasticNetCV class can be used to set the parameters alpha and l1_ratio ( ρ\rhoρ ) by cross-validation.

Multi-task Elastic-Net

The MultiTaskElasticNet is an elastic-net model that estimates sparse coefficients for multiple regression problems jointly: Y is a 2D array of shape (n_samples, n_tasks). The constraint is that the selected features are the same for all the regression problems, also called tasks. The class MultiElasticNetCV can be used to set the parameters alpha and l1_ratio by cross-validation.

Least Angle Regression

Least-angle Regression (LAR) is a regression algorithm for high-dimensional data. LARS is similar to forward stepwise regression. At each step, it finds the feature most correlated with the target. When there are multiple features having equal correlation, instead of continuing along the same feature, it proceeds in a direction equiangular between the features. RThe LARS model can be used via the estimator Lars

  • The advantages of LARS:
    • Numerically significant when num of features greater than num samples
    • same order complexity as least squares
    • produced fill piecewise linear solution path, useful in cross-validation or attempts to tune the model
    • If two features are almost equally correlated with the target, then their coefficients should increase at approximately the same rate.
    • Easily modified to produce solutions for other estimators
  • The disadvantages of LARS:
    • It maybe especially sensitive to noise

LARS Lasso

LassoLars is a lasso model implemented using the LARS algorithm, and unlike the implementation based on coordinate descent, this yields the exact solution, which is a piecewise linear as a function of the norm of its coefficients.

Orthogonal Matching Point

OrthogonalMatchingPursuit or orthogonal_mp implement the OMP algorithm for approximating the fit of a linear model with constraints imposed on the number of non zero coefficients. Orthogonal matching pursuit can approximate the optimum solution vector with a fixed number of non-zero elements.

yXw22 subject to w0nnonzero-coefs\underset{w}{\text{arg}\space \text{min}} \lvert y - \bm{X}w \rVert ^{2} _{2} \space \text{subject to} \space \lVert w \rVert _{0} \leq n_{\text{nonzero-coefs}}warg minyXw22 subject to w0nnonzero-coefs

Alternatively, orthogonal matching pursuit can target a specific error instead of a specific number of non-zero coefficients. This can be expressed as:

w0 subject to yXw22tol\underset{w}{\text{arg}\space \text{min}} \lVert w \rVert _{0} \space \text{subject to} \space \lvert y - \bm{X}w \rVert ^{2} _{2} \leq \text{tol}warg minw0 subject to yXw22tol

OMP is based on a greedy algorithm that includes at each step the atom most highly correlated with the current residual.

Bayesian Regression

Bayesian regression techniques can be used to include regularization parameters in the estimation procedure: the regularization parameter is not set in a hard sense but tuned to the data at hand.

  • The advantages of Bayesian Regression:
    • It adapts to the data at hand
    • It can be used to include regularization parameters in the estimation procedure
  • The disadvantages of Bayesian Regression:
    • Inference of the model can be time consuming

Bayesian Ridge Regression

BayesianRidge estimates a probabilistic model of the regression problem as described above. Bue to the Bayesian framework, the weights found are slightly different to the ones found by Ordinary Least Squares. however, Bayesian Ridge Regression is more robust to ill-posed problems.

Automatic Relevenace Determination - ARD

The Automatic Relevanace Determination (as being implemented in ARDRegression) is a kind of linear model similar to Bayesian Ridge Regression, but that leads to sparser coefficients www . RAD is also known in the literature as sparse Bayesian learning and Relevance Vector Machine.

Logistic Regression

The logistic regresison is implemented in LogisticRegression. Despite its name, it is implements as a linear model for classification rather than regression in terms of the Scikit-Learn/ML nomenclature. Also known as logit regression, maximum-entropy classification, or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function. This implementation can fit binary, One-vs-Rest, or multinomial logistic regression with optional l-1, l-2 or Elastic-Net regularization. The predict_proba method of LogisticRegression predicts the probability of the positive class. You can specify regularization with the penalty argument. there are different solvers available for LogisticRegression, each with different solver support (penalties), abilities to do multiclass classification, and behaviors.

Solvers for Logistic Regression

Generalized Linear Models

Generalized Linear Models (GLM) extend linear models in two ways. First, the predicted values y^\hat{y}y^ are linked to a linear combination of the input variables X\bm{X}X via an inverse link function h\bm{h}h as

y^(w,X)=h(Xw)\hat{y}(w,\bm{X}) = h(\bm{X}w)y^(w,X)=h(Xw)

Second, the squared loss function is replaced by the unit deviance d\bm{d}d of a distribution in the exponential family (or more precisely, a reproductive exponential dispersion model (EDM)). The minimization problem becomes:

12nsamplesid(yi,y^i)+α2w22\underset{w}{min} \cfrac{1}{2n_{\text{samples}}}\sum_{i}d(y_i , \hat{y}_i ) + \frac{\alpha}{2} \lVert w \rVert ^2 _{2}wmin2nsamples1id(yi,y^i)+2αw22

where α\alphaα is the L2 regularization penalty. When sample weights are provided, the average becomes the weighted average.

EDMs and their Unit Deviance

The choice of distribution function depends on the problem at hand:

  • If the target values yyy are counts (non-negative integer valued) or relative frequencies (non-negative), you might use a Poisson distribution with a log-link
  • If the target vales are positive valued and skewed, you might try a Gamma distribution with a log-link
  • If the target values seem to be heavier tailed than a Gamma distribution, you might try an Inverse Gaussian distribution (or even higher variance powers off the Tweedie family)
  • If the target vales yyy are probabilities, you can use the Bernoulli distribution. The Bernoulli distribution with a logit link can be user for binary classification. The Categorical distribution with a softmax link can be used for multiclass classification.

Usage

  • TweedieRegressor implements a generalized linear model for the Tweedie distribution, that allows to model any of the above mentioned distributions using the appropriate power parameter.
    • power=1: Poisson Distribution (PoissonRegressor)
    • power-2: Gamma Distribution (GammaRegressor)
    • power=3: Inverse Gaussian Distribution

Stochastic Gradient Descent

Stochastic Gradient Descent is a simple yet efficient approach to fit linear models. It is particularly useful when the number of samples is very large. The partial_fit method allows online/out-of-core learning. The classes SGDClassifier and SGDRegressor provide functionality to fit linear models for classification and regression using different (convex) loss functions and different penalties.

Perceptron

The Perceptron is another simple classification algorithm suitable for large scale learning. By default:

  • It does not require a learning rate
  • It is not regularized (penalized)
  • It updates its model only on mistakes

The Perceptron is a wrapper around the SGDClassifier class using a perceptron loss and a constant learning rate.

Passive Aggressive Algorithms

The passive-aggressive algorithms are a family of algorithms for large-scale learning. They are similar to the Perceptron in that they do not requite a learning rate. However, contrary to the Perceptron, they include a regularization parameter C. (PassiveAggressiveClassifier and PassiveAggressiveRegressor)

Robustness Regression: Outliers and Modeling Errors

Robust Repression aims to fit a regression model in the presence of corrupt data: either outliers, or error in the model. Note that robust fitting in high-dimensional settings is very hard. The robust models will probably not work in these settings.

Things to Remember About Outliers

Huber Regression

The HuberRegression is different to Ridge because it applies a linear loss to samples that are classified as outliers. A sample is classified as an inlier if the absolute error of that sample is lesser than a certain threshold. Huber Regression is scaling invariant. It should be more efficient to use on data with small number of samples while SGDRegressor needs a number of passes on the training data to produce the same robustness.

Quantile Regression

Quantile Regression estimates the median or other quantiles of yyy conditional on \bm{X} $ , while ordinary least squares (OLS) estimates the conditional mean. Quantile regression may be useful if one is predicting an interval instead of a point prediction.

Polynomial Regression: Extending Linear Models with Basis Functions

One common pattern with machine learning is to use linear models trained on nonlinear functions of the data. This approach maintains the generally fast performance of linear methods, while allowing them to fit a much wider range of data. Using the PolynomialFeatures transformer, you can transform an input matrix to a new data matrix of a given degree (i.e., with degree=2 [x1,x2][x_1, x_2][x1,x2] becomes [1,x1,x2,x12,x1x2,x22][1, x_1, x_2, x_1^2, x_1 x_2, x_2 ^2][1,x1,x2,x12,x1x2,x22] ). A liner model trained on polynomial features is able to exactly recover the input polynomial features.

from sklearn import linear_model 
from sklearn import preprocessing
from sklearn import pipeline
import numpy as np 
import matplotlib.pyplot as plt
## Linear Regression
reg = linear_model.LinearRegression().fit(np.array([[0, 0], [1, 1], [2, 2]]), np.array([0, 1, 2])) 
print("Lienar Regression:",reg.coef_)

## Ridge Regression
ridge_reg = linear_model.Ridge(alpha=0.5).fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1])
print("Ridge Regression:",ridge_reg.coef_)
ridge_reg_cv = linear_model.RidgeCV(alphas=np.logspace(-6,6,13))
ridge_reg_cv.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1])
print(r"Ridge Hyperparameter Search $ \alpha $:",ridge_reg_cv.alpha_)

## Lasso Regresison
lasso_reg = linear_model.Lasso(alpha=0.1)
lasso_reg.fit([[0, 0], [1, 1]], [0, 1])
print("Lasso Regression:",lasso_reg.coef_)

## LARS Lasso 
lars_lasso = linear_model.LassoLars(alpha=0.1)
lars_lasso.fit([[0, 0], [1, 1]], [0, 1])
print("LARS Lasso:",lars_lasso.coef_)

## Bayesian Ridge Regresison 
bay_ridge = linear_model.BayesianRidge()
bay_ridge.fit([[0., 0.], [1., 1.], [2., 2.], [3., 3.]],[0., 1., 2., 3.])
print("Bayesian Ridge Regression:",bay_ridge.coef_)

# Tweedie Regressor
tweedie = linear_model.TweedieRegressor(power=1,alpha=0.5,link="log")
tweedie.fit([[0, 0], [0, 1], [2, 2]], [0, 1, 2])
print("Tweedie Regressor:",tweedie.coef_)

## Polynomial Regresison 
model = pipeline.Pipeline([('poly', preprocessing.PolynomialFeatures(degree=3)),
                  ('linear', linear_model.LinearRegression(fit_intercept=False))])
# fit to an order-3 polynomial data
x = np.arange(5)
y = 3 - 2 * x + x ** 2 - x ** 3
model = model.fit(x[:, np.newaxis], y)
print("Polynomial Regresison:",model.named_steps['linear'].coef_)
out[2]

Lienar Regression: [0.5 0.5]
Ridge Regression: [0.34545455 0.34545455]
Ridge Hyperparameter Search $ \alpha $: 0.01
Lasso Regression: [0.6 0. ]
LARS Lasso: [0.6 0. ]
Bayesian Ridge Regression: [0.49999993 0.49999993]
Tweedie Regressor: [0.24631611 0.43370317]
Polynomial Regresison: [ 3. -2. 1. -1.]

Linear and Quadratic Analysis

Linear Discriminant Analysis (LinearDiscriminantAnalysis) and Quadratic Discriminant Analysis (QuadraticDiscriminantAnalysis) are two classic classifiers with, and their names suggest, a linear and a quadratic surface, respectively. These classifiers are attractive because they have closed-form solutions that can be easily computed, are inherently multiclass, have proven to work well in practice, and have no hyperparameters to tune. (Image below shows decision boundaries - bottom row shows that Linear Discriminant Analysis can only learn linear boundaries).

Linear Discriminant and Quadratic Discriminant Analysis

Dimensionality Reduction using Linear Discriminant Analysis

LinearDiscriminantAnalysis can be used to perform supervised dimensionality reduction, by projecting the input data to a linear subspace consisting of the directions which maximize the separaion between classes (in a precise sense discussed in the mathematics section below). The dimension of the output is necessarily less than the number of classes, so this is in general a rather strong dimensionality reduction. This is implemented in the transform method. The desired dimensionality can be set using the n_components parameter.

Mathematical Formulation of the LDA and QDA Classifiers

Both LDA nad QDA can be derived from simple probabilistic models which model the class conditional distribution of the data P(X  y=k)P(\bm{X} \space | \space y=k)P(X  y=k) for each class kkk .

Shrinkage and Covaraince Error

Shrinkage is a form of regularizationn used to improve the estimation of covariance matrices un situations where the number of training samples is small compared to the number of features. In this scenario, the empirical sample covariance is a poor estimator, and shrinkage helps improving the generalization performance of the classifier. Shrinkage LDA can be used by setting the shrinkage parameter of the LinearDiscriminantAnalysis class to 'auto'.

Kernel Ridge Regression

Kernel Ridge Regression (KRR) combines Ridge Regression and classification (linear least squares with l2-norm regularization) with the kernel trick. It thus learns a linear function in the space induced by the respective kernel and the data. For non-linear kernels, this corresponds to a non-linear function in the original space. The form of the model learned by KernelRidge is identical to support vector regression SVR, but different loss functions are used.

Support Vector Machines

Support Vector Machines (SVMs) are a set of supervised learning methods used for classification, regression, and outliers detection.

  • Advantages of Support Vector Machines:
    • Effective in high dimensional space
    • Still effective in cases where number of dimensions is greater than the number of samples
    • Uses a subset of traning points in the decision function (called support vectors), so it is also memory efficient
    • Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but you csan specify custom Kernels
  • Disadvantages of Support Vector Machines:
    • If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and regularization term is crucial
    • SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation

Classification

SVC, NuSVC and LinearSVC are classes of performing binary and multi-class classification on a dataset.

Support Vector Classification

SVC and NuSVC are similar methods, but accept slightly different sets of parameters and have different mathematical formulas. On the other hand, LinearSVC is another faster implementation of Support Vector Classification for the case of a linear kernel. As other classifiers, SVC, NuSVC and LinearSVC take as inputs two arrays: an array X of shape (n_samples, n_features) holding the training samples, and an array y of class labels or shape (n_samples). SVMs decision function depends on some subset of the training data, called support vectors. Some properties of these support vectors can be found in attributes support_vectors_, support_, and n_support_

Multi-class Classification

SVC amd NuSVC implement the"one-versus-one" approach for multi-class classification. In total n_classes * (n_classes-1) / 2 classifiers are constructed and each one trains data from two classes. To provide a consistent interface with other classifiers, the decision_function_shape option allows to monotonically transform the results of the "one-versus-one" classifiers to a "one-vs-rest" decision function of shape (n_samples, n_classes). LinearSVC implements "one-vs-the-rest" multi-class strategy, thus training n_classes models.

Scores and Probabilities

The decision_function method of SVC and NuSVC gives per-class scores for each sample (or a single score per sample in the binary case). When the constructor option Probability is set to True, class membership probability estimates (from methods predict_proba and predict_log_proba) are enabled. In the binary case, the probabilities are calibrated using Platt scaling: logistic regression on the SVM's scores, fit by an additional cross-validation on the training data. The cross-validation involved in Platt scaling is an expensive operation for large datasets.

Unbalanced Problems

In problems where it is desired to give more importance to certain classes or certain individual samples, the parameters class_weight and sample_weight can be used. SVC (but not NuSVC) implements the parameter class_weight in the fit method.

SVC, NuSVC, SVR, NuSVR, LinearSVC, LinearSVR and OneClassSVM implement also weights for individual samples in the fit method through the sample_weight parameter. Similar to class_weight, this sets the parameter C for the i-th example to C * sample_weight[i], which will encourage the classifier to get these samples right.

Regression

The method of Support Vector Classification can be extended to solve regression problems. This method ios called Support Vector Regression. The model produced by support vector classification depends only on a subset of the training data, because the cost function for building the model does not care about training points that lie beyond the margin. Analogously, the model produced by Support Vector Regression depends only on a subset of the training data, because the cost function ignores samples whose prediction is close to their target. Three different implementation of Support vector Regression: SVR, NuSVR and LinearSVR. LinearSVR provides a faster implementation than SVR but only considers the linear kernel, while NuSVR implements a slighly different formulation that SVR and LinearSVR

Density, estimation, novelty detection

The class OneClassSVM implements a One-Class SVM which is used in outlier detection

Complexity

Support Vector Machines are powerful tools, but their compute and storage requirements increase rapidly with the number of training vectors. The core of an SVM is a quadratic programming problem (QP), separating support vectors from the rest of the training data.

Tips on Practical Use

  • Avoid Data Copy: Make sure data passed in is C-ordered (check NumPy flags attribute) and double precision
  • Kernel Cache Size: the size of the kernel Cache has a strong impact on run times for larger problems
  • Setting C: C is 1 by default and it's a reasonable default choice. If you have a lot of noisy observation you should try to decrease it: decreasing C corresponds to more regularization.
  • Support Vector Machines are not scale invariant, so it is highly recommended to scale your data.
  • Regarding shrinkage: "We found that if the number of iterations is large, then shrinking can shorten the training time. However, if we loosely solve the optimization problem (e.g., by using a large stopping tolerance), the code without using shrinking may be much faster"
  • Parameter nu in NuSVC/OneClassSVM/NuSVR approximates thr fraction of training errors and support vectors
  • In SVC, if the data is unbalanced (many positive and few negative), set class_weight='balanced'
    • Randomness of the underlying implementations:Set the random_state parameter to control randomness.
  • Using L1 penalization as provided by LinearSVC(penalty='l1', dual=False) yields a sparse solution - only a subset of feature weights is different from zero and contribute to the decision function

Kernel Functions

Different kernel functions can be set with the kernel parameter

Kernel Functions

rbf_svc = svm.SVC(kernel='rbf')

Parameters of the RBF Kernel

When training an SVM with the Radial Basis Function (RBF) kernel, two parameters must be considered: C and gamma. gamma defines how much influence a single training example has. Proper choice of C and gamma is critical to the SVM's performance.

Mathematical Formulation

A support vector machine constructs a hyper-plane or set of hyper-planes in a high or infinite dimensional space, which can be used for classification, regression or other tasks. Intuitively, a good separation is achieved by the hyper-plane that has the largest distance to the nearest training data points of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier. The figure below shows the decision function for a linearly separable problem, with three samples on the margin boundaries, called “support vectors”:

Support vectors

from sklearn import svm
# Support Vector Clasifier
X = [[0, 0], [1, 1]]
y = [0, 1]
clf = svm.SVC()
clf.fit(X, y)
print("Support Vector Classifer:",clf.predict([[0,1],[1,0],[0.5,0.5],[0,0]]))

# Multiclass Classification
X = [[0], [1], [2], [3]]
Y = [0, 1, 2, 3]
clf = svm.SVC(decision_function_shape='ovo')
clf.fit(X, Y)
dec = clf.decision_function([[1]])
print("SVM MultiOutput Classification Shape: ",dec.shape[1]) # 6 classes: 4*3/2 = 6

clf.decision_function_shape = "ovr"
dec = clf.decision_function([[1]])
print("SVM MultiOutput Classification Shape: ",dec.shape[1]) # 4 classes

# Linear SVC
lin_clf = svm.LinearSVC(dual="auto")
lin_clf.fit(X, Y)
dec = lin_clf.decision_function([[1]])
print("Linear SVM MultiOutput Classification Shape: ",dec.shape[1]) 

## SVR
from sklearn import svm
X = [[0, 0], [2, 2]]
y = [0.5, 2.5]
regr = svm.SVR()
regr.fit(X, y)
print("Linear SVR:",regr.predict([[1, 1]]))
out[5]

Support Vector Classifer: [1 1 1 0]
SVM MultiOutput Classification Shape: 6
SVM MultiOutput Classification Shape: 4
Linear SVM MultiOutput Classification Shape: 4
Linear SVR: [1.5]

from sklearn.datasets import make_moons
X_moons, y_moons = make_moons()
out[6]

array([0.87177979, 0.49057433])

Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) is a simple yet very efficient approach to fitting linear classifiers and regressors under convex loss functions such as (linear) Support Vector Machines and Logistic Regression. Even though SGD has been around in the machine learning community for a long time, it has received a considerable amount of attention just recently in the context of large-scale learning. Used in text classification and natural language processing. SGD is merely an optimization technique and does not correspond to a specific family of machine learning models. It is only a way to train a model.

  • The advantages of Stochastic Gradient Descent:
    • Efficiency
    • Ease of implementation (lots of opportunity for code tuning)
  • The disadvantages of Stochastic Gradient Descent:
    • Requires a number of hyperparameters
    • Sensitive to feature scaling

Classification

The class SGDClassifier implements a plain Stochastic Gradient Descent learning routing which supports different loss functions and penalties for classification. The concrete loss function can be set via the loss parameter. The concete penalty can be set via the penalty parameter. SGDClassifier supports multi-class classification by combining multiple binary classifiers in a "one versus all" (OVA) scheme. For each of the K\bm{K}K classes, a binary classifier is learned that discriminates between that and all other K1\bm{K} - 1K1 classes. SGDClassifier supports both weighted classes and weighted instances via the fit parameters class_weight and sample_weight.

Regression

The class SGDRegressor implements a plain stochastic gradient descent learning routine which supports different loss functions and penalties to fit linear regression models. SGDRegressor is well suited for regression problems with a large number of training samples (> 10.000), for other problems we recommend Ridge, Lasso, or ElasticNet. The concrete loss can be set via the loss parameter. The epsilon and penalty hyperparameters can also be set.

Online Ine-Class SVM

The class sklearn.linear_model.SGDOneClassSVM implements an online linear version of the One Class SVM using a stochastic gradient descent.

Complexity

The major advantage of SGD is effiency, which is basically linear in the number of training examples.

Stopping Criteria

The classes SGDClassifier and SGDRegressor provide two criteria to stop the algorithm when a given level of convergence is reached:

  1. With early_stopping=True, the input data is split into a training set and a validation set. The model is then fitted on the training set, and the stopping criteria is based on the prediction score computed on the validation set. The size of the validation set can be set with validation_fraction
  2. With early_stopping=False, the model is fitted on the entire input data and the stopping critreris is based on the objective function computed on the training data.

Tips on Practical Use

  • Sensitive to feature scaling, so it is recommended to scale the data to have a variance of 1 and a mean of 0
  • Finding a reasonable regularization term α\alphaα is best done using automatic hyper-parameter search, GridSearchCV or RandomizedSearchCV
  • Empirically, we found that SGD converges after observing approximately 10^6 training samples. Thus, a reasonable first guess for the number of iterations is max_iter = np.ceil(10**6 / n), where n is the size of the training set.

Nearest Neighbors

sklearn.neighbors provides functionality for unsupervised and supervised neighbors-based learning methods. Unsupervised and nearest neighbors is the foundation of many other learning methods, such as manifold learning and spectral clustering. The principle behind nearest neighbor methods is to find a predefined number of training samples closest in distance to the new point, and predict the label form these. The number of samples can be a user-defined constant (k-nearest neighbor learning) or vary based n the local density of points (radius-based neighbor learning). The distance can be anything - most commonly Euclidean. Neighbors-based methods are known as non-generalizing machine learning methods, since they simply "remember" all of its training data. Being non-parametric, it is often successful in classification situations where the descision boundary is very irregular.

Unsupervised Nearest Neighbors

NearestNeighbors implements unsupervised nearest neighbors learning. It acts as a uniform interface to three different nearest neighbors algorithms: BallTree, KDTree, and a brute-force algorithm based on routines in sklearn.metrics.pairwise. The choice of neighbors search algorithm is controlled through the algorithm parameter. The query method can be used to get the nearest neighbor.

Nearest Neighbor Classification

Neighbors-based classification is a type of instance-based learning or non-generalizing learning: it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the most representatives within the nearest neighbors of the point.

Scikit-Learn implements two different nearest neighbor classifiers: KNeighborsClassifier implements learning basded on the k\bm{k}k nearest neighbors of each query point, where k\bm{k}k is an integer value specified by the user. RadiusNeighborsClassifier implements learning based on the number of neighbors within a fixed radius rrr of each training point, where rrr is a floating-point value specified by the user. The weights keyword can be set to control the wights of neighbors.

Nearest Neighbor Regression

Neighbors-based regression can be used in cases where the data labels are continuous rather than discrete variables. The label assigned to a query point is computed based on the mean of the labels of its nearest neighbors.

Scikit-Learn implements KNeighborsRegressor and RadiusNeighborsRegressor. The weights parameter can again be set.

Nearest Centroid Classifier

The NearestCentroid classifier is a simple algorithm that represents each class by the centroid of its members. It has no parameters to choose, making it a good baseline classifier. It has a shrink_threshold parameter, which implements the nearest shrunken centroid classifier.

Nearest Neighbors Transformer

You can compute a sparse nearest neighbors graph with KNeighborsTransformer and RadiusNeighborsTransformer. This has multiple benefits.

Neighborhood Components Analysis

Neighborhood Component Analysis (NCA, NeighborhoodComponentsAnalysis) is a distance metric learning algorithm which aims to improve the accuracy of a nearest neighbor classification compared to the standard Euclidean distance. The algorithm directly maximizes a stochastic variant of the leave-one-out k-nearest neighbors (KNN) score on the training set. It can also learn a low-dimensional linear projection of data that can be used for data visualization and fast classification. It can be used for classification or dimensionality reduction.

from sklearn.neighbors import NearestNeighbors
import numpy as np
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)
distances, indices = nbrs.kneighbors(X)
print(distances, indices)
out[9]

[[0. 1. ]
[0. 1. ]
[0. 1.41421356]
[0. 1. ]
[0. 1. ]
[0. 1.41421356]] [[0 1]
[1 0]
[2 1]
[3 4]
[4 3]
[5 4]]

Gaussian Processes

Gaussian Processes (GP) are a nonparametric learning algorithm used to solve regression and probabilistic classification problems.

  • Advantages
    • prediction interpolates the observations
    • the prediction is probabilistic (Gaussian)
    • Versatile: different kernels can be specified
  • disadvantages
    • Implementation is not sparse
    • Lose efficiency in high-dimensional spaces

GaussianProcessregressor implements Gaussian processes (GP) for regression purposes. The GaussianProcessClassifier implements Gaussian processes (GP) for classification purposes, more specifically for probabilistic classification, where test predictions take the form of class probabilities. Kernels are a crucial ingredient of GPs, which determine the shape of prior and posterior of the GP. They encode the assumptions on the function being learned by defining the "similarity" of two datapoints combined with the assumption that similar datapoints should have similar target values.

Cross decomposition

The cross decomposition module contains supervised estimators for dimensionality reduction and regression, belonging to the "Partial Least Squares" family.

Cross decomposition algorithms find the fundamental relations between two matrices (X and Y). They are latent variable approaches to modeling the covariance structures in these two spaces. They will try to find the multidimensional direction in the X space that explains the maximum multidimensional variance direction in the Y space. In other words, PLS projects both X and Y into a lower-dimensional subspace such that the covariance between transformed(X) and transformed(Y) is maximal.

Naive Bayes

Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes' theorem with the "naive" assumption of conditional independence between ever pair of features given the value of the class variable.

Decision Trees

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is tyo create a model that predicts the value of a traget variable by learning simple decision rules inferred from the data features. A tree can be seen as a piecewise constant approximation.

Ensembles: Gradient Boosting, Random Forests, Bagging, Voting, Stacking

  • Advantages
    • Simple to understand and interpret
    • Requires little data preparation (other techniques require data normalization)
    • Cost of using the tree is logarithmic in the number of data points used to train the tree
    • Can handle numerical and categorical data
    • Able to handle multi-output classification
    • Uses a white box model
    • Possible to validate model using statistical tests
    • Performs well even if its assumptions are somewhat violated by the trie model
  • disadvantages
    • Can create models that do not generalize to data well
    • Trees can be unstable because small variations in the data might result in a completely differeny tree being generated
    • Predictions of decision trees are neither smooth nor continuous, but piecewise constant in approximation
    • Concepts that are hard to learn because decision trees do not express them easily - XOR
    • Create biased trees if some classes dominate

DecisionTreeClassifier for classification. sklearn.tree.plot_tree(classifier) method for plotting tree. DecsiionTreeRegressor for regression.

Tips on Practical Use

  • They can overfit the data on data with a large number of features
  • Consider performing dimensionality redction beforehand
  • Visualize your tree as you are training by using the export function
  • Remember the number of samples required to populate the tree doubles fro each level the tree grows to. Use max_levels to control the size of the tree to prevent overfitting
  • Use min_samples_split or min_samples_leaf to ensure that multiple samples inform every decision in the tree, by controlling which splits will be considered
  • Balance dataset before training to prevent tree from being biased towars the classses that are dominant
  • If the samples are weighted, it will be easier to optimize the tree structure using weight-based pre-pruing criteria such as min_weight_fraction_leaf
  • All decision trees use np.float32 arrays internally
  • If the input matrix X is very sparse, it is recommened to convert to sparse csc_matrix befor calling fit and csr_matrix before calling predict

Ensembles: Gradient Boosting, Random Forests, Bagging, Voting, Stacking

Ensemble methods combine the predictions of several base estimators built with a given learning algorithm to improve generalizability / robustness over a single estimator. Two famous examples of ensemble methods are gradient boosted trees and random forests. Ensemble models can be applied to any base learner beyond trees, in averaging methods such as Bagging ,ethods, model stacking, or voting, or in boosting, such as AdaBoost.

Gradient Boosting Trees

Gradient Tree Boosting or Gradient Boosted Decision Trees is a generalization of boosting to arbitrary differentiable loss functions. GBDT is an excellent model for both regression and classification, in particular for tabular data. HistGradientBoostingClassifier and HistGradientBoostingRegressor can be orders of magnitude faster than GradientBoostingClassifier and GradientBoostingRegressor when the number of samples is larger than tens of thousands of samples. These histogram based predictors have built in support for missing values. They also have support for categorical data that is often better than OnHotEncoding. The 2 most important parameters for GradientBoostingClassifier and GradientBoostingRegressor are n_estimators and learning_rate. You can control the tree size and loss function.

Random Forests and Other Randomized Tree Ensembles

The sklearn.ensemble module includes two averaging algorithms based on randomized decision trees: the RandomForest algorithm and the Extra-Trees method. In random forests (RandomForestClassifier and RandomForestRegressor), each tree in the enemble is built from a sample drawn with replacement from the training set.

The purpose of these two sources of randomness is to decrease the variance of the forest estimator. Indeed, individual decision trees typically exhibit high variance and tend to overfit. The injected randomness in forests yield decision trees with somewhat decoupled prediction errors. By taking an average of those predictions, some errors can cancel out. Random forests achieve a reduced variance by combining diverse trees, sometimes at the cost of a slight increase in bias. In practice the variance reduction is often significant hence yielding an overall better model.

A competitive alternative to random forests are Histogram-Based Gradient Boosting models. In extremely random trees ExtraTreesClassifier and ExtraTreesRegressor, randomness goes one step further in the way splits are computed. Features used at the top of the tree contribute to the fional prediction decision of a larger fraction of input samples. The expected fraction of the samples they contribute to can thus be used as an estimates of the relative importance of the features. Ny averaging the estimates of predictive ability over several randomized trees one can reduce the variance of such an estimate and use it for feature selection.

Bagging Meta-Estimator

In ensemble algorithms, bagging methods form a class of algorithms which build several instances of a black-box estimator on random subsets of the original training set and then aggregate their individual predictions to form a final prediction. These methods are used as a way to reduce the variance of a base estimator (e.g., a decision tree), by introducing randomization into its construction procedure and then making an ensemble out of it. In many cases, bagging methods constitute a very simple way to improve with respect to a single model, without making it necessary to adapt the underlying base algorithm. As they provide a way to reduce overfitting, bagging methods work best with strong and complex models (e.g., fully developed decision trees), in contrast with boosting methods which usually work best with weak models (e.g., shallow decision trees).

  • Bagging methods differ from each other in the way that they draw random subsets of the training set:
    • When random subsets of the dataset are drawn as random subsets of the samples, then this algorithm is known as Pasting
    • When samples are drawn with replacement, then the method is known as Bagging
    • When random subsets of the dataset are drawn as random subsets of the features, then the method is known as Random Subspaces
    • Finally, when base estimators are built on subsets of both samples and features, then the method is known as Random Patches

Bagging methods are offered as a unified BaggingClassifier meta-estimator (BaggingRegressor), taking as input a user-specified estimator along with parameters specifying the strategy to draw random subsets.

Voting Classifier

The idea behind the VotingClassifier is to combine conceptually different machine learning classifiers and use a majority vote or the average predicted probabilities (soft vote) to predict the class labels. Such a classifier can be useful for a set of equally well performing models in order to balance out their individual weaknesses.

In contrast to majority voting (hard voting), soft voting returns the class label as argmax of the sum of predicted probabilities.

Voting Regressor

The idea behind the VotingRegressor is to combine conceptually different machine learning regressors and return the average predicted values. Such a regressor can be useful for a set of equally well performing models in order to balance out their individual weaknesses.

Stacked Generalization

Stacked generalization is a method for combining estimators to reduce their biases. More precisely, the predictions of each individual estimator are stacked together and used as input to a final estimator to compute the prediction. This final estimator is trained through cross-validation. The StackingClassifier and StackingRegressor provide such strategies which can be applied to classification and regression problems.

AdaBoost

The module sklearn.ensemble includes the popular boosting algorithm AdaBoost, introduced in 1995 by Freund and Schapire. The core principle of AdaBoost is to fit a sequence of weak learners (i.e., models that are only slightly better than random guessing, such as small decision trees) on repeatedly modified versions of the data. The predictions from all of them are then combined through a weighted majority vote (or sum) to produce the final prediction.

Multiclass and Multioutput Algorithms

Multi Output Classifier

The modules in this section implement meta-estimators, which require a base estimator to be provided in their constructor. Meta-estimators extend the functionality of the base estimator to support multi-learning problems, which is accomplished by transforming the multi-learning problem into a set of simpler problems, then fitting one estimator per problem.

This section covers two modules: sklearn.multiclass and sklearn.multioutput.

Multiclass classification is a classification task with more than two classes. Each sample can only be labeled as one class. The one-versus-rest strategy, also known as one-versus-all, is implemented in OnevsRestClassifier. The strategy consists in fitting one classifier per class. For each classifier, the class is fitted against all the other classes. In addition to its computational efficiency (only n_classes classifiers are needed), one advantage of this approach is its interpretability. Since each class is represented by one and only one classifier, it is possible to gain knowledge about the class by inspecting its corresponding classifier. This is the most commonly used strategy and is a fair default choice. OneVsOneClassifier constructs one classifier per pair of classes. At prediction time, the class which received the most votes is selected. In the event of a tie (among two classes with an equal number of votes), it selects the class with the highest aggregate classification confidence by summing over the pair-wise classification confidence levels computed by the underlying binary classifiers. Since it requires to fit n_classes * (n_classes - 1) / 2 classifiers, this method is usually slower than one-vs-the-rest, due to its O(n_classes^2) complexity.

Multilabel classification (closely related to multioutput classification) is a classification task labeling each sample with m labels from n_classes possible classes, where m can be 0 to n_classes inclusive. This can be thought of as predicting properties of a sample that are not mutually exclusive.

Multiclass-multioutput classification (also known as multitask classification) is a classification task which labels each sample with a set of non-binary properties. Both the number of properties and the number of classes per property is greater than 2. A single estimator thus handles several joint classification tasks. This is both a generalization of the multilabel classification task, which only considers binary attributes, as well as a generalization of the multiclass classification task, where only one property is considered.

Multioutput regression predicts multiple numerical properties for each sample. Each property is a numerical variable and the number of properties to be predicted for each sample is greater than or equal to 2.

Feature Selection

The classes in the sklearn.feature_selection module can be used for feature selection/dimensionality reduction on sample sets, either to improve estimators’ accuracy scores or to boost their performance on very high-dimensional datasets. VarianceThreshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e. features that have the same value in all samples.

Semi-supervised Learning

Semi-supervised learning is a situation in which in your training data some of the samples are not labeled. The semi-supervised estimators in sklearn.semi_supervised are able to make use of this additional unlabeled data to better capture the shape of the underlying data distribution and generalize better to new samples.

Isotonic Regression

The class IsotonicRegression fits a non-decreasing real function to 1-dimensional data

Probability Calibration

When performing classification you often want not only to predict the class label, but also obtain a probability of the respective label. This probability gives you some kind of confidence on the prediction. Some models can give you poor estimates of the class probabilities and some even do not support probability prediction (e.g., some instances of SGDClassifier). The calibration module allows you to better calibrate the probabilities of a given model, or to add support for probability prediction.

Neural Network Models (Supervised)

Multi-layer Perceptron (MLP) is a supervised learning algorithm that learns a function f:RmRof : \R ^m \rightarrow \R ^of:RmRo by training on a dataset, where is the number of dimensions for input and is the number of dimensions for output. Given a set of features X=x1,x2,,xm\bm{X} = x_1, x_2, \ldots, x_mX=x1,x2,,xm and a yarget yyy , it can learn a non-linear function approximator for either classification or regression. It is different from logistic regression, in that between the input and the output layer, there can be one or more non-linear layers, called hidden layers. Below shows a one hidden layer MLP with scalar output.

One Hidden Layer MLP

Unsupervised Learning

Gaussian Mixture models

sklearn.mixture is a package which enables one to learn Gaussian Mixture Models (diagonal, spherical, tied and full covariance matrices supported), sample them, and estimate them from data. A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters. One can think of mixture models as generalizing k-means clustering to incorporate information about the covariance structure of the data as well as the centers of the latent Gaussians.

Manifold Learning

Manifold learning is an approach to non-linear dimensionality reduction. Algorithms for this task are based on the idea that the dimensionality of many data sets is only artificially high.

High-dimensional datasets can be very difficult to visualize. While data in two or three dimensions can be plotted to show the inherent structure of the data, equivalent high-dimensional plots are much less intuitive. To aid visualization of the structure of a dataset, the dimension must be reduced in some way.

Manifold Learning can be thought of as an attempt to generalize linear frameworks like PCA to be sensitive to non-linear structure in data. Though supervised variants exist, the typical manifold learning problem is unsupervised: it learns the high-dimensional structure of the data from the data itself, without the use of predetermined classifications.

Clustering

Clustering of unlabeled data can be performed with the module sklearn.cluster. Each clustering algorithm comes in two variants: a class, that implements the fit method to learn the clusters on train data, and a function, that, given train data, returns an array of integer labels corresponding to the different clusters. For the class, the labels over the training data can be found in the labels_ attribute.

Biclustering

Biclustering algorithms simultaneously cluster rows and columns of a data matrix. These clusters of rows and columns are known as biclusters. Each determines a submatrix of the original data matrix with some desired properties.

Covariance Estimation

Many statistical problems require the estimation of a population’s covariance matrix, which can be seen as an estimation of data set scatter plot shape. Most of the time, such an estimation has to be done on a sample whose properties (size, structure, homogeneity) have a large influence on the estimation’s quality. The sklearn.covariance package provides tools for accurately estimating a population’s covariance matrix under various settings.

Novelty and Outlier Detection

outlier detection: The training data contains outliers which are defined as observations that are far from the others. Outlier detection estimators thus try to fit the regions where the training data is the most concentrated, ignoring the deviant observations.

novelty detection: The training data is not polluted by outliers and we are interested in detecting whether a new observation is an outlier. In this context an outlier is also called a novelty.

Density Estimation

Density estimation walks the line between unsupervised learning, feature engineering, and data modeling. Some of the most popular and useful density estimation techniques are mixture models such as Gaussian Mixtures (GaussianMixture), and neighbor-based approaches such as the kernel density estimate (KernelDensity). Gaussian Mixtures are discussed more fully in the context of clustering, because the technique is also useful as an unsupervised clustering scheme.

Model Selection and Evaluation

Cross-validation: Evaluation Estimator Performance

Flowchart of Cross Validation Workflow in Model Training

In Scikit-Learn, a random split into training and test sets can be quicly computed with the train_test_split helper function.

When evaluating different settings (“hyperparameters”) for estimators, such as the C setting that must be manually set for an SVM, there is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally. This way, knowledge about the test set can “leak” into the model and evaluation metrics no longer report on generalization performance. To solve this problem, yet another part of the dataset can be held out as a so-called “validation set”: training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be done on the test set. However, by partitioning the available data into three sets, we drastically reduce the number of samples which can be used for learning the model, and the results can depend on a particular random choice for the pair of (train, validation) sets.

A solution to this problem is a procedure called cross-validation (CV for short). In the basic approach, called k-fold CV, the training set is split into k smaller sets. The following procedure is perfomed for each of the k "folds":

  • A model is trained using \bm{k - 1}k1 of the folds as training data
  • the resulting model is validated on the remaining part of the data (i.e., t is used as a test set to compute the performance measure such as accuracy)

The performance measure reported by k-fold cross validation is then the average of the values computed in the loop.

Cross Validation Visualization

Computing Cross-Validated Metrics

The simplest way to use cross-validation is to call the cross_val_score helper function on the estimator and the dataset. Set the folds with the cv parameter. The cross_validate function differs from cross_val_score in two ways:

  • It allows multiple metrics for evaluation
  • It returns a dict containing fit-times, score-times (and optionally training scores, fitted estimators, train-test split indices) in addition to the test score

The function cross_val_predict has a similar interface to cross_val_score, but returns, for each element in the input, the prediction that was obtained for that element when it was in the test set. Only validation strateges that assign all elements to a test set exactly once can be used. This function os appropriate for visualization of predictions and ensemble methods.

Cross Validation Iterators

  • KFold divides all the samples in k\bm{k}k groups of samples, called folds, of equal sizes. The prediction function is learned using \bm{ k - 1 }k1 folds, and the fold is left out for test.
  • RepeatedKFold repeats K-Fold n times.
  • LeaveOneOut is a simple cross validation. Each learning set is created by taking all the samples except one, the test set being the sample left out. Thus, for samples, we have different training sets and different tests set.
  • LeavePOut is very similar to LaveOneOut as it creates all the possible training/test sets by removing ppp samples from the complete set.
  • The ShuffleSplit iterator will generate a user defined number of independent train / test dataset splits. Samples are first shuffled and then split into a pair of train and test sets.
  • StratifiedKFold is a variation of k-fold which returns stratified folds: each set contains approximately the same percentage of samples of each target class as the complete set.
  • StratifiedShuffleSplit is a variation of ShuffleSplit, which returns stratified splits, i.e which creates splits by preserving the same percentage for each target class as in the complete set.
  • GroupKFold is a variation of k-fold which ensures that the same group is not represented in both testing and training sets.
  • There are others...

Cross validation iterators can be used to split test and train sets - train_test_split is a wrapper around ShuffleSplit and thus only allows for stratified sampling and cannot account for groups. TimeSeriesSplit for time series data.

A Note on Shuffling

If the data ordering is not arbitray (samples with the same class label are contiguous), shuffling it first may be essential to get a meaningful cross-validation result.

Permutation Test Score

permutation_test_score offers another way to evaluate the performance of classifiers. It provides a permutation based p-value, which represents how likely an observed performance of the classifier would be obtained by chance.

Tuning the hyper-parameters of an estimator

Hyper-parameters are parameters that are not directly learnt within estimators. In scikit-learn they are passed as arguments to the constructor of the estimator classes. Typical examples include C, kernel and gamma for Support Vector Classifier, alpha for Lasso, etc.

It is possible to search the hyper-parameter space for the best cross-validation score. Any parameter provided when constructing an estimator may be optimized in this manner. To get the names and current values for all parameters for a given estimator, use estimator.get_params(). A search consists of:

  • an estimator (regressor or classsifier)
  • a parameter space
  • a method for searching and sampling candidates
  • a cross-validation scheme
  • a score function

Two generic approaches to parameter search are provided in scikit-learn: for given values, GridSearchCV exhaustively considers all parameter combinations, while RandomizedSearchCV can sample a given number of candidates from a parameter space with a a specified distribution. Both these tools have successive halving counterparts: HalvingGridSearchCB and HalvingRandomSearchCV, which can be much faster at finding a good parameter combination. Note that it is common that a small subset of those parameters can have a large impact on the predictive or computation performance of the model while others can be left to their default values.

Exhaustive Grid Search

The grid search provided by GridSearchCV exhaustively generates candidates from a grid of parameter search values specified with the param_grid parameter.

Randomized Parameter Optimization

While using a grid of parameter settings is current the most widely used method for parameter optimization, other search methods have more favorable properties. RandomizedSearchCV implements a randomized search over parameters, where each setting is sampled from a distribution over possible parameter values. This has two main benefits over exhaustive search:

  • A budget can be chosen independent of the number of parameters and possible values
  • Adding parameters that do not influence the performance does not decrease efficiency

Searching for Optimal Parameters with Successive Halving

HalvingGridSearchCB and HalvingRandomSearchCV search the parameter space with successive halving. Successive halving (SH) is like a tournament among candidate parameter combinations. SH is an iterative selection process where all candidates (the parameter combinations) are evaluated with a small amount of resources at the first iteration. Only some of these candidates are selected for the next iteration, which will be allocated more resources. For parameter tuning, the resource is typically the number of training samples, but it can be any numeric parameter.

Tips for Parameter Search

  • *Specifying an objective metric *: Parameter search uses the score function of the estimator to evaluate a parameter setting by default. For some applications, other scoring functions are better suited. You can set the scoring parameter.
  • *Specifying multiple metrics for evaluation *
  • *Composite estimators and parameter spaces *: You can search over parameters of composite or nested estimators (like Pipeline) using the < estimator >__< parameter > syntax.
  • *Model Selection: Development and Devaluation *: Important to evaluate models on held-out samples that were not seen during the grid search process.
  • *Parallelism *: The parameter search tools evaluate each parameter combination on each data fold independently. Computation can run in parallel by using n_jobs=-1.
  • Robustness to failure: Setting error_score=0 will make the procedure robust to a failure of some parameter settings to fit one or more folds of the data.

Alternatives to Brute Force Parameter Search

Some models can fit data for a range of values of some parameter almost as efficiently as fitting the estimator for a single value of the parameter.

Some Models Implement Own Search of Parameter Space

Information Criteria

Some models can offer an information-theoretic closed-form formula of the optimal estimate of the regularization parameter by computing a single regularization path (instead of several when using cross-validation). LassoLARSIC.

Out of Bag Estimates

When using ensemble methods base upon bagging, i.e. generating new training sets using sampling with replacement, part of the training set remains unused. For each classifier in the ensemble, a different part of the training set is left out.

Tuning the decision threshold for class prediction

Classification is best divided into two parts:

  • a statistical problem of learning a model to predict, ideally, class probabilities (predict_proba or decision_function)
  • the decision problem to take concrete action based on those probability predictions (predict)

In binary classification, a decision rule or action is then defined by thresholding the scores, leading to the prediction of a single class label for each sample. For binary classification in scikit-learn, class labels predictions are obtained by hard-coded cut-off rules: a positive class is predicted when the conditional probability P(y X)P(y \space | \bm{X} )P(y X) is greater than 0.5 (obtained with predict_proba) or if the decision score is greater than 0 (obtained with decision_function). These hard coded rules are not ideal for most use cases - i.e. in cases where you want a high recall rate.

Post Tuning the Decision Threshold

One solution to address the problem stated in the introduction is to tune the decision threshold of the classifier once the model has been trained. The TunedThresholdClassifierCV tunes this threshold ysing an internal cross-validation. The optimum threshold is chosen to maximize a given metric.

Metrics and Scoring: Qualifying the Quality of Predictions

There are three different APIs for evaluating the quality of a model's predictions:

  • Estimator score method
    • Estimators have a score method providing a default evaluation criterion for the problem they are designed to solve.
  • scoring parameter
    • Model-evaluation tools using cross-validation rely on an internal scoring strategy
  • Metric Functions
    • The sklearn.metrics module implements functions assessing predictin error for specific purposes

This Page Mainly Goes Over Scoring Functions for Evaluation Models

The confusion_matrix function evaluates classification accuracy by computing the confusion matrix with each row corresponding to the true class. ConfusionMatrixDisplay can be used to visually represent a confusion matrix as shown in the Confusion matrix example. The classification_report function builds a text report showing the main classification metrics. Intuitively, precision is the ability of the classifier not to label as positive a sample that is negative, and recall is the ability of the classifier to find all the positive samples. There is a lot of stuff here for Regression metrics, classification metrics, and clustering metrics.

Validating Curves: Plotting Scores to Evaluate Models

Every estimator has its advantages and drawbacks. Its generalizatioon error can be decomposed in terms of bias, variance, and noise. The bias of an estimator is its average error for different training sets. The variance of an estimator indicates how sensitive it is to varying training sets. Noise is a property of the data.

Validation Curve

It is sometimes helpful to plot the influence of a single hyperparameter on the training score and the validation score to find out whether the estimator is overfitting or underfitting for some hyperparameter values. The function validation_curve can help in this case,

Learning Curve

A learning curve shows the validation and training score of an estimator for varying number of training samples. It is a tool to find out how much we benefit from adding more training data and whether the estimatro suffers more from a variance error or a bias error.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm
X, y = datasets.load_iris(return_X_y=True)
print(X.shape,y.shape)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)
print(X_train.shape, y_train.shape)
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)
out[15]

(150, 4) (150,)
(90, 4) (90,)

0.9666666666666667

Inspection

Predictive performance is often the main goal of developing machine learning models. Yet summarizing performance with an evaluation metric is often insufficient: it assumes taht the evaluation metric and test datset perfectly reflect the target domain, which is rarely true. In certain domains, a model needs a certain level of interpretability before it can be deployed. The sklearn.inspection module provides tools to help understand the prediction from a model and what affects them. This can be used to evaluate assumptions and biases of a model, design a better model, or to diagnose issues with model performance.

Partial Dependence and Individual Conditional Expectation Plots

Partial dependence plots (PDP) and individual conditional expectation (ICE) plots can be used to visualize and analyze interaction between the target response and a set of input features of interest. Partial Dependence plots (PDP) show the dependence between the target response and a set of input features of interest, marginalizing over the values of all other input features (the 'complement' features). An individual conditional expectation (ICE) plot shows the dependence between the target function and an input feature of interest. However, unlike a PDP, which shows the average effect of the input feature, an ICE plot visualizes the dependence of the prediction on a feature for each sample separately with one line per sample.

Permutation Feature Importance

Permutation feature importance is a model inspection technique that measures the contribution of each feature to a fitted model's statistical performance on a given tabular dataset. This technique is particularly useful for non-linear or opaque estimators and involves randomly shuffling the values of a single feature and observing the resulting degredation of the model's score. The permutation_importance function calculates the feature importance of estimators for a given dataset.

Visualizations

ScikitLearn defines a simple API for creating visualization for Machine Learning. The key feature of this API is to allow for quick plotting and visual adjustments without recalculation. We provide Display classes that expose two methods for creating plots: from_estimator and from_predictions. the from_estimator method will takje a fitted estimjator and some data (X and y) and create a Display object. Smetimes, we would like to only compute the predictions once and one should use from_predictions instead.

Available Plotting Utilities / Display Objects

Dataset Transformations

Pipelines and Composite Estimators

To build a composite estimator, transformers are usually combined with other transformers or with predictors (such as classifiers or regressors). The most common tool used for composing estimators is a Pipeline. Pipelines require all steps except the last to be a transformer. If the last step provides a tranform method, then the pipeline would perform a tranform method as the last step. If the pipeline has a predict method, then the pipeline would expose that method, and given a data X, use all steps except the last to transform the data, and then give that transformed data to the predict method of the last step of the pipeline. The Pipeline class is often used in combination with ColumnTransformer or FeatureUnion which concatenate the output of transformers into a composite feature space.

Pipeline: chaining Estimators

Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing teh data, for example feature selection, normalization, and classification. Multiple purposes served here:

  • Convenience and encapsulation: only call fit and predict on your data to fit a whole sequence of estimators
  • Join Parameter Selection: grid search over parameters of all estimators in the pipeline at once
  • Safety: Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors

Usage

The Pipeline is built using a list of (key, value) pairs, where the key is a string containing the name you want to give this step and value is an estimator object. The estimators are stored as a list in the steps attribute. A sub-pipeline can be extracted using the slicing notation commonly used for Python sequences such as lists or strings - convenient for performing only some transformations. It is common to adjust the parameters of an estimator within a pipeline. teh parameter is therefore nested because it belongs to a particular sub-step. Parameters of the estimators in the pipeline are accessible using the <estimator>__<parameter> syntax.

Transforming Target in Regression

TransformedTargetRegressor transforms the targets y before fitting a regression model. the predictions are mapped back to the original space via an inverse transform. It takes as ab argument the regressor that will be used for prediction, and the transformer that will be applied to teh target value.

FeatureUnion: Composite Feature Spaces

FeatureUnion combines several transformer objects into a new transformer that combines their output. A FeatureUnion takes a list of transformer objects. During fitting, each of these is fit to the data independently. During fitting, each of these is fit to the data independently. The transformers are applied in parallel, and the feature matrices they output are concatenated side-by-side into a larger matrix. When you want to apply different transformations to each field of the data, see the related class ColumnTransformer. A FeatureUnion is built using a list of (key, value) pairs, where the key is the name you want to give to a given transformation and value ius an estimator object.

ColumnTransformer for heterogeneous data

The ColumnTransformer helps performing different transformation for different columns of the data, within a Pipeline that is safe from data leakage and can be parameterized. Apart from a scalar or a single item list, the column selection can be specified as a list of multiple items, an integer array, a slice, a boolean mask, or with a make_column_selector.

Visualizing Composite Estimators

Estimators are displayed with an HTML representation when shown in a jupyter notebook. You can use estimator_html_repr to get the HTML representation of a pipeline.

Feature Extraction

The sklearn.feature_extraction module can be used to extract features in a format supported by machine learning algorithms from datasets consisting of formats such as text and image. The class DictVectorizer can be used to convert feature arrays represented as lists of standarad Python dict objects to the NumPy.SciPy representation by scikit-learn estimators. The class featureHasher is a high-speed, low-memory vectorizer that uses a technique known as feature hashing, or the "hashing trick". Instead of building a hash table of the features encountered in training, as the vectorizers do, instances of FeatureHasher apply a hash function to the features to determine their column index in sample matrices directly.

Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.

In order to address this, scikit-learn provides utilities for the most common ways to extract numerical features from text content:

  • tokenizing strings and giving an integer id for each possible token
  • counting the occurrences of tokens in each document
  • normalizing and weighing with diminishing importance tokens that occur in the majority of samples / documents

In this scheme, features and samples are defined as follows:

  • each individual token occurrence frequency (normalized or not) is treated as a feature
  • the vector of all the token frequencies for a given document is considered a multivariate sample

A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus. We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, cunting, and normalization) is called the Bag of Workds representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document. The resulting feature matrices are usually sparse.

CountVectorizer implements both tokenization and occurence counting in a single class. There are some useful methods and attributes here. Stop words are words like "and" and "the" which are presumed to be uninformative in representing the content of a text, and which may be removed to avoid them being construed as signal for prediction. In particular in a supervised setting the bag of words representation can be successfully combined with fast and scalable linear models to train document classifiers. In an Unsupervised setting, bag of words representation can be used to group similar documents together by applying clustering algorithms such as K-Means. there are other considerations here that I may come back to

Preprocesing Data

The sklearn.preprocessing package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators. In general, many learning algorithms such as linear models benefit from standardization of the data set.

Standardization, or mean removal and variance scaling

Standardization of datasets is a common requirement for many machine learning estimators implemented in Scikit-Learn; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance. In practice, we often ignore the shape of the distribution and just transform the data to center it by removing the mean value of each feature, then scale it by dividing non-constant features by their standard deviation. The preprocessing module provides the StandardScaler for this. An alternative to standardization is scaling features to lie between a given minimum and maximum value,often between 0 and 1, or so that the maximum absolute value of each feature is scaled to unit size. This can be achieved using MinMaxScaler or MaxAbsScaler, respectively. Centering sparse data would destroy the sparseness structure in the data, and this rarely is a sensible thing to do. However, it can make sense to scale sparse inputs, especially if features are on different scales. If your data contains many outliers, scaling using the mean and variance of the data is likely to not work very well. In these cases, you can use RobustScaler as a drop in replacement, instead. If you have a kernel, K\bm{K}K that computes a dot product in a feature space defined by a function ϕ()\phi (\cdot )ϕ() , a KernelCenterer can transform the kernel matrix so that it contains inner products in the feature space defined by ϕ\phiϕ followed by the removal of the mean in that space.

Non-linear transformation

Two types of transformations are available: quantile transforms and power transforms. Both quantile and power transforms are based on monotonic transformations of the features and thus preserve the rank of values along each feature. QuantileTransforemer provides a non-parametric transformation to map the data to a uniform distribution with values between 0 and 1. By performing a rank transformation, a quantile transform smooths out unusual distributionbs and is less influenced by outliers than scaling methods. In many modeling scenarios, normality of the features in a dataset is desirable. Power transforms are a family of parametric, monotonic transformations that aim to map data from any distribution to as close to a Gaussian distribution as possible in order to stabilize variance and minimize skewness. PowerTransformer currently provides two such power transformations, the Yeo-Johnson transform and the Box-Cox transform.

Normalization

Normalization is the process of scaling individual samples to have unit norm. This process can be useful if you plan to use a quadratic from such as the dot-product or any other kernel to quantify the similarity of any pair of samples. The function normalize provides a quick and easy way to perform this operation on a single array-like dataset, either using ll, l2, or max norms. The preprocessin modul further provides a utility class Normalizer that implements the same class operation using the Transformer API.

Encoding Categorical Features

To convert categorical features to such integer codes, we can use the OrdinalEncoder. This estimator transforms each categorical feature to one new feature of integers (0 to n_categories - 1). Another possibility to convert categorical features to features that can be used with scikit-learn estimators is to use a one-of-K, also known as one-hot or dummy encoding. This type of encoding can be obtained with the OneHotEncoder, which transforms each categorical feature with n_categories possible values into n_categories binary features, with one of them 1, and all others 0. OneHotEncoder and OrdinalEncoder support aggregating infrequent categories into a single output for each feature. The parameters to enable the gathering of infrequent caregories are min_frequency and max_categories. The TargetEncoder uses the target mean conditioned on the categorical feature for encoding unordered categories, i.e. nominal categories. This encoding scheme is useful with categorical features with high cardinality, where one-hot encoding would inflate the feature space making it more expensive for a downstream model to process.

Discretization

Discretization provides a way to partition continuous features into discrete values. Certain datasets with continuous features may benefit from discretization because discretization can transform the dataset of continuous attributes to one with only nominal attributes. KBinsDiscretizer discretes features into k bins. Feature binarization is the process of thresholding numerical features to get boolean values. This can be useful for downstream probabilistic estimators that make assumption that the input data is distruted to a multi-variate Bernoulli distributions. The Binarizer is meant to be used in the earlys tsages of Pipeline.

Polynomial Features

PolynomialFeatures implements a simple and common method to use for polynomial features, which can get features' high-=order and interaction terms.

Custom Transformer

Often, you will want to convert an existing POython function into a transformer to assist in data cleaning or processing. You can implement a transformer from an arbitrary function with FunctionTransformer

Imputation of Missing Values

For various reasons, many real world datasets contain missing values, often encoded as blanks, NaNs or other placeholders. Such datasets however are incompatible with scikit-learn estimators which assume that all values in an array are numerical, and that all have and hold meaning. A basic strategy to use incomplete datasets is to discard entire rows and/or columns containing missing values. However, this comes at the price of losing data which may be valuable (even though incomplete). A better strategy is to impute the missing values, i.e., to infer them from the known part of the data. See the glossary entry on imputation.

Univariate vs Multivariate Imputation

One type of imputation algorithm is univariate, which imputes values in the i-th feature dimesnion using only non-missing valeus in that feature dimension (e.g. SimpleImputer). By contrast, multivariate imputation algorithms use the entire set of available feature dimensions to estimate the missing values (IterativeImputer).

Univariate Feature Imputation

The SimpleImputer clkas sprovides basic strategies for imputing missing values. Missing values can be imputed with a provided constant value, or using statistics (mean, median, or most frequent) of each column in which the missing values are located.

Multivariate Feature Imputation

The IterativeImputer class, which models each feature with missing values as a function of other features, and uses that estimate for imputation.

It does so in an iterated round-robin fashion: at each step, a feature column is designated as output y and the other feature columns are treated as inputs X. A regressor is fit on (X, y) for known y. Then, the regressor is used to predict the missing values of y. This is done for each feature in an iterative fashion, and then is repeated for max_iter imputation rounds. The results of the final imputation round are returned.

Nearest Neighbors Imputation

The KNNImputer class provides imputation for missing values using the k-Nearest neighbors approach. Each missing feature is imputed using values form n_neighbors nearest neighbors that have a value for the feature. The feature of the neighbors are averaged uniformly or weighted by distance to each neighbor.

Unsupervised Dimensionality Reduction

If your number of features is high, it may be useful to reduce it with an unsupervised step prior to supervised steps. Many of the Unsupervised learning methods implement a transform method that can be used to reduce the dimensionality. Below are two specific examples of this pattern thar are heavily used:

  • Principal Component Analysis
  • Random Projections
  • Feature agglomeration

Random Projection

The sklearn.random_projection module implements a simple and computationally efficient way to reduce the dimensionality of the data by trading a controlled amount of accuracy for faster procesing times and smaller model sizes. This module implements two types of unstructured random matrix: Gaussian random matrix and sparse random matrix. The dimensions and distribution of random projection matrices are controlled so as to preserve the pairwise distances between any two samples of the dataset. Thus random projection is a suitable approximation technique for distance based method.

The Johnson-Lindenstrauss lemma

In mathematics, the Johnson-Lindenstrauss lemma is a result concerning low-distortion embeddings of points from high-dimensional into low-dimensional Euclidean space. The lemma states that a small set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved. The map used for the embedding is at least Lipschitz, and can even be taken to be an orthogonal projection.

Knowing only the number of samples, the johnson_lindenstrauss_min_dim estimates conservatively the minimal size of the random subspace to guarantee a bounded distortion introduced by some random projection.

Gaussian Random Projection

The GaussianRandomProjection reduces the dimensionality by projecting the original input space on a randomly generated matrix whose componets are drawn from the following distribution: N(0,1ncomponents)N \left( 0, \cfrac{1}{n_{\text{components}}} \right)N(0,ncomponents1)

Sparse Random Projection

The SparseRandomProjection reduces the dimensionality by projecting the original input space using a sparse random matrix.

Kernel Approximation

This submodule contains functions that approximate the feature mappings that correspond to certain kernels, as they are used for example in support vector machines. The following feature functions perform non-linear transformations of the input, which can serve as a basis for linear classification or other algorithms. The advanatge of using feature maps is that they can be better suited for online learning and can significantly reduce the cost of leaning with very large datasets. Standard kernelized SVMs do not scale well to large datasets, but using an approximate kernel map it is possible to use much more efficient linear SVMs.

The Nystroem method, as implemented in Nystroem is a general method for reduced rank approzimation of kernels. The RBFSample constructs an approximate mapping for the radial basis function kernel, also known as Random Kitchen Sinks. This transformation can be used to explictly model a kernel map prior to applying a linear regression.

Pairwise Metrics, Affinites, and Kernels

The sklearn.metrics.pairwise submodule implements utilities to evaluate pairwise distances or affinity of sets of samples. This module contains both distance metrics and kernels. Distance metrics are functions d(a,b) such that d(a,b) < d(a,c) if objects a and b are considered "more similar" than objects a and c. Two objects exactly alike would have a distance of 0. Kernels are measures of similarity, s(a,b) > s(a,c) if objects a and b are considered "more similar" than objects a and c. The cosine_similarity computes the L2-normalized dot product of vectors. The function linear_kernel computes the linear kernel, that is, a special case of polynomial_kernel with degree=1 and coef0=0. The function polynomial_kernel computes the degree-d polynomial kernel between the two vectors. The function sigmoid_kernel computes the sigmoid kernel between teh two vectors. The rbf_kernel function computes the radial basis function (RBF) kernel between two vectors. The laplacian_kernel function is variant on the radial basis kernel function.

Common Kernel Functions

Transforming the Prediction Target

There are transformations that are not intended to be used on features, only on supervised learning targets. LabelBinarizer is a utility class to help create a label indicator matrix from a list of multiclass labels. The MultiLabelBinarizer transformer can be used to convert between a collection of collections of labels and the indicator format. LabelEncoder is a utility class to help normalize labels such that they contain values only between 0 and n_classes - 1.

# Pipeline 
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import Pipeline
iris = load_iris()
pipe = Pipeline(steps=[
   ('select', SelectKBest(k=2)),
   ('clf', LogisticRegression())])
pipe.fit(iris.data, iris.target)
print(pipe[-1:])
print(pipe[:-1].get_feature_names_out())
pipe.set_params(clf__C=10)

# Count Vectorizer
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
]
X = vectorizer.fit_transform(corpus)
print("Count Vectorizer:",type(X),X.size)
from sklearn.preprocessing import StandardScaler
from numpy import random
## Standard Scaler 
random.seed(42)
X_train = np.array(np.array([[np.random.randn()*100 for i in range(100)], [np.random.randn()*100 for i in range(100)], [np.random.randn()*100 for i in range(100)]]))
print(X_train[0:5,0:5])
scaler = StandardScaler().fit(X_train)
X_scaled = scaler.transform(X_train)
print(X_scaled[0:5,0:5])
print(X_scaled.mean(axis=0)[0:5])
print(X_scaled.std(axis=0)[0:5])
out[19]

Pipeline(steps=[('clf', LogisticRegression())])
['x2' 'x3']
Count Vectorizer: <class 'scipy.sparse._csr.csr_matrix'> 19
[[ 49.6714153 -13.82643012 64.76885381 152.30298564 -23.41533747]
[-141.53707421 -42.06453228 -34.27145165 -80.22772692 -16.12857117]
[ 35.77873603 56.07845264 108.30512432 105.3802052 -137.7669368 ]]
[[ 0.78540412 -0.33667855 0.3101197 0.92779459 0.64069022]
[-1.41120845 -1.02119243 -1.34999471 -1.38823376 0.77150517]
[ 0.62580433 1.35787097 1.03987502 0.46043917 -1.41219539]]
[ 1.11022302e-16 0.00000000e+00 0.00000000e+00 -1.85037171e-17
7.40148683e-17]
[1. 1. 1. 1. 1.]

Dataset Loading Utilities

Toy Datasets

Scikit-Learn comes with a few small standard datasets that do not require to download any file from some external website

  • load_iris
    • Load the famous iris dataset. This dataset contains 3 classes of 50 instances each, where each class refers to a type of Iris plant.
  • load_diabetes
    • Ten baseline variables were obtained for around 500 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline
  • load_digits
    • Load the MNIST dataset
  • load_linnerud
    • The Linnerund dataset is a multi-output regression dataset. It consists of three exercise (data) and three physiological (target) variables collected from twenty middle-aged men in a fitness club
  • load_wine
    • The data is the results of a chemical analysis of wines grown in the same region in Italy by three different cultivators. There are thirteen different measurements taken for different constituents found in the three types of wine.
  • load_breast_cancer
    • features are computed from a digitized image of a fine needle aspirate of breast mass. They describe characteristics of the cell nuclei present in the image.

Real World Datasets

  • fetch_olivetti_faces
    • COntains a set of face images taken between 1992 and 1994. 1- different images of each of 40 different subjects. All the images were taken against a dark background, with various expressions and varying lighting.
  • fetch_20newsgroups
    • Load the filenames and data from the 20 newsgroups dataset (classification).
    • The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation).
  • fetch_20newsgroups_vectorized
    • Load and vectorize the 20 newsgroups dataset (classification).
  • fetch_lfw_pairs
    • This dataset is a collection of JPEG pictures of famous people collected over the internet, all details are available on the official website. The typical task is called Face Verification.
  • fetch_covtype
    • The samples in this dataset correspond to 30x30m patches of forest in the US, collected for the task of predicting each patch's cover type - the dominant species of tree
  • fetch_rcv1
    • Over 800,000 manually categorized newswire stories made available by Ruters, Ltd. for research purposes
  • fetch_kddcup99
    • The artifical data was generated using a closed network and hand-injected attacks to produce a large number of differenmt types of attack with normal activity in the background. As the initial goal was to produce a large training set for supervised learning algorithms there is a large portion (80%) of abnormal data that is unrealistic in the real world.
  • fetch_california_housing
    • The target variable is the median house value for California districts, expressed in hundreds of thousands of dollars ($100,000).
  • fetch_species_distributions
    • This dataset represents the geographic distribution of two species in Central and South America.

Generated Datasets

Generators for Classification and Clustering

make_blobs and make_classification create multiple datasets by allocation each class one or more normally distributed clusters of points. make_circle and make_moons generate 2d binary classification datasets that are challenging to certain datasets. make_multilabel_classification generates random samples with multiple labels, reflecting a bag of words drawn from a mixture of topics. make_biclusters and make_checkerboard for clustering.

Generators for Regression

make_regression produces regression targets as an optionally-sparse random linear combination of random features, with noise. make_sparse_uncorrelated, make_friedman1, make_friedman2, and make_friedman3 are other examples.

Generators for Manifold Learning

  • make_s_curve - generates an S curve dataset
  • make_swiss_roll - Generates a swiss roll dataset

Generators for Decomposition

  • make_low_rank_matrix: generate mostly low rank matrix with bell-shaped singular values
  • make_sparse_coded_signal: generate a signal as a sparse combination of dictionary elements
  • make_spd_matrix: generate a random symmetric, positive-definite matrix
  • make_sparse_spd_matrix: generate a sparse symmetric positive matrix

Loading Other Datasets

Use load_sample_images() and load_sample_image(index) to load images / image. You can use the fetch_openml function to download datasets from the openml.org repository.

Computing with scikit-learn

Strategies to Scale Computationally: Bigger Data

For some applications, the amount of examples, features (or both) and / or the speed at which they need to be processed are challenging for traditional approaches. In these cases, Scikit-Learn has a number of options you can consider to make you system scale.

Scaling with Instances Using Out-of-Core Learning

Out-of-core learning is a techniques used to learn from data that cannot fit in a computer's main memory (RAM). Here is a sketch of a system designed to achieve this goal:

  1. a way top stream instances
  2. a way to extract features from instances
  3. an incremental algorithm

Computational performance

For some applications, the performance (mainly latency and throughput at prediction time) of estimators is crucial. It may also be of interest to consider the training throughput but this is often less important in a production setup (where it often takes place offline). Prediction latency is measured as the elapsed time necessary to make a prediction (e.g. in micro-seconds). Prediction throughput is defined as the number of predictions that software can deliver in a given amount of time.

Model Persistence

After training a scikit-learn model, it is desirable to have a way to persist the model for future use without having to retrain. You probably want to use pickle. pickle is a module from the Python Stahndard library. It can serialize and deserialize any Python object, including custom Python classes and objects.While pickle can be used to easily save abd load scikit-learn models, it may trigger malicious code while loading a model from an untrusted source.

Common Pitfalls and Recommended Practices

Inconsistent Preprocessing

Use the same data transforms that you use to train the model on subsequent datasets, whether it's test data or data in production systems. If you do not, the feature space will change, and the model will not be able to perform effectively.

Data Leakage

Data leakage occurs when information that would not be available at prediction time is used when building the model. This results in overly optimistic performance estimates, and thus poorer performance when the model is used on actually novel data. A common cause is not keeping the test and train subsets separate. Test data should never be used to make choices about the model. The general rule is to never call fit on the test data. (Including preprocessing steps) Although both train and test data subsets should receive the same preprocessing transformation, it is important that these transformations are only learnt from the training data. The Scikit-Lean pipeline module is a great way to prevent data leakage as it ensures that the appropriate method is performed on the correct data subset. Feature selection should only use the training data. Some Scikit-Learn objects are inherently random. These are usually estimators (RandomForestClassifier) and cross-validation splitters (KFold). the randomness of these objects is controlled via their random_state parameter. In order to obtain reproducible (constant) results across multiple program executions, we need to remove all uses of random_state=None, which is the default. The recommended way is to declare a rng variable at the top of the program, and pass it down to any object that accepts a random_state parameter. It is preferable to evaluate the cross-validation performance by letting teh estimator use a different RNG on each fold. This is done by passing a RandomState instance (or None) to the estimator initialization.

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import numpy as np

rng = np.random.RandomState(0)
X, y = make_classification(random_state=rng)
rf = RandomForestClassifier(random_state=rng)
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    random_state=rng)
rf.fit(X_train, y_train).score(X_test, y_test)
out[24]

0.84

Choosing the Right Estimator

Here is a helpful chart on choosing the right estimator:

Choosing the Right Estimator