API design for machine learning software: experiences from the scikit-learn project
Looking for more research papers to read, I scanned my Hands-On Machine Learning notes for the many papers that were referenced there. This is one of those papers. These papers are mainly on machine learning and deep learning topics.
Reference API design for machine learning software: experiences from the scikit-learn project Paper
scikit-learn is an increasingly popular machine learning library. Written in Python, it is designed to be simple and efficient, accessible to non-experts, and reusable in various contexts. In this paper, the design choices for the API of the project are discussed. The simple interface shared by all learning and processing units in the library is discussed and its advantages in terms of composition and reusability.
The scikit-learn project provides an open-source machine learning library for the Python programming language. The ambition of the project is to provide efficient and well-established machine learning tools within a programming environment that is accessible to non-machine learning experts and reusable in various scientific areas. scikit-learn is a library - a collection of classes anf functions that users import into Python programs. NumPy augments Python with a contiguous numeric array datatype and fast array computing primitives, while SciPy extends it further with common numerical operations, either by implementing these in Python / NumPy or by wrapping existing C/C++/Fortran implementations.
All objects within scikit-learn share a uniform basic APU consisting of three complementary interfaces: an estimator interface for building and fitting models, a predictor interface for making predictions and a transformer interface for converting data. Design choices were guided to avoid the proliferation of framework code. The API is designed to adhere to the following broad principles:
- Consistency: All objects share a consistent interface composed of a limited set of methods
- Inspection: Constructor parameters and parameter values determined by learning algorithms are stored and exposed as public attributes
- Non-proliferation of classes: Learning algorithms are the only objects to be represented using custom classes. Datasets are represented as NumPy arrays or SciPy sparse matrices. Hyper-parameter names and values are represented as standard Python strings whenever possible.
- Composition: Compose from existing building blocks
- Sensible Defaults: Provide appropriate defaults where required
In most machine learning tasks, data is modeled as a set of variables. For example, in a supervised learning task, the goal is to find a mapping from input variables , called features, to some output variable . A sample is defined as a pair of values of these variables. A widely used representation of a dataset, a collection of such samples, is a pair of matrices with numerical values: one for the input values and one for the output values. Each row of these matrices corresponds to one sample of the dataset and each column to one variable of the problem. NumPy multidimensional arrays are used for dense data, and SciPy sparse matrices are used for sparse data. This allows the library to take advantage of vectorized operations and the fact that many people are familiar with the libraries. For tasks where the inputs are text files or semi-structured objects, vectorizer objects are provided that efficiently convert such data to the NumPy or SciPy formats. The API is oriented towards processing batches of samples rather than single samples.
The estimator interface is at the core of the library. It defines instantiation mechanisms of objects and exposes a
fit
method for learning a model from training data. An estimator is initialized from a set of named constant
hyper-parameter values and can be considered as a function that maps these values to actual learning
algorithms. The constructor of an estimator does not see any actual data, nor does it perform any actual
learning. All it does is attach the given parameters to the object. Actual learning is performed by the
fit
method. Its task is to run a learning algorithm and to determine model-specific parameters from
teh training data and set these as attributes on the estimator object. The parameters learned by an
estimator are exposed as public attributes with names suffixed with a trailing underscore to facilitate model
inspection. The choice to let a single model serve dual purposes as estimator and model has mostly
been driven by usability and technical considerations. In scikit-learn, classical learning algorithms are
not the only objects to be implemented as estimators. Preprocessing routines (the scaling of features)
or feature extraction techniques (vectorization of text documents) ate implemented by the estimator
interface.
The predictor interface extends the notion of an estimator by adding a predict
method that takes an array
X_test
and produces the predictions for X_test
, based on the learning parameters of the estimator. Apart form
predict
, predictors may also implement methods that quantify the confidence of predictions. In the case of linear
models, the decision_function
method returns the distance of samples to the separating hyperplane. Some
predictors also provide a predict_proba
method which returns class probabilities. Predictors provide a score
function
to assess their performance on a batch of input data.
Since it is common to modify or filter data before feeding it to a learning algorithm, some estimators in the library
implement a transformer interface which defines a transform
method. It takes as input some new data X_test
and
yields as output a transformed version of X_test
. Preprocessing, feature selection, feature extraction, and
dimensionality reduction are all provided as transformers within the library. Every transformation allows
fit(X_train).transform(X_train)
to be written as fit_transform(X_train)
. The combined fit_transform
method prevents repeated computations.
Some machine learning methods - ensemble methods or multiclass and multilabel classification schemes using a binary classifier - are implemented a smeta-estimators. They take as input an existing base estimator and use it internally for learning and making predictions.
Composition of estimators can be done in scikit-learn sequentially through Pipeline
objects or in
a parallel fashion through FeatureUnion
objects. Pipeline
objects chain multiple estimators into a
single one. FeatureUnion
objects combine multiple transformers into a single one that concatenates their
outputs.
The problem of model selection is to find, with some hyper-parameter space, the best combination of
hyper-parameters, with respect to some user-specified criterion. In scikit-learn, model selection is supported in two
distinct meta-estimators, GridSearchCV
and RandomizedSearchCV
. They take as input an estimator (basic or
composite), whose hyper-parameters must be optimized, and a set of hyperparameter settings to search through.
Randomized search is a smarter algorithm that avoids the combinatorial explosion in grid search by sampling a fixed
number of times from its parameter distributions.
Estimators are defined by interface, bot by inheritance, where the interface is entirely implicit as far as the programming language is concerned. Duck typing allows both for extensibility and flexibility: as long as an estimator follows the API and conventions outlined, then it can be used in lieu of a built-in estimator.
Comments
You have to be logged in to add a comment
User Comments
There are currently no comments for this article.