API design for machine learning software: experiences from the scikit-learn project

Looking for more research papers to read, I scanned my Hands-On Machine Learning notes for the many papers that were referenced there. This is one of those papers. These papers are mainly on machine learning and deep learning topics.

Reference API design for machine learning software: experiences from the scikit-learn project Paper

DOWNLOAD TEX

Date Created: 25 10, 2025

Last Edited: 27 13, 2025

1 140

scikit-learn is an increasingly popular machine learning library. Written in Python, it is designed to be simple and efficient, accessible to non-experts, and reusable in various contexts. In this paper, the design choices for the API of the project are discussed. The simple interface shared by all learning and processing units in the library is discussed and its advantages in terms of composition and reusability.

The scikit-learn project provides an open-source machine learning library for the Python programming language. The ambition of the project is to provide efficient and well-established machine learning tools within a programming environment that is accessible to non-machine learning experts and reusable in various scientific areas. scikit-learn is a library - a collection of classes anf functions that users import into Python programs. NumPy augments Python with a contiguous numeric array datatype and fast array computing primitives, while SciPy extends it further with common numerical operations, either by implementing these in Python / NumPy or by wrapping existing C/C++/Fortran implementations.

All objects within scikit-learn share a uniform basic APU consisting of three complementary interfaces: an estimator interface for building and fitting models, a predictor interface for making predictions and a transformer interface for converting data. Design choices were guided to avoid the proliferation of framework code. The API is designed to adhere to the following broad principles:

Consistency: All objects share a consistent interface composed of a limited set of methods
Inspection: Constructor parameters and parameter values determined by learning algorithms are stored and exposed as public attributes
Non-proliferation of classes: Learning algorithms are the only objects to be represented using custom classes. Datasets are represented as NumPy arrays or SciPy sparse matrices. Hyper-parameter names and values are represented as standard Python strings whenever possible.
Composition: Compose from existing building blocks
Sensible Defaults: Provide appropriate defaults where required

In most machine learning tasks, data is modeled as a set of variables. For example, in a supervised learning task, the goal is to find a mapping from input variables $X_{1}, \dots, X_{p}$ , called features, to some output variable $Y$ . A sample is defined as a pair of values $({[x_{1}, \dots, x_{p}]}^{T}, y)$ of these variables. A widely used representation of a dataset, a collection of such samples, is a pair of matrices with numerical values: one for the input values and one for the output values. Each row of these matrices corresponds to one sample of the dataset and each column to one variable of the problem. NumPy multidimensional arrays are used for dense data, and SciPy sparse matrices are used for sparse data. This allows the library to take advantage of vectorized operations and the fact that many people are familiar with the libraries. For tasks where the inputs are text files or semi-structured objects, vectorizer objects are provided that efficiently convert such data to the NumPy or SciPy formats. The API is oriented towards processing batches of samples rather than single samples.

The estimator interface is at the core of the library. It defines instantiation mechanisms of objects and exposes a fit method for learning a model from training data. An estimator is initialized from a set of named constant hyper-parameter values and can be considered as a function that maps these values to actual learning algorithms. The constructor of an estimator does not see any actual data, nor does it perform any actual learning. All it does is attach the given parameters to the object. Actual learning is performed by the fit method. Its task is to run a learning algorithm and to determine model-specific parameters from teh training data and set these as attributes on the estimator object. The parameters learned by an estimator are exposed as public attributes with names suffixed with a trailing underscore to facilitate model inspection. The choice to let a single model serve dual purposes as estimator and model has mostly been driven by usability and technical considerations. In scikit-learn, classical learning algorithms are not the only objects to be implemented as estimators. Preprocessing routines (the scaling of features) or feature extraction techniques (vectorization of text documents) ate implemented by the estimator interface.

The predictor interface extends the notion of an estimator by adding a predict method that takes an array X_test and produces the predictions for X_test, based on the learning parameters of the estimator. Apart form predict, predictors may also implement methods that quantify the confidence of predictions. In the case of linear models, the decision_function method returns the distance of samples to the separating hyperplane. Some predictors also provide a predict_proba method which returns class probabilities. Predictors provide a score function to assess their performance on a batch of input data.

Since it is common to modify or filter data before feeding it to a learning algorithm, some estimators in the library implement a transformer interface which defines a transform method. It takes as input some new data X_test and yields as output a transformed version of X_test. Preprocessing, feature selection, feature extraction, and dimensionality reduction are all provided as transformers within the library. Every transformation allows fit(X_train).transform(X_train) to be written as fit_transform(X_train). The combined fit_transform method prevents repeated computations.

Some machine learning methods - ensemble methods or multiclass and multilabel classification schemes using a binary classifier - are implemented a smeta-estimators. They take as input an existing base estimator and use it internally for learning and making predictions.

Composition of estimators can be done in scikit-learn sequentially through Pipeline objects or in a parallel fashion through FeatureUnion objects. Pipeline objects chain multiple estimators into a single one. FeatureUnion objects combine multiple transformers into a single one that concatenates their outputs.

The problem of model selection is to find, with some hyper-parameter space, the best combination of hyper-parameters, with respect to some user-specified criterion. In scikit-learn, model selection is supported in two distinct meta-estimators, GridSearchCV and RandomizedSearchCV. They take as input an estimator (basic or composite), whose hyper-parameters must be optimized, and a set of hyperparameter settings to search through. Randomized search is a smarter algorithm that avoids the combinatorial explosion in grid search by sampling a fixed number of times from its parameter distributions.

Estimators are defined by interface, bot by inheritance, where the interface is entirely implicit as far as the programming language is concerned. Duck typing allows both for extensibility and flexibility: as long as an estimator follows the API and conventions outlined, then it can be used in lieu of a built-in estimator.

API design for machine learning software: experiences from the scikit-learn project

Comments

User Comments