API design for machine learning software: experiences from the scikit-learn project

Looking for more research papers to read, I scanned my Hands-On Machine Learning notes for the many papers that were referenced there. This is one of those papers. These papers are mainly on machine learning and deep learning topics.

Reference API design for machine learning software: experiences from the scikit-learn project Paper

Date Created:
Last Edited:
1 8

scikit-learn is an increasingly popular machine learning library. Written in Python, it is designed to be simple and efficient, accessible to non-experts, and reusable in various contexts. In this paper, the design choices for the API of the project are discussed. The simple interface shared by all learning and processing units in the library is discussed and its advantages in terms of composition and reusability.

The scikit-learn project provides an open-source machine learning library for the Python programming language. The ambition of the project is to provide efficient and well-established machine learning tools within a programming environment that is accessible to non-machine learning experts and reusable in various scientific areas. scikit-learn is a library - a collection of classes anf functions that users import into Python programs. NumPy augments Python with a contiguous numeric array datatype and fast array computing primitives, while SciPy extends it further with common numerical operations, either by implementing these in Python / NumPy or by wrapping existing C/C++/Fortran implementations.

All objects within scikit-learn share a uniform basic APU consisting of three complementary interfaces: an estimator interface for building and fitting models, a predictor interface for making predictions and a transformer interface for converting data. Design choices were guided to avoid the proliferation of framework code. The API is designed to adhere to the following broad principles:

  • Consistency: All objects share a consistent interface composed of a limited set of methods
  • Inspection: Constructor parameters and parameter values determined by learning algorithms are stored and exposed as public attributes
  • Non-proliferation of classes: Learning algorithms are the only objects to be represented using custom classes. Datasets are represented as NumPy arrays or SciPy sparse matrices. Hyper-parameter names and values are represented as standard Python strings whenever possible.
  • Composition: Compose from existing building blocks
  • Sensible Defaults: Provide appropriate defaults where required

In most machine learning tasks, data is modeled as a set of variables. For example, in a supervised learning task, the goal is to find a mapping from input variables X1,,Xp, called features, to some output variable Y . A sample is defined as a pair of values ([x1,,xp]T,y) of these variables. A widely used representation of a dataset, a collection of such samples, is a pair of matrices with numerical values: one for the input values and one for the output values. Each row of these matrices corresponds to one sample of the dataset and each column to one variable of the problem. NumPy multidimensional arrays are used for dense data, and SciPy sparse matrices are used for sparse data. This allows the library to take advantage of vectorized operations and the fact that many people are familiar with the libraries. For tasks where the inputs are text files or semi-structured objects, vectorizer objects are provided that efficiently convert such data to the NumPy or SciPy formats. The API is oriented towards processing batches of samples rather than single samples.

The estimator interface is at the core of the library. It defines instantiation mechanisms of objects and exposes a fit method for learning a model from training data. An estimator is initialized from a set of named constant hyper-parameter values and can be considered as a function that maps these values to actual learning algorithms. The constructor of an estimator does not see any actual data, nor does it perform any actual learning. All it does is attach the given parameters to the object. Actual learning is performed by the fit method. Its task is to run a learning algorithm and to determine model-specific parameters from teh training data and set these as attributes on the estimator object. The parameters learned by an estimator are exposed as public attributes with names suffixed with a trailing underscore to facilitate model inspection. The choice to let a single model serve dual purposes as estimator and model has mostly been driven by usability and technical considerations. In scikit-learn, classical learning algorithms are not the only objects to be implemented as estimators. Preprocessing routines (the scaling of features) or feature extraction techniques (vectorization of text documents) ate implemented by the estimator interface.

The predictor interface extends the notion of an estimator by adding a predict method that takes an array X_test and produces the predictions for X_test, based on the learning parameters of the estimator. Apart form predict, predictors may also implement methods that quantify the confidence of predictions. In the case of linear models, the decision_function method returns the distance of samples to the separating hyperplane. Some predictors also provide a predict_proba method which returns class probabilities. Predictors provide a score function to assess their performance on a batch of input data.

Since it is common to modify or filter data before feeding it to a learning algorithm, some estimators in the library implement a transformer interface which defines a transform method. It takes as input some new data X_test and yields as output a transformed version of X_test. Preprocessing, feature selection, feature extraction, and dimensionality reduction are all provided as transformers within the library. Every transformation allows fit(X_train).transform(X_train) to be written as fit_transform(X_train). The combined fit_transform method prevents repeated computations.

Some machine learning methods - ensemble methods or multiclass and multilabel classification schemes using a binary classifier - are implemented a smeta-estimators. They take as input an existing base estimator and use it internally for learning and making predictions.

Composition of estimators can be done in scikit-learn sequentially through Pipeline objects or in a parallel fashion through FeatureUnion objects. Pipeline objects chain multiple estimators into a single one. FeatureUnion objects combine multiple transformers into a single one that concatenates their outputs.

The problem of model selection is to find, with some hyper-parameter space, the best combination of hyper-parameters, with respect to some user-specified criterion. In scikit-learn, model selection is supported in two distinct meta-estimators, GridSearchCV and RandomizedSearchCV. They take as input an estimator (basic or composite), whose hyper-parameters must be optimized, and a set of hyperparameter settings to search through. Randomized search is a smarter algorithm that avoids the combinatorial explosion in grid search by sampling a fixed number of times from its parameter distributions.

Estimators are defined by interface, bot by inheritance, where the interface is entirely implicit as far as the programming language is concerned. Duck typing allows both for extensibility and flexibility: as long as an estimator follows the API and conventions outlined, then it can be used in lieu of a built-in estimator.

Comments

You have to be logged in to add a comment

User Comments

Insert Math Markup

ESC
About Inserting Math Content
Display Style:

Embed News Content

ESC
About Embedding News Content

Embed Youtube Video

ESC
Embedding Youtube Videos

Embed TikTok Video

ESC
Embedding TikTok Videos

Embed X Post

ESC
Embedding X Posts

Embed Instagram Post

ESC
Embedding Instagram Posts

Insert Details Element

ESC

Example Output:

Summary Title
You will be able to insert content here after confirming the title of the <details> element.

Insert Table

ESC
Customization
Align:
Preview:

Insert Horizontal Rule

#000000

Preview:


View Content At Different Sizes

ESC

Edit Style of Block Nodes

ESC

Edit the background color, default text color, margin, padding, and border of block nodes. Editable block nodes include paragraphs, headers, and lists.

#ffffff
#000000

Edit Selected Cells

Change the background color, vertical align, and borders of the cells in the current selection.

#ffffff
Vertical Align:
Border
#000000
Border Style:

Edit Table

ESC
Customization:
Align:

Upload Lexical State

ESC

Upload a .lexical file. If the file type matches the type of the current editor, then a preview will be shown below the file input.

Upload 3D Object

ESC

Upload Jupyter Notebook

ESC

Upload a Jupyter notebook and embed the resulting HTML in the text editor.

Insert Custom HTML

ESC

Edit Image Background Color

ESC
#ffffff

Insert Columns Layout

ESC
Column Type:

Select Code Language

ESC
Select Coding Language

Insert Chart

ESC

Use the search box below

Upload Previous Version of Article State

ESC