Reading the XGBoost Docs

I wanted to familiarize myself with xgboost because I read that it is used to win a lot of Kaggle competitions that involve tabular data.

XGBoost Documentation

XGBoost is an optimized gradient booosting library designed to be highly efficient, flexibile, and portable. It implements machine leanring algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boostig (also known as GBDT, DBM) that solve many data sceince problems in a fast and accurate way. The same code runs on major distributed environment and can solve problems beyond billions of examples.

Installation

XGBoost provides binary packages for some language bindings. The binary packages supportthe GPU algorithm (device=cuda:0) on machines with NVDIA GPUs. Note that training with multiple GPUs is only supported on Linux platform.

$ # Pip 21.3+ is required
$ pip install xgboost

For the command above, you might need to run the command with the --user flag or use virtualenv if you run into permission errors. Id you don;t have access to a GPU and you want to decrease the size of the installed packages and save the disk space:

$ pip install xgboost-cpu

Getting Started with XGBoost

from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
# read data
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(data['data'], data['target'], test_size=.2)
# create model instance
bst = XGBClassifier(n_estimators=2, max_depth=2, learning_rate=1, objective='binary:logistic')
# fit model
bst.fit(X_train, y_train)
# make predictions
preds = bst.predict(X_test)
print(accuracy_score(y_test,preds))
out[2]

0.9333333333333333

GPU Support

Most of the algorithms in XGBoost, including training, prediction, and evaluation,can be accelerated with CUDA-capable GPUs. To enable GPU acceleration, specify the device parameter as cuda. In addition, the device ordinal (which GPU to use if you have multiple devices in the same node) can be specified using the cuda:[ordinal] syntax, where [ordinal] is an integer that represents the device ordinal (default 0).

# Using GPU
params = dict()
params["device"] = "cuda"
params["tree_method"] = "hist"
Xy = xgboost.QuantileDMatrix(X, y)
xgboost.train(params, Xy)
# With Scikit-Learn Interface
XGBRegressor(tree_method="hist", device="cuda")

XGBoost supports distributed GPU training using Dask, Spark and PySpark.

Introduction to Boosted Trees

XGBoost stands for "Extreme Gradient Boosting", where the term "Gradient Boosting" originates from the paper Greedy Function Approximation, by Friedman.

The gradient boosted trees has been around for a while, and there a lot of materials on the topic. This tutorial explains boosted trees in a self-contained, principled way using the elements of supervised learning.

Elements of Supervised Learning

XGBoost is used for supervised leanring problems, where we use the training data xi\textbf{x}_ixi to predict a target variable yiy_iyi . The model in supervised learning usually refers to the mathematical structure by which the prediction yiy_iyi is made frm the input xi\textbf{x}_ixi . The parameters are the undetermined part that we need to learn from data. With judicious choices for yiy_iyi , we may express a variety of tasks, such as regression, classification, and ranking. The task of training the model amounts to finding the best parameters θ\thetaθ that best fit the training data xi\textbf{x}_ixi and labels yiy_iyi . In order to train the model, we need to define the objective function to measure how well the model fit the training data. A salient characteristic of objective functions is that theyc onsist of two parts: *training loss and regularization term:

obj(θ)=L(θ)+Ω(θ)\text{obj}(\theta) = L(\theta) + \Omega (\theta)obj(θ)=L(θ)+Ω(θ)

where LLL is the training loss function, and Ω\OmegaΩ is the regularization term. The training loss measures how predictive the model is with respect to the training data. The regularization term controls the complexity of the model, which helps us to avoid overfitting.

Bias Variance Tradeoff

The general principle we want is both a simple and predictive model. The tradeoff between the two is also referred to as bias-variance tradeoff in machine leanring.

Decision Tree Ensembles

The model choice of XGBoost is decision tree ensembles. The tree ensemble consists of a set of classification and regression trees (CART). Usually, a single tree is not strong enough to be used in practice. What is actually used ios the ensemble model, which sums the prediction of multiple trees together. Mathematically, the descion tree model can be written:

y^i=k=1Kfk(xi), fkF\hat{y}_i = \sum_{k=1}^K f_k (x_i),\space f_k \in Fy^i=k=1Kfk(xi), fkF

where KKK is the number of trees, fkf_kfk is a function in the functional space FFF , and FFF is the set of all possible CARTs. The objective function to be optimized is given by:

obj(θ)=inl(yi,y^i)+k=1Kw(fk)\text{obj}(\theta) = \sum_{i}^n l(y_i,\hat{y}_i)+\sum_{k=1}^K w(f_k)obj(θ)=inl(yi,y^i)+k=1Kw(fk)

Random forests and boosted trees are really the same models; the difference arises from how we train them.

Often in the context of information retrieval, learning-to-rank aims to train a model that arranges a set of query results into an ordered list. For supervised learning-to-rank, the predictors are sample docuyments encoded as afeature matrix, and the labels are relevance degree for each sample.

out[4]