Reading the XGBoost Docs

I wanted to familiarize myself with xgboost because I read that it is used to win a lot of Kaggle competitions that involve tabular data.

2 431

XGBoost Documentation

XGBoost is an optimized gradient booosting library designed to be highly efficient, flexibile, and portable. It implements machine leanring algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boostig (also known as GBDT, DBM) that solve many data sceince problems in a fast and accurate way. The same code runs on major distributed environment and can solve problems beyond billions of examples.

Installation

XGBoost provides binary packages for some language bindings. The binary packages supportthe GPU algorithm (device=cuda:0) on machines with NVDIA GPUs. Note that training with multiple GPUs is only supported on Linux platform.

$ # Pip 21.3+ is required
$ pip install xgboost

For the command above, you might need to run the command with the --user flag or use virtualenv if you run into permission errors. Id you don;t have access to a GPU and you want to decrease the size of the installed packages and save the disk space:

$ pip install xgboost-cpu

Getting Started with XGBoost

from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
# read data
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(data['data'], data['target'], test_size=.2)
# create model instance
bst = XGBClassifier(n_estimators=2, max_depth=2, learning_rate=1, objective='binary:logistic')
# fit model
bst.fit(X_train, y_train)
# make predictions
preds = bst.predict(X_test)
print(accuracy_score(y_test,preds))
out[2]

0.9333333333333333

GPU Support

Most of the algorithms in XGBoost, including training, prediction, and evaluation,can be accelerated with CUDA-capable GPUs. To enable GPU acceleration, specify the device parameter as cuda. In addition, the device ordinal (which GPU to use if you have multiple devices in the same node) can be specified using the cuda:[ordinal] syntax, where [ordinal] is an integer that represents the device ordinal (default 0).

# Using GPU
params = dict()
params["device"] = "cuda"
params["tree_method"] = "hist"
Xy = xgboost.QuantileDMatrix(X, y)
xgboost.train(params, Xy)
# With Scikit-Learn Interface
XGBRegressor(tree_method="hist", device="cuda")

XGBoost supports distributed GPU training using Dask, Spark and PySpark.

Introduction to Boosted Trees

XGBoost stands for "Extreme Gradient Boosting", where the term "Gradient Boosting" originates from the paper Greedy Function Approximation, by Friedman.

The gradient boosted trees has been around for a while, and there a lot of materials on the topic. This tutorial explains boosted trees in a self-contained, principled way using the elements of supervised learning.

Elements of Supervised Learning

XGBoost is used for supervised leanring problems, where we use the training data xi\textbf{x}_ixi to predict a target variable yiy_iyi . The model in supervised learning usually refers to the mathematical structure by which the prediction yiy_iyi is made frm the input xi\textbf{x}_ixi . The parameters are the undetermined part that we need to learn from data. With judicious choices for yiy_iyi , we may express a variety of tasks, such as regression, classification, and ranking. The task of training the model amounts to finding the best parameters θ\thetaθ that best fit the training data xi\textbf{x}_ixi and labels yiy_iyi . In order to train the model, we need to define the objective function to measure how well the model fit the training data. A salient characteristic of objective functions is that theyc onsist of two parts: *training loss and regularization term:

obj(θ)=L(θ)+Ω(θ)\text{obj}(\theta) = L(\theta) + \Omega (\theta)obj(θ)=L(θ)+Ω(θ)

where LLL is the training loss function, and Ω\OmegaΩ is the regularization term. The training loss measures how predictive the model is with respect to the training data. The regularization term controls the complexity of the model, which helps us to avoid overfitting.

Bias Variance Tradeoff

The general principle we want is both a simple and predictive model. The tradeoff between the two is also referred to as bias-variance tradeoff in machine leanring.

Decision Tree Ensembles

The model choice of XGBoost is decision tree ensembles. The tree ensemble consists of a set of classification and regression trees (CART). Usually, a single tree is not strong enough to be used in practice. What is actually used ios the ensemble model, which sums the prediction of multiple trees together. Mathematically, the descion tree model can be written:

y^i=k=1Kfk(xi), fkF\hat{y}_i = \sum_{k=1}^K f_k (x_i),\space f_k \in Fy^i=k=1Kfk(xi), fkF

where KKK is the number of trees, fkf_kfk is a function in the functional space FFF , and FFF is the set of all possible CARTs. The objective function to be optimized is given by:

obj(θ)=inl(yi,y^i)+k=1Kw(fk)\text{obj}(\theta) = \sum_{i}^n l(y_i,\hat{y}_i)+\sum_{k=1}^K w(f_k)obj(θ)=inl(yi,y^i)+k=1Kw(fk)

Random forests and boosted trees are really the same models; the difference arises from how we train them.

Often in the context of information retrieval, learning-to-rank aims to train a model that arranges a set of query results into an ordered list. For supervised learning-to-rank, the predictors are sample docuyments encoded as afeature matrix, and the labels are relevance degree for each sample.

out[4]

Comments

You have to be logged in to add a comment

User Comments

Insert Math Markup

ESC
About Inserting Math Content
Display Style:

Embed News Content

ESC
About Embedding News Content

Embed Youtube Video

ESC
Embedding Youtube Videos

Embed TikTok Video

ESC
Embedding TikTok Videos

Embed X Post

ESC
Embedding X Posts

Embed Instagram Post

ESC
Embedding Instagram Posts

Insert Details Element

ESC

Example Output:

Summary Title
You will be able to insert content here after confirming the title of the <details> element.

Insert Table

ESC
Customization
Align:
Preview:

Insert Horizontal Rule

#000000

Preview:


View Content At Different Sizes

ESC

Edit Style of Block Nodes

ESC

Edit the background color, default text color, margin, padding, and border of block nodes. Editable block nodes include paragraphs, headers, and lists.

#ffffff
#000000

Edit Selected Cells

Change the background color, vertical align, and borders of the cells in the current selection.

#ffffff
Vertical Align:
Border
#000000
Border Style:

Edit Table

ESC
Customization:
Align:

Upload Lexical State

ESC

Upload a .lexical file. If the file type matches the type of the current editor, then a preview will be shown below the file input.

Upload 3D Object

ESC

Upload Jupyter Notebook

ESC

Upload a Jupyter notebook and embed the resulting HTML in the text editor.

Insert Custom HTML

ESC

Edit Image Background Color

ESC
#ffffff

Insert Columns Layout

ESC
Column Type:

Select Code Language

ESC
Select Coding Language

Insert Chart

ESC

Use the search box below

Upload Previous Version of Article State

ESC