Bagging Predictors
Looking for more research papers to read, I scanned my Hands-On Machine Learning notes for the many papers that were referenced there. This is one of those papers. These papers are mainly on machine learning and deep learning topics.
Reference Bagging Predictors Paper
Bagging Predictors
1970-01-01
Introduction
Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor, The aggregation averages over the versions when predicting a numerical outcome and does a plurality vote when predicting a class. The multiple versions are formed by making bootstrap replicates of the learning set and using these as new learning sets. Bagging can give substantial gains in accuracy. The vital element is the instability of the prediction method. If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy.
A learning set of \(\mathscr{L}\) consists of data \(\{ (y_n,\textbf{x}_n, n=1 , \ldots , N) \}\) where the \(y\)’s are either class labels or numerical response. Assume we have a procedure for using this learning set to form a predictor \(\phi (\textbf{x},\mathscr{L})\) - if the input is \(\textbf{x}\) we predict \(y\) by \(\phi (\textbf{x},\mathscr{L})\). Now suppose we are given a sequence of learning sets \(\{\mathscr{L}_k\}\), each consisting of \(N\) independent observations from the same underlying distribution as \(L\). Our mission is to use \(\{\mathscr{L}_k\}\) to get a better predictor than the single learning set predictor \(\phi (\textbf{x},\mathscr{L})\). The restrcition is that all we are allowed to work with is the sequence of predictors \(\{\phi (\textbf{x},\mathscr{L}_k)\}\).
If y is numerical, an obvious procedure is to replace \(\phi (\textbf{x},\mathscr{L})\) by an average of \(\phi (\textbf{x},\mathscr{L}_k)\) over \(k\), i.e. by \(\phi_A (\textbf{x})=\text{E}_{\mathscr{L}}\phi (\textbf{x},\mathscr{L})\) where \(E_{\mathscr{L}}\) denotes the expectation over \(\mathscr{L}\) and the subscript \(A\) in \(\phi_A\) denotes aggregation. If \(\phi (\textbf{x},\mathscr{L}_)\) predicts a class \(j\in \{1 , \ldots , J\}\) then one method of aggregating the \(\phi (\textbf{x}, \mathscr{L}_k)\) is by voting. Let \(N_j = nr\{l;\phi(\textbf{x},\mathscr{L}_k)=j\}\) and take \(\phi_A(\textbf{x})=\text{argmax}_j N_j\), that is, the \(j\) for which \(N_j\) is the maximum.
Usually, though, we have a single learning set \(\mathscr{L}\) without the luxury of replicates of \(\mathscr{L}\). Still, an imitation of the process leading to \(\phi_A\) can be done. Take repeated bootstrap samples \(\{\mathscr{L}^{(B)}\}\) from \(\mathscr{L}\) and form \(\{\phi (\textbf{x},\mathscr{L}^{(B)})\}\). If \(y\) is numerical, take \(\phi_B\) as
\[\phi_B =\alpha v_B \phi (\textbf{x},\mathscr{L}^{(B)})\]
If \(y\) is a class label, let the \(\{\phi (\textbf{x},\mathscr{L}^{(B)}) \}\) vote to form \(\phi_B(\textbf{x})\). We call this procedure bootstrap aggregating and use the acronym bagging. The \(\{\mathscr{L}^{(B)}\}\) form replicate data sets, each consisting of \(N\) cases, drawn at random, bit with replacement, from \(\mathscr{L}\). Each \((y_n,\textbf{x}_n)\) may appear repeated times or not at all in any particular \(\mathscr{L}^{(B)}\). The \(\{\mathscr{L}^{(B)}\}\) are replicate data sets drawn from the bootstrap distribution approximating the distribution underlying \(\mathscr{L}\). A critical factor in whether bagging will improve accuracy is the stability of the procedure for constructing \(\phi\). If changes in \(\mathscr{L}\) produce small values in \(\phi\), the \(\phi_B\) will be close to \(\phi\), Improvement will occur for unstable procedures where small changes in \(\mathscr{L}\) can result in large changes in \(\phi\).
For unstable procedures - neural nets, classification and regression trees - bagging works well. The evidence, both experimental and theoretical, is that bagging can push a good but unstable procedure a significant step towards optimality. On the other hand, it can slightly degrade the performance of some stable procedures - like \(k\)-nearest neighbors.
Concluding Remarks
The natural competitor to bagging by voting for classification is to predict the class that was predicted by the classifier model with the highest probability. This estimate was computed in every classification example and the resulting misclassification rate was always virtually identical to the voting misclassification rate.
Author claims that more models are required when there are more classes. Not as many models are needed for regression. "Bagging is almost a dream procedure for parallel computing."
Bagging is a relatively easy way to improve an existing method, since all that needs adding is a loop in the front that selects the bootstrap sample and sends it to the procedure and a backend that does the aggregation. You lose a simple, interpretable structure, but you gain accuracy.
Comments
You have to be logged in to add a comment
User Comments
There are currently no comments for this article.