Posted by dcoetzee on May 7th, 2009
Citation: Leo Breiman. Bagging predictors. Machine Learning, 24, 2 (Aug. 1996), 123-140.
Abstract: Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor. The aggregation averages over the versions when predicting a numerical outcome and does a plurality vote when predicting a class. The multiple versions are formed by making bootstrap replicates of the learning set and using these as new learning sets. Tests on real and simulated data sets using classification and regression trees and subset selection in linear regression show that bagging can give substantial gains in accuracy. The vital element is the instability of the prediction method. If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy.
Discussion: This 1996 machine learning paper described a very simple mechanism called bagging (short for “bootstrap aggregating“) for improving the accuracy of predictors by averaging the results of a number of independent predictors trained over samples from the same training set.
Suppose we’re creating a decision tree to predict, based on a number of medical tests, whether or not a patient has cancer. A typical way to do this is to gather some cases from a medical database, then use them to train a predictor model, such as a decision tree or neural network.
The issue with this straightforward approach is that many predictors used in machine learning, including decision trees and neural networks, are unstable: training them on a slightly different data set can produce significantly different results. For example, if we split our training set in half, and use each half to build a decision tree, those two decision trees are likely to be quite different and yield different predictions. This is a result of overfitting, the phenomenon where a model fits itself too well to the training data, and not well enough to the complete population of past and future cases underlying the sample.
To overcome this, we introduce ideas from a statistical technique called bootstrapping. To motivate this concept, suppose you have a random sample of 1000 test scores from a pool of 1 million, and you want to estimate what the median test score is. Because we can’t know this exactly, we instead seek a confidence interval that has a 98% chance of containing the median. Finding a confidence interval for the mean test score is easy using standard techniques, because the sample mean and population mean are analytically closely related, but the same is not true of the median.
When analysis fails, we take a more computational approach. The idea is to use the sample itself as an approximate model of the underlying distribution: we can replicate each score 1000 times to obtain a pool of 1 million scores that will statistically resemble the real 1 million scores. We then take (say) 200 independent samples of size 1000 from this pool, and compute their medians. Finally, we take the 1% and the 99% percentiles of these median scores to get our confidence interval. In practice, rather than duplicating scores many times, it’s more typical to simply sample with replacement, so that each score can be chosen multiple times. Provided the original sample is a good empirical approximation of the complete population, the result is a very good prediction of the actual median.
What does this have to with machine learning? Well, suppose instead of test scores we have medical test results, and instead of the median, we’re estimating whether they have cancer. We draw (with replacement) bootstrapping samples from the original data, and for each one we use it as training data to construct a predictor. Then, for new cases we take a majority vote among the predictions to determine the overall prediction. Provided that the data is a good empirical approximation of the complete population of past and future cases, the result will be a good prediction that resists overfitting.
This technique is extremely general, and can be applied to virtually any type of classification problem and any model. It can even use completely different models for different bootstrap samples, which can make it easy to sidestep difficult questions about choosing good models and model parameters. It has three main disadvantages:
- Performance: For each of the bootstrap samples, we have to construct a separate model from scratch, and each query must bear the overhead of querying each of these models and combining their results. Fortunately, bagging is easily parallelizable – because of this, and because it can extract good accuracy from fast-to-query models like decision trees, it can actually be more efficient in some cases.
- Relies on instability: If the model being used is stable, or in other words does not change significantly between bootstrap samples, the resulting system can actually be less accurate than a model built directly on the original data. An example of a stable model is a nearest-neighbor classifier. To vividly illustrate this point, the paper describes an experiment with a linear regression model where the stability of the model depends on the number of variables used. Sure enough, there is a crossover point in accuracy between the bagged and the unbagged versions of this predictor.
- Interpretability: One of the widely cited advantages of models like the decision tree is that it’s easy for a human to understand how it makes it decisions – like a flowchart, it just performs a series of tests. But when you have twenty decision trees, each making different decisions on the same inputs, it’s harder to get an intuitive feel for their aggregate decision making – it’s like predicting the decision that a committee will make by learning about its members.
Bagging is just one of a family of resampling ensemble methods; another popular one is called boosting, a technique that iteratively combines multiple predictors by encouraging each new model to focus on cases misclassified by previous models. Boosting can learn more quickly than bagging because of its focus on misclassified cases, but is also less robust to errors in training data. A 2005 paper by Kotsiantis and Pintelas described a scheme for combining the two (“Combining Bagging and Boosting”, International Journal of Computational Intelligence, PDF).
Bagging has already played a strong role in the design of machine learning systems. Now that we’ve hit the clock speed barrier and are seeing more highly parallel and multicore systems, it’s bound to become an even more attractive option. Keep it in mind if you ever need to design a simple but accurate machine learning system.
The author releases all rights to all content herein and grants this work into the public domain, with the exception of works owned by others such as abstracts, quotations, and WordPress theme content.