Random Decision Forests
Looking for more research papers to read, I scanned my Hands-On Machine Learning notes for the many papers that were referenced there. This is one of those papers. These papers are mainly on machine learning and deep learning topics.
Reference Random Decision Forests Paper
Introduction
Decision Trees are attractive classifiers due to their high execution speed. But trees derived with traditional methods often cannot be grow to arbitrary complexity for possible loss of generalization accuracy on unseen data. The limitation in accuracy usually means suboptimal accuracy on training data. Following the principles of stochastic modeling, this paper proposes a method to construct tree-based classifiers whose capacity can be arbitrarily expanded for increases in accuracy for both training data and unseen data. The essence of the method is to build multiple trees in randomly selected subspaces of the feature space. Trees in different subspaces generalize their classification in complementary ways, and their combined classification can be monotonically improved.
Decision tree classifiers are attractive because of their many advantages - the idea is intuitively appealing, training is often straight-forward, and best of all, classification is extremely fast. Many studies propose heuristics to construct a tree for optimal classification accuracy or to minimize its size. Yet trees constructed with fixed training data are prone to be overly adapted to the training data. Pruning back a fully-grown tree may increase generalization accuracy on unseen data, often at the expense of accuracy on the training data.
There is a fundamental limitation on the complexity of tree classifiers - they should not be grown too complex to overfit the training data. This paper describes a method to overcome apparent limitation. The paper illustrates ideas using oblique decision trees which are convenient for optimizing training set accuracy.
Notes
Binary decision trees studied in prior literature often use a single feature at each nonterminal (decision) node. A test point is assigned to the left or right branch by its value of that feature. Geometrically this means splitting by hyperplanes. Oblique decision trees are more general in that the hyperplanes are not necessarily parallel to any of the axes. Each hyperplane is represented by a linear function of the feature components. Using oblique hyperplanes usually yields a smaller tree that can fully split the data to leaves containing a single class. Sizes of the trees may differ drastically depending on how the hyperplanes are selected.
Most of the sophistication in tree growing algorithms is the attempt to minimize the tree size, but there is little promise on the generalization accuracy. This paper starts with two simple methods for tree construction, neither of which involves any sophisticated optimization procedure.
The first method for tree growing finds a splitting hyperplane among those that are perpendicular to a line connecting two data clusters. It aims at separating at least two classes at each nonterminal node. The second method uses fixed increment perceptron training algorithm to choose the hyperplane at each nonterminal node. Both tree-growing methods are able to grow complex trees that perfectly classify the training data. Yet because of the biases of the particular ways in which the hyperplanes are chosen, the generalization accuracy is rarely as good.
Randomization has been a powerful tool for introducing differences in classifiers. Previously it has been used to initialized training algorithms with different configurations that eventually yield different classifiers. The method used in this paper to create multiple trees is to construct trees in randomly selected subpaces of the feature space. For a given feature space of dimensions, there are subspaces in which a decison tree can be constructed. The use of randomization in selecting components of the feature space is merely a convenient way to explore the possibilities. A decison tree is constructed in each selected subpace using the entire training set and the algorithms given in the previous section. Notice that each of these tree classifies the training data 100% correctly. Yet the classification is invariant for points that are different from the training points that are different from the training points only in the unselected dimensions. Thus each tree generalizes its classification in a different way. In the theory of stochastic modeling, classification accuracies are related to the statistical properties of the combination function, and it is shown that very high accuracies can be achieved far before all the possible combinations are used.
The problem of poor generalization can be overcome by the use of multiple trees.
Comments
You can read more about how comments are sorted in this blog post.
User Comments
There are currently no comments for this article.