Decision trees are one of the most popular machine learning algorithms in use today. In this post you will learn what the components of a decision tree are, ways to prevent overfitting decision trees, and when the best time to use a decision trees is. In part 2 you will learn how decision trees are implemented.
Decision trees are named as such because the output model is in the form of a tree structure. Classification trees output class labels and regression trees output continuous values. A decision tree asks a series of questions where at each level of the tree you are narrowing down the possible labels with each question until you get your answer. Not all features in the data need to be used in the final model.
Components of a Decision Tree
A decision tree is built of 3 different components:
A node is each question the decision tree asks. With the root node being the top most node and should be the question that results in the most information gain. Each node simply picks an attribute and asks one question about that attribute. The way this is done depends on the exact implementation of the decision tree.
Edges are the lines leaving the nodes. These lines represent the path from the node the decision tree will take depending on the answer to the question asked by the node. The edges are the possible values the question asked can result in.
Leaves are at the end of edges and represent the output of that path along the decision tree. When you reach a leaf by working through the decision tree that is what you output for that sample.
Best Time to Use Decision Trees
Decision trees are one of the most popular machine learning algorithms because they can produce very easy to understand models and are therefore easier to explain to non-technical stakeholders. They also implicitly perform feature selection and require less data preparation than other machine learning algorithms.
Feature selection is done by the algorithm that implements a decision tree. In general, the decision tree will determine the best attribute to split on at each level in the tree, meaning the top most node (root node) is often the most important attribute.
Decision trees require less data preparation. There is no need to worry about the scale of your numerical variables like you would need to in other regression models that have coefficient values as weights. Missing values in your data will not prevent the decision tree from running. Decision trees do not make assumptions about the distribution of your data and is, therefore “immune” to skewed distributions or outliers in your data.
Decision trees are rather easy to overfit. The top most attributes tend to perform the best but as you add more levels to the tree, the lower attributes tend to perform weaker and lead to overfitting. Simply setting a max depth beforehand or pruning the tree afterward is a way to prevent overfitting.
Decision tree algorithms are what is known as a ‘greedy’ algorithm which means at each iteration it is only thinking about what the best attribute to split on given the current step. This means it may get stuck at a local optimum but the upside is it is much faster to train. To overcome this problem use an ensemble learner such as a random forest which creates many different decision trees and aggregates them to determine one output.
The decision trees will by default run until every value is the same output in a leaf or there are no more attributes left to split on. This can result in extremely big and specific decision trees that have overfit the data. In order to make a model that will generalize well to unseen data (have good predictive power), we will want to simplify the decision tree and prevent overfitting. This is largely accomplished by setting a max depth or setting a minimum sample size before, or by pruning the tree after it is made.
Max depth refers to how many levels the tree has. Once the tree reaches the maximum depth it will stop creating further nodes and aggregate the values into a leaf in which the majority of values becomes the output for that leaf.
A minimum sample size will cause a leaf to be made as soon as a branch has the minimum sample size number of samples or less.
Pruning is done after the full decision tree is made. To prune, branches from the bottom (the ones with relatively low importance) are removed. Continue to remove branches so long as the model accuracy does not decrease. Smaller and simpler trees are preferred
Ensemble learning is the process by which multiple models are created and combined together to get one output. In machine learning, there are two very common ensemble methods, bootstrap aggregating (aka bagging) and boosting. A random forest is a type of ensemble learner that specifically uses decision trees.
Bootstrap Aggregating (Bagging)
Many models are created by training each model on a randomly drawn subset from the original training data. The models are then combined by having each model vote with equal weight on the output.
- Random Forest – combines bagging with randomly selecting features in a decision tree
Many models are created but each new model that is built will give more weight to the training samples that the previous model got wrong. This can often be more accurate than bagging but will also be more likely to overfit.
In the next post of this series, you will learn how the decision tree algorithm is implemented and what it means to pick the “best” attribute. Click here for part 2.