In part 1 of this series, you learned what the components of a decision tree are, what the advantages of using a decision tree are, how to prevent overfitting, and about ensemble learning. In this post, you will learn how the decision tree algorithm is implemented and what it means to pick the “best” attribute.
There are several different iterations of decision tree algorithms that are common. However, these decision tree algorithms mostly follow this general outline:
- Pick the “best” attribute
- Ask a question
- Follow the answer path
- Repeat until get an answer
What constitutes the “best” attribute to split on varies slightly by each specific algorithm but in general it is the attribute that results in the cleanest split of the data. Decision trees are known as “greedy algorithms.” This means that at each step the algorithm picks the optimally best choice given that step without necessarily thinking steps ahead. This can sometimes result in only obtaining a local optimum. To overcome this it is often better to use combinations of trees such as a random forest.
Getting an answer or reaching the end of the decision tree occurs when there exists a homogeneous set of labels (all data left has the same label), there are no more attributes to split the data on (label usually just the most popular left), or when the tree hits a pre-specified depth level.
The most common specific implementations are:
- ID3 – designed by J.R. Quinlan, this algorithm takes the feature that will yield the largest information gain (greedy) and will grow the tree to maximum size then prune
- C4.5 – expands on ID3 by allowing for numerical values by partitioning a continuous value
- CART- most recent implementation, according to Sklearn doc this is the version of decision tree they use
Picking the “Best” Attribute
To keep the decision tree small, at each step of the tree, choose the attribute to split on that will result in the purest new classes. Continue splitting until all nodes are pure or information gain is 0. Information gain is based on the concept of entropy.
Entropy is a measure of randomness with the units of the measure being bits of information. The value of entropy will range from 0 to log2(number of classes). In the case of a binary classification problem, the value of entropy will range from 0 to log2(2) or from 0 to 1. An outcome that is completely certain will have an entropy of 0 while a completely random output will have maximal entropy or 1 in the case of a coin flip (binary classification). The chart below illustrates how entropy changes in a binary classification problem given the probability of the p+ class.
The best attribute to split on is the attribute that provides the most information gain. Information gain is the reduction in randomness over the labels you have based on splitting on some value of a particular attribute. In other words, information gain measures the reduction in entropy that happens from splitting your data on a certain attribute.
Gini impurity measures the probability that a randomly chosen sample from your data set would be labeled incorrectly if it were simply randomly labeled based on the distribution of labels in the set.
Gini will be at its minimum value of 0 when all samples at a node in the decision tree are of the same class. Gini is the default criterion to use for picking the best attribute when using Sci-kit Learn’s implementation (you can change to entropy).
Decision trees are one of the most commonly used machine learning algorithms. In this series you learned:
- The components of a decision tree – nodes, edges, leaves
- The advantages of decision trees
- Easy to understand
- Can work with erroneous or missing data
- Ways to prevent overfitting (max depth, min sample size, pruning)
- Ensembling (Bagging, Boosting)
- The algorithmic implementation
- How the best attribute is selected