# Decision Trees for Machine Learning

In machine learning, we use past data to predict a future state. When data is labelled based on a desired attribute, we call it *supervised learning*. There are many algorithms facilitating such a learning. **Decision tree** is one such. Decision tree is a directed graph where nodes correspond to some test on attributes, branch represents an outcome of a test and a leaf corresponds to a class label.^{}

## Discussion

Could you explain with an example how a decision tree is formed? Assuming sufficient data has been collected, suppose we want to predict if an individual has risk of high blood pressure. We have two classes, namely, high BP and normal BP. We consider two attributes that may influence BP, overweight and extent of exercise.

The first decision node (topmost in the tree) could be either weight or extent of exercise. Either way, we also decide the threshold by which data is split at the decision node. We could further split the branches based on the second attribute. The idea is to arrive at a grouping that clearly splits those at risk of high BP and those who are not. In the figure, we note that being overweight clearly separates individuals at high risk. Among those who are not overweight, exercise attribute must be used to arrive at a suitable separation.

^{}The process of choosing the decision nodes is iterative. The leaf node is chosen when the data in the node favours one class over all other classes.

How is information entropy related to decision trees? Entropy measures the uncertainty present in a random variable X. For example, an unbiased coin can either show head or tail when tossed. It has maximum entropy. If the coin is biased, say towards head, then it has less entropy since we already know that head is more likely to occur. The uncertainty is quantified in the range of 0 (no randomness) and 1 (absolute randomness). If there is no randomness, the variable does not hold any information.

^{}Consider a binary classification problem with only two classes, positive and negative class. If all samples are positive or all are negative, then entropy is zero or low. If half of the records are of positive class and half are of negative class, then entropy is one or high. Mathematically, entropy is defined as

$$H(X) = - \sum_{i=1}^n p_i(x) * log _2 p_i(x)$$

What is Gini Impurity? Gini Impurity (also called

*Gini Index*) is an alternative to entropy that helps us choose attributes by which we can split the data. It measures the probability of incorrectly identifying a class.^{}Lower the Gini Index, better it is for the split. The idea is to lower the uncertainty and therefore get better in classification.Mathematically, Gini Impurity is defined as

$$Gini(X) = 1 - \sum_{i=1}^n p_i(x) ^2$$

The choice of attribute between 2 classes, \(x_1\) and \(x_2\), with \(N_1\) and \(N_2\) as number of instances, and \(N=N_1+N_2\) is calculated from

$$Gini(X) = \frac{N_1}{N} * Gini(x_1) + \frac{N_2}{N} * Gini(x_2)$$

where

$$ Gini(x1)=1 - p(x_1) ^2$$

What is information gain? Information gain is change in entropy, prior to split and post split, with respect to selected attribute. We select that attribute for splitting data if it gives a high information gain. Information gain is also called as

**Kullback-Leibler Divergence**.^{}Information Gain IG(S,A) for a set S is the effective change in entropy after deciding on a particular attribute A. Mathematically,

$$IG(S,A)=H(S)-H(S,A)$$

What are the pros and cons of decision trees? Among the pros are,

^{}- Decision trees are easily interpretable since it mirrors human decision-making process and a white box model.
- Decision trees can be used for both regression (continuous variables) and classification (categorical variables).
- Decision trees are non-Parametric. This means we make no assumption on linearity or normality of data.

Among the cons are,

^{}- Small changes to data may alter the tree and therefore the results. Decision Trees are therefore not robust.
- When a tree grows large and complex, it tends to overfit. This is overcome by limiting tree depth, boosting or bagging.

## Milestones

Morgan and Sonquist propose **Automatic Interaction Detection (AID)**, the first *regression tree* algorithm. This recursively splits data depending on impurity and stops splitting on reaching a certain level of impurity.^{}

Messenger and Mandell propose **Theta Automatic Interaction Detection (THAID)** , the first *classification tree* algorithm. This recursively splits data to maximize cases in modal category.^{}

Kass propose **Chi-square Automatic Interaction Detection (CHAID)** that splits data into nodes to start with and then merge nodes that have non-significant splits. This is quicker but can be inaccurate.^{}

Researchers improvise AID and THAID to arrive at **Classification and Regression Trees (CART)**. This is done to improve accuracy by pruning the tree. The CART also gives variable importance scores.^{}

Qunilan proposes **Iterative Dichotomiser (ID3)** where he uses *information entropy* to quantify impurity. This is then used to calculate gain ratio for *two-class decisions*.^{}

As an extension of ID3, Quinlan proposes **C4.5**. This can handle *multiple-class decisions*.^{}

## Sample Code

## References

- Hendler, Danny. 2018. "Advanced Topics in on-line Social Network Analysis." SlidePlayer. Accessed 2019-04-11.
- Kosaka, Norio. 2018. "Gini and Entropy Comparison." Graphical Representation of the Gini and Entropy. Accessed 2019-04-11.
- Kulkarni, Mayur. 2017. "Decision Trees for Classification: A Machine Learning Algorithm." Information Gain. Accessed 2019-04-11.
- Loh, Wei-Yin. 2014. "A Brief History of Classification and Regression Trees." Slides. Accessed 2019-04-11.
- Petrenko, Dimitri. 2015. "Decision tree completely different between rpart and party package." Question. Accessed 2019-04-11.
- Shannon, C E. 1948. "A Mathematical Theory of Communication." The Bell System Technical Journal, vol. 27, pp. 379–423, 623–656, July, October. Accessed 2019-04-11.
- Sosnovshchenko, Alexander. 2019. "Machine Learning with Swift." Decision tree learning pros and cons. Accessed 2019-04-11.
- Stefanowski, Jerzy. 2010. "Discovering Decision Trees." Lecture 5 SE Master Course. Accessed 2019-04-11.
- Wikipedia. 2018. "Binary Entropy Function." Wikipedia, October 09. Accessed 2019-04-11.

## Further Reading

- Saxena, Rahul. 2017. "How decision tree algorithm works." Information Gain. Accessed 2019-04-11.
- Loh, Wei-Yin. 2014. "International Statistical Review." Fifty Years of Classification and Regression Trees. Accessed 2019-04-11.
- QUINLAN, J R. 1986. "Induction of Decision Trees." Kluwer Academic Publishers, Boston. Accessed 2019-04-11.
- STEFANOWSKI, JERZY. 2010. "Discovering Decision Trees." Lecture 5 SE Master Course. Accessed 2019-04-11.

## Article Stats

## Cite As

### See Also

- Random Forests
- Regression Modelling
- Machine Learning
- Information Entropy