Decision Trees for Machine Learning
In machine learning, we use past data to predict a future state. When data is labelled based on a desired attribute, we call it supervised learning. There are many algorithms facilitating such a learning. Decision tree is one such. Decision tree is a directed graph where nodes correspond to some test on attributes, branch represents an outcome of a test and a leaf corresponds to a class label.^{}
Discussion

Could you explain with an example how a decision tree is formed? Assuming sufficient data has been collected, suppose we want to predict if an individual has risk of high blood pressure. We have two classes, namely, high BP and normal BP. We consider two attributes that may influence BP, overweight and extent of exercise.
The first decision node (topmost in the tree) could be either weight or extent of exercise. Either way, we also decide the threshold by which data is split at the decision node. We could further split the branches based on the second attribute. The idea is to arrive at a grouping that clearly splits those at risk of high BP and those who are not. In the figure, we note that being overweight clearly separates individuals at high risk. Among those who are not overweight, exercise attribute must be used to arrive at a suitable separation.^{}
The process of choosing the decision nodes is iterative. The leaf node is chosen when the data in the node favours one class over all other classes.

How is information entropy related to decision trees? Entropy measures the uncertainty present in a random variable X. For example, an unbiased coin can either show head or tail when tossed. It has maximum entropy. If the coin is biased, say towards head, then it has less entropy since we already know that head is more likely to occur. The uncertainty is quantified in the range of 0 (no randomness) and 1 (absolute randomness). If there is no randomness, the variable does not hold any information.^{}
Consider a binary classification problem with only two classes, positive and negative class. If all samples are positive or all are negative, then entropy is zero or low. If half of the records are of positive class and half are of negative class, then entropy is one or high. Mathematically, entropy is defined as
$$H(X) =  \sum_{i=1}^n p_i(x) * log _2 p_i(x)$$

What is Gini Impurity? Gini Impurity (also called Gini Index) is an alternative to entropy that helps us choose attributes by which we can split the data. It measures the probability of incorrectly identifying a class.^{} Lower the Gini Index, better it is for the split. The idea is to lower the uncertainty and therefore get better in classification.
Mathematically, Gini Impurity is defined as
$$Gini(X) = 1  \sum_{i=1}^n p_i(x) ^2$$
The choice of attribute between 2 classes, \(x_1\) and \(x_2\), with \(N_1\) and \(N_2\) as number of instances, and \(N=N_1+N_2\) is calculated from
$$Gini(X) = \frac{N_1}{N} * Gini(x_1) + \frac{N_2}{N} * Gini(x_2)$$
where
$$ Gini(x1)=1  p(x_1) ^2$$

What is information gain? Information gain is change in entropy, prior to split and post split, with respect to selected attribute. We select that attribute for splitting data if it gives a high information gain. Information gain is also called as KullbackLeibler Divergence.^{}
Information Gain IG(S,A) for a set S is the effective change in entropy after deciding on a particular attribute A. Mathematically,
$$IG(S,A)=H(S)H(S,A)$$

What are the pros and cons of decision trees? Among the pros are,^{}
 Decision trees are easily interpretable since it mirrors human decisionmaking process and a white box model.
 Decision trees can be used for both regression (continuous variables) and classification (categorical variables).
 Decision trees are nonParametric. This means we make no assumption on linearity or normality of data.
Among the cons are,^{}
 Small changes to data may alter the tree and therefore the results. Decision Trees are therefore not robust.
 When a tree grows large and complex, it tends to overfit. This is overcome by limiting tree depth, boosting or bagging.
Milestones
Morgan and Sonquist propose Automatic Interaction Detection (AID), the first regression tree algorithm. This recursively splits data depending on impurity and stops splitting on reaching a certain level of impurity.^{}
Messenger and Mandell propose Theta Automatic Interaction Detection (THAID) , the first classification tree algorithm. This recursively splits data to maximize cases in modal category.^{}
Kass propose Chisquare Automatic Interaction Detection (CHAID) that splits data into nodes to start with and then merge nodes that have nonsignificant splits. This is quicker but can be inaccurate.^{}
Researchers improvise AID and THAID to arrive at Classification and Regression Trees (CART). This is done to improve accuracy by pruning the tree. The CART also gives variable importance scores.^{}
Qunilan proposes Iterative Dichotomiser (ID3) where he uses information entropy to quantify impurity. This is then used to calculate gain ratio for twoclass decisions.^{}
As an extension of ID3, Quinlan proposes C4.5. This can handle multipleclass decisions.^{}
Sample Code
References
 Hendler, Danny. 2018. "Advanced Topics in online Social Network Analysis." SlidePlayer. Accessed 20190411.
 Kosaka, Norio. 2018. "Gini and Entropy Comparison." Graphical Representation of the Gini and Entropy. Accessed 20190411.
 Kulkarni, Mayur. 2017. "Decision Trees for Classification: A Machine Learning Algorithm." Information Gain. Accessed 20190411.
 Loh, WeiYin. 2014. "A Brief History of Classification and Regression Trees." Slides. Accessed 20190411.
 Petrenko, Dimitri. 2015. "Decision tree completely different between rpart and party package." Question. Accessed 20190411.
 Shannon, C E. 1948. "A Mathematical Theory of Communication." The Bell System Technical Journal, vol. 27, pp. 379–423, 623–656, July, October. Accessed 20190411.
 Sosnovshchenko, Alexander. 2019. "Machine Learning with Swift." Decision tree learning pros and cons. Accessed 20190411.
 Stefanowski, Jerzy. 2010. "Discovering Decision Trees." Lecture 5 SE Master Course. Accessed 20190411.
 Wikipedia. 2018. "Binary Entropy Function." Wikipedia, October 09. Accessed 20190411.
Further Reading
 Saxena, Rahul. 2017. "How decision tree algorithm works." Information Gain. Accessed 20190411.
 Loh, WeiYin. 2014. "International Statistical Review." Fifty Years of Classification and Regression Trees. Accessed 20190411.
 QUINLAN, J R. 1986. "Induction of Decision Trees." Kluwer Academic Publishers, Boston. Accessed 20190411.
 STEFANOWSKI, JERZY. 2010. "Discovering Decision Trees." Lecture 5 SE Master Course. Accessed 20190411.
Article Stats
Cite As
See Also
 Random Forests
 Regression Modelling
 Machine Learning
 Information Entropy