• Illustrating a decision tree. Source: Navlani 2018.
    Illustrating a decision tree. Source: Navlani 2018.
  • Decision tree based on two decision nodes. Source: Petrenko 2015.
    Decision tree based on two decision nodes. Source: Petrenko 2015.
  • Binary entropy function. Source: Wikipedia 2018, Entropy of a Bernoulli trial.
    Binary entropy function. Source: Wikipedia 2018, Entropy of a Bernoulli trial.
  • Comparing Gini Impurity and Entropy. Source: Kosaka 2018.
    Comparing Gini Impurity and Entropy. Source: Kosaka 2018.
  • Example illustrating the calculation of information gain. Source: Hendler 2018, slide 46.
    Example illustrating the calculation of information gain. Source: Hendler 2018, slide 46.

Decision Trees for Machine Learning

Avatar of user raam.raam
raam.raam
765 DevCoins
Avatar of user arvindpdmn
arvindpdmn
252 DevCoins
2 authors have contributed to this article
Last updated by raam.raam
on 2019-04-25 17:29:02
Created by raam.raam
on 2019-04-15 07:13:40
Improve this article. Show messages

Summary

Illustrating a decision tree. Source: Navlani 2018.
Illustrating a decision tree. Source: Navlani 2018.

In machine learning, we use past data to predict a future state. When data is labelled based on a desired attribute, we call it supervised learning. There are many algorithms facilitating such a learning. Decision tree is one such. Decision tree is a directed graph where nodes correspond to some test on attributes, branch represents an outcome of a test and a leaf corresponds to a class label.

Milestones

1963

Morgan and Sonquist propose Automatic Interaction Detection (AID), the first regression tree algorithm. This recursively splits data depending on impurity and stops splitting on reaching a certain level of impurity.

1972

Messenger and Mandell propose Theta Automatic Interaction Detection (THAID) , the first classification tree algorithm. This recursively splits data to maximize cases in modal category.

1980

Kass propose Chi-square Automatic Interaction Detection (CHAID) that splits data into nodes to start with and then merge nodes that have non-significant splits. This is quicker but can be inaccurate.

1984

Researchers improvise AID and THAID to arrive at Classification and Regression Trees (CART). This is done to improve accuracy by pruning the tree. The CART also gives variable importance scores.

1986

Qunilan proposes Iterative Dichotomiser (ID3) where he uses information entropy to quantify impurity. This is then used to calculate gain ratio for two-class decisions.

1993

As an extension of ID3, Quinlan proposes C4.5. This can handle multiple-class decisions.

Discussion

  • Could you explain with an example how a decision tree is formed?
    Decision tree based on two decision nodes. Source: Petrenko 2015.
    Decision tree based on two decision nodes. Source: Petrenko 2015.

    Assuming sufficient data has been collected, suppose we want to predict if an individual has risk of high blood pressure. We have two classes, namely, high BP and normal BP. We consider two attributes that may influence BP, overweight and extent of exercise.

    The first decision node (topmost in the tree) could be either weight or extent of exercise. Either way, we also decide the threshold by which data is split at the decision node. We could further split the branches based on the second attribute. The idea is to arrive at a grouping that clearly splits those at risk of high BP and those who are not. In the figure, we note that being overweight clearly separates individuals at high risk. Among those who are not overweight, exercise attribute must be used to arrive at a suitable separation.

    The process of choosing the decision nodes is iterative. The leaf node is chosen when the data in the node favours one class over all other classes.

  • How is information entropy related to decision trees?
    Binary entropy function. Source: Wikipedia 2018, Entropy of a Bernoulli trial.
    Binary entropy function. Source: Wikipedia 2018, Entropy of a Bernoulli trial.

    Entropy measures the uncertainty present in a random variable X. For example, an unbiased coin can either show head or tail when tossed. It has maximum entropy. If the coin is biased, say towards head, then it has less entropy since we already know that head is more likely to occur. The uncertainty is quantified in the range of 0 (no randomness) and 1 (absolute randomness). If there is no randomness, the variable does not hold any information.

    Consider a binary classification problem with only two classes, positive and negative class. If all samples are positive or all are negative, then entropy is zero or low. If half of the records are of positive class and half are of negative class, then entropy is one or high. Mathematically, entropy is defined as

    $$H(X) = - \sum_{i=1}^n p_i(x) * log _2 p_i(x)$$

  • What is Gini Impurity?
    Comparing Gini Impurity and Entropy. Source: Kosaka 2018.
    Comparing Gini Impurity and Entropy. Source: Kosaka 2018.

    Gini Impurity (also called Gini Index) is an alternative to entropy that helps us choose attributes by which we can split the data. It measures the probability of incorrectly identifying a class. Lower the Gini Index, better it is for the split. The idea is to lower the uncertainty and therefore get better in classification.

    Mathematically, Gini Impurity is defined as

    $$Gini(X) = 1 - \sum_{i=1}^n p_i(x) ^2$$

    The choice of attribute between 2 classes, \(x_1\) and \(x_2\), with \(N_1\) and \(N_2\) as number of instances, and \(N=N_1+N_2\) is calculated from

    $$Gini(X) = \frac{N_1}{N} * Gini(x_1) + \frac{N_2}{N} * Gini(x_2)$$

    where

    $$ Gini(x1)=1 - p(x_1) ^2$$

  • What is information gain?
    Example illustrating the calculation of information gain. Source: Hendler 2018, slide 46.
    Example illustrating the calculation of information gain. Source: Hendler 2018, slide 46.

    Information gain is change in entropy, prior to split and post split, with respect to selected attribute. We select that attribute for splitting data if it gives a high information gain. Information gain is also called as Kullback-Leibler Divergence.

    Information Gain IG(S,A) for a set S is the effective change in entropy after deciding on a particular attribute A. Mathematically,

    $$IG(S,A)=H(S)-H(S,A)$$

  • What are the pros and cons of decision trees?

    Among the pros are,

    • Decision trees are easily interpretable since it mirrors human decision-making process and a white box model.
    • Decision trees can be used for both regression (continuous variables) and classification (categorical variables).
    • Decision trees are non-Parametric. This means we make no assumption on linearity or normality of data.

    Among the cons are,

    • Small changes to data may alter the tree and therefore the results. Decision Trees are therefore not robust.
    • When a tree grows large and complex, it tends to overfit. This is overcome by limiting tree depth, boosting or bagging.

Sample Code

  • # https://stackabuse.com/decision-trees-in-python-with-scikit-learn/
    # Accessed 2019-04-16.
     
    # Data
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
     
     
    # Classification
    from sklearn.tree import DecisionTreeClassifier
    classifier = DecisionTreeClassifier()
    #classifier = DecisionTreeClassifier(criterion = "gini")
    #classifier = DecisionTreeClassifier(criterion = "entropy")
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
     
    # Evaluation of Classification
    from sklearn.metrics import classification_report, confusion_matrix
    print(confusion_matrix(y_test, y_pred))
    print(classification_report(y_test, y_pred))
     
     
    # Regression
    from sklearn.tree import DecisionTreeRegressor
    regressor = DecisionTreeRegressor()
    regressor.fit(X_train, y_train)
     
    y_pred = regressor.predict(X_test)
    df=pd.DataFrame({'Actual':y_test, 'Predicted':y_pred})
    df
     
    # Evaluation of Regression
    from sklearn import metrics
    print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
    print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
    print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

References

  1. Hendler, Danny. 2018. "Advanced Topics in on-line Social Network Analysis." SlidePlayer. Accessed 2019-04-11.
  2. Kosaka, Norio. 2018. "Gini and Entropy Comparison." Graphical Representation of the Gini and Entropy. Accessed 2019-04-11.
  3. Kulkarni, Mayur. 2017. "Decision Trees for Classification: A Machine Learning Algorithm." Information Gain. Accessed 2019-04-11.
  4. Loh, Wei-Yin. 2014. "A Brief History of Classification and Regression Trees." Slides. Accessed 2019-04-11.
  5. Petrenko, Dimitri. 2015. "Decision tree completely different between rpart and party package." Question. Accessed 2019-04-11.
  6. Shannon, C E. 1948. "A Mathematical Theory of Communication." The Bell System Technical Journal, vol. 27, pp. 379–423, 623–656, July, October. Accessed 2019-04-11.
  7. Sosnovshchenko, Alexander. 2019. "Machine Learning with Swift." Decision tree learning pros and cons. Accessed 2019-04-11.
  8. Stefanowski, Jerzy. 2010. "Discovering Decision Trees." Lecture 5 SE Master Course. Accessed 2019-04-11.
  9. Wikipedia. 2018. "Binary Entropy Function." Wikipedia, October 09. Accessed 2019-04-11.

Milestones

1963

Morgan and Sonquist propose Automatic Interaction Detection (AID), the first regression tree algorithm. This recursively splits data depending on impurity and stops splitting on reaching a certain level of impurity.

1972

Messenger and Mandell propose Theta Automatic Interaction Detection (THAID) , the first classification tree algorithm. This recursively splits data to maximize cases in modal category.

1980

Kass propose Chi-square Automatic Interaction Detection (CHAID) that splits data into nodes to start with and then merge nodes that have non-significant splits. This is quicker but can be inaccurate.

1984

Researchers improvise AID and THAID to arrive at Classification and Regression Trees (CART). This is done to improve accuracy by pruning the tree. The CART also gives variable importance scores.

1986

Qunilan proposes Iterative Dichotomiser (ID3) where he uses information entropy to quantify impurity. This is then used to calculate gain ratio for two-class decisions.

1993

As an extension of ID3, Quinlan proposes C4.5. This can handle multiple-class decisions.

Tags

See Also

Further Reading

  1. Saxena, Rahul. 2017. "How decision tree algorithm works." Information Gain. Accessed 2019-04-11.
  2. Loh, Wei-Yin. 2014. "International Statistical Review." Fifty Years of Classification and Regression Trees. Accessed 2019-04-11.
  3. QUINLAN, J R. 1986. "Induction of Decision Trees." Kluwer Academic Publishers, Boston. Accessed 2019-04-11.
  4. STEFANOWSKI, JERZY. 2010. "Discovering Decision Trees." Lecture 5 SE Master Course. Accessed 2019-04-11.

Article Stats

Author-wise Stats for Article Edits

Author
No. of Edits
No. of Chats
DevCoins
6
0
765
1
1
252
877
Words
2
Chats
7
Edits
1
Likes
586
Hits

Cite As

Devopedia. 2019. "Decision Trees for Machine Learning." Version 7, April 25. Accessed 2019-11-13. https://devopedia.org/decision-trees-for-machine-learning