# Decision Trees for Machine Learning

In machine learning, we use past data to predict a future state. When data is labelled based on a desired attribute, we call it supervised learning. There are many algorithms facilitating such a learning. Decision tree is one such. Decision tree is a directed graph where nodes correspond to some test on attributes, branch represents an outcome of a test and a leaf corresponds to a class label.

## Discussion

• Could you explain with an example how a decision tree is formed?

Assuming sufficient data has been collected, suppose we want to predict if an individual has risk of high blood pressure. We have two classes, namely, high BP and normal BP. We consider two attributes that may influence BP, overweight and extent of exercise.

The first decision node (topmost in the tree) could be either weight or extent of exercise. Either way, we also decide the threshold by which data is split at the decision node. We could further split the branches based on the second attribute. The idea is to arrive at a grouping that clearly splits those at risk of high BP and those who are not. In the figure, we note that being overweight clearly separates individuals at high risk. Among those who are not overweight, exercise attribute must be used to arrive at a suitable separation.

The process of choosing the decision nodes is iterative. The leaf node is chosen when the data in the node favours one class over all other classes.

• How is information entropy related to decision trees?

Entropy measures the uncertainty present in a random variable X. For example, an unbiased coin can either show head or tail when tossed. It has maximum entropy. If the coin is biased, say towards head, then it has less entropy since we already know that head is more likely to occur. The uncertainty is quantified in the range of 0 (no randomness) and 1 (absolute randomness). If there is no randomness, the variable does not hold any information.

Consider a binary classification problem with only two classes, positive and negative class. If all samples are positive or all are negative, then entropy is zero or low. If half of the records are of positive class and half are of negative class, then entropy is one or high. Mathematically, entropy is defined as

$$H(X) = - \sum_{i=1}^n p_i(x) * log _2 p_i(x)$$

• What is Gini Impurity?

Gini Impurity (also called Gini Index) is an alternative to entropy that helps us choose attributes by which we can split the data. It measures the probability of incorrectly identifying a class. Lower the Gini Index, better it is for the split. The idea is to lower the uncertainty and therefore get better in classification.

Mathematically, Gini Impurity is defined as

$$Gini(X) = 1 - \sum_{i=1}^n p_i(x) ^2$$

The choice of attribute between 2 classes, $$x_1$$ and $$x_2$$, with $$N_1$$ and $$N_2$$ as number of instances, and $$N=N_1+N_2$$ is calculated from

$$Gini(X) = \frac{N_1}{N} * Gini(x_1) + \frac{N_2}{N} * Gini(x_2)$$

where

$$Gini(x1)=1 - p(x_1) ^2$$

• What is information gain?

Information gain is change in entropy, prior to split and post split, with respect to selected attribute. We select that attribute for splitting data if it gives a high information gain. Information gain is also called as Kullback-Leibler Divergence.

Information Gain IG(S,A) for a set S is the effective change in entropy after deciding on a particular attribute A. Mathematically,

$$IG(S,A)=H(S)-H(S,A)$$

• What are the pros and cons of decision trees?

Among the pros are,

• Decision trees are easily interpretable since it mirrors human decision-making process and a white box model.
• Decision trees can be used for both regression (continuous variables) and classification (categorical variables).
• Decision trees are non-Parametric. This means we make no assumption on linearity or normality of data.

Among the cons are,

• Small changes to data may alter the tree and therefore the results. Decision Trees are therefore not robust.
• When a tree grows large and complex, it tends to overfit. This is overcome by limiting tree depth, boosting or bagging.

## Milestones

1963

Morgan and Sonquist propose Automatic Interaction Detection (AID), the first regression tree algorithm. This recursively splits data depending on impurity and stops splitting on reaching a certain level of impurity.

1972

Messenger and Mandell propose Theta Automatic Interaction Detection (THAID) , the first classification tree algorithm. This recursively splits data to maximize cases in modal category.

1980

Kass propose Chi-square Automatic Interaction Detection (CHAID) that splits data into nodes to start with and then merge nodes that have non-significant splits. This is quicker but can be inaccurate.

1984

Researchers improvise AID and THAID to arrive at Classification and Regression Trees (CART). This is done to improve accuracy by pruning the tree. The CART also gives variable importance scores.

1986

Qunilan proposes Iterative Dichotomiser (ID3) where he uses information entropy to quantify impurity. This is then used to calculate gain ratio for two-class decisions.

1993

As an extension of ID3, Quinlan proposes C4.5. This can handle multiple-class decisions.

## Sample Code

• # https://stackabuse.com/decision-trees-in-python-with-scikit-learn/
# Accessed 2019-04-16.

# Data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

# Classification
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()
#classifier = DecisionTreeClassifier(criterion = "gini")
#classifier = DecisionTreeClassifier(criterion = "entropy")
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

# Evaluation of Classification
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

# Regression
from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor()
regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)
df=pd.DataFrame({'Actual':y_test, 'Predicted':y_pred})
df

# Evaluation of Regression
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

## Author-wise Stats for Article Edits

Author
No. of Edits
No. of Chats
DevCoins raam.raam
6
0
841 arvindpdmn
2
1
281
877
Words
2
Likes
3385
Hits

## Cite As

Devopedia. 2020. "Decision Trees for Machine Learning." Version 8, July 23. Accessed 2021-09-10. https://devopedia.org/decision-trees-for-machine-learning
• Site Map