Entropy and Information Gain

Note

Updated: April 28, 2025

Overview

When building a Decision Tree, we aim to split data into increasingly pure groups.
But how can we measure "purity" or "impurity" mathematically?

One powerful way is using entropy, a concept from information theory.
Entropy measures how mixed or uncertain a group is.
When we split data, we use information gain — the reduction in entropy — to decide the best split.

1. Understanding Purity and Impurity

Imagine a bucket of marbles where each color = a class.

If all marbles are blue, the bucket is perfectly pure — you always guess “blue” and be right.
If the marbles are half blue, half red, the bucket is impure — your guess will be wrong half the time.

The more mixed the bucket, the higher the uncertainty.
We want to find questions (splits) that move us toward lower uncertainty — smaller entropy.

2. What is Entropy?

Plain language definition:

"Entropy measures how much surprise or uncertainty there is when picking an item at random."

If the bucket is pure (all one class), no surprise → entropy = 0.
If the bucket is highly mixed, lots of surprise → higher entropy.

Entropy Formula

Let $p_k$ be the proportion of samples in class $k$ .
If there are $K$ classes, the entropy $H$ is:

H = -\sum_{k=1}^K p_k \log_2 p_k

where:

$\log_2$ is the logarithm base 2 (information measured in "bits"),
By convention, $0 \log_2 0 = 0$ .

Examples

Pure bucket (all one class, say $p_k=1$ ):

H = -1 \times \log_2 1 = 0

✅ No uncertainty.

Two classes evenly mixed ( $p_1 = p_2 = 0.5$ ):

H = -\left(0.5 \log_2 0.5 + 0.5 \log_2 0.5\right) = 1.0

✅ Maximum uncertainty for 2 classes.

3. How Splitting Affects Entropy

When a question splits a parent group into left and right children:

$N_P$ : number of samples in parent
$N_L$ , $N_R$ : numbers of samples in left/right child
$H_P$ : entropy of parent
$H_L$ , $H_R$ : entropies of left/right children

The weighted average entropy after the split is:

\text{Weighted Entropy} = \frac{N_L}{N_P} H_L + \frac{N_R}{N_P} H_R

4. What is Information Gain?

Information gain measures how much entropy decreases after a split:

\text{Information Gain} = H_P - \left( \frac{N_L}{N_P} H_L + \frac{N_R}{N_P} H_R \right)

✅ Interpretation:

High information gain → split made the groups much purer.
Low information gain → split did not improve purity much.

When building a tree, we choose the split with the highest information gain at each step.

5. Growing a Decision Tree with Entropy and Information Gain

Start with all data in the root node.
For every possible question (feature + threshold):
- Calculate how the split would divide the data.
- Calculate the information gain from the split.
Choose the question with the highest information gain.
Split the node into left/right children.
Recursively repeat steps 2–4 for each child.
Stop when:
- A node is pure (entropy = 0), or
- Other limits are reached (e.g., max_depth, min_samples_leaf).

6. Making Predictions with the Tree

To classify a new input:

Start at the root node.
Ask the stored yes/no question.
Follow the yes or no branch.
Repeat until reaching a leaf node.
Output the most common class label in that leaf.

✅ Like playing "20 Questions" — each answer brings you closer to the final class.

Final Thoughts

Entropy measures how mixed a group is.
Information gain tells us how much a split reduces that mixing.
Decision Trees use information gain to decide which questions to ask.

Building trees with entropy and information gain helps models split the data intelligently, leading to better classification accuracy.