Entropy and Information Gain
Overview
When building a Decision Tree, we aim to split data into increasingly pure groups.
But how can we measure "purity" or "impurity" mathematically?
One powerful way is using entropy, a concept from information theory.
Entropy measures how mixed or uncertain a group is.
When we split data, we use information gain — the reduction in entropy — to decide the best split.
1. Understanding Purity and Impurity
Imagine a bucket of marbles where each color = a class.
- If all marbles are blue, the bucket is perfectly pure — you always guess “blue” and be right.
- If the marbles are half blue, half red, the bucket is impure — your guess will be wrong half the time.
The more mixed the bucket, the higher the uncertainty.
We want to find questions (splits) that move us toward lower uncertainty — smaller entropy.
2. What is Entropy?
Plain language definition:
"Entropy measures how much surprise or uncertainty there is when picking an item at random."
- If the bucket is pure (all one class), no surprise → entropy = 0.
- If the bucket is highly mixed, lots of surprise → higher entropy.
Entropy Formula
Let be the proportion of samples in class .
If there are classes, the entropy is:
where:
- is the logarithm base 2 (information measured in "bits"),
- By convention, .
Examples
- Pure bucket (all one class, say ):
✅ No uncertainty.
- Two classes evenly mixed ():
✅ Maximum uncertainty for 2 classes.
3. How Splitting Affects Entropy
When a question splits a parent group into left and right children:
- : number of samples in parent
- , : numbers of samples in left/right child
- : entropy of parent
- , : entropies of left/right children
The weighted average entropy after the split is:
4. What is Information Gain?
Information gain measures how much entropy decreases after a split:
✅ Interpretation:
- High information gain → split made the groups much purer.
- Low information gain → split did not improve purity much.
When building a tree, we choose the split with the highest information gain at each step.
5. Growing a Decision Tree with Entropy and Information Gain
- Start with all data in the root node.
- For every possible question (feature + threshold):
- Calculate how the split would divide the data.
- Calculate the information gain from the split.
- Choose the question with the highest information gain.
- Split the node into left/right children.
- Recursively repeat steps 2–4 for each child.
- Stop when:
- A node is pure (entropy = 0), or
- Other limits are reached (e.g.,
max_depth
,min_samples_leaf
).
6. Making Predictions with the Tree
To classify a new input:
- Start at the root node.
- Ask the stored yes/no question.
- Follow the yes or no branch.
- Repeat until reaching a leaf node.
- Output the most common class label in that leaf.
✅ Like playing "20 Questions" — each answer brings you closer to the final class.
Final Thoughts
- Entropy measures how mixed a group is.
- Information gain tells us how much a split reduces that mixing.
- Decision Trees use information gain to decide which questions to ask.
Building trees with entropy and information gain helps models split the data intelligently, leading to better classification accuracy.