Taiju Sanagi: Experiments

Entropy and Information Gain

Note
Updated: April 28, 2025

Overview

When building a Decision Tree, we aim to split data into increasingly pure groups.
But how can we measure "purity" or "impurity" mathematically?

One powerful way is using entropy, a concept from information theory.
Entropy measures how mixed or uncertain a group is.
When we split data, we use information gain — the reduction in entropy — to decide the best split.

1. Understanding Purity and Impurity

Imagine a bucket of marbles where each color = a class.

  • If all marbles are blue, the bucket is perfectly pure — you always guess “blue” and be right.
  • If the marbles are half blue, half red, the bucket is impure — your guess will be wrong half the time.

The more mixed the bucket, the higher the uncertainty.
We want to find questions (splits) that move us toward lower uncertainty — smaller entropy.

2. What is Entropy?

Plain language definition:

"Entropy measures how much surprise or uncertainty there is when picking an item at random."

  • If the bucket is pure (all one class), no surprise → entropy = 0.
  • If the bucket is highly mixed, lots of surprise → higher entropy.

Entropy Formula

Let pkp_k be the proportion of samples in class kk.
If there are KK classes, the entropy HH is:

H=k=1Kpklog2pkH = -\sum_{k=1}^K p_k \log_2 p_k

where:

  • log2\log_2 is the logarithm base 2 (information measured in "bits"),
  • By convention, 0log20=00 \log_2 0 = 0.

Examples

  • Pure bucket (all one class, say pk=1p_k=1):
H=1×log21=0H = -1 \times \log_2 1 = 0

✅ No uncertainty.

  • Two classes evenly mixed (p1=p2=0.5p_1 = p_2 = 0.5):
H=(0.5log20.5+0.5log20.5)=1.0H = -\left(0.5 \log_2 0.5 + 0.5 \log_2 0.5\right) = 1.0

✅ Maximum uncertainty for 2 classes.

3. How Splitting Affects Entropy

When a question splits a parent group into left and right children:

  • NPN_P: number of samples in parent
  • NLN_L, NRN_R: numbers of samples in left/right child
  • HPH_P: entropy of parent
  • HLH_L, HRH_R: entropies of left/right children

The weighted average entropy after the split is:

Weighted Entropy=NLNPHL+NRNPHR\text{Weighted Entropy} = \frac{N_L}{N_P} H_L + \frac{N_R}{N_P} H_R

4. What is Information Gain?

Information gain measures how much entropy decreases after a split:

Information Gain=HP(NLNPHL+NRNPHR)\text{Information Gain} = H_P - \left( \frac{N_L}{N_P} H_L + \frac{N_R}{N_P} H_R \right)

✅ Interpretation:

  • High information gain → split made the groups much purer.
  • Low information gain → split did not improve purity much.

When building a tree, we choose the split with the highest information gain at each step.

5. Growing a Decision Tree with Entropy and Information Gain

  1. Start with all data in the root node.
  2. For every possible question (feature + threshold):
    • Calculate how the split would divide the data.
    • Calculate the information gain from the split.
  3. Choose the question with the highest information gain.
  4. Split the node into left/right children.
  5. Recursively repeat steps 2–4 for each child.
  6. Stop when:
    • A node is pure (entropy = 0), or
    • Other limits are reached (e.g., max_depth, min_samples_leaf).

6. Making Predictions with the Tree

To classify a new input:

  1. Start at the root node.
  2. Ask the stored yes/no question.
  3. Follow the yes or no branch.
  4. Repeat until reaching a leaf node.
  5. Output the most common class label in that leaf.

✅ Like playing "20 Questions" — each answer brings you closer to the final class.

Final Thoughts

  • Entropy measures how mixed a group is.
  • Information gain tells us how much a split reduces that mixing.
  • Decision Trees use information gain to decide which questions to ask.

Building trees with entropy and information gain helps models split the data intelligently, leading to better classification accuracy.