Taiju Sanagi: Experiments

Multinomial Naive Bayes

Note
Updated: April 21, 2025

This note introduces the Multinomial Naive Bayes algorithm using scikit‑learn, explains the step‑by‑step logic behind how it works, and then demonstrates a from‑scratch implementation to show that the core idea is simple and easy to build.

What is Multinomial Naive Bayes?

Multinomial Naive Bayes is like keeping a bag‑of‑words count for every e‑mail, for example:

  • How many times does the word money appear in the message?

This single‑word example is just one feature — in practice the model counts how often each word in the vocabulary appears in the email. Each word’s count adds a score toward spam or ham.

After summing all those scores, the class with the higher total wins.

It learns these scores from past data — how often each word appears in spam and ham messages. Using counts (not just presence/absence) makes this model better for longer documents where the same word might appear multiple times.

This notebook will:

  • Use scikit‑learn to demonstrate how Multinomial Naive Bayes works in practice
  • Explain the logic behind it in an intuitive way (scorecard view, Laplace smoothing, using logs instead of tiny products)
  • Show how to implement the same idea step by step from scratch

Let’s dive into the details to understand how it works and how to implement it ourselves.

Preparation

import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.datasets import fetch_20newsgroups from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.metrics import ( accuracy_score, confusion_matrix, ConfusionMatrixDisplay, roc_curve, auc, ) newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes')) X_text = newsgroups.data y = newsgroups.target target_names = newsgroups.target_names print(f"Loaded {len(X_text)} documents across {len(target_names)} categories") vectorizer = CountVectorizer(stop_words='english', max_features=10000) # limit vocab size X = vectorizer.fit_transform(X_text) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.25, random_state=42, stratify=y ) print(f"X shape: {X.shape}") # sparse matrix shape (n_docs, n_words) print("Sample vocabulary words:", list(vectorizer.vocabulary_.keys())[:10]) # show a few words

Outputs:

Loaded 18846 documents across 20 categories X shape: (18846, 10000) Sample vocabulary words: ['sure', 'pens', 'fans', 'pretty', 'confused', 'lack', 'kind', 'posts', 'recent', 'massacre']

Data Observation

X is a sparse matrix with shape (18846, 10000), where each row represents a document and each column represents a word from the 10,000 most frequent vocabulary terms (after removing stopwords).

Each value in X is the raw count of how many times that word appears in the corresponding document.

For example:

  • X[0, 5123] = 2 means the 5123rd word in the vocabulary appears twice in document 0.
  • Most values are zero due to the sparsity of natural language.

The target y contains the category labels (0–19), each corresponding to a newsgroup topic such as 'rec.autos', 'sci.space', or 'talk.politics.misc'.

By learning how often each word appears across different categories, Multinomial Naive Bayes estimates class‑conditional probabilities and predicts the most likely category for each new document.

Implement with Scikit-Learn

sk_model = MultinomialNB(alpha=1.0) # Laplace smoothing sk_model.fit(X_train, y_train) y_pred_sk = sk_model.predict(X_test) y_proba_sk = sk_model.predict_proba(X_test) # shape: (n_samples, n_classes) acc_sk = accuracy_score(y_test, y_pred_sk) print(f"scikit-learn accuracy = {acc_sk:.4f}")

Outputs:

scikit-learn accuracy = 0.6719

Behind the Scenes: Multinomial Naive Bayes

Multinomial Naive Bayes is a way to classify documents by looking at the words inside them and how many times each word appears.

Unlike Bernoulli Naive Bayes (which only checks if a word is present), Multinomial Naive Bayes uses raw word counts.

Bayes' Theorem for Classification

We want to know:

“What is the probability this document belongs to a class, given the words in it?”

We write this as:

P(classwords)P(wordsclass)P(class)P(\text{class} \mid \text{words}) \propto P(\text{words} \mid \text{class}) \cdot P(\text{class})

We calculate this score for every possible class and pick the one with the highest result.

The Naive Assumption

We assume all words in a document are conditionally independent, given the class.

That means:

P(wordsclass)=i=1nP(xiclass)xiP(\text{words} \mid \text{class}) = \prod_{i=1}^{n} P(x_i \mid \text{class})^{x_i}

Where:

  • xix_i is how many times word ii appears in the document
  • P(xiclass)P(x_i \mid \text{class}) is the probability of that word in that class
  • \prod means multiply all these probabilities

Input Format

Each input document is turned into a list of word counts:

  • "data science is fun" → [1, 1, 1, 0, 0, 0, …]
  • "science science science" → [0, 3, 0, 0, 0, …]

This is why we write:

xiNx_i \in \mathbb{N}

Which means each feature xix_i is a whole number (0 or more).

Estimating Word Probabilities from Training Data

To calculate P(xiclass)P(x_i \mid \text{class}), we look at all documents in a class and count how often each word appears.

Let:

  • Ni,cN_{i,c} = total number of times word ii appears in class cc
  • NcN_c = total number of all word occurrences in class cc
  • VV = vocabulary size (number of different words)

Then:

P(xiclass)=Ni,c+αNc+αVP(x_i \mid \text{class}) = \frac{N_{i,c} + \alpha}{N_c + \alpha V}

Where α\alpha is a smoothing value (usually 1) to prevent zero probabilities.

This tells us: “If we randomly pick a word from all the training documents in this class, what's the chance that it's word ii?”

In other words, it’s the relative frequency of the word in that class.

Example

Imagine we are training a classifier for two classes: sports and tech.

We focus on just 3 words in the vocabulary: game, code, team.

In all training documents for sports:

  • game appears 10 times
  • code appears 0 times
  • team appears 5 times

So:

  • Ngame, sports=10N_{\text{game, sports}} = 10
  • Ncode, sports=0N_{\text{code, sports}} = 0
  • Nteam, sports=5N_{\text{team, sports}} = 5
  • Total Nsports=10+0+5=15N_{\text{sports}} = 10 + 0 + 5 = 15

Using α=1\alpha = 1 and V=3V = 3:

P(gamesports)=10+115+3=11180.611P(codesports)=0+115+3=1180.056P(teamsports)=5+115+3=618=0.333P(\text{game} \mid \text{sports}) = \frac{10 + 1}{15 + 3} = \frac{11}{18} \approx 0.611 \\ P(\text{code} \mid \text{sports}) = \frac{0 + 1}{15 + 3} = \frac{1}{18} \approx 0.056 \\ P(\text{team} \mid \text{sports}) = \frac{5 + 1}{15 + 3} = \frac{6}{18} = 0.333

This tells the model:

  • "game" is very common in sports documents
  • "code" is almost never seen in sports
  • "team" is fairly common

Combine with Prior Probability

We also multiply by the prior probability of each class:

P(class)=Number of training documents in this classTotal number of training documentsP(\text{class}) = \frac{\text{Number of training documents in this class}}{\text{Total number of training documents}}

So the full expression for the class score becomes:

P(classWords)P(class)i=1nP(xiclass)xiP(\text{class} \mid \text{Words}) \propto P(\text{class}) \cdot \prod_{i=1}^{n} P(x_i \mid \text{class})^{x_i}

🔄 This is where Multinomial Naive Bayes differs from Bernoulli Naive Bayes:

  • BernoulliNB multiplies:

    i=1nθixi(1θi)1xi\prod_{i=1}^{n} \theta_i^{x_i} \cdot (1 - \theta_i)^{1 - x_i}

    (for binary word presence: 1 or 0)

  • MultinomialNB multiplies:

    i=1nP(xiclass)xi\prod_{i=1}^{n} P(x_i \mid \text{class})^{x_i}

    (where xix_i is a count — 0, 1, 2, 3, ...)

So instead of using (1θi)(1 - \theta_i) for absent words like in BernoulliNB,
MultinomialNB just ignores words that don’t appear (xi=0x_i = 0) and focuses more on words that repeat.

🧠 Intuition:

  • For BernoulliNB, every word either contributes a "yes" or "no" (present or absent)
  • For MultinomialNB, only present words contribute — and more frequent words contribute more

We now have a complete score expression in probability form.
Before calculating it on a computer, we’ll apply logarithms to avoid underflow.

Final Scoring Formula (with Logs)

When classifying a new document, we compute:

logP(classX)logP(class)+i=1nxilogP(xiclass)\log P(\text{class} \mid X) \propto \log P(\text{class}) + \sum_{i=1}^{n} x_i \cdot \log P(x_i \mid \text{class})

Each word’s count xix_i acts like a weight, multiplying the impact of that word’s log-probability.

The model picks the class with the highest final score.

Let's Code It

Now that we understand how it works, let’s implement it from scratch!

class MyMultinomialNB: def __init__(self, alpha: float = 1.0): # α = 1 → classic Laplace smoothing (“add-one”) # prevents zero probabilities for unseen words self.alpha = alpha # ==================== TRAIN ==================== def fit(self, X: np.ndarray, y: np.ndarray): """ X ─ shape (n_docs, n_words) Each value = how many times word i appears in the document y ─ shape (n_docs,) Each value = class label (e.g. 0 = ham, 1 = spam) """ n_docs, n_words = X.shape self.classes_ = np.unique(y) n_classes = len(self.classes_) # ---------- 1. PRIOR P(class) ---------- class_counts = np.bincount(y, minlength=n_classes) self.log_prior_ = np.log(class_counts / n_docs) # shape (n_classes,) # ---------- 2. WORD COUNTS ---------- # word_counts[c, j] = total number of times word j appears in class c word_counts = np.zeros((n_classes, n_words)) for c in self.classes_: X_class = X[y == c] word_counts[c] = X_class.sum(axis=0) # sum over documents in class # total_words[c] = total number of all word tokens in class c total_words = word_counts.sum(axis=1, keepdims=True) # shape (n_classes, 1) # ---------- 3. SMOOTHED WORD PROBABILITIES ---------- # θ = probability of word j given class c # shape: (n_classes, n_words) self.log_theta_ = np.log( (word_counts + self.alpha) / (total_words + self.alpha * n_words) ) return self # =================== PREDICT =================== def predict(self, X: np.ndarray): """ For each document: - Compute log‑score for each class: log P(class) + sum_i x_i * log P(word_i | class) - Choose class with highest score """ # X: shape (n_test_docs, n_words) # log_theta_: shape (n_classes, n_words) # Compute score matrix: (n_test_docs, n_classes) scores = X @ self.log_theta_.T + self.log_prior_ return self.classes_[scores.argmax(axis=1)] my_nb = MyMultinomialNB(alpha=1.0).fit(X_train, y_train) y_pred_my = my_nb.predict(X_test) acc_my = accuracy_score(y_test, y_pred_my) print(f"scratch accuracy = {acc_my:.4f}")

Outputs:

scratch accuracy = 0.6719

It Works!

The scratch model hits an accuracy of 0.6719, matching scikit-learn.

This confirms that the logic behind Multinomial Naive Bayes — counting word frequencies, applying Laplace smoothing, computing log-scores, and picking the class with the highest total — behaves exactly as expected.

We’ve successfully built Multinomial Naive Bayes from the ground up!