Taiju Sanagi: Experiments

Gaussian Naive Bayes

Note
Updated: April 21, 2025

This note introduces the Gaussian Naive Bayes algorithm using scikit‑learn, explains the step‑by‑step logic behind how it works, and then demonstrates a from‑scratch implementation to show that the core idea is simple and easy to build.

What is Gaussian Naive Bayes?

Gaussian Naive Bayes is a classifier designed for continuous numerical features, such as height, weight, or petal length.

Instead of counting how often a word appears (like in text classification), it assumes that each feature follows a normal distribution for each class.

For example, if we want to classify flowers as Setosa or Versicolor based on petal width, Gaussian NB models how petal widths are distributed within each class, learning:

  • The mean and variance of each feature for each class
  • Then uses the Gaussian probability density function to score how likely a new value is under each class

After summing all those log-probabilities and adding class priors, the class with the higher total wins.

It learns these statistics from past data — how features are distributed in each class. This makes the model well-suited for real‑valued, continuous input data.

This notebook will:

  • Use scikit‑learn to demonstrate how Gaussian Naive Bayes works in practice
  • Explain the logic behind it in an intuitive way (mean/variance estimation, Gaussian formula, using logs for numerical stability)
  • Show how to implement the same idea step by step from scratch

Let’s dive into the details to understand how it works and how to implement it ourselves.

Preparation

# -------------------------------------------------- # 0. Imports # -------------------------------------------------- import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.naive_bayes import GaussianNB from sklearn.metrics import ( accuracy_score, confusion_matrix, ConfusionMatrixDisplay, roc_curve, auc, ) # -------------------------------------------------- # 1. Load the dataset # -------------------------------------------------- iris = load_iris(as_frame=True) X = iris.data y = iris.target target_names = iris.target_names print(f"Loaded {X.shape[0]} samples × {X.shape[1]} features") print("Classes:", list(target_names)) print("\nSample feature values:") print(X.head()) # -------------------------------------------------- # 2. Train-test split # -------------------------------------------------- X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.25, random_state=42, stratify=y )

Outputs:

Loaded 150 samples × 4 features Classes: ['setosa', 'versicolor', 'virginica'] Sample feature values: sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) 0 5.1 3.5 1.4 0.2 1 4.9 3.0 1.4 0.2 2 4.7 3.2 1.3 0.2 3 4.6 3.1 1.5 0.2 4 5.0 3.6 1.4 0.2

Data Observation

These real-valued features describe physical measurements of iris flowers. Since the features are continuous and follow roughly Gaussian distributions per class, this dataset is ideal for Gaussian Naive Bayes.

Implement with Scikit-Learn

# -------------------------------------------------- # 3. Train Gaussian Naive Bayes # -------------------------------------------------- model = GaussianNB() model.fit(X_train, y_train) # -------------------------------------------------- # 4. Predict and evaluate # -------------------------------------------------- y_pred = model.predict(X_test) acc = accuracy_score(y_test, y_pred) print(f"Gaussian Naive Bayes accuracy = {acc:.4f}")

Outputs:

Gaussian Naive Bayes accuracy = 0.9211

Behind the Scenes: Gaussian Naive Bayes

Gaussian Naive Bayes is a way to classify data by looking at real-valued features and modeling them with normal (Gaussian) distributions.

Unlike Bernoulli Naive Bayes (which works on binary features) or Multinomial Naive Bayes (which uses word counts), Gaussian Naive Bayes assumes that each feature is continuous and normally distributed within each class.

Bayes' Theorem for Classification

We want to know:

“What is the probability this input belongs to a class, given its feature values?”

We write this as:

P(classfeatures)P(featuresclass)P(class)P(\text{class} \mid \text{features}) \propto P(\text{features} \mid \text{class}) \cdot P(\text{class})

We calculate this score for every class and choose the one with the highest value.

The Naive Assumption

We assume each feature is conditionally independent given the class:

P(x1,x2,...,xnclass)=i=1nP(xiclass)P(x_1, x_2, ..., x_n \mid \text{class}) = \prod_{i=1}^{n} P(x_i \mid \text{class})

Input Format

Each input is a vector of continuous values like:

  • [5.1, 3.5, 1.4, 0.2] → sepal/petal measurements of a flower

This is why we write:

xiRx_i \in \mathbb{R}

Which means each feature xix_i is a real number.

Estimating Feature Probabilities from Training Data

To calculate P(xiclass)P(x_i \mid \text{class}), we assume each feature follows a Gaussian distribution for each class.

We look at all training samples in a class and calculate:

  • μ\mu = the mean of feature xix_i in this class
  • σ2\sigma^2 = the variance of feature xix_i in this class

Then we use the Gaussian PDF:

P(xiclass)=12πσ2exp((xiμ)22σ2)P(x_i \mid \text{class}) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( - \frac{(x_i - \mu)^2}{2\sigma^2} \right)

This tells us:
“If we randomly pick a value of feature xix_i from this class, how likely is it to be near the value we just observed?”

In other words, it’s the likelihood of seeing xix_i under the bell curve for that class.

Example

Imagine we are training a classifier for two classes: pass and fail, based on one feature: study_hours.

We analyze the training data and compute:

  • For pass:
    • μ=8\mu = 8, σ2=1\sigma^2 = 1
  • For fail:
    • μ=4\mu = 4, σ2=1\sigma^2 = 1

Now for a new sample with study_hours = 6, we compute:

  • For pass:
P(6pass)=12π1exp((68)221)=12πexp(2)0.05399P(6 \mid \text{pass}) = \frac{1}{\sqrt{2\pi \cdot 1}} \exp\left( - \frac{(6 - 8)^2}{2 \cdot 1} \right) = \frac{1}{\sqrt{2\pi}} \exp(-2) \approx 0.05399
  • For fail:
P(6fail)=12πexp((64)22)=12πexp(2)0.05399P(6 \mid \text{fail}) = \frac{1}{\sqrt{2\pi}} \exp\left( - \frac{(6 - 4)^2}{2} \right) = \frac{1}{\sqrt{2\pi}} \exp(-2) \approx 0.05399

This tells the model:

  • 6 is equally likely under both curves
  • So the final prediction would depend on the class priors

Why Use a Gaussian?

In real-world data, many measurements (like height, weight, test scores, petal length, etc.) naturally follow a bell-shaped curve — they tend to cluster around an average value, with fewer very small or very large cases.

This is called a normal distribution (or Gaussian), and it’s a good fit for many features in real datasets.

The Gaussian PDF gives us a way to score how typical a value is.
It forms the foundation of Gaussian Naive Bayes — converting raw values into likelihoods, which are used for classification.

This is why Gaussian Naive Bayes works well on numerical, continuous data.

Combine with Prior Probability

We also multiply by the prior probability of each class:

P(class)=Number of training samples in this classTotal number of training samplesP(\text{class}) = \frac{\text{Number of training samples in this class}}{\text{Total number of training samples}}

So the full class score becomes:

P(classX)P(class)i=1nP(xiclass)P(\text{class} \mid X) \propto P(\text{class}) \cdot \prod_{i=1}^{n} P(x_i \mid \text{class})

🔄 This is where Gaussian Naive Bayes differs from Bernoulli and Multinomial:

  • BernoulliNB: binary word presence (1 or 0)
  • MultinomialNB: word counts (integers ≥ 0)
  • GaussianNB: continuous real-valued features, modeled using normal distributions

🧠 Intuition:

  • GaussianNB assumes each feature in each class forms a bell-shaped curve
  • It uses the curve to measure how "typical" a feature value is for that class

The more typical the values are for a class, the higher the score.

Final Scoring Formula (with Logs)

To avoid tiny numbers and numerical instability, we move everything into log space.

We want to compute:

P(classX)P(class)i=1nP(xiclass)P(\text{class} \mid X) \propto P(\text{class}) \cdot \prod_{i=1}^{n} P(x_i \mid \text{class})

Taking the logarithm turns the product into a sum:

logP(classX)logP(class)+i=1nlogP(xiclass)\log P(\text{class} \mid X) \propto \log P(\text{class}) + \sum_{i=1}^{n} \log P(x_i \mid \text{class})

Step-by-step: Log of the Gaussian PDF

Recall the Gaussian PDF for a single feature xix_i:

P(xiclass)=12πσ2exp((xiμ)22σ2)P(x_i \mid \text{class}) = \frac{1}{\sqrt{2\pi\sigma^2}} \cdot \exp\left( - \frac{(x_i - \mu)^2}{2\sigma^2} \right)

Taking the log of both sides:

logP(xiclass)=log(12πσ2)+log(exp((xiμ)22σ2))\log P(x_i \mid \text{class}) = \log \left( \frac{1}{\sqrt{2\pi\sigma^2}} \right) + \log \left( \exp\left( - \frac{(x_i - \mu)^2}{2\sigma^2} \right) \right)

Which simplifies to:

logP(xiclass)=12log(2πσ2)(xiμ)22σ2\log P(x_i \mid \text{class}) = - \frac{1}{2} \log(2\pi\sigma^2) - \frac{(x_i - \mu)^2}{2\sigma^2}

This gives us the complete formula for each feature’s log-probability.

Final Decision Rule

Putting it all together:

logP(classX)logP(class)+i=1n[12log(2πσ2)(xiμ)22σ2]\log P(\text{class} \mid X) \propto \log P(\text{class}) + \sum_{i=1}^{n} \left[ - \frac{1}{2} \log(2\pi\sigma^2) - \frac{(x_i - \mu)^2}{2\sigma^2} \right]

This score is computed for each class, and the model selects the one with the highest log-probability.

Let's Code It

Now that we understand how it works, let’s implement it from scratch!

class MyGaussianNB: def __init__(self): # No smoothing hyperparameter needed. # Gaussian Naive Bayes assumes each feature follows: # P(x_i | class) = (1 / sqrt(2πσ²)) * exp( - (x_i - μ)² / (2σ²) ) pass # ==================== TRAIN ==================== def fit(self, X: np.ndarray, y: np.ndarray): """ X ─ shape (n_samples, n_features) Each row is a sample, each column is a real-valued feature. y ─ shape (n_samples,) Each value is the class label (e.g. 0, 1, 2) """ n_samples, n_features = X.shape self.classes_ = np.unique(y) n_classes = len(self.classes_) # ---------- 1. PRIOR ---------- # P(class) = (# samples in class) / (total samples) # log_prior[c] = log P(class = c) class_counts = np.bincount(y, minlength=n_classes) self.log_prior_ = np.log(class_counts / n_samples) # shape (n_classes,) # ---------- 2. MEAN & VARIANCE ---------- # For each class c: # μ_c = mean of feature x_i in class c # σ²_c = variance of feature x_i in class c self.mean_ = np.zeros((n_classes, n_features)) self.var_ = np.zeros((n_classes, n_features)) for c in self.classes_: X_c = X[y == c] # Get samples in class c self.mean_[c, :] = X_c.mean(axis=0) # μ_c self.var_[c, :] = X_c.var(axis=0) # σ²_c return self # ==================== GAUSSIAN LOG-LIKELIHOOD ==================== def _log_likelihood(self, x: np.ndarray): """ Compute log P(x | class) for each class using: log P(x_i | class) = -0.5 * log(2πσ²) - ((x_i - μ)² / (2σ²)) Final likelihood for a sample x: log P(x | class) = sum over all features i of log P(x_i | class) """ eps = 1e-9 # avoid division by zero num = -0.5 * ((x - self.mean_) ** 2) / (self.var_ + eps) # squared deviation log_pdf = -0.5 * np.log(2 * np.pi * self.var_ + eps) + num # full log-PDF return log_pdf.sum(axis=1) # total log-likelihood across features # =================== PREDICT =================== def predict(self, X: np.ndarray): """ For each input sample x: Compute the class score: log P(class | x) ∝ log P(class) + log P(x | class) Then choose the class with the highest total log-probability. """ predictions = [] for x in X: log_likelihood = self._log_likelihood(x) # log P(x | class) log_posterior = self.log_prior_ + log_likelihood # total score best_class = self.classes_[np.argmax(log_posterior)] predictions.append(best_class) return np.array(predictions) my_nb = MyGaussianNB().fit(X_train.to_numpy(), y_train.to_numpy()) y_pred_my = my_nb.predict(X_test.to_numpy()) acc_my = accuracy_score(y_test, y_pred_my) print(f"scratch accuracy = {acc_my:.4f}")

Outputs:

scratch accuracy = 0.9211

It Works!

The scratch model hits an accuracy of 0.9211, matching scikit-learn.

This confirms that the logic behind Gaussian Naive Bayes — estimating class‑conditional means and variances, applying the Gaussian PDF, summing log-scores, and picking the class with the highest total — behaves exactly as expected.

We’ve successfully built Gaussian Naive Bayes from the ground up!