Ridge Regression

Note

Updated: April 20, 2025

This note introduces the Ridge Regression algorithm using scikit-learn, explains the step-by-step logic behind how it works, and then demonstrates a from-scratch implementation to show that the core idea is simple and easy to build.

What is Ridge Regression?

Ridge Regression is like Linear Regression with a safety net.

It still tries to draw the best straight line through the data, but it also penalizes large coefficients to prevent the model from overfitting. This makes it more robust, especially when the data is noisy or when features are highly correlated.

It adds a small "cost" for having large weight values — which keeps the model simpler and more generalizable.

This notebook will:

Use scikit-learn to demonstrate how Ridge Regression works in practice
Explain the logic behind it in an intuitive way
Show how to implement the same idea step by step from scratch

Let’s dive into the details to understand how it works and how to implement it ourselves.

Preparation

Before we apply Ridge Regression, let’s create a dataset that has a nonlinear trend and some random noise — so we can see how regularization helps prevent overfitting.

import numpy as np
import matplotlib.pyplot as plt

# Create nonlinear data
np.random.seed(0)
X = np.linspace(0, 10, 20).reshape(-1, 1)
y = 0.5 * X**2 - X + 2 + np.random.randn(20, 1) * 4  # add noise

# Visualize the data
plt.scatter(X, y, color='blue', label='Data')
plt.title("Generated Nonlinear Data")
plt.xlabel("X")
plt.ylabel("y")
plt.legend()
plt.show()

Outputs:

Implement with Scikit-Learn

We’ll now use scikit-learn to fit a Ridge Regression model on a small, noisy dataset.

Because the relationship is nonlinear, we’ll first expand the input features using PolynomialFeatures. Then we’ll apply Ridge Regression to see how regularization helps prevent overfitting — especially with high-degree polynomials and limited data.

We’ll compare two models:

A standard Polynomial Regression (which tends to overfit with small data)
A Ridge-regularized Polynomial Regression (which smooths the curve)

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import LinearRegression, Ridge

degree = 20

# Without Ridge (standard polynomial)
poly_model = make_pipeline(PolynomialFeatures(degree), StandardScaler(), LinearRegression())
poly_model.fit(X, y)
y_pred_poly = poly_model.predict(X)

# With Ridge (strong regularization)
ridge_model = make_pipeline(PolynomialFeatures(degree), StandardScaler(), Ridge(alpha=1.0))
ridge_model.fit(X, y)
y_pred_ridge = ridge_model.predict(X)

# Plot
plt.scatter(X, y, color='blue', label='Data')
plt.plot(X, y_pred_poly, color='red', label='Polynomial (deg=20)')
plt.plot(X, y_pred_ridge, color='green', linestyle='--', label='Ridge Polynomial (deg=20, alpha=1.0)')
plt.title("Polynomial vs Ridge (High Degree with Scaling)")
plt.xlabel("X")
plt.ylabel("y")
plt.legend()
plt.show()

Outputs:

Understanding the Visualization

The plot above shows how two models behave when trained on just 20 data points with visible noise.

The red line is standard Polynomial Regression (degree 20)
The green dashed line is Polynomial + Ridge Regression (degree 20, alpha=1.0)

Both models use the same high-degree polynomial, but Ridge applies regularization to keep the curve smoother and prevent it from overfitting the noise in the data.

Behind the Scenes

1. Polynomial Features = Curve-Friendly Input

A normal linear model like:

\hat{y} = w_0 + w_1 x

can only fit a straight line. To model curves, we expand the input:

x \rightarrow [1, x, x^2, x^3, \dots, x^d]

So the model becomes:

\hat{y} = w_0 + w_1 x + w_2 x^2 + \dots + w_d x^d

This gives the model more flexibility to follow curved patterns in the data.

Even though the output is nonlinear in $x$ , it is still linear in the weights, so we can train it using the same techniques as linear regression.

2. Why Regularization?

With just 20 points and a 20-degree polynomial, the model can easily overfit — twisting and turning to match every point, even the noisy ones.

Ridge Regression prevents this by adding a penalty to the training process.
This penalty gets larger when the model uses big weights.

So instead of just minimizing prediction error, the model now minimizes:

\text{Loss} = \frac{1}{2n} \sum_{i=1}^n (\hat{y}_i - y_i)^2 + \frac{\alpha}{2} \sum_{j=1}^d w_j^2

This added term:

\frac{\alpha}{2} \sum w_j^2

is what encourages smaller weights. Big weights lead to a big penalty. Small weights keep the loss low.

This is the heart of regularization.

3. How does it push the weights?

Ridge adds a shrink force during training to stop weights from growing too big.

If we use gradient descent, each weight is updated like this:

w_j := w_j - \eta \cdot \left( \text{error gradient} + \alpha \cdot w_j \right)

Let’s break it down:

$\text{error gradient}$ is the usual part that comes from prediction error
$+ \alpha \cdot w_j$ is the new Ridge penalty that pulls the weight back toward zero
$\eta$ is the learning rate

So the model does two things at once:

Adjusts the weight to reduce prediction error
Pulls it back toward zero if it's getting too large

The bigger the weight, the stronger the pull — so large weights shrink faster.

This small shrink every step keeps the model from becoming too wiggly — especially when fitting high-degree polynomials.

If we use the closed-form solution instead of gradient descent, Ridge modifies the normal equation.

Normally, we solve for the weights with:

\mathbf{w} = (X^\top X)^{-1} X^\top y

But Ridge adds a penalty directly into the matrix:

\mathbf{w} = (X^\top X + \alpha I)^{-1} X^\top y

$X$ is the matrix of polynomial features
$I$ is the identity matrix
$\alpha$ controls how strong the shrink is

To avoid issues when $X^\top X$ is nearly singular (which is common with high-degree polynomials), we use the Moore–Penrose pseudoinverse:

\mathbf{w} = \left(X^\top X + \alpha I\right)^{+} X^\top y

This ensures numerical stability.

We also exclude the bias term (first column of $X$ ) from regularization by setting $I_{00} = 0$

4. Why Scale Polynomial Features?

Polynomial terms like $x^2$ , $x^5$ , or $x^{20}$ can have vastly different magnitudes.

This causes:

Numerical instability
Slower convergence
Ineffective regularization

That’s why we standardize the features — so each has mean 0 and standard deviation 1:

We scale $[x, x^2, \dots, x^d]$ , but leave the bias column (1) untouched.

This makes Ridge regularization more effective, and ensures the model fits more like scikit-learn.

5. What happens when weights get smaller?

Each weight controls how much the model curves:

$w_1$ controls the slope
$w_2$ controls the curve
$w_3$ , $w_4$ , and higher degrees control twists and wiggles

If the weights are big, the model swings wildly — trying to hit every data point exactly.

But when Ridge keeps the weights small:

The curve becomes smoother
The shape stays closer to the middle of the data
It focuses on the overall pattern, not the noise

So instead of chasing every random bump in the data, Ridge helps the model stay calm and centered.

Let's Code It

Now let’s implement Polynomial Regression with Ridge Regularization from scratch,
using the closed-form solution (Normal Equation with regularization) we discussed earlier.

We’ll follow two main steps:

Expand the input into polynomial features
Apply scaling to $x, x^2, \dots, x^d$ (but not the bias)
Solve for the weights using the Ridge-modified normal equation:

\mathbf{w} = \left(X^\top X + \alpha I\right)^{+} X^\top y

This gives the weights that both fit the data and stay small, helping to avoid overfitting — especially with high-degree polynomials.

# STEP 1: Expand input into polynomial features
def add_polynomial_features(X, degree):
    """
    Expand input into: [1, x, x^2, ..., x^d]
    - The first column (1) is the bias term
    - Each subsequent column represents a power of x
    """
    if X.ndim == 1:
        X = X.reshape(-1, 1)

    features = [np.ones((X.shape[0], 1))]  # Bias term
    if degree >= 1:
        features.append(X)  # x^1

    for d in range(2, degree + 1):
        features.append(X**d)  # x^2 to x^d

    return np.hstack(features)

# STEP 2: Define Ridge-regularized Polynomial Regression class
class MyRidgePolynomialRegression:
    def __init__(self, degree=2, alpha=1.0):
        self.degree = degree
        self.alpha = alpha
        self.coef_ = None
        self.scaler = StandardScaler()  # Only used for non-bias features

    def fit(self, X, y):
        if X.ndim == 1:
            X = X.reshape(-1, 1)

        # --- Step 1: Expand input into [1, x, x^2, ..., x^d] ---
        X_poly = add_polynomial_features(X, self.degree)

        # --- Step 2: Separate bias column (do not scale or regularize it) ---
        X_bias = X_poly[:, 0:1]     # shape = (n_samples, 1)
        X_features = X_poly[:, 1:]  # shape = (n_samples, degree)

        # --- Step 3: Scale only the features (x, x^2, ..., x^d) ---
        if X_features.shape[1] > 0:
            X_features_scaled = self.scaler.fit_transform(X_features)
            # --- Step 4: Recombine unscaled bias with scaled features ---
            X_ready = np.hstack([X_bias, X_features_scaled])
        else:
            # No features to scale (e.g., degree = 0)
            X_ready = X_bias

        # --- Step 5: Apply Ridge Regularized Normal Equation ---
        # w = (XᵀX + αI)^(-1) Xᵀy  — where I[0,0] = 0 to exclude bias from regularization
        n_features = X_ready.shape[1]
        I = np.eye(n_features)    # Creates an identity matrix (I)
        I[0, 0] = 0               # Sets the top-left element to 0, to exclude the bias term from regularization in Ridge Regression

        # w = (XᵀX + αI)^(-1) Xᵀy
        term1 = X_ready.T @ X_ready + self.alpha * I  # → XᵀX + αI
        term2 = X_ready.T @ y                         # → Xᵀy
        self.coef_ = np.linalg.pinv(term1) @ term2    # → w = (term1)^(-1) term2

    def predict(self, X):
        if X.ndim == 1:
            X = X.reshape(-1, 1)

        # --- Step 1: Generate polynomial features from test data ---
        X_poly = add_polynomial_features(X, self.degree)

        # --- Step 2: Separate bias and features ---
        X_bias = X_poly[:, 0:1]
        X_features = X_poly[:, 1:]

        # --- Step 3: Scale test features using the fitted scaler ---
        if X_features.shape[1] > 0:
            X_features_scaled = self.scaler.transform(X_features)
            X_ready = np.hstack([X_bias, X_features_scaled])
        else:
            X_ready = X_bias

        # --- Step 4: Predict using dot product ---
        return X_ready @ self.coef_

# STEP 3: Train and plot
model_my = MyRidgePolynomialRegression(degree=20, alpha=1.0)
model_my.fit(X, y)
y_pred_my = model_my.predict(X)

plt.scatter(X, y, color='blue', label='Data')
plt.plot(X, y_pred_my, color='purple', label='My Ridge Polynomial (deg=20)')
plt.title("Polynomial vs Ridge (High Degree with Scaling)")
plt.xlabel("X")
plt.ylabel("y")
plt.legend()
plt.show()

Outputs:

It Works!!

The curved regression line produced by our from-scratch Polynomial Regression with Ridge Regularization closely matches the result from scikit-learn.

This confirms that the regularization logic, feature scaling, and solution formula behave as expected.

We've successfully built Polynomial Ridge Regression from the ground up.