Taiju Sanagi: Experiments

Ridge Regression

Note
Updated: April 20, 2025

This note introduces the Ridge Regression algorithm using scikit-learn, explains the step-by-step logic behind how it works, and then demonstrates a from-scratch implementation to show that the core idea is simple and easy to build.

What is Ridge Regression?

Ridge Regression is like Linear Regression with a safety net.

It still tries to draw the best straight line through the data, but it also penalizes large coefficients to prevent the model from overfitting. This makes it more robust, especially when the data is noisy or when features are highly correlated.

It adds a small "cost" for having large weight values — which keeps the model simpler and more generalizable.

This notebook will:

  • Use scikit-learn to demonstrate how Ridge Regression works in practice
  • Explain the logic behind it in an intuitive way
  • Show how to implement the same idea step by step from scratch

Let’s dive into the details to understand how it works and how to implement it ourselves.

Preparation

Before we apply Ridge Regression, let’s create a dataset that has a nonlinear trend and some random noise — so we can see how regularization helps prevent overfitting.

import numpy as np import matplotlib.pyplot as plt # Create nonlinear data np.random.seed(0) X = np.linspace(0, 10, 20).reshape(-1, 1) y = 0.5 * X**2 - X + 2 + np.random.randn(20, 1) * 4 # add noise # Visualize the data plt.scatter(X, y, color='blue', label='Data') plt.title("Generated Nonlinear Data") plt.xlabel("X") plt.ylabel("y") plt.legend() plt.show()

Outputs:

Output Image
Output Image

Implement with Scikit-Learn

We’ll now use scikit-learn to fit a Ridge Regression model on a small, noisy dataset.

Because the relationship is nonlinear, we’ll first expand the input features using PolynomialFeatures. Then we’ll apply Ridge Regression to see how regularization helps prevent overfitting — especially with high-degree polynomials and limited data.

We’ll compare two models:

  • A standard Polynomial Regression (which tends to overfit with small data)
  • A Ridge-regularized Polynomial Regression (which smooths the curve)
from sklearn.pipeline import make_pipeline from sklearn.preprocessing import PolynomialFeatures, StandardScaler from sklearn.linear_model import LinearRegression, Ridge degree = 20 # Without Ridge (standard polynomial) poly_model = make_pipeline(PolynomialFeatures(degree), StandardScaler(), LinearRegression()) poly_model.fit(X, y) y_pred_poly = poly_model.predict(X) # With Ridge (strong regularization) ridge_model = make_pipeline(PolynomialFeatures(degree), StandardScaler(), Ridge(alpha=1.0)) ridge_model.fit(X, y) y_pred_ridge = ridge_model.predict(X) # Plot plt.scatter(X, y, color='blue', label='Data') plt.plot(X, y_pred_poly, color='red', label='Polynomial (deg=20)') plt.plot(X, y_pred_ridge, color='green', linestyle='--', label='Ridge Polynomial (deg=20, alpha=1.0)') plt.title("Polynomial vs Ridge (High Degree with Scaling)") plt.xlabel("X") plt.ylabel("y") plt.legend() plt.show()

Outputs:

Output Image
Output Image

Understanding the Visualization

The plot above shows how two models behave when trained on just 20 data points with visible noise.

  • The red line is standard Polynomial Regression (degree 20)
  • The green dashed line is Polynomial + Ridge Regression (degree 20, alpha=1.0)

Both models use the same high-degree polynomial, but Ridge applies regularization to keep the curve smoother and prevent it from overfitting the noise in the data.

Behind the Scenes

1. Polynomial Features = Curve-Friendly Input

A normal linear model like:

y^=w0+w1x\hat{y} = w_0 + w_1 x

can only fit a straight line. To model curves, we expand the input:

x[1,x,x2,x3,,xd]x \rightarrow [1, x, x^2, x^3, \dots, x^d]

So the model becomes:

y^=w0+w1x+w2x2++wdxd\hat{y} = w_0 + w_1 x + w_2 x^2 + \dots + w_d x^d

This gives the model more flexibility to follow curved patterns in the data.

Even though the output is nonlinear in xx, it is still linear in the weights, so we can train it using the same techniques as linear regression.

2. Why Regularization?

With just 20 points and a 20-degree polynomial, the model can easily overfit — twisting and turning to match every point, even the noisy ones.

Ridge Regression prevents this by adding a penalty to the training process.
This penalty gets larger when the model uses big weights.

So instead of just minimizing prediction error, the model now minimizes:

Loss=12ni=1n(y^iyi)2+α2j=1dwj2\text{Loss} = \frac{1}{2n} \sum_{i=1}^n (\hat{y}_i - y_i)^2 + \frac{\alpha}{2} \sum_{j=1}^d w_j^2

This added term:

α2wj2\frac{\alpha}{2} \sum w_j^2

is what encourages smaller weights. Big weights lead to a big penalty. Small weights keep the loss low.

This is the heart of regularization.

3. How does it push the weights?

Ridge adds a shrink force during training to stop weights from growing too big.

If we use gradient descent, each weight is updated like this:

wj:=wjη(error gradient+αwj)w_j := w_j - \eta \cdot \left( \text{error gradient} + \alpha \cdot w_j \right)

Let’s break it down:

  • error gradient\text{error gradient} is the usual part that comes from prediction error
  • +αwj+ \alpha \cdot w_j is the new Ridge penalty that pulls the weight back toward zero
  • η\eta is the learning rate

So the model does two things at once:

  1. Adjusts the weight to reduce prediction error
  2. Pulls it back toward zero if it's getting too large

The bigger the weight, the stronger the pull — so large weights shrink faster.

This small shrink every step keeps the model from becoming too wiggly — especially when fitting high-degree polynomials.

If we use the closed-form solution instead of gradient descent, Ridge modifies the normal equation.

Normally, we solve for the weights with:

w=(XX)1Xy\mathbf{w} = (X^\top X)^{-1} X^\top y

But Ridge adds a penalty directly into the matrix:

w=(XX+αI)1Xy\mathbf{w} = (X^\top X + \alpha I)^{-1} X^\top y
  • XX is the matrix of polynomial features
  • II is the identity matrix
  • α\alpha controls how strong the shrink is

To avoid issues when XXX^\top X is nearly singular (which is common with high-degree polynomials), we use the Moore–Penrose pseudoinverse:

w=(XX+αI)+Xy\mathbf{w} = \left(X^\top X + \alpha I\right)^{+} X^\top y

This ensures numerical stability.

We also exclude the bias term (first column of XX) from regularization by setting I00=0I_{00} = 0

4. Why Scale Polynomial Features?

Polynomial terms like x2x^2, x5x^5, or x20x^{20} can have vastly different magnitudes.

This causes:

  • Numerical instability
  • Slower convergence
  • Ineffective regularization

That’s why we standardize the features — so each has mean 0 and standard deviation 1:

We scale [x,x2,,xd][x, x^2, \dots, x^d], but leave the bias column (1) untouched.

This makes Ridge regularization more effective, and ensures the model fits more like scikit-learn.

5. What happens when weights get smaller?

Each weight controls how much the model curves:

  • w1w_1 controls the slope
  • w2w_2 controls the curve
  • w3w_3, w4w_4, and higher degrees control twists and wiggles

If the weights are big, the model swings wildly — trying to hit every data point exactly.

But when Ridge keeps the weights small:

  • The curve becomes smoother
  • The shape stays closer to the middle of the data
  • It focuses on the overall pattern, not the noise

So instead of chasing every random bump in the data, Ridge helps the model stay calm and centered.

Let's Code It

Now let’s implement Polynomial Regression with Ridge Regularization from scratch,
using the closed-form solution (Normal Equation with regularization) we discussed earlier.

We’ll follow two main steps:

  1. Expand the input into polynomial features
  2. Apply scaling to x,x2,,xdx, x^2, \dots, x^d (but not the bias)
  3. Solve for the weights using the Ridge-modified normal equation:
w=(XX+αI)+Xy\mathbf{w} = \left(X^\top X + \alpha I\right)^{+} X^\top y

This gives the weights that both fit the data and stay small, helping to avoid overfitting — especially with high-degree polynomials.

# STEP 1: Expand input into polynomial features def add_polynomial_features(X, degree): """ Expand input into: [1, x, x^2, ..., x^d] - The first column (1) is the bias term - Each subsequent column represents a power of x """ if X.ndim == 1: X = X.reshape(-1, 1) features = [np.ones((X.shape[0], 1))] # Bias term if degree >= 1: features.append(X) # x^1 for d in range(2, degree + 1): features.append(X**d) # x^2 to x^d return np.hstack(features) # STEP 2: Define Ridge-regularized Polynomial Regression class class MyRidgePolynomialRegression: def __init__(self, degree=2, alpha=1.0): self.degree = degree self.alpha = alpha self.coef_ = None self.scaler = StandardScaler() # Only used for non-bias features def fit(self, X, y): if X.ndim == 1: X = X.reshape(-1, 1) # --- Step 1: Expand input into [1, x, x^2, ..., x^d] --- X_poly = add_polynomial_features(X, self.degree) # --- Step 2: Separate bias column (do not scale or regularize it) --- X_bias = X_poly[:, 0:1] # shape = (n_samples, 1) X_features = X_poly[:, 1:] # shape = (n_samples, degree) # --- Step 3: Scale only the features (x, x^2, ..., x^d) --- if X_features.shape[1] > 0: X_features_scaled = self.scaler.fit_transform(X_features) # --- Step 4: Recombine unscaled bias with scaled features --- X_ready = np.hstack([X_bias, X_features_scaled]) else: # No features to scale (e.g., degree = 0) X_ready = X_bias # --- Step 5: Apply Ridge Regularized Normal Equation --- # w = (XᵀX + αI)^(-1) Xᵀy — where I[0,0] = 0 to exclude bias from regularization n_features = X_ready.shape[1] I = np.eye(n_features) # Creates an identity matrix (I) I[0, 0] = 0 # Sets the top-left element to 0, to exclude the bias term from regularization in Ridge Regression # w = (XᵀX + αI)^(-1) Xᵀy term1 = X_ready.T @ X_ready + self.alpha * I # → XᵀX + αI term2 = X_ready.T @ y # → Xᵀy self.coef_ = np.linalg.pinv(term1) @ term2 # → w = (term1)^(-1) term2 def predict(self, X): if X.ndim == 1: X = X.reshape(-1, 1) # --- Step 1: Generate polynomial features from test data --- X_poly = add_polynomial_features(X, self.degree) # --- Step 2: Separate bias and features --- X_bias = X_poly[:, 0:1] X_features = X_poly[:, 1:] # --- Step 3: Scale test features using the fitted scaler --- if X_features.shape[1] > 0: X_features_scaled = self.scaler.transform(X_features) X_ready = np.hstack([X_bias, X_features_scaled]) else: X_ready = X_bias # --- Step 4: Predict using dot product --- return X_ready @ self.coef_ # STEP 3: Train and plot model_my = MyRidgePolynomialRegression(degree=20, alpha=1.0) model_my.fit(X, y) y_pred_my = model_my.predict(X) plt.scatter(X, y, color='blue', label='Data') plt.plot(X, y_pred_my, color='purple', label='My Ridge Polynomial (deg=20)') plt.title("Polynomial vs Ridge (High Degree with Scaling)") plt.xlabel("X") plt.ylabel("y") plt.legend() plt.show()

Outputs:

Output Image
Output Image

It Works!!

The curved regression line produced by our from-scratch Polynomial Regression with Ridge Regularization closely matches the result from scikit-learn.

This confirms that the regularization logic, feature scaling, and solution formula behave as expected.

We've successfully built Polynomial Ridge Regression from the ground up.