Taiju Sanagi: Experiments

Linear Regression

Note
Updated: April 18, 2025

This note introduces the Linear Regression algorithm using scikit-learn, explains the step-by-step logic behind how it works, and then demonstrates a from-scratch implementation to show that the core idea is simple and easy to build.

What is Linear Regression?

Linear Regression is like drawing the best straight line through a set of points.

The line represents a relationship between the input feature and the predicted value — like how a person's weight might relate to their height.

It learns from existing data to find the "best fit line" and uses it to make predictions on new data.

This notebook will:

  • Use scikit-learn to demonstrate how Linear Regression works in practice
  • Explain the logic behind it in an intuitive way
  • Show how to implement the same idea step by step from scratch

Let’s dive into the details to understand how it works and how to implement it ourselves.

Preparation

import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import make_regression from sklearn.linear_model import LinearRegression # 1. Load regression data X, y = make_regression(n_samples=100, n_features=1, noise=10, random_state=42) X = X.flatten() # 2. Plot function def plot_regression_line(X, y, w, b, title="Regression Line", label="Model", color="red"): plt.scatter(X, y, label="Data", alpha=0.6) x_line = np.linspace(X.min(), X.max(), 100) y_line = w * x_line + b plt.plot(x_line, y_line, color=color, label=label) plt.title(title) plt.xlabel("x") plt.ylabel("y") plt.legend() plt.grid(True) plt.show()

Implement with Scikit-Learn

sk_model = LinearRegression() sk_model.fit(X.reshape(-1, 1), y) w_sklearn = sk_model.coef_[0] b_sklearn = sk_model.intercept_ plot_regression_line(X, y, w_sklearn, b_sklearn, title="Scikit-learn LinearRegression", label="sklearn", color="blue")

Outputs:

Output Image
Output Image

Behind the Scenes

1. The Goal

We want to find the best-fitting straight line:

y^=wx+b\hat{y} = w \cdot x + b

Where:

  • y^\hat{y} is the predicted value
  • xx is the input
  • ww is the slope (how steep the line is)
  • bb is the intercept (where it crosses the y-axis)

2. How Good is the Line? (Cost Function)

To measure how well the line fits the data, we use Mean Squared Error (MSE) as our cost function:

J(w,b)=12ni=1n(y^iyi)2=12ni=1n(wxi+byi)2J(w, b) = \frac{1}{2n} \sum_{i=1}^{n} \left( \hat{y}_i - y_i \right)^2 = \frac{1}{2n} \sum_{i=1}^{n} \left( w x_i + b - y_i \right)^2

This means:

  • For each data point, we compute the difference between predicted and actual value: (y^iyi)(\hat{y}_i - y_i)
  • Then we square the difference to:
    • Make all errors positive (cancel out minus and plus)
    • Make the model more sensitive to larger errors
  • Finally, we average the squared errors across all nn samples

We use 12n\frac{1}{2n} instead of 1n\frac{1}{n} to simplify the math when we take derivatives later.

3. How to Minimize the Cost? (Gradient Descent)

We want to adjust the values of ww and bb to make the cost function J(w,b)J(w, b) as small as possible.

To do this, we use an algorithm called gradient descent, which:

  • Calculates the slope of the cost function
  • Takes small steps in the direction that reduces the cost
  • Repeats this process over and over again — adjusting ww and bb a little each time

It keeps checking:

"Which direction should I move to make the cost smaller?"

To move in that direction, we compute the partial derivatives:

JwandJb\frac{\partial J}{\partial w} \quad \text{and} \quad \frac{\partial J}{\partial b}

And update like this:

w:=wαJww := w - \alpha \cdot \frac{\partial J}{\partial w} b:=bαJbb := b - \alpha \cdot \frac{\partial J}{\partial b}

α\alpha is the learning rate — a small number that controls how big each update step is

This cycle of "calculate slope → move → calculate again" continues until the model improves and the cost becomes low enough.

Before diving into the derivative formulas, let’s quickly review two key tools that make it all work.

4. A Quick Review: Power Rule and Chain Rule

Power Rule:

If a function contains a square like:

f(x)=x2f(x) = x^2

Then the derivative is:

ddxf(x)=2x\frac{d}{dx} f(x) = 2x

This means:

If your function contains something squared, its derivative will be 2 times that thing.

Chain Rule:

Now, what if the thing being squared is itself a function?
Let’s say:

f(x)=[g(x)]2f(x) = [g(x)]^2

Then the chain rule says:

ddxf(x)=2g(x)g(x)\frac{d}{dx} f(x) = 2 \cdot g(x) \cdot g'(x)

This means:

  • You treat the whole inner function g(x)g(x) like a single variable and apply the power rule: 2g(x)2 \cdot g(x)
  • Then multiply by the derivative of the inside part: g(x)g'(x)

So:

Chain rule = outer derivative × inner derivative

5. Derivatives Step-by-Step (Putting It All Together)

Now we compute Jw\frac{\partial J}{\partial w} and Jb\frac{\partial J}{\partial b} separately, starting from a single data point to keep things simple.

Let’s say we have just one data point:

  • Input: xx
  • Actual output: yy
  • Predicted output: y^=wx+b\hat{y} = wx + b

So the cost for this single point is:

J=(wx+by)2J = (wx + b - y)^2

We define the inner function:

u=wx+byu = wx + b - y

So the cost becomes:

J=u2J = u^2

This prepares us to apply the chain rule to J=u2J = u^2.

Derivative with respect to ( w )

Step-by-step:

dJdw=ddw(u2)\frac{dJ}{dw} = \frac{d}{dw} (u^2)

Now apply the chain rule:

  • Outer derivative: ddu(u2)=2u\frac{d}{du}(u^2) = 2u
  • Inner derivative:
    • dudw=x\frac{du}{dw} = x, because when we change ww, only the wxwx term affects the slope. xx is constant, so the slope is xx

So:

dJdw=2ux=2(wx+by)x\frac{dJ}{dw} = 2u \cdot x = 2(wx + b - y) \cdot x

Derivative with respect to ( b )

Same logic:

dJdb=ddb(u2)\frac{dJ}{db} = \frac{d}{db} (u^2)

Apply the chain rule:

  • Outer derivative: ddu(u2)=2u\frac{d}{du}(u^2) = 2u
  • Inner derivative:
    • dudb=1\frac{du}{db} = 1, because when we change bb, it directly adds to uu. The slope is 1, and other parts don’t change

So:

dJdb=2(wx+by)\frac{dJ}{db} = 2(wx + b - y)

6. Generalizing to All Data Points

Now that we’ve understood the derivatives for a single data point, we can scale this up to the whole dataset.

We repeat the same process for each training example (xi,yi)(x_i, y_i) and sum the gradients.
Our full cost function over all points is:

J(w,b)=12ni=1n(wxi+byi)2J(w, b) = \frac{1}{2n} \sum_{i=1}^{n} \left( w x_i + b - y_i \right)^2

Derivative with respect to ( w )

Apply the chain rule:

  • Outer derivative of the square: 2(wxi+byi)2(w x_i + b - y_i)
  • Inner derivative with respect to ww: xix_i

So the full gradient becomes:

Jw=12ni=1n2(wxi+byi)xi\frac{\partial J}{\partial w} = \frac{1}{2n} \sum_{i=1}^{n} 2(w x_i + b - y_i) \cdot x_i

Now cancel the 2s:

Jw=1ni=1n(wxi+byi)xi\frac{\partial J}{\partial w} = \frac{1}{n} \sum_{i=1}^{n} (w x_i + b - y_i) \cdot x_i

Derivative with respect to ( b )

Same idea:

  • Outer derivative: 2(wxi+byi)2(w x_i + b - y_i)
  • Inner derivative with respect to bb: 11

So:

Jb=12ni=1n2(wxi+byi)\frac{\partial J}{\partial b} = \frac{1}{2n} \sum_{i=1}^{n} 2(w x_i + b - y_i)

Cancel the 2s:

Jb=1ni=1n(wxi+byi)\frac{\partial J}{\partial b} = \frac{1}{n} \sum_{i=1}^{n} (w x_i + b - y_i)

7. When to Stop: Convergence Criteria

Gradient descent doesn’t just make one update — it keeps looping:

  1. Calculate the gradient (slope) using the current values of ww and bb
  2. Use the learning rate α\alpha to take a step downhill
  3. Repeat: update ww, update bb, and calculate again

This cycle continues until one of the following happens:

  • Convergence: The cost becomes very small, and further updates barely change the result. For example, if the difference in cost between steps is less than a small number like 10610^{-6}, we say it’s close enough.

  • Maximum iterations: We set a safety limit (e.g. 1000 steps) so it won’t run forever.

  • No significant improvement: If the cost hasn’t improved for many steps, we might be stuck in a flat spot or local minimum. In that case, it’s better to stop early.

So gradient descent is like a climber constantly feeling the slope, moving slowly downhill — and stopping when it’s either flat enough, time’s up, or there's no more progress.

This completes the core learning process — now we’re ready to visualize or implement it in code!

Let's Code It

class MyLinearRegression: def __init__(self, learning_rate=0.01, max_iter=1000, tol=1e-6): # Learning rate α: controls step size in gradient descent (Section 3) self.alpha = learning_rate # Maximum number of gradient descent iterations (Section 7) self.max_iter = max_iter # Tolerance for convergence: stop if cost change is very small (Section 7) self.tol = tol # Initialize weights (w: slope, b: intercept) to zero (Section 5 start) self.w = 0 self.b = 0 def fit(self, X, y): n = len(X) # number of training samples # Start with w = 0 and b = 0 as initial guesses w, b = 0.0, 0.0 # Track previous cost to check for convergence (Section 7) prev_cost = float('inf') # Gradient Descent Loop (Section 3) for i in range(self.max_iter): # Predict y: ŷ = w * x + b (Section 1) y_pred = w * X + b # Error = prediction - actual (Section 2) error = y_pred - y # Mean Squared Error cost function (Section 2) cost = (1 / (2 * n)) * np.sum(error ** 2) # Convergence check: stop if cost doesn't improve much (Section 7) if abs(prev_cost - cost) < self.tol: print(f"Converged at iteration {i}, cost: {cost:.6f}") break prev_cost = cost # Compute gradients (Section 6) # dJ/dw = (1/n) ∑ (w x_i + b - y_i) * x_i dw = (1 / n) * np.sum(error * X) # dJ/db = (1/n) ∑ (w x_i + b - y_i) db = (1 / n) * np.sum(error) # Update parameters (Section 3) w -= self.alpha * dw b -= self.alpha * db # Store the learned weights after training self.w = w self.b = b def predict(self, X): # Make predictions using learned weights (Section 1 again) return self.w * X + self.b # 5. Fit with custom model and plot my_model = MyLinearRegression(learning_rate=0.01, max_iter=1000) my_model.fit(X, y) plot_regression_line(X, y, my_model.w, my_model.b, title="MyLinearRegression (from scratch)", label="Custom GD", color="green")

Outputs:

Output Image
Output Image

It Works!!

The regression line produced by our scratch implementation closely matches the result from scikit-learn.

This confirms that the gradient descent logic — computing the cost, applying the chain rule, and updating the parameters — behaves exactly as expected.

We've successfully built Linear Regression from the ground up!