Step 1: Define the Cost Function
We want to minimize the error between predictions and true values.
J(w)=21∥Xw−t∥2
✅ Meaning:
- Xw: predicted values
- t: true target values
- Xw−t: error vector
- ∥⋅∥2: sum of squared errors
- 21: for convenient derivative later
Step 2: Expand ∥Xw−t∥2
Use the rule:
∥v∥2=v⊤v
Expand:
∥Xw−t∥2=(Xw−t)⊤(Xw−t)
Distribute:
=(Xw)⊤(Xw)−(Xw)⊤t−t⊤(Xw)+t⊤t
Since (Xw)⊤t=t⊤(Xw) (scalars are symmetric):
=(Xw)⊤(Xw)−2t⊤(Xw)+t⊤t
Step 3: Rewrite in Matrix Terms
Recognizing standard matrix rules:
- (Xw)⊤(Xw)=w⊤X⊤Xw
- t⊤(Xw)=t⊤Xw
Thus:
J(w)=21(w⊤X⊤Xw−2t⊤Xw+t⊤t)
Step 4: Take the Gradient ∇wJ(w)
Differentiate term-by-term:
- The derivative of w⊤X⊤Xw with respect to w is 2X⊤Xw.
- The derivative of −2t⊤Xw with respect to w is −2X⊤t.
- The derivative of t⊤t (a constant) is 0.
Because we have a 21 factor outside, it cancels out the 2’s from the derivatives.
Thus:
∇wJ(w)=X⊤Xw−X⊤t
Step 5: Set the Gradient to Zero (Find the Minimum)
Setting the gradient to zero:
X⊤Xw−X⊤t=0
Rearranging:
X⊤Xw=X⊤t
✅ This is called the normal equation.
Step 6: Solve for w
Multiply both sides by (X⊤X)−1:
w∗=(X⊤X)−1X⊤t
✅ This gives the optimal weights.
Final Steps Summary (plain text)
- Write the cost function: J(w)=21∥Xw−t∥2
- Expand the square: w⊤X⊤Xw−2t⊤Xw+t⊤t
- Differentiate term-by-term:
- w⊤X⊤Xw→2X⊤Xw
- −2t⊤Xw→−2X⊤t
- t⊤t→0
- Set the gradient to zero: X⊤Xw=X⊤t
- Solve for w: w∗=(X⊤X)−1X⊤t
Important Notes
✅ X⊤X must be invertible (i.e., X has full rank).
✅ If X⊤X is singular (no inverse), you can fix it by adding Ridge regularization:
w∗=(X⊤X+λI)−1X⊤t
✅ Linear regression has this direct solution.
Other models like logistic regression or neural networks need iterative optimization methods.
Final Formula
The direct closed-form solution for linear regression is:
w∗=(X⊤X)−1X⊤t