linear regression

1. geometry derivation

Let \(A\) be \(m\times n\). Let \(b\) be \(n \times 1\).
We want to find \(Ax\) such that mean square error is minimized with \(b\). If \(A\) has linearly independent columns, then we see that what we want is the projection of \(b\) onto the column space of \(A\). In particular, we want the coordinates, \(x\) of the projection.
This is given by \((A^TA)^{-1}A^Tb\) (see projection matrix)

2. calculus derivation

The squared error can be written as: \((Ax - b)^T(Ax - b)\).
Expanding: \((x^TA^TAx - 2b^TAx + b^2)\)
- Note: \(x^TA^Tb = (b^TAx)^T = (b^TAx)^T\) because they are all scalars
Taking the vector derivative with respect to \(x\) and setting equal to 0: \(2x^T(A^TA) - 2b^TA = 0\). So \(x = (A^TA)^{-1}A^Tb\)
- Note: \(\frac{\partial x^TAx}{\partial x} = x^T(A + A^T)\) TODO: why?
See Vivek Yadav blog and textbook and lecture notes

3. relationship with MLE

Assuming the model \(y = mx + b + \epsilon\) where \(\epsilon\) is Gaussian, the log likelihood of the data is the sum of terms involving \((AX_i - Y_i)^2\), which come from writing out the Gaussian likelihood.
Maximing the likelihood is then equivalent to finding the min squared error

Created: 2025-11-02 Sun 18:54