linear regression
1. geometry derivation
- Let \(A\) be \(m\times n\). Let \(b\) be \(n \times 1\).
- We want to find \(Ax\) such that mean square error is minimized with \(b\). If \(A\) has linearly independent columns, then we see that what we want is the projection of \(b\) onto the column space of \(A\). In particular, we want the coordinates, \(x\) of the projection.
- This is given by \((A^TA)^{-1}A^Tb\) (see projection matrix)
2. calculus derivation
- The squared error can be written as: \((Ax - b)^T(Ax - b)\).
- Expanding: \((x^TA^TAx - 2b^TAx + b^2)\)
- Note: \(x^TA^Tb = (b^TAx)^T = (b^TAx)^T\) because they are all scalars
- Taking the vector derivative with respect to \(x\) and setting equal to 0: \(2x^T(A^TA) - 2b^TA = 0\). So \(x = (A^TA)^{-1}A^Tb\)
- Note: \(\frac{\partial x^TAx}{\partial x} = x^T(A + A^T)\) TODO: why?
- See Vivek Yadav blog and textbook and lecture notes
3. relationship with MLE
- Assuming the model \(y = mx + b + \epsilon\) where \(\epsilon\) is Gaussian, the log likelihood of the data is the sum of terms involving \((AX_i - Y_i)^2\), which come from writing out the Gaussian likelihood.
- Maximing the likelihood is then equivalent to finding the min squared error