Skip to content
Asta Li

Linear Regression with All the Steps

Common misconceptions, implicit assumptions, and missed steps in linear regression.

Linear regression is kind of a beautiful problem. I wanted to point out a few details that a lot of people miss and provide the derivation of the closed-form solution without skipping any steps.

Problem Set-up

We’re given a dataset of N points (xi,yi)(x_i, y_i), where xix_i is a D-dimensional feature vectory and yiy_i is a 1D scalar. In this case, y(x)y(x) is a random variable. We want to build a model such that, given some new inputs xx, we can predict the mean output y^(x)\hat{y}(x).

Points

Our model hypothesis is that yy is linear in ww, not necessarily xx, which is a common misconception:

y^(x)=jDwjfj(x)+w0\hat{y}(x) = \sum_j^D w_j f_j(x) + w_0

where wjw_j are elements of the weight vector ww and fj(x)f_j(x) is a (potentially non-linear) function applied to x. For the rest of this problem, we’ll assume f(x)=xf(x) = x for simplicity. For the convenience of matrix operations, we’ll roll the bias term inside ww by prepending a dimension to xx and defining:

f0(x)1;x0=1D=D+1f_0(x) \equiv 1; x_0 = 1 \\ D = D + 1 \\

Note that we are making a number of assumptions about the data and errors here - see this great Stats StackExchange answer for more details.

Loss Function

We want to choose model parameters ww to fit the data well. To do this, we can minimize an error function that is larger if the model fits the data more poorly. We use the sum of squared errors as our loss function for a few reasons:

  • The resulting function is smooth and differentiable, which is easier to optimize
  • The errors from overshooting some amount are the same as undershooting by that amount
  • The least squares solution gives us the maximum likelihood estimator (MLE) if the noise is an independent Gaussian (yi=wxi+ϵi,ϵN(0,σ2)y_i = wx_i + \epsilon_i, \epsilon \sim N(0, \sigma^2))

Matrix Form

Before we dive into the solution, we’ll rewrite the problem formulation and loss function in matrix form, where Y^\hat{Y} is an Nx1 vector, XX is a NxD matrix, and ww is a Dx1 vector.

Y^=Xw\hat{Y} = XwL=(Y^Y)T(Y^Y)=(XwY)T(XwY)L = (\hat{Y} - Y)^T (\hat{Y} - Y) \\ = (Xw - Y)^T (Xw - Y)

Closed Form Solution

Now we can solve for the ww that minimizes our loss function by taking the first derivative of the loss function LL with respect to ww and setting it equal to zero. This leads to our global minimum because the least-squares loss is convex.

Lw(XwY)T(XwY)=0\frac{\partial L}{\partial w} (Xw - Y)^T (Xw - Y) = 0

Before taking the derivative, we can expand out the product.

Lw(wTXTXwwTXTyyTXw+yTy)=0\frac{\partial L}{\partial w} (w^TX^TXw - w^TX^Ty - y^TXw + y^Ty) = 0

Recall our matrix derivative properties (proofs):

x(xTAx)=(A+AT)x=2Ax  (1)x(xTA)=A  (2)x(Ax)=AT  (3)\frac{\partial}{\partial x} (x^T A x) = (A + A^T)x = 2Ax \;(1) \\ \frac{\partial}{\partial x} (x^T A) = A \;(2) \\ \frac{\partial}{\partial x} (A x) = A^T \;(3) \\

Note that (A+AT)x=2Ax(A + A^T)x = 2Ax if AA is symmetric.

Take the derivative. For the first term, use property 1 where A=XTXA=X^TX is symmetric (proof). For the second term, use property 2 where A=XTyA=X^Ty. For the third term, use property 3 where A=yTXA=y^TX.

2XTXwXTy(yTX)T=02X^TXw - X^Ty - (y^TX)^T = 0 \\2XTXw2XTy=02X^TXw - 2X^Ty = 0 \\w=(XTX)1XTyw = (X^TX)^{-1}X^Ty

The solution w=(XTX)1XTyw = (X^TX)^{-1}X^Ty is also known as the normal equation.

Discussion

Note that XTXX^TX may not be invertible if:

  • There are too few examples for the number of features: N < D
  • It is singular (by definition), i.e. the determinant is zero
  • The columns are linearly dependent; there are redundant features

Here I focus on the closed-form Ordinary Least Squares solution, but it’s also important to understand the MLE, SVD, and gradient descent approaches. MLWiki has a great compact summary.

Further Reading

I really like the book Numerical Optimization by Jorge Nocedal and Stephen Wright. I recommend reading at least Section 2: Fundamentals of Unconstrained Optimization.

© 2021 by Asta Li. All rights reserved.