Lecture Notes

18. Least Squares

Part of the Series on Linear Algebra.

By Akshay Agrawal. Last updated Dec. 20, 2018.

Previous entry: Projections ; Next entry: Principal Component Analysis

This section treats the problem of least squares. We are given a matrix $A \in \RR^{m \times n}$ and a vector $b \in \RR^{m}$ , and our goal is to find a vector $x \in \RR^n$ that minimizes $\norm{Ax - b}$ . This problem has “squares” in its name because it is often posed as the equivalent problem of minimizing $\norm{Ax - b}^2$ .

The case when $A$ has a trivial nullspace

In our study of orthogonal projections, we saw that if $m \geq n$ and $A$ is full rank, then this problem has a unique solution $x_{0} = A^\dagger b$ , where $A^\dagger = (A^TA)^{-1}A^T$ is the pseudoinverse of $A$ . The vector in $\range{A}$ that achieves the minimum distance to $b$ is $Ax_{0} = AA^\dagger b = Q_1Q_1^T b$ , where $Q_1$ is the familiar matrix with orthonormal columns from $A$ ’s full QR factorization.

Solution set of the general least squares problem

Regardless of whether $A$ is full rank, and regardless of whether $m \geq n$ , the vector in $\range{A}$ achieving minimum distance to $b$ is the orthogonal projection of $b$ onto $\range{A}$ . This means that if $Q_1$ is a matrix with orthonormal columns spanning the range of $A$ , then $Q_1Q_1^Tb$ minimizes $\norm{v - b}$ over all $v \in \range{A}$ ; this was proved in the previous section. When $A$ has a non-trivial nullspace, however, the least squares problem has infinitely many solutions.

Consider the general least squares problem, with $A \in \RR^{m \times n}$ and $b \in \RR^n$ the problem data. Let $Q_1$ be a matrix with orthonormal columns spanning the range of $A$ ; such a matrix can be obtained by applying the modified Gram-Schmidt algorithm to the columns of $A$ . Let $x_0$ be a vector in $\RR^n$ such that $Ax_0 = Q_1Q_1^T b$ (such a vector must exist, since $Q_1Q_1^T$ is the orthogonal projection onto $\range{A}$ , which means $\range{Q_1Q_1^T} = \range{A}$ ). Then the solution set of the least squares problem is

$S = \{x_0 + v : v \in \null{A}\}.$

The set $S$ always has at least one member, namely $x_0$ . If $A$ has a nontrivial nullspace (i.e., if there is a vector other than $0$ in the null space of $A$ ), then $S$ has infinitely many members.

Residual vector

We can also characterize the residual vector $Ax_0 - b$ , where $x_0$ is any particular solution to the least squares problem. Let $Q_1$ be a matrix whose columns are an orthonormal basis for $\range{A}$ and $Q_2$ a matrix whose columns are an orthonormal basis for $\range{A}^\perp$ . The residual vector is

$Ax_0 - b = Q_1Q_1^Tb - b = (Q_1Q_1^T - I)b = -Q_2Q_2^Tb,$

where the last equality comes from the fact that $I - Q_1Q_1^T$ is the complementary projection of $Q_1Q_1$ , i.e., it is the projection onto $\range{A}^\perp$ . Notice that we say the residual vector, because the residual vector is unique; while the least squares problem might have infinitely many solutions, there is a uniqe vector in $\range{A}$ that is closest to $b$ . The optimal squared residual is therefore $\norm{-Q_2Q_2^Tb}^2 = \norm{Q_2^T b}^2$ . Although $Q_2$ has orthonormal columns, it is not an orthogonal matrix (because it is not square), so we cannot further simplify the righthand side.

Best linear unbiased estimator

Suppose we want to estimate a vector $\beta$ . We cannot measure $\beta$ directly; instead, we observe a vector $y$ , which we know to be a function of $\beta$ : $y = X \beta + \epsilon$ . The matrix $X \in \RR^{m \times n}$ , $m \geq n$ < $\rank{X} = n$ represents a sensor for $\beta$ , and the vector $\epsilon$ is a random variable representing measurement noise with mean $0$ and variance $\sigma^2I$ . We seek an unbiased estimate of $\beta$ that is a linear function of $y$ .

One such estimate is $\beta_{ls} = X^\dagger y$ , since $\E{X^\dagger y} = X^\dagger X \E{\beta} + \E{\epsilon} = \E{\beta}$ . It turns out that $\beta_{ls}$ has lower variance than any other unbiased estimator of $\beta$ ; in this sense, $\beta_{ls}$ is the best linear unbiased estimator of $\beta$ . To see this, let $\hat \beta = Cy$ be some other unbiased estimator of $\beta$ , and note that $CX = I$ . We will show that $\Var{\hat \beta} - \Var{\beta_{ls}}$ is positive semidefinite.

The variance of $\hat \beta$ is

$\begin{align*} \Var{\hat \beta} &= \Var{Cy} \\ &= C\Var{y}C^T \\ &= \sigma^2CC^T. \end{align*}$

Because $P_X = XX^\dagger$ is an orthogonal projection, $\norm{P_X x} \leq \norm{x}$ for all $x \in \RR^n$ , $P_X^2 = P_X$ and $P_X = P_X^T$ . Hence

$\begin{align*} x^T CC^T &= x\norm{C^T x}^2 \\ &\geq \norm{P_X C^T x}^2 \\ &= x^T C P_X P_X^T C^T x \\ &= x^T C P_X C^T x \\ &= x^T C X (X^TX)^{-1} X^T C^T x \\ &= x^T (X^TX)^{-1} x. \end{align*}$

This implies that $\sigma^2 CC^T - \sigma^2 (X^TX)^{-1} = \Var{\hat \beta} - \Var{\beta_{ls}}$ is positive semidefinite.

Least-norm solution via the Moore-Penrose inverse

The vector $x_0 = A^\dagger b$ gives the least-norm solution of the general least squares problem (in which $A$ may be singular), where $A^\dagger$ is the Moore-Penrose inverse (see the notes on the SVD). To see why this is the case, let $A = U \Sigma V^T$ , so that $A^\dagger = V \Sigma^\dagger U^T$ and $x_0 = V\Sigma^\dagger U^T b$ (the matrix $\Sigma^\dagger$ is obtained by transposing $\Sigma$ and inverting the non-zero entries). If $A = D$ is diagonal, then it is clear that the least-norm minimizer of $\norm{Dx - b}$ is $D^\dagger b$ . Otherwise, note that

$\norm{Ax - b} = \norm{U \Sigma V^T x - b} = \norm{\Sigma V^T x - U^T b}$ ,

since $U$ is an isometry. Let $y = V^T x$ . Note that $\norm{y} = \norm{x}$ and that $V^T$ is invertible, so $x$ is the least-norm minimizer of $\norm{Ax - b}$ if and only if $y$ is the least-norm minimizer of $\norm{\Sigma y - U^Tb}$ . Since $\Sigma$ is diagonal, by our previous argument we conclude that the least-norm minimizer of $\norm{\Sigma y - U^Tb}$ is $y_0 = \Sigma^\dagger U^Tb$ . Therefore, the least-norm minimizer of $\norm{Ax - b}$ is $V y_0 = V \Sigma^\dagger U^T b = A^\dagger b$ .

References

Stephen Boyd and Sanjay Lall. EE 263 Course Notes.
Dan Kalman. A Singularly Valuable Decomposition: The SVD of a Matrix.
Lloyd Trefethen and David Bau. Numerical Linear Algebra.