Lecture Notes

15. Singular Value Decomposition

Part of the Series on Linear Algebra.

By Akshay Agrawal. Last updated Dec. 20, 2018.

Previous entry: Ellipsoids ; Next entry: QR Factorization

One of the central aims of linear algebra is to find bases with respect to which the matrices of linear maps are simple. We saw in a previous section that every self-adjoint linear map has a diagonal matrix with respect to an orthonormal basis consisting of eigenvectors. In this section, we will see that every linear map $T : \RR^n \to \RR^m$ has a diagonal matrix with respect to orthonormal bases $(v_1, v_2, \cdots, v_n)$ and $(u_1, u_2, \cdots, u_m)$ for the domain and range of $T$ , respectively. The expression of the matrix of $T$ with respect to these bases is called the singular value decomposition (SVD) of $T$ ; it is analogous to the eigenvector decomposition for symmetric matrices. The upshot is this: In terms of the bases $(v_i)$ and $(u_i)$ , the linear map $T$ simply dilates some components of the input and shrinks others, while possibly truncating components or appending zeros to account for passing from $\RR^n$ to $\RR^m$ when $n \neq m$ .

The SVD

Let $A$ be any real $m \times n$ matrix. There exist orthogonal matrices $U \in \RR^{m \times m}$ and $V \in \RR^{n \times n}$ and a diagonal matrix $\Sigma \in \RR^{m \times n}$ such that $A = U \Sigma V^T$ . It is conventional to arrange the diagonal entries $\Sigma_{ii} = \sigma_i$ , which are nonnegative, in decreasing order $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > 0 = \sigma_{r+1} = \cdots = \sigma_n$ , where $r = \rank{A}$ . The positive $\sigma_i$ are called the singular values of $A$ , and the columns of $U$ and $V$ are its left and right singular vectors. The expression $U \Sigma V^T$ is called the SVD of $A$ . This means that with respect to the orthonormal bases $(v_1, \cdots, v_n)$ and $(u_1, \cdots, u_m)$ of $\RR^n$ and $\RR^m$ respectively, the linear map has a diagonal matrix, namely, $\Sigma$ . Geometrically, the SVD shows that a linear map first rotates its input $x \in \RR^n$ by $V^T$ , scales each axis by $\sigma_i$ , and appends zeros (if $m > n$ ) or truncates (if $m < n$ ) to obtain an $m$ -vector.

If we drop the $\sigma_{p+1}, \ldots, \sigma_{n}$ from $\Sigma$ , along with $v_{p+1}, \ldots, v_n$ and $u_{p+1}, \ldots, u_m$ , we obtain what is called the thin SVD, $A = \hat U \hat \Sigma \hat V^T$ , where $\hat U \in \RR^{m \times r}$ , $\hat \Sigma \in \RR^{r \times r}$ , and $\hat V \in \RR^{n \times r}$ . In block diagram form, the thin SVD is

$A = \hat U \hat \Sigma \hat V^T = \begin{bmatrix} \lvert & \lvert & & \lvert \\ u_1 & u_2 & \cdots & u_r \\ \lvert & \lvert & & \lvert \\ \end{bmatrix} \begin{bmatrix} \sigma_1 & & \\ & \ddots &\\ & & \sigma_r \\ \end{bmatrix} \begin{bmatrix} & v_1^T & \\ & v_2^T & \\ & \vdots & \\ & v_r^T & \\ \end{bmatrix}.$

Note that for $1 \leq i \leq r$ , $Av_i = \sigma_i u_i$ , and for $r < i \leq n$ , $Av_i = 0$ .

Construction

The SVD is closely related to the eigendecomposition of a symmetric matrix. To construct an SVD of a matrix $A \in \RR^{m \times n}$ of rank $r$ , we need an orthonormal basis $(v_1, \ldots, v_n)$ of $\RR^n$ and an orthonormal basis $(u_1, \ldots, u_m)$ of $\RR^m$ such that $Av_i = \sigma_i u_i$ for $1 \leq i \leq r$ . We will take $(v_i)$ to be an orthonormal basis for $\RR^n$ consisting of eigenvectors of $A^TA$ , and we will take $(u_i)$ to be normalized images of these vectors under $A$ .

Let $V\Lambda V^T$ be the eigendecomposition of $A^TA$ , so that $(v_1, \ldots, v_n)$ are orthonormal eigenvectors of $A^TA$ and the diagonal entries $\lambda_i$ of $\Lambda$ are arranged in decreasing order. Clearly $(v_1, \ldots, v_n)$ is an orthonormal basis for $\RR^n$ . Notice that

$(Av_i)^T (Av_j) = v_i^T (A^TAv_j) = \lambda_j v_i^Tv_j,$

so $(Av_1, Av_2, \ldots, Av_n)$ is a list of mutually orthogonal vectors; moreover, the nonzero vectors in this list are a basis for the range of $A$ . Since $A^TA$ is positive semidefinite, the $\lambda_i$ are all nonnegative; since $A$ has rank $r$ (and since the rank of a matrix equals the rank of its corresponding Gram matrix), $\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_r > 0 = \lambda_{r+1} = \cdots = \lambda_{n}$ . Also, $Av_i$ is nonzero for $1 \leq i \leq r$ and zero otherwise, since $\norm{Av_i}^2 = \lambda_i$ . Hence, $(Av_1, \ldots, Av_r)$ is a basis for the range of $A$ . Define

$u_i =\frac{Av_i}{\norm{Av_i}} = \frac{1}{\sqrt{\lambda_i}}Av_i$

so that $(u_1, u_2, \ldots, u_r)$ is an orthonormal list of vectors spanning $\range{A}$ . If $r < m$ , we can extend this list to an orthonormal basis $(u_1, u_2, \ldots, u_m)$ of $\RR^m$ . Forming a matrix $U = \begin{bmatrix}u_1, u_2, \ldots, u_m\end{bmatrix}$ and taking $\sigma_i = \sqrt{\lambda_i}$ gives an SVD of $A$ .

Relationship to eigenvectors of $A^TA$ and $AA^T$

Because $A^TA = \hat V \hat \Sigma^2 \hat V^T$ , the right singular vectors $v_1, v_2, \ldots, v_r$ are orthonormal eigenvectors of $A^TA$ , with eigenvalues $\sigma_1^2 \geq \sigma_2^2 \geq \cdots \geq \sigma_r^2 > 0$ .

Similarly, since $AA^T = \hat U \hat \Sigma^2 \hat U^T$ , the left singular vectors $(u_1, u_2, \ldots, u_r)$ are orthonormal eigenvectors of $AA^T$ , with eigenvalues $\sigma_1^2 \geq \sigma_2^2 \geq \cdots \geq \sigma_r^2 > 0$ .

Subspaces spanned by the singular vectors

Since $(v_1, \ldots, v_n)$ is a basis for $\RR^n$ , $Av_i = 0$ for $i > r$ (where $r = \rank{A}$ ), and $Av_i = \sigma_i u_i \neq 0$ for $i \leq r$ , $(v_{r+1}, \ldots, v_n)$ is a basis of $\null{A}$ . Because $(v_1, \ldots, v_n)$ is an orthonormal list, $(v_1, \ldots, v_r)$ is a basis for $\null{A}^\perp = \range{A^T}$ .

Similarly, $(u_1, \ldots, u_r)$ is a basis of $\range{A}$ and $(u_{r+1}, \ldots, u_{m})$ is a basis of $\range{A}^\perp = \null{A^T}$ .

Image of the unit ball

The SVD provides a compelling geometric description of linear transformations: every linear transformation from $\RR^n \to \RR^m$ deforms the unit ball in $\RR^n$ into an ellipsoid in $\RR^m$ .

The unit ball in $\RR^n$ can be represented as the set of linear combinations of $v_1, \ldots, v_n$ whose coefficient vectors have norm at most one, i.e.,

$B = \{ c_1v_1 + c_2v_2 + \cdots + c_nv_n : \norm{c}^2 \leq 1 \}$

(since $\norm{c_1v_1 + c_2v_2 + \cdots + c_nv_n} = \norm{c}$ because the $v_i$ are orthonormal). The image of a vector $x \in B$ is $Ax = c_1\sigma_1 u_1 + c_2 \sigma_2 u_2 + \cdots + c_r \sigma_r u_r$ . Define $y_i = c_i \sigma_i$ . Then the image of $B$ under $A$ contains all vectors of the form $y_1u_1 + \cdots + y_r u_r$ such that

$\frac{y_1^2}{\sigma_1^2} + \cdots + \frac{y_r^2}{\sigma_r^2} = \sum_{i = 1}^{r} c_i^2 \leq 1.$

That is, the image of $B$ under $A$ is an ellipsoid with semi-axes $u_1, \ldots, u_r$ and with lengths $\sigma_1, \ldots, \sigma_r$ respectively. Hence, every linear transformation collapses $n - r$ dimensions, deforms the unit- $r$ sphere into an ellipsoid, and embeds the ellipsoid into $\RR^m$ . This characterization also shows that $\norm{A} = \sigma_1$ (which is achieved by $x = v_1$ ).

As a sum of outer products

Any matrix product $XY$ can be expressed as a sum of outer products $\sum_{i=1}^{n}x_i y_i^T$ where $x_i$ are the columns of $X$ and $y_i^T$ the rows of $Y$ . Since

$A = \hat U \hat \Sigma \hat V^T = \begin{bmatrix} \sigma_1 u_1 & \sigma_2 u_2 & \cdots & \sigma_r u_r \end{bmatrix} \begin{bmatrix} v_1^T \\ v_2^T \\ \vdots \\ v_r^T \end{bmatrix},$

the matrix $A$ can be expressed as $A = \sum_{i=1}^{r} \sigma_i u_i v_i^T$ .

Since $Ax = \sum_{i=1}^{r} \sigma_i u_i v_i^Tx = \sum_{i=1}^{r} \sigma_i v_i^Tx u_i$ , $A$ expands $x$ in the basis $(u_i)$ with components $v_i^Tx \sigma_i$ . The scalars $v_i^Tx$ are just the components of $x$ in the orthonormal basis $(v_1, v_2, \ldots, v_n)$ , so this representation makes clear that $A$ just scales the components of $x$ when passing from the $v$ basis to the $u$ basis.

Moore-Penrose inverse

The matrix $A^\dagger = V \Sigma^\dagger U^T$ , where $\Sigma^\dagger$ is obtained by transposing $\Sigma$ and inverting its nonzero entries, is called the Moore-Penrose inverse of $A$ . When $A$ is tall and full-rank, $A^\dagger = (A^TA)^{-1} A^T$ ; when it is short and full-rank, it equals $A^T (AA^T)^{-1}$ . The Moore-Penrose inverse can be used to obtain minumum-norm solutions to least-squares problems, as we will see in a subsequent section.

Error analysis

Suppose that $A \in \RR^{n \times n}$ is invertible, and that we seek $x$ such that $y = Ax$ . Of course, $y = A^{-1}x$ . But suppose that our measurement of $y$ is corrupted so that we observe $y + \delta y$ , so $y + \delta y = A(x + \delta x)$ with $\delta x = A^{-1}\delta y$ ( $\delta y$ and $\delta x$ should be interpreted as infinitesimals). Then the error in $x$ , $\norm{\delta x} = \norm{A^{-1} \delta y}$ , is at most $\norm{A^{-1}}\norm{\delta y}$ . Since $y = Ax$ , it is also the case that $\norm{y} \leq \norm{A}\norm{x}$ . Then

$\frac{\norm{\delta x}}{\norm{A}\norm{x}} \leq \frac{\norm{A^{-1}}\norm{\delta y}}{\norm{y}},$

or equivalently

$\frac{\norm{\delta x}}{\norm{x}} \leq \norm{A}\norm{A^{-1}}\frac{\norm{\delta y}}{\norm{y}}.$

The number $\norm{A}\norm{A^{-1}}$ is called the condition number of $A$ ; it is denoted $\kappa(A)$ , and it is equal to $\sigma_1 / \sigma_n$ . For nonsquare matrices, the condition number is defined in terms of the psuedoinverse: $\kappa(A) = \norm{A}\norm{A^\dagger}$ .

If $\kappa(A)$ is small, then we say that $A$ is well-conditioned; otherwise, we say that it is ill-conditioned. When the condition number of a matrix is large, small perturbations in the measurement of $y$ can lead to large perturbations in the estimation of $x$ . If $\kappa(A)$ is extremely large, then $A$ should be treated as singular, for practical purposes, even if it is invertible.

Low-rank approximations

Suppose $A \in \RR^{m \times n}$ , $\rank{A} = r$ , $A = U \Sigma V^T = \sum_{i=1}^{r} \sigma_i u_i v_i^T$ . We seek a matrix $\hat{A}$ of rank $p$ , $1 \leq p \leq r$ , that best approximates $A$ with respect to the operator norm. That is, we seek $\hat{A}$ minimizing $\norm{A - \hat{A}}$ .

It turns out that the best rank- $p$ approximation is $\hat{A} = \sum_{i=1}^{p} \sigma_i u_i v_i^T$ (in fact, $\hat{A}$ is also the best rank- $p$ approximation with respect to the Frobenius norm, though we won’t show this here). This means that the outer products $u_i v_i^T$ are ordered by importance; this is intuitive, since $v_1$ corresponds to the direction of maximal stretch in the domain of $A$ . Notice that $\norm{A - \hat{A}} = \norm{\sum_{i=p+1}^{r} \sigma_i u_i v_i^T} = \sigma_{p+1}$ . In particular, $\sigma_n$ can be intepreted as the distance from $A$ to the nearest singular matrix; if $\sigma_n$ is small, then $A$ is nearly singular.

To see why $\hat{A}$ is in fact is an optimal rank- $p$ approximation of $A$ , let $B$ be any matrix with rank at most $p$ . The dimension of the null space of $B$ is at least $n - p$ ; $(v_1, \ldots, v_{p+1})$ spans a $p+1$ dimensional subspace of $\RR^n$ . Hence there exists a unit vector $z$ that is in the intersection of $\null{B}$ and $\span{(v_1, \ldots, v_{p+1})}$ . Then

$\norm{(A - B)z}^2 = \sum_{i=1}^{p+1} \sigma_i^2 (v_i^T z)^2 \geq \sigma_{p+1}^2\norm{z}^2 = \sigma_{p+1},$

so $\norm{A - B} \geq \sigma_{p+1} = \norm{A - \hat{A}}$ (in the above, we used the fact that $(v_i)$ is an orthonormal list of vectors to obtain the first equality and the inequality, since $\sum (v_i^T z)^2 = \norm{z}^2$ ).

References

Stephen Boyd and Sanjay Lall. EE 263 Course Notes.
Dan Kalman. A Singularly Valuable Decomposition: The SVD of a Matrix.