$$ \newcommand{\qed}{\tag*{$\square$}} \newcommand{\span}{\operatorname{span}} \newcommand{\dim}{\operatorname{dim}} \newcommand{\rank}{\operatorname{rank}} \newcommand{\norm}[1]{\|#1\|} \newcommand{\grad}{\nabla} \newcommand{\prox}[1]{\operatorname{prox}_{#1}} \newcommand{\inner}[2]{\langle{#1}, {#2}\rangle} \newcommand{\mat}[1]{\mathcal{M}[#1]} \newcommand{\null}[1]{\operatorname{null} \left(#1\right)} \newcommand{\range}[1]{\operatorname{range} \left(#1\right)} \newcommand{\rowvec}[1]{\begin{bmatrix} #1 \end{bmatrix}^T} \newcommand{\Reals}{\mathbf{R}} \newcommand{\RR}{\mathbf{R}} \newcommand{\Complex}{\mathbf{C}} \newcommand{\Field}{\mathbf{F}} \newcommand{\Pb}{\operatorname{Pr}} \newcommand{\E}[1]{\operatorname{E}[#1]} \newcommand{\Var}[1]{\operatorname{Var}[#1]} \newcommand{\argmin}[2]{\underset{#1}{\operatorname{argmin}} {#2}} \newcommand{\optmin}[3]{ \begin{align*} & \underset{#1}{\text{minimize}} & & #2 \\ & \text{subject to} & & #3 \end{align*} } \newcommand{\optmax}[3]{ \begin{align*} & \underset{#1}{\text{maximize}} & & #2 \\ & \text{subject to} & & #3 \end{align*} } \newcommand{\optfind}[2]{ \begin{align*} & {\text{find}} & & #1 \\ & \text{subject to} & & #2 \end{align*} } $$
In this section, we present an important application of the SVD known as principal component analysis (PCA). It takes -dimensional vectors and finds a basis consisting of orthonormal vectors that approximates them well. PCA is therefore a linear dimensionality reduction technique. It is used in machine learning, statistics, and exploratory data anlysis; PCA can be used to compress data, de-noise them, interpret them, and, when or , visualize them.
Suppose we have data matrix , meaning that we have different observations of variables, and each observation is a row of . We seek an orthonormal list , and , such that the Euclidean distance from the observations and the subspace spanned by the is minimized.
The optimization problem to solve is
where is the optimization variable. Or, equivalently,
If is a solution to this problem, then can be interpreted as an embedding of , and the i-th row of is an embedding of (that is, the i-th row of ).
It turns out that the solution is to use the first right singular vectors of .
In applications, the data matrix is typically centered, so that the sum of its rows equals , and normalized, so that the standard deviation of each column is ; these assumptions are useful in practice (see the section on pairwise distances below).
The distance from each observation to its projection on the subspace spanned by the is (see the section on projections), and moreover by the Pythagorean theorem
Since is a constant, minimizing the distance from each observation to its projection is equivalent to maximizing
That is, we can minimize the distance between the observations and the subspace spanned by the by maximizing , subject to the constraint that the columns of are orthonormal. Notice that
is attained at the first right singular vector of , or equivalently the top eigenvector of . Assume inductively that , where the th right singular vector of . Let be the matrix , then is a unit vector that attains
Since , it follows that .
The PCA solution, therefeore, takes to be the first right singular vectors of , . Notice that the components of our data, with respect to the basis , can be computed as
where the are the left singular vectors of and the are its singular values. The matrix is sometimes referred to as a score matrix. The vectors are called the principal vectors of , and, for each observation , is called the -th principal component of . The coordinates of an observation in the basis are called the principal components of the observation.
The score matrix or embedding matrix given by PCA is . Above, we computed by first computing the top -eigenvectors of , storing them in a matrix , and computing . We can instead compute by taking an eigenvector decomposition its gram matrix: .
PCA is equivalent to finding an optimal low-rank approximation to the data matrix ; instead of outputting the rank- approximation, PCA outputs an orthonormal basis that can be used to construct the low-rank approximation. Because the SVD furnishes optimal low-rank approximations, this is another sense in which PCA and the SVD are intimately related.
The coordinates of the th observation with respect to the bases are stored in the th row of ; i.e.,
In particular, the -th row of the matrix
stores the approximation of computed by PCA. This matrix is nothing other than the optimal rank- approximation of .
When the are centered (i.e., when ), PCA can be be equivalently phrased as the following problem:
Note that each term in the sum is nonnegative because projections do not increase distances. This problem is in turn equivalent to
The term can be interpreted as the squared distance between the embedded points and . In this sense, PCA learns a linear (orthogonal projection) embedding that tries to preserve squared pairwise distances.
To see why this is true, note that the objective can be rewritten as
The last two terms vanish because the data is centered.