Lecture Notes

19. Principal Component Analysis

Part of the Series on Linear Algebra.

By Akshay Agrawal. Last updated Dec. 20, 2018.

In this section, we present an important application of the SVD known as principal component analysis (PCA). It takes $p$ -dimensional vectors and finds a basis consisting of $k < n$ orthonormal vectors that approximates them well. PCA is therefore a linear dimensionality reduction technique. It is used in machine learning, statistics, and exploratory data anlysis; PCA can be used to compress data, de-noise them, interpret them, and, when $k=2$ or $3$ , visualize them.

Suppose we have data matrix $X \in \RR^{n \times p}$ , meaning that we have $n$ different observations of $p$ variables, and each observation $x_i^T$ is a row of $X$ . We seek an orthonormal list $(q_1, \ldots, q_k)$ , $q_i \in \RR^p$ and $k < p$ , such that the Euclidean distance from the observations and the subspace spanned by the $q_i$ is minimized.

The optimization problem to solve is

$\min_{Q^TQ = I} \norm{X - XQQ^T}_F^2,$

where $Q = \begin{bmatrix} q_1 & q_2 & \cdots & q_k \end{bmatrix} \in \RR^{p \times k}$ is the optimization variable. Or, equivalently,

$\min_{Q^TQ = I} \sum_{i,j} \norm{x_i}^2 - \norm{QQ^Tx_i}^2 \equiv \min_{Q^TQ = I} \sum_{i,j} \norm{x_i}^2 - \norm{Q^Tx_i}^2 .$

If $V$ is a solution to this problem, then $V^T x_i \in \RR^{k}$ can be interpreted as an embedding of $x_i$ , and the i-th row of $XV \in \RR^{n \times k}$ is an embedding of $x_i$ (that is, the i-th row of $X$ ).

It turns out that the solution is to use the first $k$ right singular vectors $(v_1, \ldots, v_k)$ of $X$ .

In applications, the data matrix is typically centered, so that the sum of its rows equals $0$ , and normalized, so that the standard deviation of each column is $1$ ; these assumptions are useful in practice (see the section on pairwise distances below).

Solution via the SVD

The distance from each observation $x_i$ to its projection on the subspace spanned by the $q_i$ is $\norm{(I - QQ^T)x_i}$ (see the section on projections), and moreover by the Pythagorean theorem

$\norm{(I - QQ^T)x_i}^2 = \norm{x_i}^2 - \norm{QQ^Tx_i}^2.$

Since $\norm{x}$ is a constant, minimizing the distance from each observation to its projection is equivalent to maximizing

$\norm{QQ^Tx_i} = \norm{q_1^T x_i q_1 + \cdots + q_k^T x_i q_k} = \norm{x_i^TQ}.$

That is, we can minimize the distance between the observations and the subspace spanned by the $q_i$ by maximizing $\norm{XQ}_F$ , subject to the constraint that the columns of $Q$ are orthonormal. Notice that

$\sup_{\norm{q} = 1} \norm{Xq}$

is attained at the first right singular vector $v_1$ of $X$ , or equivalently the top eigenvector of $X^TX$ . Assume inductively that $q_j = v_j$ , where $v_j$ the $j$ th right singular vector of $X$ . Let $V_j$ be the matrix $\begin{bmatrix}v_1 & v_2 & \cdots & v_j\end{bmatrix}$ , then $q_{j+1}$ is a unit vector that attains

$\sup_{\norm{q} = 1} \norm{X(I - V_jV_j^T)q)}.$

Since $X(I - V_jV_j^T) = \sum_{i=j+1}^{p} \sigma_i u_i v_i^T$ , it follows that $q_{j+1} = v_{j+1}$ .

The PCA solution, therefeore, takes $(q_1, \ldots, q_k)$ to be the first $k$ right singular vectors of $X$ , $(v_1, \ldots, v_k)$ . Notice that the components of our data, with respect to the basis $(v_1, \ldots, v_k)$ , can be computed as

$T = XV_k = U \Sigma_k = \begin{bmatrix} \sigma_1 u_1 & \sigma_2 u_2 & \cdots & \sigma_k u_k \end{bmatrix}$

where the $u_i$ are the left singular vectors of $X$ and the $\sigma_i$ are its singular values. The $n \times k$ matrix $T$ is sometimes referred to as a score matrix. The vectors $v_i$ are called the principal vectors of $X$ , and, for each observation $x_i$ , $x_i^T v_j$ is called the $j$ -th principal component of $x_i$ . The coordinates of an observation in the basis $(v_1, \ldots, v_k)$ are called the principal components of the observation.

Via the gram matrix

The score matrix or embedding matrix given by PCA is $T = U \Sigma_k$ . Above, we computed $T$ by first computing the top $k$ -eigenvectors of $X^TX$ , storing them in a matrix $V_k$ , and computing $T = XV_k$ . We can instead compute $T$ by taking an eigenvector decomposition its gram matrix: $XX^T = U \Sigma^2 U^T$ .

Connection to low-rank approximation

PCA is equivalent to finding an optimal low-rank approximation to the data matrix $X$ ; instead of outputting the rank- $k$ approximation, PCA outputs an orthonormal basis that can be used to construct the low-rank approximation. Because the SVD furnishes optimal low-rank approximations, this is another sense in which PCA and the SVD are intimately related.

The coordinates of the $i$ th observation $x_i^T$ with respect to the bases $(v_1, v_2, \ldots, v_k)$ are stored in the $i$ th row of $T$ ; i.e.,

$x_i^T \approx \sum_{j=1}^{k} T_{ij} v_j^T.$

In particular, the $i$ -th row of the matrix

$TV_k^T = U_k \Sigma_k V_k^T = \begin{bmatrix} \sigma_1 u_1 & \sigma_2 u_2 & \cdots & \sigma_k u_k \end{bmatrix} V_k^T$

stores the approximation of $x_i^T$ computed by PCA. This matrix is nothing other than the optimal rank- $k$ approximation of $X$ .

Connection to pairwise distances

When the $x_i$ are centered (i.e., when $\sum_{i} x_i = 0$ ), PCA can be be equivalently phrased as the following problem:

$\min_{Q^TQ = 1} \sum_{i, j} \norm{x_i - x_j}^2 - \norm{QQ^T (x_i - x_j)}^2.$

Note that each term in the sum is nonnegative because projections do not increase distances. This problem is in turn equivalent to

$\min_{Q^TQ = 1} \sum_{i, j} \norm{x_i - x_j}^2 - \norm{Q^T (x_i - x_j)}^2.$

The term $\norm{Q^T(x_i - x_j)}^2$ can be interpreted as the squared distance between the embedded points $Q^Tx_i$ and $Q^Tx_j$ . In this sense, PCA learns a linear (orthogonal projection) embedding that tries to preserve squared pairwise distances.

To see why this is true, note that the objective can be rewritten as

$\min_{Q^TQ = 1} (2n - 1)\sum_{i} (\norm{x_i}^2 - \norm{Q^Tx_i}^2) - 2\left(\sum_i x_i\right)^T\left(\sum_i x_i\right) + 2\left(\sum_i x_i\right)^T QQ^T \left(\sum_i x_i\right).$

The last two terms vanish because the data is centered.

References

Tim Roughgarden and Gregory Valiant. CS168: The Modern Algorithmic Toolbox Lecture #7.
Cosma Shalizi. Advanced Data Analysis from an Elementary Point of View.