PCA
- Pattern Recognition and Machine Learning: Chapter 12
- Elements of Statistical Learning: Section 14.5.1
- An Introduction to Statistical Learning: Section 6.3.1
Dimensionality reduction
Assume we have a set of \(N\) observations (e.g. cells) of \(p\) variables (e.g. genes).
Dimensionality reduction consists of representing the observations in a lower-dimensional space with dimension \(q\ll p\) while preserving some characteristics (e.g. relative distance) of the original \(p\)-dimensional representation.
For visualizing relationships among observations, we typically use \(q=2\).
Principal component analysis (PCA)
Assume we have a set of \(N\) observations \(x_1,\dots,x_N\in \mathbb{R}^p\) of \(p\) variables, collected in the \(N\times p\) matrix \(\mathbf{X}\).
The principal components of \(\mathbf{X}\) provide a sequence of best linear approximations to \(\mathbf{X}\), representing \(\mathbf{X}\) by a line (\(q=1\)), plane (\(q=2\)), etc. in \(p\)-dimensional space.
A rank-\(q\) linear model in \(p\)-dimensional space takes the form
\[ f(y) = \mu + y_1 v_1 + \dots + y_q v_q = \mu + \mathbf{V}_q z, \]
where \(\mu\in\mathbb{R}^p\) is a location vector, \(z\in\mathbb{R}^q\) is a (low-dimensional) vector, and \(\mathbf{V}_q=(v_1,\dots,v_q)\) is a \(p\times q\) matrix of \(q\) orthogonal unit vectors,
\[ \begin{aligned} v_k v_{k'}^T = \sum_{j=1}^p v_{kj}v_{k'j}= \delta_{kk'} \end{aligned} \]
Fitting this model to the data using least squares amounts to minimizing the reconstruction error:
\[ \min_{\mu,\{y_{i}\},\mathbf{V}\_q} \sum_{i=1}^N \Bigl\| x_i - \mu - \mathbf{V}_q z_i \Bigr\|^2 \]
Partial optimization for \(\mu\) and the \(y_i\) gives
\[ \begin{aligned} \hat{\mu} &= \bar x = \frac1{N}\sum_{i=1}^N x_i\\ \hat{z}_i &= \mathbf{V}_q^T(x_i-\bar{x}) \end{aligned} \]
Plugging this into () leaves us to find the orthogonal matrix \(\mathbf{V}_q\):
\[ \begin{aligned} \min_{\mathbf{V}\_q} \sum_{i=1}^{N} \Bigl\| (x_i - \bar{x}) - \mathbf{V}_q\mathbf{V}_q^T(x_i-\bar{x}) \Bigr\|^2 \end{aligned} \]
It can be shown that the optimal solution is found when \(\mathbf{V}_q=(v_1,\dots,v_q)\) contains the \(q\) eigenvectors with largest eigenvalues of the \(p\times p\) matrix \(\mathbf{X}^T\mathbf{X}\) (if the data is centred such that \(\bar x=0\)).
The solution can be expressed using the singular value decomposition of the obsered data matrix \(\mathbf{X}\):
\[ \mathbf{X} = \mathbf{U} \Lambda \mathbf{V}^T \]
where \(\mathbf{U}\) is an \(N\times p\) orthogonal matrix (\(\mathbf{U}^T\mathbf{U}=\mathbf{I}_p\)) whose columns are called the left singular vectors, \(\mathbf{V}\) is a \(p\times p\) orthogonal matrix (\(\mathbf{V}^T\mathbf{V}=\mathbf{I}_p\)) whose columns are called the right singular vectors, and \(\Lambda\) is a \(p\times p\) diagonal matrix with diagonal elements \(\lambda_1\geq \lambda_2 \geq \dots \geq \lambda_p\geq 0\) called the singular values.
The solution \(\mathbf{V}_q\) above consists of the first \(q\) columns of \(\mathbf{V}\).
The columns of \(\mathbf{U}\Lambda\) are called the principal components of \(\mathbf{X}\). The first \(q\) principal components are \(\mathbf{U}_q\Lambda_q\). Each principal component is a vector in \(\mathbb{R}^N\), that is, it takes a value for each observation. The solution \(\hat{z}_i\) above consists of the rows of \(\mathbf{U}_q\Lambda_q\), that is, for each observation \(i\), \(\hat{z}_i\) consists of the values of the first \(q\) principal components for observation \(i\).