8b Dimensionality Reduction with PCA
Aug 13, 2017 14:08 · 720 words · 4 minutes read
Dimensionality Reduction
Motivation
Data Compression
If there's a lot of redundant data, we can find highly correlated features, plot them, and make a new line that seems to describe both features accurately.
This will reduce the total data we need to store, and speed up the algorithm.
We're reducing the features rather than number of examples. $m$ stays the same, but $n$, the number of features in $x^{(i)}$, will be reduced.
Visualisation
It's difficult to visualise data with more than three dimensions - we can reduce it to 3 or less in order to plot it.
We need to find new features that can summarise other features.
e.g. hundreds of features related to a country's economy might be combined into one called "Economic Activity".
Principal Component Analysis Problem Formulation
Given two features $x_1$ and $x_2$, we want to find a single line that effectively describes both features at once. The old features are mapped onto this new line to get a new single feature.
This can be done with three features, where they're mapped onto a plane.
The more general case is to reduce from n-dimensions to k-dimensions: find $k$ vectors onto which to project the data so as to minimise the projection error.
The goal of PCA is to reduce the average of all distances of every feature to the projection line (the projection error).
Differences to linear regression
In linear regression, we're minimising the squared error to the predictor line. These are vertical distances.
In PCA, we're minimising the shortest orthogonal distances to our data points.
Principal Component Analysis Algorithm
Before applying PCA, we need to perform feature scaling and mean normalisation with the standard:
\[x_j^{(i)} = \frac{ x_j^{(i)} - \mu_j}{s_j})\]
PCA has two tasks:
- Find the plane that will minimise the projection error
- Compute the transformed $z_1, z_2, ..., z_m$
Step 1. Compute covariance matrix, $\Sigma$
\[ \Sigma = \frac{1}{m} \Sigma_{i=1}^m ( x^{(i)} ) ( x^{(i)} )^T\]
Step 2. Compute eigenvectors of covariance matrix
We can use singular value decomposition in Octave for this:
[U, S, V] = svd(Sigma)
We just want the $U \in \mathbb{R}^{n*n}$ matrix.
Step 3. Take the first $k$ columns of $U$ and compute $z$
We assign the first $k$ columns of $U$ to $U_{reduce}$, a $n \times k$ matrix. We then compute $z$ with $z^{(i)} = U_{reduce}^T \cdot x^{(i)} $
Choosing the number of principal components
How should we choose $k$, the number of dimensions we're reducing to?
A good way is to choose $k$ to be the smallest value such that the average squared projection error divided by the total variation in the data is $\le 0.01$.
\[\dfrac{\dfrac{1}{m}\sum^m_{i=1}||x^{(i)} - x_{approx}^{(i)}||^2}{\dfrac{1}{m}\sum^m_{i=1}||x^{(i)}||^2} \leq 0.01\]
A slow way would be to try PCA with $k = 1,2,...$, and check whether the formula gives $\le 0.01$ (retains 99% of the variance). If not, increase $k$ and restart.
In Octave, we can use the $S$ matrix from svd() and check for 99% of retained variance using:
\[\frac{ \Sigma_{i=1}^k S_{ii} }{ \Sigma_{i=1}^n S_{ii} } \ge 0.99\]
Reconstructing from Compressed Representation
Once compressed, we may want to go back to our original number of features. To do this, we use:
\[x^{(1)}_{approx} = U_{reduce} \cdot z^{(1)}\]
This can also be vectorised with:
\[X_{approx} = Z \cdot U_{reduce}'\]
Note this only gives approximations of the original data X.
Advice for applying PCA
The most common use of PCA is to speed up supervised learning.
Given a training set with a large number of features, we can use PCA to reduce the number of features in each example of the training set.
Note we should define the PCA reduction only on the training set, and not on the cross-validation or test sets. You apply the mapping to your cross-validation and test sets after it's been defined on the training set.
Applications:
- Compression
- Reduce space of data
- Speed up algorithm
- Visualisation of data
- Choose $k=2$ or $k=3$
Bad uses of PCA
Trying to prevent overfitting. You might think that reducing the features with PCA would be a good way to address overfitting. It might work, but not recommended as it doesn't consider the values of our results y. Regularisation will be at least as effective.
Don't assume that you need to do PCA. Try the full machine learning algorithm without PCA first. Then use PCA if you find that you need it.