A Guide to Principal Component Analysis

Written by Smaran Iyer. An overview of one of the most important data analysis tools.

Introduction

Machine learning deals with lots of data. Oftentimes, we work with many variables, which can lead to issues regarding the efficiency of the program. One helpful technique to mitigate this problem is dimensionality reduction. Somehow, we try to reduce the number of input variables while trying our best not to compromise the results of the program. This reduces the amount of necessary storage space and reduces the computation time [1]. In this article, I will give a run-down on one algorithm commonly used in dimensionality reduction — Principal Component Analysis (PCA).

Background Math

Most basic linear algebra classes deal with the concept of dimensionality. You might think of it as the space ℝ, ℝ², or even ℝⁿ to represent n-dimensions. While dealing with data, we like to think of dimensionality as the number of input variables. This makes sense when we correlate with linear algebra — one input variable can scale across the x-axis, two variables give the x-y coordinate system, three provide the x-y-z coordinate system, and so on. In dimensionality reduction, we project the original dataset from n dimensions into a new space of m dimensions, where m < n.

Sometimes variables are correlated. Other times, they are not. Sometimes they have a weak correlation, while other times they have an incredibly strong correlation. If two input variables have the strongest correlation possible — ie. one variable depends precisely on the other — then all of the data in the first variable is redundant because the first is a linear combination of the second. This means that, while creating our ML model, we can totally ignore one of the variables, keep the other, and still get completely accurate results. Of course, this is all assuming an ideal world where two variables are perfectly correlated — this doesn’t happen often in the real world. But the idea sets the stage for PCA.

Normalizing the Data

Linear algebra (and math in general) is a lot simpler when the number zero is involved! Therefore, the first step in PCA is to center our data around the zero vector, acheived by subtracting the mean of the dataset from every datapoint (x_new = x_old-μ). This “shifts” the whole set to be centered around zero. Our next step is to ensure that measurement biases do not affect the process of PCA. This is done by dividing every datapoint by its standard deviation. The entire process detailed in this paragraph is called normalization, and the final formula stands:

The process of normalization, where x_new is the normalized vector of x_old.

The next step is to test the correlation between every pair of input variables. We encode this information in a matrix called the covariance matrix, which looks like this (with three input variables):

Building a covariance matrix.

We must cover variance and covariance. Variance (var) of a variable measures how much the variable differs from the mean, on average. Covariance (cov), on the other hand, is like variance for two variables — it measures how much both variables differ from the mean together. Now we can see why it is essential to standardize the dataset. Suppose one variable was measured in kilometers while the other was measured in centimeters. The first one would have an obvious bias when calculating the variance or covariance.

The covariance between two variables.

For variance, we would replace y with x, so the bracketed term gets squared. Note that cov(x,x) = var(x), and cov(x,y) = cov(y,x). From the covariance matrix formula, we can easily see that it contains the covariance of every pair of input variables. It must always be symmetric and square.

Making Sense of the Covariance Matrix

The eigenvalues and eigenvectors of the covariance matrix are important. Since the covariance matrix is symmetric, from the spectral theorem, we can diagonalize this covariance matrix into the form P¹DP, where P is the matrix formed by using the eigenvectors as row vectors, and D is a diagonal matrix where every entry is the corresponding eigenvalue [3]. The eigenvectors of the covariance matrix show the direction of the variables’ spread — all the eigenvectors must be orthogonal to each other. The eigenvalues refer to the magnitude. They answer the question: How much do these variables spread in this particular direction (eigenvector)?

We finally get to the point where we understand what a principal component is. A principal component is just a vector — more precisely, a linear combination of the input vectors. Our purpose here is to reimagine the original data in a way that allows us to make sense of the variance. The first principal component must contain as much information on the variance of data as possible, and so we choose the first principal component as the direction of maximum variance of data. For efficiency, the second principal component must be uncorrelated to the first, and so we choose a direction that is orthogonal to the first principal component. We again choose it such that the variance is maximized, like so:

The diagram above shows a set of data with two variables. The first principal component follows the direction of maximum variance, and the second principal component is the direction of maximum variance that is orthogonal to the first. (Source: Analytics Vidhya)

The process is repeated until we get the same number of principal components as the number of input variables. It turns out that the principal components are the same as the eigenvectors of the covariance matrix. In addition, because eigenvalues encode variance, the eigenvectors with higher eigenvalues are considered “more important” principal components than the others.

Principal Components

The most important principal component stores the most information. The least important stores the least information. Thus, we finally get to the objective of PCA: Discard the least important principal components (we will undoubtedly lose some information unless we live in an ideal world where some eigenvalues are zero) and keep the remaining ones, so we can reduce the dimensionality of the original dataset. How many eigenvectors to discard is up to the programmer — we are sacrificing accuracy for efficiency here. The original dataset is projected onto the new axes formed by the chosen principal components, and further machine learning algorithms are performed on these new axes. The amount of information loss by sacrificing any given eigenvector can be calculated with the formula:

The loss of information due to PCA compared to the original information.

The importance of each principal component can be captured in a graph that looks like this:

An example of how principal component summarizes the variance in data. (Source: Stackoverflow)

Applications

PCA is a handy data analysis tool. Therefore, PCA is useful in any field that uses a lot of data, including (but is not limited to) music, marketing, teaching, healthcare, and nuclear science. For example, a study from the Johannes Gutenberg Universitat-Mainz used PCA on data from a mouse experiment [4]. They analyzed all the principal components to determine which variables corresponded to the most variance and were, therefore, the most important.

References

[1] https://data-flair.training/blogs/dimensionality-reduction-tutorial/

[2] https://towardsdatascience.com/5-things-you-should-know-about-covariance-26b12a0516f1

[3] https://brilliant.org/wiki/spectral-theorem/

[4] https://d-nb.info/1168548586/34

--

--

University of Toronto Machine Intelligence Team

UTMIST’s Technical Writing Team publishes articles on topics within machine learning to our official publication: https://medium.com/demistify