Principal Component Analysis

Article Info

Contributed by
2 authors

Last updated on
2019-09-23 13:22:18

Eigenvalues and Eigenvectors for Data Scientists
Singular Value Decomposition
Dimensionality Reduction
Feature Engineering
Factor Analysis
Kernel Principal Component Analysis

Article Versions

8 2019-09-23 13:22:18
1623,1149 8,1623

By arvindpdmn

Grammar correction.
7 2019-01-22 06:17:28
1149,1115 7,1149

By arvindpdmn

Added image in Summary.
6 2019-01-12 14:12:12
1115,1114 6,1115

By arvindpdmn

Updated main content for clarity and flow. Use of present tense in Milestones. New images.
5 2019-01-12 08:50:17
1114,1113 5,1114

By arvindpdmn

Corrections to References. Added Sample Code (R and Python). Completed See Also.
4 2019-01-06 11:12:22
1113,1112 4,1113

By raam.raam

1.03

Chat Room

Submitting ...

You are editing an existing chat message.

PCA in a nutshell. Source: Lavrenko and Sutton 2011, slide 13.

Big Data is increasingly becoming the norm and affecting many domains. When there's lots of data involving multiple variables, the work of a data scientist gets difficult. Algorithms will also take longer to complete. Wouldn't it be sensible to identify and consider only those variables that influence the most and discard others?

Principal Component Analysis (PCA) extracts the most important information. This in turn leads to compression since the less important information are discarded. With fewer data points to consider, it becomes simpler to describe and analyze the dataset.

PCA can be seen a trade-off between faster computation and less memory consumption versus information loss. It's considered as one of the most useful tools for data analysis.

Discussion

Could you explain PCA with a simple example?
Illustration of principal component analysis. Source: Werner and Friedrich 2014, fig. 1.
We can describe the shape of a fish with two variables: height and width. However, these two variables are not independent of each other. In fact, they have a strong correlation. Given the height, we can probably estimate the width; and vice versa. Thus, we may say that the shape of a fish can be described with a single component.
This doesn't mean that we simply ignore either height or width. Instead, we transform our two original variables into two orthogonal (independent) components that give a complete alternative description. The first component (blue line) will explain most of the variation in the data. The second component (dotted line) will explain the remaining variation. Note that both components are derived from both height and width.
More intuitively, the first component line can be seen as the best-fit line that minimizes information loss. Alternatively, it can also be seen as the line that maximizes the variation; that is, it tries to explain as much of the variation in the dataset as possible.
Could you mention some real-world use cases of PCA?
Use of PCA for facial recognition. Source: Lipp 2015.
PCA has been applied for facial recognition. For 90% capture variance, only a third of the components had to be retained. This may be sufficient for Machine Learning applications. The other two-thirds contain most of the image details.
In another study, the consumption of 17 different food types was studied across 4 countries in the UK. Thus, this problem has 17 features and hence non-trivial to analyze. With PCA, the first component showed that Northern Ireland was unique. People of Northern Ireland consumed fresh potatoes and fresh fruit differently from other populations.
The lower molar teeth of an ancient mammal named Kuehneotherium was studied in nine variables. PCA showed that just two components are enough to explain over 95% of total variation. When plotted, it was easy to see the clusters and relate them back to the original features. One cluster stood for a species of Kuehneotherium while another broader cluster suggested an unidentified animal.
To detect lactose in lactose-free milk using NIR spectroscopy, containing 601 dimensions, PCA identified distinct clusters with just two principal components.
Isn't PCA similar to Dimensionality Reduction?
In a complex data-intensive problem, there are usually many influencing variables. The term variable is equivalent to other commonly used terms: feature or dimension.
The idea of reducing the number of variables or dimensions is called Dimensionality Reduction. This can be done in two ways:
- Feature Elimination: We drop some features that we may consider unimportant. While the approach is simple, we lose useful information present in those dropped features.
- Feature Extraction: We transform the original set of features into another set of features. The idea is to pack the most important information into as few derived features as possible. We can reduce the number of dimensions by dropping some of the derived features. But we don't lose complete information from the original features: derived features are a linear combination of the original features.
PCA is in fact a method for doing feature extraction. In PCA, derived features are also called composite features or principal components. Moreover, these principal components are linearly independent from one another.
What are advantages of the PCA technique?
PCA minimizes information loss even when fewer principal components are considered for analysis. This is because each principal component is along a direction that maximizes variation, that is, the spread of data. More importantly, the components themselves need not be identified a priori: they are identified by PCA from the dataset. Thus, PCA is an adaptive data analysis technique. In other words, PCA is an unsupervised learning method.
By reducing the number of dimensions, PCA enables easier data visualization. Visualization helps us to identify clusters, patterns and trends more easily. Fewer dimensions means less computation and lower error rate. PCA reduces noise and makes algorithms work better.
Finding the principal components is really an eigenvalue/eigenvector problem, which has been well studied with lots of algorithms available for practical use.
Although Gaussian distribution of data is assumed, as a descriptive tool PCA doesn't need this assumption. It can be used for exploratory analysis on data of any distribution. There are also variations of PCA that cater to different data types and structures.
What are drawbacks of the PCA technique?
Examples where PCA may not work well. Source: Lever et al. 2017, fig. 4.
Here are some drawbacks of PCA:
- PCA works only if the observed variables are linearly correlated. If there's no correlation, PCA will fail to capture adequate variance with fewer components.
- PCA is lossy. Information is lost when we discard insignificant components.
- Scaling of variables can yield different results. Hence, scaling that you use should be documented. Scaling should not be adjusted to match prior knowledge of data.
- Since each principal components is a linear combination of the original features, visualizations are not easy to interpret or relate to original features.

Milestones

1850

Mid-nineteenth century works by Cauchy and Jacobi in classical analytic geometry show that the equations for the principal axes of quadratic forms and surfaces are known.

1889

Francis Galton in his Natural Inheritance connects principal axes for the first time with the correlation ellipsoid.

1901

Karl Pearson invents PCA while working to find the major and minor axes of an ellipse. However, he does not use the term PCA. In his geometric interpretation of the problem, he's trying to find "lines and planes of closest fit to systems of points in space".

1930

Harold Hotelling develops PCA independently and names the technique. His approach is what is familiar to us today, using successive orthogonal linear combinations with maximum variance. The 1930s is also the decade when the development of Factor Analysis is started. This is closely related to PCA.

1960

Around 1960, Malinowski introduces PCA to chemistry. After 1970, many chemical applications of PCA appear in literature.

1966

A scree plot based on eigenvalues shows that three factors will explain most of the data. Source: Statistica Help 2018.

How does one determine how many principal components to retain for analysis? In the context of factor analysis, R.B. Cattell proposes a method called Scree Test. A Scree Plot is used for this purpose. It represents graphically the eigenvalues or the percentages of total variation accounted for by each principal component.

Sample Code

R
python

# Source: http://r.789695.n4.nabble.com/How-to-comment-in-R-td882882.html
# Accessed 2019-01-12.
 
# Do PCA on iris dataset and plot
library(ggfortify)
df <- iris[c(1, 2, 3, 4)]
autoplot(prcomp(df), data = iris, colour = 'Species')

# Adapted from source: https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60
# Accessed 2019-01-12.
 
# Do PCA on iris dataset and plot
# ---------------------------------------------------------------------
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
 
# Download and load iris dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['sepal length','sepal width','petal length','petal width','target']
df = pd.read_csv(url, names=names)
 
# Standardize data to 0 mean and 1 variance
features = names[:-1]
x = df.loc[:, features].values
y = df.loc[:,['target']].values
x = StandardScaler().fit_transform(x)
 
# Perform PCA
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(data = principalComponents
             , columns = ['PC1', 'PC2'])
finalDf = pd.concat([principalDf, df[['target']]], axis = 1)
 
# Plot
fig = plt.figure(figsize = (8,8))
ax = fig.add_subplot(1,1,1)
explained = np.around(pca.explained_variance_ratio_*100, 2)
ax.set_xlabel('PC1 ({}%)'.format(explained[0]), fontsize = 15)
ax.set_ylabel('PC2 ({}%)'.format(explained[1]), fontsize = 15)
ax.set_title('Two-Component PCA', fontsize = 20)
targets = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
colors = ['r', 'g', 'b']
for target, color in zip(targets,colors):
    indicesToKeep = finalDf['target'] == target
    ax.scatter(finalDf.loc[indicesToKeep, 'PC1']
               , finalDf.loc[indicesToKeep, 'PC2']
               , c = color
               , s = 50)
ax.legend(targets)
ax.grid()

References

Article Stats

1203

Words

Authors

Edits

Chats

Likes

23K

Hits

Cite As

Devopedia. 2019. "Principal Component Analysis." Version 8, September 23. Accessed 2023-11-12. https://devopedia.org/principal-component-analysis

Contributed by
2 authors

Last updated on
2019-09-23 13:22:18

data data science pca dimensionality reduction principal component analysis feature engineering

Eigenvalues and Eigenvectors for Data Scientists
Singular Value Decomposition
Dimensionality Reduction
Feature Engineering
Factor Analysis
Kernel Principal Component Analysis

Principal Component Analysis

Discussion

Milestones

Sample Code

References

Further Reading

Article Stats

Author-wise Stats for Article Edits

Cite As

See Also

Login