Principal Component Analysis(PCA) in Machine Learning Made Easy.

10 min readJul 20, 2019

Check out my Github Link for code.

Bird’s-eye view of the project:

Here in this blog I’m gonna talk about the following things.

Dimensionality reduction
What is PCA
Why is it used
When it is used
How does it work
Limitations of PCA
A toy example of PCA

Dimensionality Reduction:

The higher the number of features, the harder it gets to visualize the training set and then work on it. Sometimes, most of these features are correlated, and hence redundant. This is where dimensionality reduction algorithms come into play. Dimensionality reduction is the process of reducing the number of random variables under consideration, by obtaining a set of principal variables. It can be divided into

1. feature selection :

we reduce the feature space by eliminating features. Advantages of feature elimination methods include simplicity and maintaining interpretability of your variables. As a disadvantage, you gain no information from those variables you’ve dropped. By eliminating features, we’ve also entirely eliminated any information those dropped variables would bring.

Note: we have to keep in mind that the main aim of Dimensionality Reduction is to reduce the dimensionality of the dataset by retaining as much information as possible.

2. feature extraction :

we create new independent variables/features based upon old independent variables/features (i.e. X’s) and then eliminate the least important new variables/features. Even if we eliminate the least important “new” features, the information is not lost, as the most important qualities of old features are still present somewhere in new independent variables( in other words, because these new independent variables are combinations of our old ones, we’re still keeping the most valuable parts of our old variables, even when we drop one or more of these “new” variables!)

There are three types of Dimensionality reduction :

1. PCA

2. LDA

3. GDA

What is PCA :

The Principal Component Analysis(PCA) is a way of reducing the dimensions of a given dataset by extracting new features from the original features present in the dataset. So it combines our input variables(or features) in a specific way and gives “new” features by retaining the most valuable information of all the original features. The “new” variables after PCA are all independent of one another.

**Fig 1:** Image shows “Original” features(Features 1 and 2) and “New” features(Principal components 1 and 2)

Let’s assume that our original features are ‘Feature 1’ and ‘Feature2’, and the two-dimensional data is distributed as shown in the first graph of Fig 1. As we can see, the data is spread across both the features almost equally, i.e. if we project those data points on each axis, the spread across each axis is almost equal. Spread(or variance) always indicates how much information does that particular feature has about the given data. So in this case, if we want to reduce the dimension from 2 to 1, we can’t completely eliminate one feature, as this might lead to loss of much information. That’s when PCA comes into play.

To solve this, what PCA does is, for a given data it tries to find that axis which retains much of the information(i.e. spread or variance) when data points are projected onto that axis. We can clearly see in the first graph of Fig 1, the two PCA vectors passing through the data(the two black lines which denote axis) are the axes, on which, the data has maximum and minimum spread respectively. The adjacent figure shows the PCA-projected data with axes as the two PCA vectors. These are the new features that PCA builds for us and are called principal components.

The axis along which the given data, if projected, has maximum variance(or spread) is called First Principal Component. The axis along which the given data, if projected, has second maximum variance(or spread) is called Second Principal Component and so on. This can be extended up to n-principal components for given n-dimensional data.

Why it is used :

It is used when we need to tackle the curse of dimensionality among data with linear relationships, i.e. we’re having too many dimensions (features) in your data causes noise and difficulties (it can be sound, picture or context). This specifically get worst when features have different scales (e.g. weight, length, area, speed, power, temperature, volume, time, cell number, etc.)

When should I use PCA:

1. Do you want to reduce the number of variables, but aren’t able to identify variables to completely remove from consideration?

2. Do you want to ensure your variables are independent of one another?

3. Are you comfortable making your independent variables less interpretable?

If you answered “yes” to all three questions, then PCA is a good method to use. If you answered “no” to question 3, you should not use PCA.

How does it work:

Prerequisite:

The first step we need to do while performing PCA is data standardization. i.e. scaling the data to mean as 0 and variance as 1. (why do weed to do that? please check this.)

Intuition:

Intuitively, as we have already discussed, PCA tries to find axes with maximum variance. It simply looks something like this.

**Fig 2:** explaining how PCA tries to find the best axes.

Now, these new axes(or principal components) represent new features, f’1 and f’2. where f’1 being the feature with maximum variance and f’2 being the feature with minimum variance. All these are for a two-dimensional dataset. Now, we will extend this concept to an n-dimensional dataset, where f’1 represents the first principal component, f’2 represents the second principal component and so on.

Let u1 be a unit vector in the direction of f’1. So, ||u1|| = 1 and Let

Let x̄’ be the mean of all those projected points and x̄ be the mean of original points in the dataset. So, the vector projection of x̄ on u1 is

Now we need to find unit vector u1 such that, the variance of all these projected points is maximum. i.e. our need to find the best unit vector such that, if all these original data points are projected upon it, the variance along that axis is maximum.

the formula of the variance of all these data points along an axis is

As we are column standardizing, the mean of all these original points will be a zero vector i.e. x̄ = [0,0,0,…(n times)] as it is n-dimensional. So, the equation now becomes

So, Now our final optimization equation that we need to solve is

here, “f(x)” part is called objective function and “such that” part is called constrained equation. If there were no constraint then the maximum u1 would be [ ∞, ∞, ∞, ∞, … (n times)]. So in order to overcome such a result, we have that constraint equation.

We solve this optimization equation using Lagrangian Multipliers(which I’ll be discussing in another blog). The final takeaways are

IF X is the given dataset of size (n X d), there exists a covariance matrix of X , S = X^T . X, such that

where P(d,d) is the variance of feature ‘d’, P(d, i) is the covariance of feature ‘d’ with respect to feature ‘i’ and d is the number of features of X.

There exist eigenvalues λi’s and corresponding eigenvectors Vi’s for a given covariance matrix S, such that

and each pair of eigenvectors are perpendicular.

Solution for optimization equation is :

For the direction of u1: If λ1> λ2> λ3>…>λ10 are the top ten eigenvalues of the covariance matrix S, then the corresponding vectors V1, V2, V3,…V10 are the ten principal components for the given dataset. i.e. the direction of V1 is the direction of the unit vector u1 with maximum variance, the direction of V2 is the direction of the unit vector u2 with second-most maximum variance and so on.
To know the variance over a particular axis: Just like how eigenvectors give direction, using eigenvalues we can know the variance across the corresponding eigenvector by using the following formula

Geometric interpretation of λi’s and Vi’s:

Let λ1,λ2 be the top two eigenvalues and V1, V2 be the first two principal components. Now, we can see the change in values of λ1 and λ2(the below mentioned are just assumed values) with respect to the change in the distribution of data. And yeah, if the distribution is perfectly circular then λ1=λ2.

We can also use these eigenvalues the other way around. i.e. If we have to preserve 80% of the information and accordingly we need to find the least possible dimension, k, with which we can preserve 80% of the information. i.e.

Here, we need to find the least possible value of k for which this equation holds true.

Limitations of PCA:

Circular distributions: In this, the PCA algorithm fails to find the best axis upon which the spread is maximum. This is because an axis that makes any angle with x-axis gives the same variance. Hence PCA fails.
Clustered distributions: Consider the second diagram in the above figure. There if we project the points upon the axis that is passing through larger clusters, we lose a lot of information of smaller clusters when projected upon that axis. In the same way, we are going to lose a lot of information if we project points any axis.
Non-Linear curves: Unlike in Linear distributions, In this case, if we try to find a principal component that maximizes the variance over it, we would be losing the information and properties of that pseudo-sine curve.

A toy example of PCA :

I have taken an example of breast cancer, whose dataset is readily available in sklearn.datasets package in python.

Step-1: Loading all necessary packages

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 
%matplotlib inline
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from mpl_toolkits import mplot3d

Step-2: Loading Dataset

#loads breast cancer dataset into variable by name cancer. 
cancer = load_breast_cancer() 
# creating dataframe 
df = pd.DataFrame(cancer[‘data’], columns = cancer[‘feature_names’]) 
# checking head of dataframe 
df.head()

This piece of code loads the data into a variable by name cancer. Then I’m printing the top 5 rows of the data frame created using this data. The following is what I get as a result.

So clearly, we can see that there are 30 columns and each of them are numerical values. So we can directly apply PCA.

Step-3: Standardizing and applying PCA

scalar = StandardScaler() 
  
# Standardizing 
scalar.fit(df) 
scaled_data = scalar.transform(df) 
  
# applying PCA
pca = PCA(n_components = 3) 
pca.fit(scaled_data) 
x_pca = pca.transform(scaled_data)
x_pca.shape

In this, what we are doing is standardizing the data(i.e. df) and applying PCA on it. There, n-components represents the number of principal components(i.e. new features) that we want to.

Step-4: 2-D results

plt.figure(figsize =(8, 6)) 
  
plt.scatter(x_pca[:, 0], x_pca[:, 1], c = cancer['target']) 

# labeling x and y axes 
plt.xlabel('First Principal Component') 
plt.ylabel('Second Principal Component')

In this step, I’m just displaying the 2-Dimensional plot of the data and it looks something like this.

**Fig 3:** 2-D results of PCA applied upon breast cancer dataset.

Step-5: 3-D results

ax = plt.axes(projection='3d')
ax.scatter(x_pca[:, 0], x_pca[:, 1], x_pca[:, 2], c=cancer['target'], cmap='viridis', linewidth=1);

This is where things get interesting. The above is the code for 3D plot. The following are the images that I got as output. (These images are taken after running the .py code on my desktop, as Jupyter notebooks’ .ipynb files don’t support interactive 3-Dimensional plots.)

**Fig 4:** 3-D results of PCA applied upon breast cancer dataset.

By rotating the plot, I get the following plots.

**Fig-5**: Plot emphasizing on second and third principal components

**Fig-6**: Plot emphasizing on First and third principal components

**Fig-7**: Plot emphasizing on First and Second principal components

Key Takeaways from the above plots(Fig-5, Fig-6 and Fig-7):

Variance across the first component> Variance across the second component > Variance across the third component. check those values. On First principal component, they range from something around -7 to 20, i.e. range = 27, On Second principal component, they range from something around -9 to 13, i.e. range = 22, On Third principal component, they range from something around -6 to 10, i.e. range = 16.

Code :