# [-1,0]: Principal Component Analysis

The goal of a Principal Component Analysis (PCA) is to identify some linear transformations $iY$ of your observed variables able to highlight and relocate the information contained in the initial matrix $Xr$. Such method is especially useful when working with a high number of variables and smaller dimensions are needed.

In this tutorial we will go through a step-by-step PCA, by following its mathematical formulation through its implementation. Python provides many libraries with excellent and efficient PCA methods, but I think that disassembling the problem is the best way of understanding its results and what-nots.

The file $df\_pol.pkl$ can be downloaded at this repository .
The package mlxtend can be installed with pip.

#lets bulk import all these things, we will use them later

import os
import itertools

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import pickle as pkl

from mpl_toolkits.mplot3d import Axes3D
from mlxtend.preprocessing import standardize
file=open('df_pol.pkl','rb')
df=pkl.load(file)

Let’s convert the dataframe into a numpy array.

values=df.values

Let’s standardize our array (the PCA is better managed with similar values ranges)

values=standardize(values,columns=list(range(len(list(df.columns)))))

Let’s define a general function that creates a Covariance matrix starting from a custom number of variables.

def Cov_Matrix(m):
m=m.T
n=len(m) #numero di variabili e grandezza della matrice quadrata
items=len(m[0]) #numero di elementi nel dataset per ogni variabile
cov_mat=np.zeros((n,n)) # creo una matrice quadrata di grandezza nxn riempita di zeri
means=[sum(x)/items for x in m] # medie di tutte le variabili
for i in list(range(n)):
for k in list(range(n)): # creo tutte le coppie possibili di numeri compresi tra 0 e 4 per iterare tra le variabili
somm=[]
for num_elem in list(range(items)):
somm.append((m[i][num_elem]-means[i])*(m[k][num_elem]-means[k]))
cov_mat[i][k]=(sum(somm))/(items-1)
return cov_mat
cov_mat=Cov_Matrix(values)
print(cov_mat)

array([[ 1.00209644, -0.99478977,  0.97574551,  0.03934338, -0.10307861],
[-0.99478977,  1.00209644, -0.99532921, -0.07147684,  0.06640371],
[ 0.97574551, -0.99532921,  1.00209644,  0.12582169, -0.02944231],
[ 0.03934338, -0.07147684,  0.12582169,  1.00209644, -0.33724013],
[-0.10307861,  0.06640371, -0.02944231, -0.33724013,  1.00209644]])

To do the PCA we must find eigenvectors and eigenvalues of the covariance matrix.

Given a square matrix $A$ of size $n$, let $\lambda$ be a number (real or complex) and let $x$ be a non-null vector. They are called respectively eigenvalue and eigenvector of $A$ if the following relation holds:

$Ax = \lambda x$

$x$ is said to be the eigenvector associated to the eigenvalue $\lambda$.

An eigenvector is a vector $x!=0$ which does not modify its direction if subjected to the linear transformation represented by the matrix $A$, but only changes its magnitude by a factor $\lambda$, its eigenvalue.

w,v=np.linalg.eig(cov_mat)       #v = eigenvalues        w = eigenvectors
print(w)
print(v)

[2.99852253e+00 1.32035126e+00 6.74701254e-01 1.65148977e-02
3.92238123e-04]
[[-0.57257338  0.06265364  0.08656371  0.72873719 -0.36011536]
[ 0.57631898 -0.06598755 -0.01167288  0.0584317  -0.81237402]
[-0.57325376  0.05586665 -0.08673052 -0.67136631 -0.45826208]
[-0.07998809 -0.70174666 -0.70208432  0.08916458  0.01675739]
[ 0.07069778  0.70437966 -0.70137429  0.08273332  0.00896814]]

We can verify that the equation holds for each pair $(\lambda, x)$.

print(np.dot(cov_mat,v[:,0])))
print(np.dot(w[0],v[:,0])))

#array([-1.7168742 ,  1.72810546, -1.71891433, -0.2398461 ,  0.21198889])

print(np.dot(cov_mat,v[:,1]))
print(np.dot(w[1],v[:,1]))

#array([ 0.08272481, -0.08712675,  0.07376361, -0.92655209,  0.93002857])

print(np.dot(cov_mat,v[:,2]))
print(np.dot(w[2],v[:,2]))

#array([ 0.05840464, -0.0078757 , -0.05851719, -0.47369717, -0.47321811])

print(np.dot(cov_mat,v[:,3]))
print(np.dot(w[3],v[:,3]))

#array([ 0.01203502,  0.00096499, -0.01108755,  0.00147254,  0.00136633])

print(np.dot(cov_mat,v[:,4]))
print(np.dot(w[4],v[:,4]))

#array([-1.41250974e-04, -3.18644062e-04, -1.79747859e-04,  6.57288863e-06, 3.51764717e-06])

As you can see by the previous code, the equation that determines eigenvalues and eigenvectors is satisfied for every computed value.

These eigenvectors represent the new axes for the subspace of our dataset, in order to reduce their dimensionality. To determine which eigenvectors we are going to use, we must check the corresponding eigenvalues. In simple terms, eigenvectors with smaller eigenvalues bring a lower information value about the data distribution. The standard approach is that of classifying eigenvalues higher-to-lower and choosing the first $k$ ranked associated eigenvectors.

w_v=[list(a) for a in zip(w,v.T)]
for elem in w_v:
print(elem)

Let us define a function which does this for ourselves.

def desc_sort_eig (eig_list):
eig_val=np.sort([elem[0] for elem in eig_list])[::-1]
sort_list=[elem for x in eig_val for elem in eig_list if x==elem[0]]
return(sort_list)
sort_w_c=desc_sort_eig(w_v)

We shall plot our eigenvalues in order the get a grasp of the situation:

plt.figure(figsize=(20,10))
plt.plot([elem[0] for elem in sort_w_c],marker='+',ms='20')

Points represented by the ‘+’ marker are the magnitudes of our computed eigenvalues. The higher values, along with their eigenvectors, are called Principal Components (PC). By choosing a threshold over our total variability we can select a certain number $k$ of components, that is:

$(\lambda 1 + … + \lambda k) / (\lambda 1 + … + \lambda n) \sim 90\%$

Let us define a function which operates this cut automatically:

def choose_pc (sort_w_c):
pc=[]
pc.append(sort_w_c[0][0])
sum_all_pc=sum([elem[0] for elem in sort_w_c])
perc=(sum(pc))/sum_all_pc
print('PC :\t\t{}\n%Variance :\t{}'.format(pc,perc))
for elem in sort_w_c[1:]:
if(perc<=0.9):
pc.append(elem[0])
perc=(sum(pc))/sum_all_pc
print('PC :\t\t{}\n%Variance :\t{}'.format(pc,perc))

return([np.array([list(y[1]) for x in pc for y in sort_w_c if x==y[0]]).T,perc])
pc=choose_pc(sort_w_c)
print("Variance:\t{}".format(pc[1]))

"""
PC :		[2.998522531538498]
%Variance :	0.5984498943698177
PC :		[2.998522531538498, 1.3203512593445306]
%Variance :	0.8619676980130566
PC :		[2.998522531538498, 1.3203512593445306, 0.6747012535703645]
%Variance :	0.9966256469473931
Variance:	0.9966256469473931
"""

To modify the original data we shall apply the linear transformation given by our eigenvectors.
[Matrix1 = original_dataset, Matrix2 = principal_components (PC)]

trsf_dataset=np.dot(values,pc[0])
trsf_dataset
"""
array([[-5.79517326e-01,  4.75942809e-01,  5.59270844e-04],
[ 4.49798278e-01,  8.67372553e-01,  1.32260514e-01],
[ 6.03057915e-01, -8.59662445e-01,  1.46834214e-01],
...,
[-7.27291336e-01, -5.02569354e-01,  1.06614169e+00],
[-2.80988899e+00, -4.08650989e+00, -2.44953248e+00],
[ 5.03270889e-01,  1.56551128e-02,  1.02652537e+00]])
"""

Since the resulting dataset only has 3 Principal Components, we can plot our data:

fig=plt.figure(figsize=(20,10))
ax = Axes3D(fig)
sequence_containing_x_vals = trsf_dataset[:,0]
sequence_containing_y_vals = trsf_dataset[:,1]
sequence_containing_z_vals = trsf_dataset[:,2]
ax.scatter(sequence_containing_x_vals, sequence_containing_y_vals, sequence_containing_z_vals)
plt.show()

If we limit ourselves to the first 2 PC we can scatter plot on 2 dimensions.

trsf_dataset2D=np.dot(values,pc[0][:,:2])
fig=plt.figure(figsize=(20,10))
sequence_containing_x_vals = trsf_dataset2D[:,0]
sequence_containing_y_vals = trsf_dataset2D[:,1]
plt.scatter(sequence_containing_x_vals, sequence_containing_y_vals,s=2)
plt.show()
index_col=[]
for elem in np.array(w):
index_col.append(list(np.array(sort_w_c)[:,0]).index(elem))
columns_df_PCA=[col for col in df.columns for index in index_col[:len(pc[0].T)] if list(df.columns).index(col)==index ]
trsf_df=pd.DataFrame(trsf_dataset,index=df.index,columns=columns_df_PCA)
trsf_df

"""this dataframe will have only 3 dimensions"""

We can save the modified dataframe as a python object with pickle, in order to reuse it in a different analysis.

pkl_file=open('df_pol_PCA.pkl','wb')
pkl.dump(trsf_df,pkl_file)
pkl_file.close()