[-1,0]: Principal Component Analysis

The goal of a Principal Component Analysis (PCA) is to identify some linear transformations $iY$ of your observed variables able to highlight and relocate the information contained in the initial matrix $Xr$. Such method is especially useful when working with a high number of variables and smaller dimensions are needed.

In this tutorial we will go through a step-by-step PCA, by following its mathematical formulation through its implementation. Python provides many libraries with excellent and efficient PCA methods, but I think that disassembling the problem is the best way of understanding its results and what-nots.

The file $df\_pol.pkl$ can be downloaded at this repository .
The package mlxtend can be installed with pip.

#lets bulk import all these things, we will use them later

import os
import itertools 

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import pickle as pkl 

from mpl_toolkits.mplot3d import Axes3D
from mlxtend.preprocessing import standardize
file=open('df_pol.pkl','rb')
df=pkl.load(file)

Let’s convert the dataframe into a numpy array.

values=df.values

Let’s standardize our array (the PCA is better managed with similar values ranges)

values=standardize(values,columns=list(range(len(list(df.columns)))))

Let’s define a general function that creates a Covariance matrix starting from a custom number of variables.

def Cov_Matrix(m):
    m=m.T
    n=len(m) #numero di variabili e grandezza della matrice quadrata
    items=len(m[0]) #numero di elementi nel dataset per ogni variabile
    cov_mat=np.zeros((n,n)) # creo una matrice quadrata di grandezza nxn riempita di zeri
    means=[sum(x)/items for x in m] # medie di tutte le variabili
    for i in list(range(n)):
        for k in list(range(n)): # creo tutte le coppie possibili di numeri compresi tra 0 e 4 per iterare tra le variabili
            somm=[]
            for num_elem in list(range(items)):
                somm.append((m[i][num_elem]-means[i])*(m[k][num_elem]-means[k]))
            cov_mat[i][k]=(sum(somm))/(items-1)
    return cov_mat
cov_mat=Cov_Matrix(values)
print(cov_mat)

array([[ 1.00209644, -0.99478977,  0.97574551,  0.03934338, -0.10307861],
       [-0.99478977,  1.00209644, -0.99532921, -0.07147684,  0.06640371],
       [ 0.97574551, -0.99532921,  1.00209644,  0.12582169, -0.02944231],
       [ 0.03934338, -0.07147684,  0.12582169,  1.00209644, -0.33724013],
       [-0.10307861,  0.06640371, -0.02944231, -0.33724013,  1.00209644]])

To do the PCA we must find eigenvectors and eigenvalues of the covariance matrix.

Given a square matrix $A$ of size $n$, let $\lambda$ be a number (real or complex) and let $x$ be a non-null vector. They are called respectively eigenvalue and eigenvector of $A$ if the following relation holds:

$Ax = \lambda x$

$x$ is said to be the eigenvector associated to the eigenvalue $\lambda$.

An eigenvector is a vector $x!=0$ which does not modify its direction if subjected to the linear transformation represented by the matrix $A$, but only changes its magnitude by a factor $\lambda$, its eigenvalue.

w,v=np.linalg.eig(cov_mat)       #v = eigenvalues        w = eigenvectors
print(w)
print(v)

[2.99852253e+00 1.32035126e+00 6.74701254e-01 1.65148977e-02
 3.92238123e-04]
[[-0.57257338  0.06265364  0.08656371  0.72873719 -0.36011536]
 [ 0.57631898 -0.06598755 -0.01167288  0.0584317  -0.81237402]
 [-0.57325376  0.05586665 -0.08673052 -0.67136631 -0.45826208]
 [-0.07998809 -0.70174666 -0.70208432  0.08916458  0.01675739]
 [ 0.07069778  0.70437966 -0.70137429  0.08273332  0.00896814]]

We can verify that the equation holds for each pair $(\lambda, x)$.

print(np.dot(cov_mat,v[:,0])))
print(np.dot(w[0],v[:,0])))

#array([-1.7168742 ,  1.72810546, -1.71891433, -0.2398461 ,  0.21198889])


print(np.dot(cov_mat,v[:,1]))
print(np.dot(w[1],v[:,1]))

#array([ 0.08272481, -0.08712675,  0.07376361, -0.92655209,  0.93002857])

print(np.dot(cov_mat,v[:,2]))
print(np.dot(w[2],v[:,2]))

#array([ 0.05840464, -0.0078757 , -0.05851719, -0.47369717, -0.47321811])

print(np.dot(cov_mat,v[:,3]))
print(np.dot(w[3],v[:,3]))

#array([ 0.01203502,  0.00096499, -0.01108755,  0.00147254,  0.00136633])

print(np.dot(cov_mat,v[:,4]))
print(np.dot(w[4],v[:,4]))

#array([-1.41250974e-04, -3.18644062e-04, -1.79747859e-04,  6.57288863e-06, 3.51764717e-06])

As you can see by the previous code, the equation that determines eigenvalues and eigenvectors is satisfied for every computed value.

These eigenvectors represent the new axes for the subspace of our dataset, in order to reduce their dimensionality. To determine which eigenvectors we are going to use, we must check the corresponding eigenvalues. In simple terms, eigenvectors with smaller eigenvalues bring a lower information value about the data distribution. The standard approach is that of classifying eigenvalues higher-to-lower and choosing the first $k$ ranked associated eigenvectors.

w_v=[list(a) for a in zip(w,v.T)]
for elem in w_v:
    print(elem)

Let us define a function which does this for ourselves.

def desc_sort_eig (eig_list):
    eig_val=np.sort([elem[0] for elem in eig_list])[::-1]
    sort_list=[elem for x in eig_val for elem in eig_list if x==elem[0]]
    return(sort_list)
sort_w_c=desc_sort_eig(w_v)

We shall plot our eigenvalues in order the get a grasp of the situation:

plt.figure(figsize=(20,10))
plt.plot([elem[0] for elem in sort_w_c],marker='+',ms='20')
X axis: numbered eigenvalues; y axis: eigenvalues magnitude

Points represented by the ‘+’ marker are the magnitudes of our computed eigenvalues. The higher values, along with their eigenvectors, are called Principal Components (PC). By choosing a threshold over our total variability we can select a certain number $k$ of components, that is:

$(\lambda 1 + … + \lambda k) / (\lambda 1 + … + \lambda n) \sim 90\%$

Let us define a function which operates this cut automatically:

def choose_pc (sort_w_c):
    pc=[]
    pc.append(sort_w_c[0][0])
    sum_all_pc=sum([elem[0] for elem in sort_w_c])
    perc=(sum(pc))/sum_all_pc
    print('PC :\t\t{}\n%Variance :\t{}'.format(pc,perc))
    for elem in sort_w_c[1:]:
        if(perc<=0.9):
            pc.append(elem[0])
            perc=(sum(pc))/sum_all_pc
            print('PC :\t\t{}\n%Variance :\t{}'.format(pc,perc))
        
    return([np.array([list(y[1]) for x in pc for y in sort_w_c if x==y[0]]).T,perc])
pc=choose_pc(sort_w_c)
print("Variance:\t{}".format(pc[1]))

"""
PC :		[2.998522531538498]
%Variance :	0.5984498943698177
PC :		[2.998522531538498, 1.3203512593445306]
%Variance :	0.8619676980130566
PC :		[2.998522531538498, 1.3203512593445306, 0.6747012535703645]
%Variance :	0.9966256469473931
Variance:	0.9966256469473931
"""

To modify the original data we shall apply the linear transformation given by our eigenvectors.
[Matrix1 = original_dataset, Matrix2 = principal_components (PC)]

trsf_dataset=np.dot(values,pc[0])
trsf_dataset
"""
array([[-5.79517326e-01,  4.75942809e-01,  5.59270844e-04],
       [ 4.49798278e-01,  8.67372553e-01,  1.32260514e-01],
       [ 6.03057915e-01, -8.59662445e-01,  1.46834214e-01],
       ...,
       [-7.27291336e-01, -5.02569354e-01,  1.06614169e+00],
       [-2.80988899e+00, -4.08650989e+00, -2.44953248e+00],
       [ 5.03270889e-01,  1.56551128e-02,  1.02652537e+00]])
"""

Since the resulting dataset only has 3 Principal Components, we can plot our data:

fig=plt.figure(figsize=(20,10))
ax = Axes3D(fig)
sequence_containing_x_vals = trsf_dataset[:,0]
sequence_containing_y_vals = trsf_dataset[:,1]
sequence_containing_z_vals = trsf_dataset[:,2]
ax.scatter(sequence_containing_x_vals, sequence_containing_y_vals, sequence_containing_z_vals)
plt.show()
PCA with 3 dimensions. each axis corresponds to an eigenvector. By construction their dot product is zero.

If we limit ourselves to the first 2 PC we can scatter plot on 2 dimensions.

trsf_dataset2D=np.dot(values,pc[0][:,:2])
fig=plt.figure(figsize=(20,10))
sequence_containing_x_vals = trsf_dataset2D[:,0]
sequence_containing_y_vals = trsf_dataset2D[:,1]
plt.scatter(sequence_containing_x_vals, sequence_containing_y_vals,s=2)
plt.show()
PCA with only 2 PC. Each axis corresponds to an eigenvector. By construction their dot product is zero.
index_col=[]
for elem in np.array(w):
    index_col.append(list(np.array(sort_w_c)[:,0]).index(elem))
columns_df_PCA=[col for col in df.columns for index in index_col[:len(pc[0].T)] if list(df.columns).index(col)==index ]
trsf_df=pd.DataFrame(trsf_dataset,index=df.index,columns=columns_df_PCA)
trsf_df

"""this dataframe will have only 3 dimensions"""

We can save the modified dataframe as a python object with pickle, in order to reuse it in a different analysis.

pkl_file=open('df_pol_PCA.pkl','wb')
pkl.dump(trsf_df,pkl_file)
pkl_file.close()

Lascia un commento

Il tuo indirizzo email non sarà pubblicato. I campi obbligatori sono contrassegnati *