# [1,3]: Exploration

The exploratory part of our analyses takes the name of EDA (Exploratory Data Analysis). It concerns some basic statistics like means and deviations and gaussians and histograms and so on.
We will be using mostly pandas as it provides most of the needed functions through DataFrame methods.
First and foremost, the head and tail methods, which take as argument the number of rows to show. This is useful to get a grasp on the kind of data we are working on. Unfortunately if there is missing data we won’t see it from here but these methods are a good starting point nonetheless.
If we want a less biased (towards the beginning or the end of the dataframe) sampling of the rows, the method sample will take an integer number as well and will randomly sample the rows for you.
Then there is the describe method, which computes, for each feature column, some basic statistics: mean, min, max, quartiles and std for numeric columns, and other metrics for categorical columns.

 print(df.describe())
APSI    APSSI           SPS           SCI      braliSCI  \
count  94950.000000  94950.0  94950.000000  94950.000000  94950.000000
mean      68.557504     -1.0      0.595017      0.930397     93.983044
std       19.517234      0.0      0.334602      0.221900     17.189719
min       16.000000     -1.0     -1.000000     -1.000000     66.000000
25%       51.000000     -1.0      0.274187      0.817900     80.000000
50%       71.000000     -1.0      0.610669      0.958400     93.000000
75%       87.000000     -1.0      0.957627      1.060100    104.000000
max       95.000000     -1.0      1.000000      2.014000    175.000000
uniq             k
count  94950.000000  94950.000000
mean       1.652185      3.710321
std        1.497587      2.769626
min        1.000000      2.000000
25%        1.000000      2.000000
50%        1.000000      3.000000
75%        2.000000      5.000000
max       21.000000     15.000000

The describe method is appliable to a groupby object too!

 by_meth=df.groupby('method')
by_meth.describe().SPS
#we'll limit this view to the SPS column only, for the sake of clarity
count      mean       std       min       25%       50%       75%  \
method
beagles   18990.0  0.568329  0.347455  0.020000  0.221786  0.581232  0.949239
mafftx    18990.0  0.678755  0.313320 -1.000000  0.352941  0.762712  1.000000
mlocarna  18990.0  0.602044  0.327181  0.000000  0.283750  0.655109  0.929293
raf       18990.0  0.649656  0.317930 -1.000000  0.337838  0.682171  0.989003
sparse    18990.0  0.476302  0.328543  0.002237  0.181818  0.363495  0.809592
max
method
beagles   1.0
mafftx    1.0
mlocarna  1.0
raf       1.0
sparse    1.0

Assuming we have already cleaned our data, we can then check for a naive correlation testing between variables, with the method corr, which by default computes pairwise pearson correlations between features:

 print(df.corr())
APSI  APSSI       SPS       SCI  braliSCI      uniq         k
APSI      1.000000    NaN  0.477527 -0.161952 -0.370620  0.181442  0.072330
APSSI          NaN    NaN       NaN       NaN       NaN       NaN       NaN
SPS       0.477527    NaN  1.000000  0.253083  0.118354  0.206748 -0.201991
SCI      -0.161952    NaN  0.253083  1.000000  0.589294  0.055451 -0.163018
braliSCI -0.370620    NaN  0.118354  0.589294  1.000000  0.052343 -0.101329
uniq      0.181442    NaN  0.206748  0.055451  0.052343  1.000000 -0.106866
k         0.072330    NaN -0.201991 -0.163018 -0.101329 -0.106866  1.000000

The correlation matrix may give us hints on which variables vary along each others.
To better visualize a correlation matrix, the seaborn heatmaps come in help:

import seaborn as sns
%matplotlib inline
sns.heatmap(df.corr())

or even better, using the annot parameter (annotation):

 import seaborn as sns
%matplotlib inline
sns.heatmap(df.corr(), annot=True)


Other things to explore are for example the number of unique elements in certain columns:

 print(df.k.unique()) #prints out a list of unique elements in the column
print(df.k.nunique()) #the length of the above list ((n)umber of (unique) elements)
##nunique is equivalent to a call to len(df.k.unique())