[1,3]: Exploration

The exploratory part of our analyses takes the name of EDA (Exploratory Data Analysis). It concerns some basic statistics like means and deviations and gaussians and histograms and so on.
We will be using mostly pandas as it provides most of the needed functions through DataFrame methods.
First and foremost, the head and tail methods, which take as argument the number of rows to show. This is useful to get a grasp on the kind of data we are working on. Unfortunately if there is missing data we won’t see it from here but these methods are a good starting point nonetheless.
If we want a less biased (towards the beginning or the end of the dataframe) sampling of the rows, the method sample will take an integer number as well and will randomly sample the rows for you.
Then there is the describe method, which computes, for each feature column, some basic statistics: mean, min, max, quartiles and std for numeric columns, and other metrics for categorical columns.

 print(df.describe())
 APSI    APSSI           SPS           SCI      braliSCI  \
 count  94950.000000  94950.0  94950.000000  94950.000000  94950.000000
 mean      68.557504     -1.0      0.595017      0.930397     93.983044
 std       19.517234      0.0      0.334602      0.221900     17.189719
 min       16.000000     -1.0     -1.000000     -1.000000     66.000000
 25%       51.000000     -1.0      0.274187      0.817900     80.000000
 50%       71.000000     -1.0      0.610669      0.958400     93.000000
 75%       87.000000     -1.0      0.957627      1.060100    104.000000
 max       95.000000     -1.0      1.000000      2.014000    175.000000
 uniq             k
 count  94950.000000  94950.000000
 mean       1.652185      3.710321
 std        1.497587      2.769626
 min        1.000000      2.000000
 25%        1.000000      2.000000
 50%        1.000000      3.000000
 75%        2.000000      5.000000
 max       21.000000     15.000000

The describe method is appliable to a groupby object too!

 by_meth=df.groupby('method')
 by_meth.describe().SPS
 #we'll limit this view to the SPS column only, for the sake of clarity
 count      mean       std       min       25%       50%       75%  \
 method
 beagles   18990.0  0.568329  0.347455  0.020000  0.221786  0.581232  0.949239
 mafftx    18990.0  0.678755  0.313320 -1.000000  0.352941  0.762712  1.000000
 mlocarna  18990.0  0.602044  0.327181  0.000000  0.283750  0.655109  0.929293
 raf       18990.0  0.649656  0.317930 -1.000000  0.337838  0.682171  0.989003
 sparse    18990.0  0.476302  0.328543  0.002237  0.181818  0.363495  0.809592
 max
 method
 beagles   1.0
 mafftx    1.0
 mlocarna  1.0
 raf       1.0
 sparse    1.0

Assuming we have already cleaned our data, we can then check for a naive correlation testing between variables, with the method corr, which by default computes pairwise pearson correlations between features:

 print(df.corr())
 APSI  APSSI       SPS       SCI  braliSCI      uniq         k
 APSI      1.000000    NaN  0.477527 -0.161952 -0.370620  0.181442  0.072330
 APSSI          NaN    NaN       NaN       NaN       NaN       NaN       NaN
 SPS       0.477527    NaN  1.000000  0.253083  0.118354  0.206748 -0.201991
 SCI      -0.161952    NaN  0.253083  1.000000  0.589294  0.055451 -0.163018
 braliSCI -0.370620    NaN  0.118354  0.589294  1.000000  0.052343 -0.101329
 uniq      0.181442    NaN  0.206748  0.055451  0.052343  1.000000 -0.106866
 k         0.072330    NaN -0.201991 -0.163018 -0.101329 -0.106866  1.000000

The correlation matrix may give us hints on which variables vary along each others.
To better visualize a correlation matrix, the seaborn heatmaps come in help:

import seaborn as sns
 %matplotlib inline
 sns.heatmap(df.corr())

or even better, using the annot parameter (annotation):

 import seaborn as sns
 %matplotlib inline
 sns.heatmap(df.corr(), annot=True)
 

Other things to explore are for example the number of unique elements in certain columns:

 print(df.k.unique()) #prints out a list of unique elements in the column
 print(df.k.nunique()) #the length of the above list ((n)umber of (unique) elements)
 ##nunique is equivalent to a call to len(df.k.unique())