[0, 5]: Pandas

(The introduction is taken from the Python Handbook for Data Science)
Pandas is a package built on top of NumPy, and provides an efficient implementation of a DataFrame. DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data. Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.
As we saw, NumPy’s ndarray data structure provides essential features for the type of clean, well-organized data typically seen in numerical computing tasks. While it serves this purpose very well, its limitations become clear when we need more flexibility (attaching labels to data, working with missing data, etc.), especially when dealing with less structured data, which is the most common thing in real data. Pandas, and in particular its Series and DataFrame objects, builds on the NumPy array structure and provides efficient access to these sorts of “data munging” tasks that occupy much of a scientist time.
In this chapter, we will focus on the mechanics of using Series, DataFrame, and related structures effectively.
The common alias for Pandas is pd.

import pandas as pd

The Pandas Series Object

A Pandas Series is an indexed one-dimensional array. It can be created from a list, an array, or numpy array as follows:

my_array = [1,3,6,10,15]


data = pd.Series(my_array)
data
0 1
1 3
2 6
3 10
4 15
dtype: int64

As we can see, there is an automatical index, which we can access by the attribute index. The values are accessed by the attribute values.
But a Series is more than just a NumPy array with an index. The index for example can be not just an integer, but any type.

data = pd.Series(my_array, index=['alpha','beta','gamma','delta','epsilon')
data
alpha 1
beta 3
gamma 6
delta 10
epsilon 15
dtype: int64

The way we access the values are similar to dictionaries:

data['delta']
10

The similarities to the standard dictionaries are quite clear. The main thing to remember though, is that Series are ordered. We can construct a Series from a Dictionary:

map_dict={'ENST':'RNA', 'ENSG':'gene', 'ENSP':'protein'}
data=pd.Series(map_dict)
data
ENSG gene
ENSP protein
ENST RNA
dtype: object

The dictionary is intrinsically unordered, so when converting it to a Series, Pandas builds the index from a sorted list of the dictionary keys.
Beware that the last part, the dtype, is not part of the elements you access when cycling through the values of the Series, it is just another attribute.
Pandas Series support slicing just like other arrays:

data['ENSP':]
ENSP protein
ENST RNA
dtype: object

And now for the main course:

The Pandas DataFrame

Like the Series, a DataFrame can be thought as a generalization of a mix of both a NumPy array and a dictionary.
A DataFrame can be constructed from 2 or more Series with the same index:

map_dict={'ENST':'RNA', 'ENSG':'gene', 'ENSP':'protein'}
count_dict={'ENST': 3300, 'ENSG':18435, 'ENSP':12034}
df = pd.DataFrame({'mapping': map_dict, 'counts': count_dict})
df
counts mapping
ENSG 18435 gene
ENSP 12034 protein
ENST 3300 RNA

We can access the index labels with the DataFrame attribute index.
Additionally, the DataFrame has a columns attribute, which holds the labels for all columns:

df.index
Index(['ENSG','ENSP','ENST'], dtype='object')
df.columns
Index(['counts','mapping'], dtype='object')

A DataFrame can be thus thought as a generalized 2-dimensional array with generalized indexing for accessing data.
So how do we access data?
In the most intuitive way: say we need to access the column (or ‘feature’) ‘counts’, we can access it like a dictionary, or in a Pandas way:

df['counts']
ENSG 18435
ENSP 12034
ENST 3300
Name: counts, dtype: int64
df.counts
ENSG 18435
ENSP 12034
ENST 3300
Name: counts, dtype: int64

The only difference is that the dictionary way supports labels with spaces and special characters while the ‘object’, Pandas way does not (it is easily understandable that we can make a call for df['my ingenuous label']
but not for ]df.my ingenuous label)
However, the most common way we are going to construct our DataFrames is not from dictionaries, but from numpy 2-d arrays, because most of the time biological data will come in Tab Separated Values format (.tsv), or something easily convertible to a 2-d array (or a list of lists, which is the same for Pandas).

"""this is a small sample of a list of performances of the two multiple aligners
mafftx and raf. The values represent, in order: method, dataset,
average pairwise identity, structural pairwise score,
structural conservation index, reference structural conservation index,
and how many sequences were in the dataset
You don't need to understand all these numbers"""
biglist= \
[['mafftx','5_8S_rRNA','75','0.775230007077','0.7154','68', '10'],
['mafftx','5_8S_rRNA','76','0.801566951567','0.5849','67', '10'],
['raf', 'yybP-ykoY', '47', '0.25886864813', '0.8660', '68', '7'],
['raf', 'yybP-ykoY', '47', '0.273654916512', '0.9000', '74', '7']]
#The backslash after 'biglist= ' is needed to write on a newline, nothing important
cols = ['method','dataset','APSI','SPS','SCI','refSCI','k']
my_df = pd.DataFrame(biglist, columns=cols)
my_df
method dataset APSI SPS SCI refSCI k
0 mafftx 5_8S_rRNA 75 0.775230007077 0.7154 68 10
1 mafftx 5_8S_rRNA 76 0.801566951567 0.5849 67 10
2 raf yybP-ykoY 47 0.25886864813 0.8660 68 7
3 raf yybP-ykoY 47 0.273654916512 0.9000 74 7
#be careful that the entries are actually all strings
#(we can see that from the initial list, but we'll see
# a better method in a while)
my_df[['APSI','SPS','SCI','refSCI','k']] = my_df[['APSI','SPS','SCI','refSCI','k']].apply(pd.to_numeric)
#this is needed to convert the strings to the best numeric type.

As you can see, we can define our column labels with the parameter columns, and the same can be done with the index (the parameter is of course: index)
Actually, we can define the index and columns even after the DataFrame is constructed:

df.index=['first_index','second_index','another_index', '...']
df.columns=['measure1','measure3','measure22', '...']

DataFrame attributes, methods and data selection

#assume a DataFrame called df
df.T #returns the transpose of the DataFrame
df.dtypes #it's plural
df.shape #like a NumPy array, returns the DataFrame's dimensions
df.iloc #integer-location indexing
df.loc #label-location indexing

Let’s talk about iloc and loc: the simple call to the attribute will return nothing but a pandas.core.indexing._iLocIndexer or pandas.core.indexing._LocIndexer, but that object can be accessed via slicing and it is used to select rows of the DataFrame. The iloc attribute requires an integer, a list of integers or a boolean array which will be interpreted as the positions to select. The loc attribute instead must be used with labels.

my_df
method dataset APSI SPS SCI refSCI k
0 mafftx 5_8S_rRNA 75 0.775230007077 0.7154 68 10
1 mafftx 5_8S_rRNA 76 0.801566951567 0.5849 67 10
2 raf yybP-ykoY 47 0.25886864813 0.8660 68 7
3 raf yybP-ykoY 47 0.273654916512 0.9000 74 7
my_df.index=['first','second','third','fourth']
my_df
method dataset APSI SPS SCI refSCI k
first mafftx 5_8S_rRNA 75 0.775230007077 0.7154 68 10
second mafftx 5_8S_rRNA 76 0.801566951567 0.5849 67 10
third raf yybP-ykoY 47 0.25886864813 0.8660 68 7
fourth raf yybP-ykoY 47 0.273654916512 0.9000 74 7
my_df.iloc[1:3]
method dataset APSI SPS SCI refSCI k
second mafftx 5_8S_rRNA 76 0.801566951567 0.5849 67 10
third raf yybP-ykoY 47 0.25886864813 0.8660 68 7
#1:3 is sliced (3 not included, just like in the range function)
my_df.loc['second':'fourth']
method dataset APSI SPS SCI refSCI k
second mafftx 5_8S_rRNA 76 0.801566951567 0.5849 67 10
third raf yybP-ykoY 47 0.25886864813 0.8660 68 7
fourth raf yybP-ykoY 47 0.273654916512 0.9000 74 7
#we can select a portion of the df straight away
my_df.loc['second':'fourth','SPS':'SCI']
SPS SCI
second 0.801567 0.5849
third 0.258869 0.8660
#we can combine them if you have mixed information
my_df.iloc[1:3].loc[:, 'SPS']
second 0.801567
third 0.258869
Name: SPS, dtype: float64

First trouble: while the iloc behaves as expected, the loc indexing uses a different workaround which includes the end point of the slice. Be sure not to forget this particular behaviours.
Second thing to learn: the notation [indexes] refers to rows. If we want to select columns the notation required is, as can be seen in the last example ...loc[row_indexes, column_indexes]. Just as in standard python slices, the colon “:” can be used alone to refer to all the available range.
Moreover, we can select data by boolean indexing (just like with NumPy)

bool_idx=my_df.SCI > 0.7
bool_idx
first True
second False
third True
fourth True
Name: SCI, dtype: bool
my_df[bool_idx]
method dataset APSI SPS SCI refSCI k
first mafftx 5_8S_rRNA 75 0.775230007077 0.7154 68 10
third raf yybP-ykoY 47 0.25886864813 0.8660 68 7
fourth raf yybP-ykoY 47 0.273654916512 0.9000 74 7
my_df.info()
Index: 4 entries, first to fourth
Data columns (total 7 columns):
method 4 non-null object
dataset 4 non-null object
APSI 4 non-null int64
SPS 4 non-null float64
SCI 4 non-null float64
refSCI 4 non-null int64
k 4 non-null int64
dtypes: float64(2), int64(3), object(2)
memory usage: 416.0+ bytes
my_df.describe()
APSI SPS SCI refSCI k
count 4.000000 4.000000 4.000000 4.000000 4.000000
mean 61.250000 0.527330 0.766575 69.250000 8.500000
std 16.459546 0.301708 0.145272 3.201562 1.732051
min 47.000000 0.258869 0.584900 67.000000 7.000000
25% 47.000000 0.269958 0.682775 67.750000 7.000000
50% 61.000000 0.524442 0.790700 68.000000 8.500000
75% 75.250000 0.781814 0.874500 69.500000 10.000000
max 76.000000 0.801567 0.900000 74.000000 10.000000

These kind of information can be useful when trying to infer missing values, presence of out-of-expected-range values (like -1 in positive features) and such. This will be better covered in the EDA chapter.

Code examples:

1- Numpy, Pandas
https://github.com/noise42/datastructures/blob/master/materials/tv19/misc_data/les4/Lesson4.ipynb
2- Pandas, missing values, cleaning data
https://github.com/noise42/datastructures/blob/master/materials/tv19/misc_data/les5/Lesson5.ipynb
3-Loading data, aggregation
https://github.com/noise42/datastructures/blob/master/materials/tv19/misc_data/les6/Lesson6.ipynb