[1,2]: Data Acquisition & Cleaning

We will get started with an example.
Let’s say we need to process a set of observations (e.g. RNA chemical probing data) into a suitable data structure for a clustering (see Elements of Unsupervised Learning). The scipy method linkage, contained in scipy.cluster.hierarchy (see the Docs) asks for “either a 1d compressed distance matrix or a 2d array of observation vectors.” We will go with the second choice, and let the method calculate the distances by itself.
Scipy Linkage automatically clusters your data by merging the two closest objects at each iteration. Be careful that the bold word here is “closest”, because its definition determines the outcome. What defines who’s close to whom? A distance. We can use a certain amount of distances, the default is ‘euclidean’, which is the standard distance we can use a in a Cartesian plane, if you remember Pitagora’s Theorem, then it is all you have to know:

$\sqrt{\sum_{d=1}^{D} (x_{d}^{(1)} – x_{d}^{(2)})^{2}}$

where d runs over all the dimensions of the data and $x^{1}$ stands for the first set of observations. If you are confused try to write it for D=1 and D=2.
A practical example: consider the two observations $ x^{1}=[2, 3, -1]$ and $ x^{2}=[5, 3, 3]$. We can see that we have 3 features (or dimensions).
The Euclidean distance between the two points will be:

$\sqrt{ (x_{\mathbf{1}}^{(1)} – x_{\mathbf{1}}^{(2)})^{2} +(x_{\mathbf{2}}^{(1)} – x_{\mathbf{2}}^{(2)})^{2} +(x_{\mathbf{3}}^{(1)} – x_{\mathbf{3}}^{(2)})^{2}  }$

$ = \sqrt{(2-5)^{2} + (3-3)^{2} + (-1-3)^{2}} = \sqrt{9+0+16} = 5$

Usually you would think at $x_2$ and $x_3$ as $y$ and $z$, but since we are going to work with many more than only 3 dimensions, we should drop that convention and stick up with a more general one.


Dealing with missing values

Hereafter instead I’ll be covering a more usual task: Standardizing a DataFrame and Impute (or cut) missing values. This process must be done pretty much all the time since it is virtually impossible for a large dataset to have no missing values.


Cleaning up sparse matrices

In this paragraph it is assumed that the reader knows how to load a pandas DataFrame.