[1, :]: Data Analysis mindset


If you want a more comprehensive round up of python peculiarities on Machine Learning have a look at this book, it’s valuable: python machine learning o reilly

When presented with new data, the first thing that must come into mind is the Question, capital Q. What is a possible Question I can answer by manipulating this data? Sometimes it’s obvious, most of the times it is not. Exploratory Data Analysis covers this step of the process (from now on, EDA).
We can divide this phase of your study into 5 main parts:

  1. Question(s)
  2. Acquisition & Clean
  3. Exploration
  4. Predictions and (some) Conclusions
  5. Communication

Laying down your Question(s)
The Question part must flourish from observation and intuition (wisdom, perhaps, but mostly background knowledge of the argument). What do I know about this data? What are the main issues that should be addressed when working with it? For example, as a researcher I may be interested in the latest papers in the field. This should help you in producing some ideas. Brainstorm with yourself, with others if possible. Don’t stop until you get a prime number >3 possible ideas.
Here you will find…
Data Acquisition & Cleaning
Without clean data we can’t go anywhere. The definition of clean is actually defined both by the problem addressed and by the libraries we are going to use. For instance, if we have a Tab Separated Values file (.tsv) with 104 rows containing data for 20+ columns (features) of annotations and our problem only needs to address two or three of these columns, then our definition of cleaned data is the one that got rid of all other useless columns. If our library requires our data to be put in a numpy array, our definition of cleaned must necessarily include the fact that, when reading our .tsv, the data must be properly split and assigned to a numpy array. The acquisition part is …
Here you will find a collection of libraries and their use to get your data on board.
Exploration
In this part we must never indulge on curiosity. Plot everything that comes into mind, from every point of view. Distributions, contour plots, means, medians and modes, use libraries’ built-in functions to help yourself. If too much data, sample it. Identify outliers and strange behaviors. Put labels on things and see if any new Question can come into mind.
In this section there will be an excursus on Descriptive Statistics and the libraries that help in the process.
Predictions and (some) Conclusions
This is the time to answer your Question(s) (or at least the time to try). Remember the two laws of the good bioinformagician:

IS THAT AGAINST RANDOM DATA?

NO ‘CORRELATION IMPLIES CAUSATION’

And the zeroth law:

UNDERSTAND P-VALUES

Combine data, use available software, Machine Learning (which is cool), to answer the Question you formulated previously.
Now you can try drawing some conclusions about your data.
Contents: first contact with Machine Learning techniques
Communication
The most important part. You can make awesome discoveries, but if you cannot communicate them outside (and the concept of outside is quite labile), it is an incomplete discovery. The whole plotting and data visualization process is more an art than a science and, unfortunately for us, a necessary one. If you have spare time and money, have a look at this beautiful book, it is inspiring:

Once I heard (I cannot find the source anymore: if anyone can help me referencing it I would be glad) an enlightening sentence about this: “You always know what you made a plot of, but the reader doesn’t“.
Matplotlib basics are covered here.


If you want a more comprehensive round up of python peculiarities on Machine Learning have a look at this book, it’s valuable: python machine learning o reilly