[1,4]: Predictions and (some) Conclusions

It’s time to apply some magic. Machine Learning is an expanding field that, starting from very simple principles, is able to grasp very complex things in very little time. More formally, a Machine Learning technique determines a model’s parameters from a set of data, in such a way a computer can be said to have ‘learned’ from that set. From that point on, such machine will be trained in recognizing similar situations and will be able to draw some conclusions.
Let’s talk about terminology, how do you think your data?
In machine learning, is common to think of it as a spreadsheet, a table. Columns and rows. Each row is an observation, an entity which corresponds to a typical object of this study. Each column is called a feature, think of it as a single measurable aspect of each observation (e.g. a measurable aspect of a cell is the expression level of a certain miRNA, a measurable aspect of a country is the export amount of bananas, etc., those are features).
The aim is to frame the data in a context such that a certain function f of the Input Data produces Outputs in line of what we consider ‘right’. The presence or lack of what defines the term ‘right’ creates a first distinction of ML methods: Supervised and Unsupervised Learning.


Supervised Machine Learning needs a set of data (training set) from which to learn, and a set of answers for each element in the training. Suppose we have expression data (features) from a certain cell line (observation). If my job is to predict, from expression alone, whether this cell is tumoral or not then I need to have previously taught to the algorithm how to recognize them (= I need a set of Observations for which I know whether they are tumoral or not, to teach the algorithm to distinguish). The computers easily understand mathematic models: in this sense we have to decide which mathematical model fits better our system. The process of ‘learning’ from a dataset is usually equivalent to find a curve in the feature space which divides two areas, each representing the spaces where each label belongs. When we talk about feature space, think about it as a cartesian space, where each feature has its own axis. This is clearly not doable with more than three features, but I hope you got the feeling.

If we plot the data, and draw each class with a different colour, here’s the features space:

 import matplotlib.pyplot as plt
 color_map={'class1':'r', 'class2':'k'}
 #let's prepare a map for point colours
 plt.scatter(inp_data.feature1, inp_data.feature2, color=[color_map[x] for x in inp_data.y])
 plt.xlabel('feature1')
 plt.ylabel('feature2')
 plt.show()
 



Let’s say a new observation appears at your desk. No class label. How would you assign a class, based on what you have seen until now? Think about it by yourself

The intuition is simple, we can draw a line between the two groups, then find where the new point is in the feature space and see in which area it falls.

The issues in real cases are mostly two:
dimensionality and boundaries.
Dimensionality because, especially in bioinformatics, there will be rarely only two features (and thus two dimensions). Boundaries because a hand-drawn line roughly separates the two groups, which will cause no harm if all subsequent observations are of extreme nature (far from the boundary), but what about points that lie near the decision boundary? We need more than an intuitive line drawn between two distributions of points.
Supervised Machine Learning covers this area, optimizing the parameters that draw that line (or curve, or whatever), and naturally lives in spaces with more than 2 dimensions (even if problems arise when working with too many dimensions, I recommend this paper from Ronan and colleagues).
Jump to the Supervised Machine Learning section to see a practical example on Cervical Data .


Unsupervised Machine Learning instead must be summoned when no labels in the input data are available, and/or you want to learn a possible internal structure of the data itself (forget about new observations for now, unsupervised is all about the data itself).
Clustering is one of the most used unsupervised machine learning methods in bioinformatics, and it refers to the task of finding groups of observations (samples) which are more close between each other than to other points (always think about the feature space). The algorithm guesses the classes which are not available as it was in the case of Supervised ML.
A more detailed example here [LINK].
The scikit-learn documentation provides a useful starter map to decide which route to try first (higher res), based on your data:

There are many unknown terms, but most of them are algorithm names (in the green boxes) which are already packed and ready to use (so there won’t be much mathematics actually) in the Python module.