Unsupervised machine learning is what more resembles the concept of Artificial Intelligence: the idea that a computer can learn to identify complex processes and patterns without a human to provide guidance along the way. So while a supervised algorithm will learn how to best fit its model to the ascribed labels, an unsupervised one will try to understand the underlying structure of data, by bringing together similar behaviours in sub-spaces of features and the labels are self assigned.
Unsupervised ML must be taken into consideration when you are interested in understanding an underlying structure of the data (and/or no prior labeling is present). The most simple representation of a clustering algorithm can be seen with two dimensions (or features):
Clustering consists in the ability to differently color points belonging to different data groups, without knowing those groups in advance.
The main problem is that bioinformatics data do not have that nice representation … ever. The data one works with is usually N-dimensional with N > 10, and there is an interesting paper on what happens to the world when we cross that boundary ( See this paper by Koppen )
In many papers (and even when those papers are published in high IF journals) such as this by Mertins and colleagues, scientists naively ignore the pitfalls of dimensionality, exposing to unexpected and unresolvable errors.
This work by Ronan suggests to use a practice that’s becoming increasingly common in data science, but still falls behind shadows in bioinformatics: ensemble clustering.
It consists of not trusting a single clustering method or choice of parameters when dealing with high dimensional data. The scientist should use many possible clusterings and then aggregate data with a super-clustering approach. In this way the results are more robust and less model/parameter dependent.
In the following Jupyter Notebook I want to show you, starting from the clustering work done by Mertins and colleagues in the above-cited work, how to improve the robustness of a clustering by using the techniques described by Ronan.
Link to the Jupyter Notebook