Real-world data often has missing values.
Data can have missing values for a number of reasons such as observations that were not recorded and data corruption.
Handling missing data is important as many machine learning algorithms do not support data with missing values.
Specifically, we will cover the following situations:
- How to mark invalid or corrupt values as missing
- How to remove rows with missing data
- How to impute missing values
We will work with an artificial table that fits our needs:
|RNA 1||RNA 2||RNA 3||RNA 4||RNA 5||…||RNA M|
This table contains expression data for M RNAs in N samples. The data may appear good at first but, assuming we have a documentation describing this dataset, and assuming that in this same documentation there’s a note specifying that all values equals to -1 correspond to corrupt or missing data. Some rows contain -1s and we have to deal with them since many Machine Learning algorithm cannot deal with missing data.
Beware: the current format would not be recognized as “missing” (unless explicitly specified) and therefore would cause any algorithm to misinterpret the data (since -1 are numbers just as the others, and the ML method has probably not read the data documentation).
First of all we need to mark the missing data as such, and the way to do it is to substitute the -1 with NaNs. NaN is a numeric data type value that stands for Not a Number, and is usually chosen as default for indicating anomalies.
Two words on NaNs: even if is a numeric data type value, it cannot be compared by usual operands. In facts, if you try to compare two NaNs:
import numpy as np np.nan==np.nan False #while there is a numpy method to assert whether something is nan np.isnan(np.nan) True
In brief, a NaN is not equal to another NaN. Take note.
Marking Missing Values
With real data (probably thousands columns/rows) we won’t have a clear vision of how many missing values are hiding in the data matrix, therefore we will rely on indirect information. The pandas DataFrame describe method will come in handy, since it will return a summary of the data distribution of each column (features) along the rows (samples). This is useful if we have many rows but few columns, since the information will be displayed for each column. If for example the missing data is marked with a ‘-1’ then any column containing at least one will show a min value of ‘-1’. This will help us identify how many columns are affected.
If the columns are too many instead we cannot use describe and shall resort directly to some counting.
Let’s mark beforehand the data as NaNs:
#the dataframe is stored into the variable data data= data.replace(-1, np.nan)
Now the matrix will look something like this:
Then we can count:
np.isnan(data).sum() RNA1 2 RNA2 2 RNA3 4 RNA4 6 ... RNAM 5
This gives us an overview of how many missing values we are dealing with.
Next we have basically two options: Eliminate entire samples if they contain missing values, or try to impute missing values.
There is a method for this, and it is Pandas’.
The dropna DataFrame method removes from a DataFrame all the rows which contain at least one NaN
#suppose that the number of samples N is 700, and the number of features M is 13000 data.shape (700, 13000) data_dropped=data.dropna() data_dropped.shape (320, 13000)
This operation can be devastating on certain datasets with many NaNs spread across samples because it can cause the dataframe to shrink at unacceptable levels. Always check how many rows you have before and after the drop and decide if that is ok for your analyses.
Since removing rows can be limiting with some problems, imputing missing values is a valid alternative.
The word ‘Imputing’ refers to using a model to replace missing values. The models may vary on the situation addressed, and sometimes there’s no ‘best’ model to apply. For example, we can replace missing data with:
- a constant value
- a random value
- a mean, median or mode of the feature distribution
- a value estimated by another predictive model
Important: Any imputing performed on the training data will have to be necessarily performed on any new data, otherwise the algorithm will get fooled by different data representations.
So, if you choose to impute with mean column values, you will have to store the same column values for future use, because they are the same values that must be applied to new data.
We have generally two ways of imputing missing values: the Pandas DataFrame fillna method or the SciKit Imputer.
While fillna is more explicit it is just an implementation of replace dedicated to NaNs, Imputer is a flexible class that automatically compute means or other custom imputing values, and provides the possibility to specify the value to impute (other than NaN).
#DataFrame fillna: manual imputing data_fillna = data.fillna(data.mean()) #SciKit Imputer, it defaults to impute with the mean, we will specify it anyway from sklearn.preprocessing import Imputer imputer=Imputer(strategy='mean') data_imputer = imputer.fit_transform(data)
The difference between two methods is that in the second case the Imputer will return a numpy array (and not a DataFrame), so it will not retain index nor column names. You will have to add them manually after the imputing.