DataFrames (from now on DF) have many useful methods (functions that you call with DF.method(arguments)) and attributes (elements that you call with DF.attribute), which I won’t be fully covering here (an overview is in the Chapter dedicated to pandas and numpy). For a list of the plethora of options available see the pandas.DataFrame Documentation.
In the case of clustering [what about supervised learning?], if the matrix representing the data is sparse, you may have the possibility to reduce a number of “empty” features, that is, columns without any value. Make no mistake: whether the absence of value is represented by ‘0’ or by ‘None’ or other placeholders exquisitely depends on the specific problem. Let’s do some pandas magic to remove these columns at once:
def cleanColumns(df): """returns a df with all zero columns stripped out, useful for sparse data """ return df.iloc[:, list(map(lambda x: not x, (df == 0).all(axis=0))) ]
Let’s unwrangle this code one step at a time:
The idea is that we want to select only those columns in the DF that are not full of zeros. We will use boolean indexing:
The iloc method subsets a dataframe using integer position or boolean indexing (See Pandas section for more).
Integer position indexing example:
With df.iloc[0:4,0:2], we would select the first 4 rows (0 to 3) and the first two columns (0 to 2).
Boolean indexing example:
Assuming a DF with 3 rows and 2 columns, DF.iloc[[True,False,True], [False,True]] would select the elements of first and third row at their second column.
The two indexing methods can be combined between rows and columns.
df.iloc[:, selected_columns ]
We need to select all rows (no restrictions in this problem), and “:” selects all rows (remember that the slice syntax from a to b for an array arr is arr[a:b], that is the same colon). For the selected columns we need a bit more code instead:
It returns a DF with the same shape as the original, but with boolean values, True if the element in the same position was equal to 0, False otherwise.
(df == 0).all(axis=0)
The all method asserts whether all elements in a Series are True. Since it is a method usable both by a Series and a DataFrame, in the latter case it will return a Series of booleans. The axis parameter can be 0 or 1, depending on the axis you want to check the presence of zeros. With 0 it will count alongside rows (for each column it looks into every row) and with 1 it will count alongside columns (for each row it looks into every column).
In this case, this will return the a Series with a number of elements equal to the number of columns (features or dimensions), with each boolean representing whether the column is full of zeros.
list(map(lambda x: not x, (df == 0).all(axis=0)))
This is purposely done in a more complicated way (using the logical NOT “~” would have reached the same goals). This is just another way of remembering that even if there probably is a best way to solve a specific problem, sometimes it is better to just solve it in the first one or two ways that come into mind. The efficiency and optimization will come with practice.
The map function is straightforward even if sometimes is tricky to grasp. It is one of the most useful functions in Python. It applies the same function to all elements of an array or a dataframe.
The function here is defined in-line. When this happens, the keyword here is lambda. The lambda function defined gives the same results as writing:
def mapping_function(x): return (not x) map(mapping_function, (df == 0).all(axis=0))
We are just applying the logical NOT to everything in the Series coming out from the previous expression, turning to False the boolean of the columns full of zeros.
Since Python 3.x the map function does not return a list anymore so we have to explicitly trasform it into a list if we want to use the map function. But there is a Pandas alternative: apply
(df == 0).all(axis=0).apply(lambda x: not x)
The apply function is very similar to map, except is a method of DataFrame/Series objects. It has been built specifically for Series and DataFrames and works like a charm. As map, it applies the same function to all elements of the object, in this case, our lambda function.
The boolean Series(or List) produced is an ideal candidate for the columns boolean indexing, therefore:
df.iloc[:, list(map(lambda x: not x, (df == 0).all(axis=0))) ]
is exactly what we need to clean our sparse DataFrame.
Question: would it have worked the same if you had replaced df == 0 with df != 0 (and without the lambda function needed to invert Trues and Falses)?