DEV Community: mattalhonte

Downcast Numerical Data Types with Pandas

mattalhonte — Mon, 28 Jan 2019 12:30:00 +0000

Recently, I had to find a way to reduce the memory footprint of a Pandas DataFrame in order to actually do operations on it. Here's a trick that came in handy!

By default, if you read a DataFrame from a file, it'll cast all the numerical columns as the float64 type. This is in keeping with the philosophy behind Pandas and NumPy - by using strict types (instead of normal Python "duck typing"), you can do things a lot faster. The float64 is the most flexible numerical type - it can handle fractions, as well as turning missing values into a NaN. This will let us read it into memory, and then start messing with it. The downside is that it consumes a lot of memory.

Now, let's say we want to save memory by manually downcasting our columns into the smallest type that can handle its values? And let's ALSO say that we want to be really, really lazy and don't want to look at a bunch of numbers by hand. And let's say we wanna do this via Method Chaining, because of all the advantages outlined here: https://tomaugspurger.github.io/method-chaining

Let's introduce our example DataFrame. We'll convert all the values to floats manually because that's what the default is when we read from a file.

df = pd.DataFrame({
    "stay_float": [0.5, 3.7, 7.5],
    "to_int": [-5, 7, 5],
    "to_uint": [1, 100, 200]}).astype(float)

First, let's introduce the workhorse of this exercise - Pandas's to_numeric function, and its handy optional argument, downcast. This will take a numerical type - float, integer (not int), or unsigned - and then downcast it to the smallest version available.

Next, let's make a function that checks to see if a column can be downcast from a float to an integer.

def float_to_int(ser):
    try:
        int_ser = ser.astype(int)
        if (ser == int_ser).all():
            return int_ser
        else:
            return ser
    except ValueError:
        return ser

We're using the try/except pattern here because if we try to make a column with NaN values into an integer column, it'll throw an error. If it'd otherwise be a good candidate for turning into an integer, we should figure a value to impute for those missing values - but that'll be different for every column. Sometimes it'd make sense to make it 0, other times the mean or median of the column, or something else entirely.

I'd also like to direct your attention to Line 4, which has a very useful Pandas pattern - if (ser == int_ser).all(). When you do operations on Pandas columns like Equals or Greater Than, you get a new column where the operation was applied element-by-element. If you're trying to set up a conditional, the interpreter doesn't know what to do with an array containing [True, False, True] - you have to boil it down to a single value. So, if you wan to check if two columns are completely equal, you have to call the .all() method (which has a useful sibling, any()) to make a conditional that can actually be used to control execution.

Next, let's make a function that lets us apply a transformation to multiple columns based on a condition. The assign method is pretty awesome, and it'd be fun to not have to leave it (or, if we do, to at least replace it with a function we can pipe as part of a chain of transformations to the DataFrame as a whole).

def multi_assign(df, transform_fn, condition):
    df_to_use = df.copy()

    return (df_to_use
        .assign(
            **{col: transform_fn(df_to_use[col])
               for col in condition(df_to_use)})
           )

assign lets us do multiple assignments, so long as we make a dictionary of column names and target values and then unpack it. Really, it'd actually be easier to skip the function and go directly to using this syntax, except that I'm not aware of a method of accessing a filterable list of the DF's columns while still "in" the chain. I think future versions of Pandas' syntax will include this, as I've read they want to support more Method Chaining. Personally, I find the reduction in Cognitive Load is worth it, with having a lot of little modular lego-piece transformations chained together.

It also works as a nice foundation for other little helper functions. So, here's one to turn as many float columns to integers as we can.

def all_float_to_int(df):
    df_to_use = df.copy()
    transform_fn = float_to_int
    condition = lambda x: list(x
                    .select_dtypes(include=["float"])
                    .columns)    

    return multi_assign(df_to_use, transform_fn, condition)

See the pattern in action! We decide on a transformation function, we decide on what conditions we want to apply all these transformations (we could have a hundred columns, and who wants to make a note of all that?), and then we pass it to the multi-assign function.

(df
     .pipe(all_float_to_int)).dtypes


stay_float float64
to_int int64
to_uint int64
dtype: object

Cool! But we didn't actually decrease the size of our DataFrame - 64 bytes of integer takes up as many bytes as 64 bytes of float, just like how a hundred pounds of feathers weighs as much as a hundred pounds of bricks. What we did do is make it easier to downcast those columns later.

Next, let's make a function that takes a subset of the columns, and tries to downcast it to the smallest version that it can. We've got fairly small values here, so it should get some work done.

def downcast_all(df, target_type, inital_type=None):
    #Gotta specify floats, unsigned, or integer
    #If integer, gotta be 'integer', not 'int'
    #Unsigned should look for Ints
    if inital_type is None:
        inital_type = target_type

    df_to_use = df.copy()

    transform_fn = lambda x: pd.to_numeric(x, 
                                downcast=target_type)

    condition = lambda x: list(x
                    .select_dtypes(include=[inital_type])
                    .columns) 

    return multi_assign(df_to_use, transform_fn, condition)

Same basic pattern as before! But now we have two arguments - one is the target_type, which tells us what types to try to downcast to. By default, this will be the same as the initial_type, with one exception that we'll grab in a second!

(df
     .pipe(all_float_to_int)
     .pipe(downcast_all, "float")
     .pipe(downcast_all, "integer")
).dtypes


stay_float float32
to_int int8
to_uint int16
dtype: object

Alright, now we're getting somewhere! Wonder if we can do even better, though? That last column has a conspicuous name! And it has no values lower than 0 - maybe we could save space if we store it as an unsigned integer! Let's add a pipe to our chain that'll try to downcast certain integers into unsigneds...

(df
     .pipe(all_float_to_int)
     .pipe(downcast_all, "float")
     .pipe(downcast_all, "integer")
     .pipe(downcast_all,  
           target_type = "unsigned", 
           inital_type = "integer")
).dtypes


stay_float float32
to_int int8
to_uint uint8
dtype: objec

What do ya know, we can!

Let's see how much memory we save by doing this.

df.info(memory_usage='deep')


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
stay_float 3 non-null float64
to_int 3 non-null float64
to_uint 3 non-null float64
dtypes: float64(3)
memory usage: 152.0 bytes

(df
     .pipe(all_float_to_int)
     .pipe(downcast_all, "float")
     .pipe(downcast_all, "integer")
     .pipe(downcast_all,  
           target_type = "unsigned", 
           inital_type = "integer")
).info(memory_usage='deep')


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
stay_float 3 non-null float32
to_int 3 non-null int8
to_uint 3 non-null uint8
dtypes: float32(1), int8(1), uint8(1)
memory usage: 98.0 bytes

152 down to 98 - we reduced it by more than 1/3rd!

Code Snippet Corner: Using Pandas' Assign Function on Multiple Columns