Pandas ENH: Reduce DF/Series to smallest possible dtype

Hi,

I have been currently working with dataset mostly consisting of binary features (column has only 1, 0 or NaN). I know there is a category dtype, but that is not usable in some cases, e.g. when you want to store it HDF and can't use tables format (e.g. because of to wide tables). I have datasets with about 50k rows and 7k columns and it is wasting a lot of memory storing 0/1 as float64 (because int64 can't handle NaNs - but it is still terribly big). In memory size is about 15 GB, while reduced is about 4GB and HDF files are of course much smaller.

So I use this function to reduce a Series to smallest possible dtype, effectively reducing the size of the dataset (up to 8 times from 64b -> 8b):

import numpy as np
def safely_reduce_dtype(ser):  # pandas.Series or numpy.array  
    orig_dtype = "".join([x for x in ser.dtype.name if x.isalpha()]) # float/int
    mx = 1
    for val in ser.values:
        new_itemsize = np.min_scalar_type(val).itemsize
        if mx < new_itemsize:
            mx = new_itemsize
    new_dtype = orig_dtype + str(mx * 8)
    return new_dtype # or converts the pandas.Series by ser.astype(new_dtype)

it's far from perfect and probably some edge cases may occur. It could be definitely enhanced somehow, take this as a first proposal.

I think it could be added as a utility function, something like pd.to_numeric or pd.to_datetime.

What do you think?

Example

>>> import pandas
>>> serie = pd.Series([1,0,1,0], dtype='int32')
>>> safely_reduce_dtype(serie)
dtype('int8')

>>> float_serie = pd.Series([1,0,1,0])
>>> safely_reduce_dtype(float_serie)
dtype('float8')  # from float64

or when returning ser.astype(new_type):

>>> import numpy as np
>>> import pandas as pd

>>> rands = np.random.randint(1,100, 10000)
>>> ser_orig = pd.Series(rands)
>>> ser_reduced = safely_reduce_dtype(ser_orig)
>>> print(ser_orig.memory_usage(), ser_reduced.memory_usage())
80080 10080

Comment From: jreback

you mean like this: https://github.com/pydata/pandas/issues/13352 (this is in 0.19.0, rc coming soon)

In [7]: s = pd.Series([1,0,1,0])

In [8]: s
Out[8]: 
0    1
1    0
2    1
3    0
dtype: int64

In [9]: pd.to_numeric?

In [10]: pd.to_numeric(s, downcast='float')
Out[10]: 
0    1.0
1    0.0
2    1.0
3    0.0
dtype: float32

In [11]: pd.to_numeric(s, downcast='integer')
Out[11]: 
0    1
1    0
2    1
3    0
dtype: int8

in general you won't be able to go below float32. float16 is *barelysupported, andfloat8`` is nonsensical.

Comment From: hnykda

Ahh. Sorry, didn't know that.

Comment From: jorisvandenbossche

@hnykda No reason you should already have known it, since it's not yet released :-)