Hi,
I have been currently working with dataset mostly consisting of binary features (column has only 1, 0 or NaN). I know there is a category
dtype, but that is not usable in some cases, e.g. when you want to store it HDF and can't use tables
format (e.g. because of to wide tables). I have datasets with about 50k rows and 7k columns and it is wasting a lot of memory storing 0/1 as float64
(because int64
can't handle NaNs - but it is still terribly big). In memory size is about 15 GB, while reduced is about 4GB and HDF files are of course much smaller.
So I use this function to reduce a Series to smallest possible dtype, effectively reducing the size of the dataset (up to 8 times from 64b -> 8b):
import numpy as np
def safely_reduce_dtype(ser): # pandas.Series or numpy.array
orig_dtype = "".join([x for x in ser.dtype.name if x.isalpha()]) # float/int
mx = 1
for val in ser.values:
new_itemsize = np.min_scalar_type(val).itemsize
if mx < new_itemsize:
mx = new_itemsize
new_dtype = orig_dtype + str(mx * 8)
return new_dtype # or converts the pandas.Series by ser.astype(new_dtype)
it's far from perfect and probably some edge cases may occur. It could be definitely enhanced somehow, take this as a first proposal.
I think it could be added as a utility function, something like pd.to_numeric
or pd.to_datetime
.
What do you think?
Example
>>> import pandas
>>> serie = pd.Series([1,0,1,0], dtype='int32')
>>> safely_reduce_dtype(serie)
dtype('int8')
>>> float_serie = pd.Series([1,0,1,0])
>>> safely_reduce_dtype(float_serie)
dtype('float8') # from float64
or when returning ser.astype(new_type)
:
>>> import numpy as np
>>> import pandas as pd
>>> rands = np.random.randint(1,100, 10000)
>>> ser_orig = pd.Series(rands)
>>> ser_reduced = safely_reduce_dtype(ser_orig)
>>> print(ser_orig.memory_usage(), ser_reduced.memory_usage())
80080 10080
Comment From: jreback
you mean like this: https://github.com/pydata/pandas/issues/13352 (this is in 0.19.0, rc coming soon)
In [7]: s = pd.Series([1,0,1,0])
In [8]: s
Out[8]:
0 1
1 0
2 1
3 0
dtype: int64
In [9]: pd.to_numeric?
In [10]: pd.to_numeric(s, downcast='float')
Out[10]:
0 1.0
1 0.0
2 1.0
3 0.0
dtype: float32
In [11]: pd.to_numeric(s, downcast='integer')
Out[11]:
0 1
1 0
2 1
3 0
dtype: int8
in general you won't be able to go below float32
. float16
is *barelysupported, and
float8`` is nonsensical.
Comment From: hnykda
Ahh. Sorry, didn't know that.
Comment From: jorisvandenbossche
@hnykda No reason you should already have known it, since it's not yet released :-)