Problem
Unless we know the column names in advance, it is impossible to prevent read_csv
from loading all floats as np.float64
when loading in a csv file with mixed data types. This can create a great (unnecessary) strain on memory.
Possible solution
If we would be able to specify a C parser which outputs np.float16
or np.float32
instead of np.float64
, we could save half of the resulting memory footprint or more. In practice this could look like this:
pd.read_csv('data.csv', float_precision='low')
Or better yet, accepting descriptors such as 'half'
, 'single'
or 'double'
.
Alternatives
-
The best alternative right now is to hack your way around reading in the csv file completely by only reading in the first couple of lines of the file first, extracting the necessary column names and then specifying these as a lower float type using another instance of
read_csv
using thedtype
argument. -
All other alternatives require to read in the file in its entirety, e.g. (incredibly slow):
df = read_csv('data.csv')
doubles = df.select_dtypes('float64').columns
df[doubles] = df[doubles].astype('float32')
-
pd.to_numeric
has a downcasting option but will try to convert all columns. -
pd.convert_dtypes
will always convert tofloat64
.
API breaking implications
Not sure. I see an option to add another self.parser.double_converter
in parsers.pyx
but I don't speak enough C to know what this would imply.
Or am I missing something? What other ways are there to minimise memory footprint?
Comment From: mroeschke
Thanks for the suggestion, but it appears there's not much appetite for this feature from the community and core devs. Closing but can reopen if there's renewed interest