Problem

Unless we know the column names in advance, it is impossible to prevent read_csv from loading all floats as np.float64 when loading in a csv file with mixed data types. This can create a great (unnecessary) strain on memory.

Possible solution

If we would be able to specify a C parser which outputs np.float16 or np.float32 instead of np.float64, we could save half of the resulting memory footprint or more. In practice this could look like this:

pd.read_csv('data.csv', float_precision='low')

Or better yet, accepting descriptors such as 'half', 'single' or 'double'.

Alternatives

  • The best alternative right now is to hack your way around reading in the csv file completely by only reading in the first couple of lines of the file first, extracting the necessary column names and then specifying these as a lower float type using another instance of read_csv using the dtype argument.

  • All other alternatives require to read in the file in its entirety, e.g. (incredibly slow):

df = read_csv('data.csv')
doubles = df.select_dtypes('float64').columns
df[doubles] = df[doubles].astype('float32')
  • pd.to_numeric has a downcasting option but will try to convert all columns.

  • pd.convert_dtypes will always convert to float64.

API breaking implications

Not sure. I see an option to add another self.parser.double_converter in parsers.pyx but I don't speak enough C to know what this would imply.

Or am I missing something? What other ways are there to minimise memory footprint?

Comment From: mroeschke

Thanks for the suggestion, but it appears there's not much appetite for this feature from the community and core devs. Closing but can reopen if there's renewed interest