Problem
Unless we know the column names in advance, it is impossible to prevent read_csv from loading all floats as np.float64 when loading in a csv file with mixed data types. This can create a great (unnecessary) strain on memory.
Possible solution
If we would be able to specify a C parser which outputs np.float16 or np.float32 instead of np.float64, we could save half of the resulting memory footprint or more. In practice this could look like this:
pd.read_csv('data.csv', float_precision='low')
Or better yet, accepting descriptors such as 'half', 'single' or 'double'.
Alternatives
-
The best alternative right now is to hack your way around reading in the csv file completely by only reading in the first couple of lines of the file first, extracting the necessary column names and then specifying these as a lower float type using another instance of
read_csvusing thedtypeargument. -
All other alternatives require to read in the file in its entirety, e.g. (incredibly slow):
df = read_csv('data.csv')
doubles = df.select_dtypes('float64').columns
df[doubles] = df[doubles].astype('float32')
-
pd.to_numerichas a downcasting option but will try to convert all columns. -
pd.convert_dtypeswill always convert tofloat64.
API breaking implications
Not sure. I see an option to add another self.parser.double_converter in parsers.pyx but I don't speak enough C to know what this would imply.
Or am I missing something? What other ways are there to minimise memory footprint?
Comment From: mroeschke
Thanks for the suggestion, but it appears there's not much appetite for this feature from the community and core devs. Closing but can reopen if there's renewed interest