Code Sample
import pandas as pd
path = 'dataset.gz'
df = pd.read_csv(path, sep = ';', compression = 'gzip')
print(sys.getsizeof(df ) / 1024 / 1024, " Mb")
Code Output (Problem)
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-13-b36dac541c6a> in <module>
2
3 path = 'dataset.gz'
----> 4 df = pd.read_csv(path, sep = ';', compression = 'gzip')
5 print(sys.getsizeof(df) / 1024 / 1024, " Mb")
~\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
674 )
675
--> 676 return _read(filepath_or_buffer, kwds)
677
678 parser_f.__name__ = name
~\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
452
453 try:
--> 454 data = parser.read(nrows)
455 finally:
456 parser.close()
~\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\parsers.py in read(self, nrows)
1131 def read(self, nrows=None):
1132 nrows = _validate_integer("nrows", nrows)
-> 1133 ret = self._engine.read(nrows)
1134
1135 # May alter columns / col_dict
~\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\parsers.py in read(self, nrows)
2035 def read(self, nrows=None):
2036 try:
-> 2037 data = self._reader.read(nrows)
2038 except StopIteration:
2039 if self._first_chunk:
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader.read()
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_rows()
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._convert_column_data()
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype()
pandas\_libs\parsers.pyx in pandas._libs.parsers._try_int64()
MemoryError: Unable to allocate 1.00 MiB for an array with shape (131072,) and data type int64
Problem description
The pandas read_csv
function got a memory error, this shouldn't happen (I have a 16 Gb RAM computer).
Meanwhile, I tried another python (conda) 3.7.4 which works fine but using pandas 0.25.1, it's currently my workaround.
Expected Output
From my workaround (Python 3.7.4, conda, pandas 0.25.1)
625.6165533065796 Mb
Output of pd.show_versions()
Comment From: SidoShiro
After more research, the error come from the fact that I mistaken the python version I was using, which was 3.8.2 32-bits. It seems that Windows only gives 2GB of memory to 32-bits processes.
The memory error comes from here.
Should a warning pop if pandas is manipulating large files in 32-bits Python version ?
Comment From: mroeschke
Thanks for the suggestion but given the lack of community or core developer interest in this feature I don't think it's likely that this feature will be implemented so closing. Can reopen if there's renewed interest