Code Sample

import pandas as pd

path = 'dataset.gz'
df = pd.read_csv(path, sep = ';', compression = 'gzip')
print(sys.getsizeof(df ) / 1024 / 1024, " Mb")

Code Output (Problem)

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-13-b36dac541c6a> in <module>
      2 
      3 path = 'dataset.gz'
----> 4 df = pd.read_csv(path, sep = ';', compression = 'gzip')
      5 print(sys.getsizeof(df) / 1024 / 1024, " Mb")

~\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
    674         )
    675 
--> 676         return _read(filepath_or_buffer, kwds)
    677 
    678     parser_f.__name__ = name

~\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
    452 
    453     try:
--> 454         data = parser.read(nrows)
    455     finally:
    456         parser.close()

~\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\parsers.py in read(self, nrows)
   1131     def read(self, nrows=None):
   1132         nrows = _validate_integer("nrows", nrows)
-> 1133         ret = self._engine.read(nrows)
   1134 
   1135         # May alter columns / col_dict

~\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\parsers.py in read(self, nrows)
   2035     def read(self, nrows=None):
   2036         try:
-> 2037             data = self._reader.read(nrows)
   2038         except StopIteration:
   2039             if self._first_chunk:

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader.read()

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_rows()

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._convert_column_data()

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype()

pandas\_libs\parsers.pyx in pandas._libs.parsers._try_int64()

MemoryError: Unable to allocate 1.00 MiB for an array with shape (131072,) and data type int64

Problem description

The pandas read_csv function got a memory error, this shouldn't happen (I have a 16 Gb RAM computer). Meanwhile, I tried another python (conda) 3.7.4 which works fine but using pandas 0.25.1, it's currently my workaround.

Expected Output

From my workaround (Python 3.7.4, conda, pandas 0.25.1)

625.6165533065796  Mb

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : None python : 3.8.2.final.0 python-bits : 32 OS : Windows OS-release : 10 machine : AMD64 processor : Intel64 Family 6 Model 142 Stepping 12, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : fr_FR.cp1252 pandas : 1.0.1 numpy : 1.18.1 pytz : 2019.3 dateutil : 2.8.1 pip : 20.0.2 setuptools : 41.2.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 2.11.1 IPython : 7.13.0 pandas_datareader: None bs4 : 4.8.2 bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : 3.1.3 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pytables : None pytest : None pyxlsb : None s3fs : 0.4.0 scipy : 1.4.1 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None xlsxwriter : None numba : None None

Comment From: SidoShiro

After more research, the error come from the fact that I mistaken the python version I was using, which was 3.8.2 32-bits. It seems that Windows only gives 2GB of memory to 32-bits processes.

The memory error comes from here.

Should a warning pop if pandas is manipulating large files in 32-bits Python version ?

Comment From: mroeschke

Thanks for the suggestion but given the lack of community or core developer interest in this feature I don't think it's likely that this feature will be implemented so closing. Can reopen if there's renewed interest