Pandas MemoryError, Python 32bit, Error when reading a large file

Code Sample

import pandas as pd

path = 'dataset.gz'
df = pd.read_csv(path, sep = ';', compression = 'gzip')
print(sys.getsizeof(df ) / 1024 / 1024, " Mb")

Code Output (Problem)

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-13-b36dac541c6a> in <module>
      2 
      3 path = 'dataset.gz'
----> 4 df = pd.read_csv(path, sep = ';', compression = 'gzip')
      5 print(sys.getsizeof(df) / 1024 / 1024, " Mb")

~\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
    674         )
    675 
--> 676         return _read(filepath_or_buffer, kwds)
    677 
    678     parser_f.__name__ = name

~\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
    452 
    453     try:
--> 454         data = parser.read(nrows)
    455     finally:
    456         parser.close()

~\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\parsers.py in read(self, nrows)
   1131     def read(self, nrows=None):
   1132         nrows = _validate_integer("nrows", nrows)
-> 1133         ret = self._engine.read(nrows)
   1134 
   1135         # May alter columns / col_dict

~\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\parsers.py in read(self, nrows)
   2035     def read(self, nrows=None):
   2036         try:
-> 2037             data = self._reader.read(nrows)
   2038         except StopIteration:
   2039             if self._first_chunk:

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader.read()

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_rows()

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._convert_column_data()

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype()

pandas\_libs\parsers.pyx in pandas._libs.parsers._try_int64()

MemoryError: Unable to allocate 1.00 MiB for an array with shape (131072,) and data type int64

Problem description

The pandas read_csv function got a memory error, this shouldn't happen (I have a 16 Gb RAM computer). Meanwhile, I tried another python (conda) 3.7.4 which works fine but using pandas 0.25.1, it's currently my workaround.

Expected Output

From my workaround (Python 3.7.4, conda, pandas 0.25.1)

625.6165533065796  Mb

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit : None python : 3.8.2.final.0 python-bits : 32 OS : Windows OS-release : 10 machine : AMD64 processor : Intel64 Family 6 Model 142 Stepping 12, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : fr_FR.cp1252 pandas : 1.0.1 numpy : 1.18.1 pytz : 2019.3 dateutil : 2.8.1 pip : 20.0.2 setuptools : 41.2.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 2.11.1 IPython : 7.13.0 pandas_datareader: None bs4 : 4.8.2 bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : 3.1.3 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pytables : None pytest : None pyxlsb : None s3fs : 0.4.0 scipy : 1.4.1 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None xlsxwriter : None numba : None None

Comment From: SidoShiro

After more research, the error come from the fact that I mistaken the python version I was using, which was 3.8.2 32-bits. It seems that Windows only gives 2GB of memory to 32-bits processes.

The memory error comes from here.

Should a warning pop if pandas is manipulating large files in 32-bits Python version ?

Comment From: mroeschke

Thanks for the suggestion but given the lack of community or core developer interest in this feature I don't think it's likely that this feature will be implemented so closing. Can reopen if there's renewed interest