Pandas read_csv cannot get big files

Code Sample, a copy-pastable example if possible

single_df = pd.read_csv(f, sep="\t", header=0, index_col=0)

Problem description

I am trying to read a 5GB tsv file into RAM, I reserve 100GB on the computational cluster I am working on and the process still gets Killed. It there a workaround for reading big files?

Expected Output

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 2.7.11.final.0 python-bits: 64 OS: Linux OS-release: 2.6.32-642.6.2.el6.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US pandas: 0.18.0 nose: 1.3.7 pip: 8.1.1 setuptools: 21.0.0 Cython: 0.24 numpy: 1.10.4 scipy: 0.17.0 statsmodels: None xarray: None IPython: 5.1.0 sphinx: 1.4.1 patsy: None dateutil: 2.5.3 pytz: 2015.7 blosc: None bottleneck: None tables: None numexpr: None matplotlib: 1.5.1 openpyxl: None xlrd: None xlwt: 0.7.5 xlsxwriter: None lxml: None bs4: None html5lib: None httplib2: None apiclient: None sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.8 boto: None None

Comment From: jreback

since no info provided this very hard to say what you are doing. you can try a newer version of pandas or show a reproducible example.

Comment From: jreback

typically you start be reading a fraction of the file to see if you are doing it correctly. The if its bigger, move to chunking: http://pandas.pydata.org/pandas-docs/stable/io.html#iterating-through-files-chunk-by-chunk

Comment From: AngryMaciek

I have been trying to chunk It into pieces of 1000 lines (with the chunksize=1000 it worked fine) and then used pd.concat to get one big df, which I need for some further calculations - it failed. How do you want me to show you a reproducible example? Should I upload the 5GB file somewhere?

Comment From: jreback

@AngryMouser no you can show df.info() as well as df.head().

Pandas read_csv cannot get big files

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`