Code Sample, a copy-pastable example if possible
single_df = pd.read_csv(f, sep="\t", header=0, index_col=0)
Problem description
I am trying to read a 5GB tsv file into RAM, I reserve 100GB on the computational cluster I am working on and the process still gets Killed. It there a workaround for reading big files?
Expected Output
Output of pd.show_versions()
Comment From: jreback
since no info provided this very hard to say what you are doing. you can try a newer version of pandas or show a reproducible example.
Comment From: jreback
typically you start be reading a fraction of the file to see if you are doing it correctly. The if its bigger, move to chunking: http://pandas.pydata.org/pandas-docs/stable/io.html#iterating-through-files-chunk-by-chunk
Comment From: AngryMaciek
I have been trying to chunk It into pieces of 1000 lines (with the chunksize=1000 it worked fine) and then used pd.concat to get one big df, which I need for some further calculations - it failed. How do you want me to show you a reproducible example? Should I upload the 5GB file somewhere?
Comment From: jreback
@AngryMouser
no you can show df.info()
as well as df.head()
.