Would it be useful to have a section in the docs discussing: - how much space pandas objects take up (vaguely how many rows * columns equates to size space, if I have an xGb csv how big will it be in pandas/HDF5/etc.). - how quick are some standard operations are likely to be (e.g. read_csv in/merge/join/etc. vs data size). - how do these compare to other platforms (?)
Probably distinct from comparing functionality (although that may also be interesting) e.g. like numpy do it for features against matlab here: http://wiki.scipy.org/NumPy_for_Matlab_Users) e.g. #3980
Comment From: cpcloud
my 2c:
1. i think it's hard to say in general how big pandas objects are b/c of homogeneity. w.r.t. homogeneous they aren't that much different from df.values.nbytes + df.columns.values.nbytes + df.index.values.nbytes
(ignoring the size of other python objects needed for repring and so forth). comparing a gb order of magnitude csv to how big that will be in pandas space doesn't seem that useful since most sane folks will not be storing files that big in text. if they are then they should immediately convert the hdf5 or even just a npz file would be an improvement.
2. i think this is interesting
3. certain platforms, e.g., matlab completely fail at everything that pandas succeeds at. in matlab there's a bastard version of DataFrame
called dataset
that really is just a matrix with some labels and nothing more and that i wouldn't recommend to my worst enemy to use. it's horrible and a comparison would not be worth the time it would take to replicate even a tiny subset of what pandas does. the only comparable platform i can think of is R. i'm sure other people know others...
Comment From: hayd
(Not sure why I took out a mention of R, it was there, this was the main one I had in mind :) ). 1. It's true it varies but it might be useful to give some examples (along with a lack of generality warning) for people to get a vague idea. For some example of a csv, (with n rows and n cols), how long it takes to read in, how much space it takes up in memory, how much space it would take up as a pickle, how much space in HDF5, postgres, etc.
Comment From: hayd
Things like; http://stackoverflow.com/questions/16628329/hdf5-and-sqlite-concurrency-compression-i-o-performance
Comment From: hayd
related: http://stackoverflow.com/questions/17269703/is-there-a-limit-to-the-amount-of-rows-pandas-read-csv-can-load
Comment From: jreback
https://groups.google.com/forum/m/#!topic/pydata/G6Z-SN9SJnY for a conversion about this
Comment From: jreback
can prob add this 2 Enhancing Performance
section (or maybe should rename to Performance
?
Comment From: hayd
@jreback I kindof think these should be distinct, but not sure what a good name would be.
Comment From: jreback
ok...sure..maybe a new top-level section (or maybe part of FAQ or something)
Comment From: hayd
related #696, and perf of read_csv from wes' blog http://wesmckinney.com/blog/?p=543
Comment From: hayd
http://stackoverflow.com/questions/18089667/pandas-how-to-estimate-how-much-memory-a-dataframe-will-need
Comment From: jreback
Reproducing my answer here (from the above link):
You have to do this in reverse.
In [4]: DataFrame(randn(1000000,20)).to_csv('test.csv')
In [5]: !ls -ltr test.csv
-rw-rw-r-- 1 users 399508276 Aug 6 16:55 test.csv
In [6]: DataFrame(randn(1000000,20)).values.nbytes
Out[6]: 160000000
Technically memory is about this (which includes the indexes)
In [16]: df.values.nbytes + df.index.nbytes + df.columns.nbytes
Out[16]: 168000160
So 160MB in memory with a 400MB file, 1M rows of 20 float columns
DataFrame(randn(1000000,20)).to_hdf('test.h5','df')
!ls -ltr test.h5
-rw-rw-r-- 1 users 168073944 Aug 6 16:57 test.h5
MUCH more compact when written as a binary HDF5 file
In [12]: DataFrame(randn(1000000,20)).to_hdf('test.h5','df',complevel=9,complib='blosc')
In [13]: !ls -ltr test.h5
-rw-rw-r-- 1 users 154727012 Aug 6 16:58 test.h5
Data is not that compressible though as its random.
WIth strings (same string, so maybe a little bogus) (file is about 1/2 size of the floats!)
In [26]: df = DataFrame(np.array(['ABCDEFGH']*20*1000000,dtype=object).reshape(1000000,20))
In [29]: df.values.nbytes +df.index.nbytes +df.columns.nbytes
Out[29]: 168000160
In [30]: df.to_csv('test.csv')
In [31]: !ls -ltr test.csv
-rw-rw-r-- 1 users 186888941 Aug 6 17:29 test.csv
In [32]: df.to_hdf('test.h5','df')
In [33]: !ls -ltr test.h5
-rw-rw-r-- 1 users 49166896 Aug 6 17:29 test.h5
Comment From: jreback
Just put this in for perf comparison of IO methods: https://github.com/pydata/pandas/commit/0d79ff883645f0564b2821d2ef7e00720494f477
so paritial progress for this
Comment From: jreback
http://stackoverflow.com/questions/24917910/fast-selection-of-a-timestamp-range-in-hierarchically-indexed-pandas-data-in-pyt
Comment From: MarcoGorelli
closing due to lack of activity, and it's not really clear what's needed anymore at this point