Pandas DOC: Timings/space of datatypes in the docs

Would it be useful to have a section in the docs discussing: - how much space pandas objects take up (vaguely how many rows * columns equates to size space, if I have an xGb csv how big will it be in pandas/HDF5/etc.). - how quick are some standard operations are likely to be (e.g. read_csv in/merge/join/etc. vs data size). - how do these compare to other platforms (?)

Probably distinct from comparing functionality (although that may also be interesting) e.g. like numpy do it for features against matlab here: http://wiki.scipy.org/NumPy_for_Matlab_Users) e.g. #3980

Comment From: cpcloud

my 2c: 1. i think it's hard to say in general how big pandas objects are b/c of homogeneity. w.r.t. homogeneous they aren't that much different from df.values.nbytes + df.columns.values.nbytes + df.index.values.nbytes (ignoring the size of other python objects needed for repring and so forth). comparing a gb order of magnitude csv to how big that will be in pandas space doesn't seem that useful since most sane folks will not be storing files that big in text. if they are then they should immediately convert the hdf5 or even just a npz file would be an improvement. 2. i think this is interesting 3. certain platforms, e.g., matlab completely fail at everything that pandas succeeds at. in matlab there's a bastard version of DataFrame called dataset that really is just a matrix with some labels and nothing more and that i wouldn't recommend to my worst enemy to use. it's horrible and a comparison would not be worth the time it would take to replicate even a tiny subset of what pandas does. the only comparable platform i can think of is R. i'm sure other people know others...

Comment From: hayd

(Not sure why I took out a mention of R, it was there, this was the main one I had in mind :) ). 1. It's true it varies but it might be useful to give some examples (along with a lack of generality warning) for people to get a vague idea. For some example of a csv, (with n rows and n cols), how long it takes to read in, how much space it takes up in memory, how much space it would take up as a pickle, how much space in HDF5, postgres, etc.

Comment From: hayd

Things like; http://stackoverflow.com/questions/16628329/hdf5-and-sqlite-concurrency-compression-i-o-performance

Comment From: hayd

related: http://stackoverflow.com/questions/17269703/is-there-a-limit-to-the-amount-of-rows-pandas-read-csv-can-load

Comment From: jreback

https://groups.google.com/forum/m/#!topic/pydata/G6Z-SN9SJnY for a conversion about this

Comment From: jreback

can prob add this 2 Enhancing Performance section (or maybe should rename to Performance?

Comment From: hayd

@jreback I kindof think these should be distinct, but not sure what a good name would be.

Comment From: jreback

ok...sure..maybe a new top-level section (or maybe part of FAQ or something)

Comment From: hayd

related #696, and perf of read_csv from wes' blog http://wesmckinney.com/blog/?p=543

Comment From: hayd

http://stackoverflow.com/questions/18089667/pandas-how-to-estimate-how-much-memory-a-dataframe-will-need

Comment From: jreback

Reproducing my answer here (from the above link):

You have to do this in reverse.

In [4]: DataFrame(randn(1000000,20)).to_csv('test.csv')

In [5]: !ls -ltr test.csv
-rw-rw-r-- 1 users 399508276 Aug  6 16:55 test.csv

In [6]: DataFrame(randn(1000000,20)).values.nbytes
Out[6]: 160000000

Technically memory is about this (which includes the indexes)

In [16]: df.values.nbytes + df.index.nbytes + df.columns.nbytes
Out[16]: 168000160

So 160MB in memory with a 400MB file, 1M rows of 20 float columns

DataFrame(randn(1000000,20)).to_hdf('test.h5','df')

!ls -ltr test.h5
-rw-rw-r-- 1 users 168073944 Aug  6 16:57 test.h5

MUCH more compact when written as a binary HDF5 file

In [12]: DataFrame(randn(1000000,20)).to_hdf('test.h5','df',complevel=9,complib='blosc')

In [13]: !ls -ltr test.h5
-rw-rw-r-- 1 users 154727012 Aug  6 16:58 test.h5

Data is not that compressible though as its random.

WIth strings (same string, so maybe a little bogus) (file is about 1/2 size of the floats!)

In [26]: df = DataFrame(np.array(['ABCDEFGH']*20*1000000,dtype=object).reshape(1000000,20))

In [29]: df.values.nbytes +df.index.nbytes +df.columns.nbytes
Out[29]: 168000160

In [30]: df.to_csv('test.csv')

In [31]: !ls -ltr test.csv
-rw-rw-r-- 1 users 186888941 Aug  6 17:29 test.csv

In [32]: df.to_hdf('test.h5','df')

In [33]: !ls -ltr test.h5
-rw-rw-r-- 1 users 49166896 Aug  6 17:29 test.h5

Comment From: jreback

Just put this in for perf comparison of IO methods: https://github.com/pydata/pandas/commit/0d79ff883645f0564b2821d2ef7e00720494f477

so paritial progress for this

Comment From: jreback

http://stackoverflow.com/questions/24917910/fast-selection-of-a-timestamp-range-in-hierarchically-indexed-pandas-data-in-pyt

Comment From: MarcoGorelli

closing due to lack of activity, and it's not really clear what's needed anymore at this point