Pandas NDFrame Virtualization - 玖涯软件开发|java/go/python

I had mentioned this at one of the pydata talks. Just to put a couple of ideas down.

Currently the internal structure of a NDFrame is a collection of Block objects, managed by a BlockManager. Each of these is a nd-dim array of a single dtype (generally these are the same dim as the parent object, eg. a DataFrame will have 2-d objects). Certain types are currently treated slightly differently, e.g. Categorical and Sparse in they are always unconsolidated (meaning you can have multiple ones).

I had 2 thoughts: - allow these to be numpy mmap objects - allow these to be a bcolz carray(or possible the entire structure as a ctable)

In theory this should allow transparent seemless serialization/deserialization to disk. Hence you could have a virtual DataFrame larger that actual memory.

I am not sure how feasible this is in practice. But might be worth trying out.

Comment From: jreback

we need to refactor block manager first, closing.