I had mentioned this at one of the pydata talks. Just to put a couple of ideas down.
Currently the internal structure of a NDFrame
is a collection of Block
objects, managed by a BlockManager
. Each of these is a nd-dim array of a single dtype (generally these are the same dim as the parent object, eg. a DataFrame will have 2-d objects). Certain types are currently treated slightly differently, e.g. Categorical
and Sparse
in they are always unconsolidated (meaning you can have multiple ones).
I had 2 thoughts:
- allow these to be numpy mmap
objects
- allow these to be a bcolz
carray
(or possible the entire structure as a ctable
)
In theory this should allow transparent seemless serialization/deserialization to disk. Hence you could have a virtual DataFrame larger that actual memory.
I am not sure how feasible this is in practice. But might be worth trying out.
Comment From: jreback
we need to refactor block manager first, closing.