Pandas ENH: Allow unique() on disk for HDF file - 玖涯软件开发|java/go/python

Is it possible to get the unique values from a chosen column from a HDF file? That would be great feature to have this happening on disk.

Comment From: jreback

http://pandas.pydata.org/pandas-docs/stable/io.html#advanced-queries

then .unique()

Comment From: michaelaye

yeah, i know that, but that's not what i want. I could have a column with 10 million entries, but only 100 unique values, and I don't want to spend the memory on that.

Comment From: jreback

you can do it in a loop where you chunk select (in read column which has start/stop), unique and update a set

something like this:

result = set()
with pd.get_store(....) as store:
    nrows = store.get_storer(key).nrows
    chunksize = 1000000
    for i in range(0,nrows/chunksize + 1):
        result |= set(store.select_column(key,start=i*chunksize,stop=(i+1)*chunksize).unique().tolist())

chunksize could be added to select_column if you are interested

Comment From: michaelaye

oh, store supports with, nice! :) so, when using chunksize with select_column i would get some kind of reader object, same as for read_csv?

Comment From: jreback

chunksize needs implementing (but yes would return an iterator over an index I think)

u can manually do it as above (chunksize would basically do this )

Comment From: michaelaye

what r u doing with result/chunksize there? 'set' / 'int' ?

Comment From: michaelaye

ah u must mean nrows/chunksize, clever thing with the set i like it, thanks

Comment From: jreback

this will use constant memory which is always nice

Comment From: jreback

closing as better suited to dask / other formats (and recipe is above)