Is it possible to get the unique values from a chosen column from a HDF file? That would be great feature to have this happening on disk.
Comment From: jreback
http://pandas.pydata.org/pandas-docs/stable/io.html#advanced-queries
then .unique()
Comment From: michaelaye
yeah, i know that, but that's not what i want. I could have a column with 10 million entries, but only 100 unique values, and I don't want to spend the memory on that.
Comment From: jreback
you can do it in a loop where you chunk select (in read column which has start/stop), unique and update a set
something like this:
result = set()
with pd.get_store(....) as store:
nrows = store.get_storer(key).nrows
chunksize = 1000000
for i in range(0,nrows/chunksize + 1):
result |= set(store.select_column(key,start=i*chunksize,stop=(i+1)*chunksize).unique().tolist())
chunksize
could be added to select_column
if you are interested
Comment From: michaelaye
oh, store supports with
, nice! :)
so, when using chunksize
with select_column i would get some kind of reader object, same as for read_csv?
Comment From: jreback
chunksize needs implementing (but yes would return an iterator over an index I think)
u can manually do it as above (chunksize would basically do this )
Comment From: michaelaye
what r u doing with result/chunksize
there? 'set' / 'int' ?
Comment From: michaelaye
ah u must mean nrows/chunksize, clever thing with the set i like it, thanks
Comment From: jreback
this will use constant memory which is always nice
Comment From: jreback
closing as better suited to dask / other formats (and recipe is above)