Code Sample, a copy-pastable example if possible
import pandas as pd, numpy as np
path = 'test.h5'
dataframes = [pd.DataFrame(np.random.rand(500, 100))for i in range(3000)]
with pd.HDFStore(path) as store:
for i, df in enumerate(dataframes):
store.put('test' + str(i), df)
%timeit store = pd.HDFStore(path).keys()
Problem description
The performance of pd.HDFStore().keys() is incredibly slow for a large store containing many dataframes. 10.6 secs for the above code to just get a list of keys in the store.
It appears the issue is related to the path_walk in tables requiring every single node be loaded to check whether it is a group.
/tables/file.py
def iter_nodes(self, where, classname=None):
"""Iterate over children nodes hanging from where.
**group = self.get_node(where)** # Does the parent exist?
self._check_group(group) # Is it a group?
return group._f_iter_nodes(classname)
%lprun -f store._handle.iter_nodes store.keys()
Timer unit: 2.56e-07 s
Total time: 0.0424965 s
File: D:\Anaconda3\lib\site-packages\tables\file.py
Function: iter_nodes at line 1998
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1998 def iter_nodes(self, where, classname=None):
1999 """Iterate over children nodes hanging from where.
2000
2001 Parameters
2002 ----------
2003 where
2004 This argument works as in :meth:`File.get_node`, referencing the
2005 node to be acted upon.
2006 classname
2007 If the name of a class derived from
2008 Node (see :ref:`NodeClassDescr`) is supplied, only instances of
2009 that class (or subclasses of it) will be returned.
2010
2011 Notes
2012 -----
2013 The returned nodes are alphanumerically sorted by their name.
2014 This is an iterator version of :meth:`File.list_nodes`.
2015
2016 """
2017
2018 6001 125237 20.9 75.4 group = self.get_node(where) # Does the parent exist?
2019 6001 26549 4.4 16.0 self._check_group(group) # Is it a group?
2020
2021 6001 14216 2.4 8.6 return group._f_iter_nodes(classname)
Therefore if the dataframes are large and you have a lot in one store this can take forever. (my real life code takes 1min to do this). My version of pandas is older but I don't think this has been fixed in subsequent versions.
Also not sure whether to raise this in pandas or tables.
Output of pd.show_versions()
Comment From: TomAugspurger
Yeah, I looked into this a while back (there may be an open issue, https://github.com/pandas-dev/pandas/issues/16503 is related but different) but didn't come up with a solution.
Also not sure whether to raise this in pandas or tables.
I'd say see if pytables exposes an API to get the same info without the same overhead. If not then raise an issue there.
Comment From: jreback
closing as out of scope. this is an issue with PyTables and cannot be directly fixed in pandas.
Comment From: MXS2514
I came up similar problems these days. Besides the pytables, it may also caused by the structure types of pandas is so powerful(many levels,tags or proprieties ), if you try structure type Series should be faster than DataFrame, but also not fast enough when using .key(), .len() on saved hdf file.
If you just want a set of dictlike {'name1' : ndarray1, 'name2' : ndarray2 ...} save on harddisk and use later with good perform(len(), .keys()), maybeh5py is enough (0.005s vs 15s when .keys())
You can use Vitables to see how the data saved and the complexity of structure.
Comment From: ben-daghir
Found a temporary solution (at least for Python 3.6.6). Downgrade to Pandas 0.20.3 (I've had better overall performance with this version anyway). Then use the root attribute and the built-in dir() method:
store = pandas.HDFStore(file)
keys = dir(store.root)
viola! Goodluck
Comment From: TomAugspurger
https://github.com/pandas-dev/pandas/pull/21543
improved the performance of .groups
. Is .keys
still slow?
I've had better overall performance with this version anyway
Can you see if we have open issues for those slowdowns?