Pandas Performance pd.HDFStore().keys() slow

Code Sample, a copy-pastable example if possible

import pandas as pd, numpy as np
path = 'test.h5'
dataframes = [pd.DataFrame(np.random.rand(500, 100))for i in range(3000)]
with pd.HDFStore(path) as store:
    for i, df in enumerate(dataframes):
        store.put('test' + str(i), df)
%timeit store = pd.HDFStore(path).keys()

Problem description

The performance of pd.HDFStore().keys() is incredibly slow for a large store containing many dataframes. 10.6 secs for the above code to just get a list of keys in the store.

It appears the issue is related to the path_walk in tables requiring every single node be loaded to check whether it is a group.

/tables/file.py

def iter_nodes(self, where, classname=None):
    """Iterate over children nodes hanging from where.

    **group = self.get_node(where)**  # Does the parent exist?
    self._check_group(group)  # Is it a group?

    return group._f_iter_nodes(classname)

%lprun -f store._handle.iter_nodes store.keys()
Timer unit: 2.56e-07 s
Total time: 0.0424965 s
File: D:\Anaconda3\lib\site-packages\tables\file.py
Function: iter_nodes at line 1998
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1998                                               def iter_nodes(self, where, classname=None):
  1999                                                   """Iterate over children nodes hanging from where.
  2000                                           
  2001                                                   Parameters
  2002                                                   ----------
  2003                                                   where
  2004                                                       This argument works as in :meth:`File.get_node`, referencing the
  2005                                                       node to be acted upon.
  2006                                                   classname
  2007                                                       If the name of a class derived from
  2008                                                       Node (see :ref:`NodeClassDescr`) is supplied, only instances of
  2009                                                       that class (or subclasses of it) will be returned.
  2010                                           
  2011                                                   Notes
  2012                                                   -----
  2013                                                   The returned nodes are alphanumerically sorted by their name.
  2014                                                   This is an iterator version of :meth:`File.list_nodes`.
  2015                                           
  2016                                                   """
  2017                                           
  2018      6001       125237     20.9     75.4          group = self.get_node(where)  # Does the parent exist?
  2019      6001        26549      4.4     16.0          self._check_group(group)  # Is it a group?
  2020                                           
  2021      6001        14216      2.4      8.6          return group._f_iter_nodes(classname)

Therefore if the dataframes are large and you have a lot in one store this can take forever. (my real life code takes 1min to do this). My version of pandas is older but I don't think this has been fixed in subsequent versions.

Also not sure whether to raise this in pandas or tables.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.0.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.19.2 nose: 1.3.7 pip: 9.0.1 setuptools: 27.2.0 Cython: 0.25.2 numpy: 1.11.3 scipy: 0.18.1 statsmodels: 0.6.1 xarray: None IPython: 5.1.0 sphinx: 1.5.1 patsy: 0.4.1 dateutil: 2.6.0 pytz: 2016.10 blosc: None bottleneck: 1.2.0 tables: 3.2.2 numexpr: 2.6.1 matplotlib: 2.0.0 openpyxl: 2.4.1 xlrd: 1.0.0 xlwt: 1.2.0 xlsxwriter: 0.9.6 lxml: 3.7.2 bs4: 4.5.3 html5lib: None httplib2: None apiclient: None sqlalchemy: 1.1.5 pymysql: None psycopg2: None jinja2: 2.9.4 boto: 2.45.0 pandas_datareader: None

Comment From: TomAugspurger

Yeah, I looked into this a while back (there may be an open issue, https://github.com/pandas-dev/pandas/issues/16503 is related but different) but didn't come up with a solution.

Also not sure whether to raise this in pandas or tables.

I'd say see if pytables exposes an API to get the same info without the same overhead. If not then raise an issue there.

Comment From: jreback

closing as out of scope. this is an issue with PyTables and cannot be directly fixed in pandas.

Comment From: MXS2514

I came up similar problems these days. Besides the pytables, it may also caused by the structure types of pandas is so powerful(many levels,tags or proprieties ), if you try structure type Series should be faster than DataFrame, but also not fast enough when using .key(), .len() on saved hdf file.

If you just want a set of dictlike {'name1' : ndarray1, 'name2' : ndarray2 ...} save on harddisk and use later with good perform(len(), .keys()), maybeh5py is enough (0.005s vs 15s when .keys())

You can use Vitables to see how the data saved and the complexity of structure.

Comment From: ben-daghir

Found a temporary solution (at least for Python 3.6.6). Downgrade to Pandas 0.20.3 (I've had better overall performance with this version anyway). Then use the root attribute and the built-in dir() method:

store = pandas.HDFStore(file)
keys = dir(store.root)

viola! Goodluck

Comment From: TomAugspurger

https://github.com/pandas-dev/pandas/pull/21543

improved the performance of .groups. Is .keys still slow?

I've had better overall performance with this version anyway

Can you see if we have open issues for those slowdowns?

Pandas Performance pd.HDFStore().keys() slow

Code Sample, a copy-pastable example if possible

Problem description

Output of pd.show_versions()

Output of `pd.show_versions()`