Using Python 3.5.2 and Pandas 0.18.1

I find that sampling a large HDF5 file using the nrows parameter takes far more RAM and time than sampling the first n rows by pulling the first n indices using a where statement.

Minimal Code Example:


store = pd.HDFStore('data_path/hdf_store.h5')
select_indices = np.arange(1000)

time1 = time.time()
train_date_hdf = store.select('table_name', where=select_indices)
time2=time.time()
print(time2-time1)

time1 = time.time()
train_date_hdf2 = pd.read_hdf(store, 'table_name', nrows=1000)
time2=time.time()
print(time2-time1)

Output is:

0.10559678077697754 84.43739724159241

In addition to the much longer time taken by using the nrows parameter, this small sample of a large table also spikes my RAM from 20% to 97% on my machine, which has 8GB of RAM. Whereas reading 1000 rows by selecting 1000 rows by index with a where clause doesn't even use 1% of my RAM.

Would there be a downside to having nrows=n be implemented by pulling the first n indices in a where statement by default?

Comment From: jreback

nrows is not a valid parameter; http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_hdf.html?highlight=read_hdf

so you are reading the entire table

Comment From: MaxPowerWasTaken

Thanks, my mistake.

Comment From: jreback

though if u want to open and issue (and even better submit a PR) for removing the kwargs would be great

Comment From: MaxPowerWasTaken

Interesting, you're saying there's no legitimate reason for read_hdf to accept kwargs?

If so I can definitely open that issue, and in six days I could look into taking a PR for it if no one else jumps on it.

Comment From: jreback

no these could all be named parameters might need a touch of refactoring but should work

Comment From: MaxPowerWasTaken

Other than nrows, what else would you recommend become a named parameter?

Comment From: jreback

you misunderstand i wouldn't have any other parameters except for those which already exist

nrows is not needed as you already have chunksize start and end

Comment From: MaxPowerWasTaken

Ok thanks, I understand now. Can I ask why you'd prefer they be named parameters rather than keyword parameters? And would you like the same for the other pd.read_ functions?

Comment From: jreback

named parameter are keyword parameters all other read_* already do it this way