Using Python 3.5.2 and Pandas 0.18.1
I find that sampling a large HDF5 file using the nrows parameter takes far more RAM and time than sampling the first n rows by pulling the first n indices using a where statement.
Minimal Code Example:
store = pd.HDFStore('data_path/hdf_store.h5')
select_indices = np.arange(1000)
time1 = time.time()
train_date_hdf = store.select('table_name', where=select_indices)
time2=time.time()
print(time2-time1)
time1 = time.time()
train_date_hdf2 = pd.read_hdf(store, 'table_name', nrows=1000)
time2=time.time()
print(time2-time1)
Output is:
0.10559678077697754 84.43739724159241
In addition to the much longer time taken by using the nrows parameter, this small sample of a large table also spikes my RAM from 20% to 97% on my machine, which has 8GB of RAM. Whereas reading 1000 rows by selecting 1000 rows by index with a where clause doesn't even use 1% of my RAM.
Would there be a downside to having nrows=n be implemented by pulling the first n indices in a where statement by default?
Comment From: jreback
nrows is not a valid parameter; http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_hdf.html?highlight=read_hdf
so you are reading the entire table
Comment From: MaxPowerWasTaken
Thanks, my mistake.
Comment From: jreback
though if u want to open and issue (and even better submit a PR) for removing the kwargs would be great
Comment From: MaxPowerWasTaken
Interesting, you're saying there's no legitimate reason for read_hdf to accept kwargs?
If so I can definitely open that issue, and in six days I could look into taking a PR for it if no one else jumps on it.
Comment From: jreback
no these could all be named parameters might need a touch of refactoring but should work
Comment From: MaxPowerWasTaken
Other than nrows, what else would you recommend become a named parameter?
Comment From: jreback
you misunderstand i wouldn't have any other parameters except for those which already exist
nrows is not needed as you already have chunksize start and end
Comment From: MaxPowerWasTaken
Ok thanks, I understand now. Can I ask why you'd prefer they be named parameters rather than keyword parameters? And would you like the same for the other pd.read_ functions?
Comment From: jreback
named parameter are keyword parameters all other read_* already do it this way