I'm new and am trying to put together a PR for documentation improvements surrounding HDFStore
. I've been looking at HDFStore.append()
and have encountered several instances where I am unsure of what the best practice is:
- Many parameters are unused or have no effect by nature. Do we keep these and document them with an additional statement that they are unused, or do we remove them/un-document them?
- Some parameters do have effects but their variable names are misleading such as
min_itemsize
that defines the max size of string columns andaxes
that change the axis of appending (though technically correct can be misleading). Do we keep them as is and document them as normal or modify the names throughout? - Some parameters have default values of
None
which are later overwritten by sub-function default values (e.g.nan_rep
andchunksize
). Do we back-propagate the default values into all sub-functions such that we can document correct default values? - Is there a point in having an append parameter when there exists both
append()
andput()
which act as near identical wrappers for the same function_write_to_group()
?
Any opinions or resources regarding this will be helpful.
Comment From: rhshadrach
Thanks for the report! I think what you're doing is raising the question of whether or not there are certain deficiencies that should be corrected or merely documented. That's the right approach, but pandas uses QST
for questions from users about using pandas. I've reworked this as a API report.
Many parameters are unused or have no effect by nature.
Can you detail these?
Some parameters do have effects but their variable names are misleading such as
min_itemsize
This goes back to
https://github.com/pandas-dev/pandas/commit/94bd4e7c32df81fb8749ac2559ccc97bddd5db9f
I don't understand why the name min_itemsize
was chosen; I am positive on changing this to max_itemsize
.
axes
that change the axis of appending (though technically correct can be misleading).
Can you detail why you find this misleading?
Some parameters have default values of None which are later overwritten by sub-function default values (e.g.
nan_rep
andchunksize
).
For each parameter, what do you believe the default should be?
Is there a point in having an append parameter
I think this is a convenience to do e.g. df.to_hdf(..., mode="a", append=True)
. What would the alternative look like without an append parameter?
Comment From: rhshadrach
Is there a point in having an append parameter
Ah, I confused df.to_hdf
with HDFStore.append
; I think you're talking about the latter. This is less clear to me, will need to dig into it more.
I think this also clears up my confusion on unused parameters. I haven't checked, but could the signature include unused parameters for a consistent signature across many functions, others of which do use the parameters? I have not checked.
Comment From: JakeTT404
I have put together the following notes regarding the parameters for the append function.
- format
for append can only be "table" as "fixed" does not support appending (remove?).
- axes
parameter must be an iterable containing a single value for the selection of a dataframe axis. It can only be two values [0]
or [1]
which change the axis used for indexing. from a user perspective, this only affects on which axis data is appended (document. rename possible but probably best not to).
- index
I believe writes the index column as a data column to allow on disk queries. See data_columns (document as such).
- append
is redundant (remove. refactor, see below).
- complib
and complevel
both seem to function as intended (fix docs. tests?).
- columns
raises a TypeError immediatly saying to use data_columns instead (changed ages ago so just remove).
- min_itemsize
should be max (replace throughout).
- nan_rep
is default to "nan" downstream (add defaults to all functions using nan_rep).
- chunksize
is default to 100000 downstream (same as above).
There are 3 functions to write data to a hdf5 file, of which they all use the same underlying function. Below shows a rough diagram of the function calls for these methods (Note: only type "table"):
flowchart TB
subgraph s2["HDFStore"]
n3["append_to_multiple"] --> n1["append"]
n2["put"] --> n4["_write_to_group"]
n1 --> n4
end
n5["df.to_hdf"] --> n1 & n2
n4 --> n6["write"]
subgraph s1["AppendableTable"]
n6 --> n7["_create_axes"] & n8["create_description"] & n9["write_data"]
end
append
, put
and df.to_hdf
all act as wrappers for the same function which is why they all have so many parameters. A good point is that due to this they can (mostly) share documentation (need to look into sharing docstrings). Best present course of action would likely be to convert them into pure wrappers (they each currently run their own checks so move these checks into _write_to_group
) and then share their docstrings. In future, a full refactoring seems best since append and put merge way too early; ideally put
would create the table then flow directly into append
for writing. Also final thought is why put
and not write
?
Take these opinions with salt as I have little experience in large, community driven projects :)