xref https://github.com/pydata/pandas/pull/12578
so its pretty easy to add (and eventually deprecate) certain top-level functions that could/should be methods. Promotes a cleaner syntax and method chaining.
In 0.18.1 would propose adding these, deprecating in 0.19.0.
This would eliminate the ability to use directly with np.ndarrays
, but I view this is as a positive (note we effectively did this for the .rolling/expanding/ewm
with np.ndarrays
as well, though deprecated for a while).
- [ ] pd.crosstab
(DataFrame)
- [x] pd.melt
(DataFrame) (PR #15521); added to DataFrame
- [ ] pd.get_dummies
(both)
- [ ] pd.cut/qcut
(only on Series)
Comment From: jorisvandenbossche
From the above list, pivot_table
is already a method, so I think that one can be removed.
Further, I don't think I find it necessary to deprecate those. Those are frequently used functions, so not sure it is worth deprecating them (which does not mean some of them would be valuable as a method as well).
Comment From: ResidentMario
@jreback I'm interested in taking a stab at this, but since I've never toyed with pandas
before I wanted to first ask what your opinion is as to implementation?
The easiest way to port these to methods would be to simply thread a call to the globals from a method (akin to what is done in Series.from_csv
/Series.to_csv
). The advantage of this approach is that it's easy to do and avoids code refactoring. The disadvantage is that you would have to add a handful of additional dependencies (e.g. from pandas.tools.tile import (cut, qcut)
) to series.py
and dataframe.py
, which I assume would slow down module initialization. Perhaps these imports could be made in the method body? Seems suboptimal...
As I see it, the best option would be to instead move the core of each of the method bodies into a core
module (common.py
?), then simplify the old global and new class methods into wrappers on those. This might allow object method implementations to use a (marginally?) faster code path, since less type-checking would be required. However, it would require a refactor.
Comment From: jreback
@ResidentMario conceptually this is very easy, (look at something like .unstack()
) or this:
def melt(self, ....):
from pandas.core.reshape import melt
return melt(self, ....)
BUT I didn't put a doc-string. And you cannot import things directly at the top-level in core/frame.py
. It wouldn't slow down initialization much, the issue is dependencies (e.g. these modules import DataFrame, so they can't be imported directly.
so here's a way to do this.
create DataFrame.melt
(like above), but use a _shared_docs
doc-string (and define it in core/frame.py
). Then you can set the doc-string directly in pandas.core.reshape.melt
directly.
I think this is a nice pattern that will work.
Comment From: ResidentMario
Ok, great, glad I asked!
Comment From: ResidentMario
Line no. 186 in pivot.py
in the current source:
DataFrame.pivot_table = pivot_table
git blame pandas/tools/pivot.py -L 186
shows that this is a very old Wes McKinney commit from circa 2011. Technically, this approach is correct, but it doesn't generate a reference in the docs. Do we want to refactor this into the new pattern?
Edit: actually it does generate a doc entry. Sphinx is more clever than I thought! So I think that pd.pivot_table
can be marked "done" as-is.
Comment From: ResidentMario
Technically speaking, if reusing the original docstring is OK, this pattern could be used for all of the other conversions as well, since the files they are defined in already import that Series
and DataFrame
names where needed.
However I think that the old functions ought to get links to the new methods in their docstrings, and vice versa, and that the new methods ought to additionally get a versionadded
docstring. To do that, Jeff's pattern is required.
In the case of pivot_table
, the alias is already ancient. Nevertheless, refactoring it using Jeff's pattern is an arguable documentation improvement, since DataFrame.pivot_table
would now link to pd.pivot_table
in the docs (and vice versa).
@jreback what do you think?
Comment From: jreback
@ResidentMario yeah ok with having a standard way of doing this for doc-strings and such (so should fix pivot_table
).
So for how to do this, Ideally establish the pattern doing 1 or 2 methods, then in additional PR's can do the rest. IOW, can tweek to get the pattern that we want (and maybe make helpers / decorators to assist first).
FYI, I have made a decorator https://github.com/pandas-dev/pandas/pull/15484/commits/e75eacb2f7d165fa4b2148c2c5cb051b0bff816c (see pandas.utils.decorator
for doing something almost like this.
I am going to split this off and push to master shortly. (the decorator)
Comment From: jbrockmendel
still relevant?
Comment From: jreback
i think it's worth adding get_dummies (there is an existing PR for this) and cut/qcut to series - separately can consider deprecating the top level
Comment From: jbrockmendel
Discussed on today's dev call and the consensus was against, as in the time since the OP we have moved in the opposite direction towards not having e.g. DataFrame.read_csv. @mroeschke also pointed out that qcut can return one object or a tuple depending on arguments, which defeats chaining. Closing.