Pandas API: port the magic X from pandas_ply/dplython to pandas proper?

Many DataFrame methods (now including __getitem__) accept callables that take the DataFrame as input, e..g, df[lambda x: x.sepal_length > 3].

However, this is annoyingly verbose. I recently suggested (https://github.com/pydata/pandas/issues/13040) enabling argument-free lambdas like df[lambda: sepal_length > 3], but this isn't a viable solution (too much magic!) because it's impossible to implement with Python's standard scoping rules.

pandas-ply and dplython provide an alternative approach, based on a magic X operator, e.g.,

(flights
  .groupby(['year', 'month', 'day'])
  .ply_select(
    arr = X.arr_delay.mean(),
    dep = X.dep_delay.mean())
  .ply_where(X.arr > 30, X.dep > 30))

pandas-ply also introduces (injects onto pandas.DataFrame) two new dataframe methods ply_select and ply_where that accept these symbolic expression build from X. dplython takes a different approach, introducing it's own dplyr like API for chaining expressions instead of using method chaining. The pandas-ply approach is much closer to what makes sense for pandas proper, given that we already support method chaining.

I think we should consider introducing an object like X into pandas proper and supporting its use on all pandas methods that accept callables that take the DataFrame as input.

I don't think we need to port ply_select and ply_where, because support for expressions in DataFrame.assign and indexing is a good substitute.

So my proposed syntax (after from pandas import X) looks like the following:

(flights
 .groupby(['year', 'month', 'day'])
 .assign(
     arr = X.arr_delay.mean(),
     dep = X.dep_delay.mean())
 [(X.arr > 30) & (X.dep > 30)])

Indexing is a little uglier than using the ply_where method, but otherwise this is a nice improvement.

Best of all, we don't need do any special tricks to introduce new scopes -- we simply define X.__getattr__ to looking attributes as columns in the DataFrame context. I expect we could even reuse the expression engines from pandas-ply or dplython directly, perhaps with a few modifications.

In my mind, this would mostly obviate the need for pandas-ply, though the alternate API provided by dpython would still be independently useful. In an ideal world, our X implementation in pandas would be something that could be reused by dplython.

cc @joshuahhh @dodger487

Comment From: datnamer

There is also this: https://github.com/dodger487/dplython

Comment From: shoyer

@datnamer Thanks -- I had a feeling I was missing something! I updated my post to include discussion of dplython as well.

Comment From: joshuahhh

I have mixed thoughts.

On the one hand, I agree that having to put lambda x: everywhere is awkward and verbose; probably awkward and verbose enough to discourage using the syntax.

But the X solution isn't perfect. The main problem is that if you have a function f and call f(X), everything breaks. (Unless f has a particularly simple implementation which doesn't look at its argument too closely.) This is why I added sym_call, but sym_call looks crappy, and you get cryptic error messages if you forget to use it. The introduction of .pipe on pandas dataframes/series made X.pipe(f) a nice option, but the "forget to use it" problem is still real.

I like using X a lot, since I understand it well and have built it into my habits. But Python isn't powerful enough to make it work in a totally predictable way, and I don't know if half-solutions like this belong in pandas.

(Thanks for asking!)

Comment From: dodger487

To add onto @joshuahhh's comment calling X as an input to a function, in dplython we use a decorator (DelayFunction) that causes the function to check arguments for any X arguments, and if so, delays calling until the correct time when the args can be supplied. I'd echo that it isn't ideal-- "you get cryptic error messages if you forget to use it." I've toyed with the idea of applying this to all functions in a module upon import but that seems like it could have some difficulties.

On the other hand, I've found that I don't need to often apply functions to X arguments because there are so many methods on Series. Also, if I'm writing a function that will be applied to an X, it's not too bad to use the DelayFunction decorator.

Overall, I agree there are some difficulties but I'm optimistic about X being a useful solution to include in pandas.

Thanks for including me on the thread!

Comment From: shoyer

@dodger487 @joshuahhh thanks for sharing your thoughts! I think pandas supports method chaining enough that the inability to use arbitrary functions is OK. X.pipe(np.log) feels a little unnatural but is not so terrible. (Note that there are tentaive plans, possibly as part of the pandas 1.0 rewrite, to port commonly used numpy ufuncs such as np.log to methods on Series/DataFrame.)

It occurs to me that dask.delayed contains yet another implementation of deferred evaluation that might be a useful reference.

Comment From: dpavlic

Very intriguing. I've tested out dplython and pandas-ply based on this issue and they both look very interesting. It looks like both use X for their own functions, but it can't be used elsewhere; i.e.:

df[(X.arr > 30) & (X.dep > 30)]

doesn't actually work with either implementation as it is. Your proposal sounds like it would allow its use there, and elsewhere; for example, I'm assuming instead of (please forgive the obviously highly contrived example):

df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}).assign(c=lambda x: x.a + x.b)

I could instead do:

df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}).assign(c=X.a + X.b)

? While pandas is never going to have some of the sheer convenience of the R syntax for these types of things, that brings it a lot closer from what I can see.

Comment From: shoyer

Despite these limitations, I still think pandas.X would be a clear win. The core functionality is useful enough on its own to merit inclusion in core pandas seems, even though add-ons like DelayFunction are probably best left to third party libraries.

@jreback @jorisvandenbossche @TomAugspurger any opinions?

Comment From: jreback

I think exposing pd.X is pretty reasonable assuming well documented with nice use cases. Its opt-in so +1.

Comment From: lpenguin

Guys, didn't saw this issue. I think i done something very similar to X magic, see #18077. Proof-of-concept implementation is in https://github.com/lpenguin/pandas-query. Just use from pandas_query import _ as X and you will get similar functionality. Though i didn't implement separate ply_select and ply_where functions, i hacked DataFrame.__getitem__, DataFrame.__setitem__ (column assigment) and DataFrame.__assign__.

Comment From: jbrockmendel

Discussed on today's dev call and the consensus is we don't want to add to the API. Closing.