Pandas Shorter syntax for selecting data and expression evaluation (proposal)

Summary:
When we select data in pandas dataframe, which has long name, the code becomes bulky. For example:

subset = really_long_name_dataframe[
    really_long_name_dataframe[['ints'].between(3, 6) 
    & (really_long_name_dataframe[['mul10'] != 40)
]

I'd like to propose a way to write this expression in more compact form. We could use some reserved name as a substitution for dataframe :

subset = really_long_name_dataframe[
    _['ints'].between(3, 6)  # _ is a substitution for really_long_name_dataframe
    & (_['mul10'] != 40)
]

The other issue when it may come useful is when we apply operations to columns:

cubes = (
    really_long_name_dataframe['ints'] * really_long_name_dataframe['squares'] 
)

# Could be written as
cubes = (
    really_long_name_dataframe(_['ints'] * _['squares'])  # via __call__ magic function
)

This can come very handy when we apply the operations to the DataFrame as a chain:

(
    some_dataframe
    .groupby('squares')
    .count()
    .assign(sqrt=_.index.map(np.sqrt).astype(int))  # .assign() function
    .set_index(_.sqrt.map(str) + ' - ' + _.ints.map(str))  # .set_index() function
    [_['ints'].between(1, 20)]  # Selecting data, .__getitem__() function
    (_['sqrt'].map(np.log10) * _['ints'])  # Evaluating expressions, .__call__() function
)

I wrote a proof-of-concept module which make pandas capable of such syntax (via monkey patching): https://github.com/lpenguin/pandas-query

Comment From: TomAugspurger

Thanks for the examples (and proof of concept!).

You're maybe aware, but for indexing, all of __getitem__, loc, iloc, take a callable, so you can do

subset = really_long_name_datafram[lambda df: df['ints'].between(3, 6) ...]

Likewise with assign.

Python's lambda isn't the shortest, but it may be an improvement. And you can refactor the lambdas out to standalone functions if they're resused.

IIRC, libraries using _ is frowned upon, since that's what the interpreter typically uses for the last returned value, and interactive use is important to pandas (a different identifier could be used of course).

Making dataframes callable would be a big change with (I'm guessing) a lot of unintended negative consequences.

Comment From: chris-b1

This is similar to the magic X used by dplython and pandas_ply, see also #13133

Comment From: gfyoung

Let's also not forget the .query method which allows you to use SQL-like syntax.

Comment From: jreback

yeah, if you want to address issues discussin the X issue #13133 pls fee free.