Summary:
When we select data in pandas dataframe, which has long name, the code becomes bulky. For example:
subset = really_long_name_dataframe[
really_long_name_dataframe[['ints'].between(3, 6)
& (really_long_name_dataframe[['mul10'] != 40)
]
I'd like to propose a way to write this expression in more compact form. We could use some reserved name as a substitution for dataframe :
subset = really_long_name_dataframe[
_['ints'].between(3, 6) # _ is a substitution for really_long_name_dataframe
& (_['mul10'] != 40)
]
The other issue when it may come useful is when we apply operations to columns:
cubes = (
really_long_name_dataframe['ints'] * really_long_name_dataframe['squares']
)
# Could be written as
cubes = (
really_long_name_dataframe(_['ints'] * _['squares']) # via __call__ magic function
)
This can come very handy when we apply the operations to the DataFrame as a chain:
(
some_dataframe
.groupby('squares')
.count()
.assign(sqrt=_.index.map(np.sqrt).astype(int)) # .assign() function
.set_index(_.sqrt.map(str) + ' - ' + _.ints.map(str)) # .set_index() function
[_['ints'].between(1, 20)] # Selecting data, .__getitem__() function
(_['sqrt'].map(np.log10) * _['ints']) # Evaluating expressions, .__call__() function
)
I wrote a proof-of-concept module which make pandas capable of such syntax (via monkey patching): https://github.com/lpenguin/pandas-query
Comment From: TomAugspurger
Thanks for the examples (and proof of concept!).
You're maybe aware, but for indexing, all of __getitem__
, loc
, iloc
, take a callable, so you can do
subset = really_long_name_datafram[lambda df: df['ints'].between(3, 6) ...]
Likewise with assign.
Python's lambda
isn't the shortest, but it may be an improvement. And you can refactor the lambdas out to standalone functions if they're resused.
IIRC, libraries using _
is frowned upon, since that's what the interpreter typically uses for the last returned value, and interactive use is important to pandas (a different identifier could be used of course).
Making dataframes callable would be a big change with (I'm guessing) a lot of unintended negative consequences.
Comment From: chris-b1
This is similar to the magic X
used by dplython and pandas_ply, see also #13133
Comment From: gfyoung
Let's also not forget the .query
method which allows you to use SQL-like syntax.
Comment From: jreback
yeah, if you want to address issues discussin the X issue #13133 pls fee free.