Feature Type
-
[X] Adding new functionality to pandas
-
[ ] Changing existing functionality in pandas
-
[ ] Removing existing functionality in pandas
Problem Description
I wish I could have a nicer mypy/flake8/pylint friendly way of doing super simple data filtering.
Right now if I want to do a simple filter, I'd normally do it like so:
val = "foo"
df = (
df_raw.set_index("A")
.sort_index()
.unstack()
.groupby("B")
.mean()
)
df = df[df["C"] == val]
df = df.join(df2).reset_index()
Which frustrates my compulsive brain, because I'd love for this to be all chainable. If I use query
instead of loc, that's, fine, by my linters will complain that val
is not used. I can add comments to make the linters ignore this, but that feels unpythonic.
An alternative method would be to
val = "foo"
df = (
df_raw.set_index("A")
.sort_index()
.unstack()
.groupby("B")
.mean()
.loc[lambda x: x[x["C"] == val]]
.join(df2)
.reset_index()
)
However once more I feel like the syntax here could be shortened given how common a task this is.
Feature Description
Add a new method to dataframes, which allows for the simplest filtering, with loc lambdas and query still the preferred methods for complex filters:
val = "foo"
df = (
df_raw.set_index("A")
.sort_index()
.unstack()
.groupby("B")
.mean()
.select("C", val)
.join(df2)
.reset_index()
)
In terms of implementation, I'm happy to open a PR for this and write some tests. For a simple as possible proposal, something akin to the pseudocode below to add to the DataFrame class.
def select(self, col, val):
if isinstance(val, (tuple, list)):
# simplified, but essentially if we can pass to isin, do so
return self[self[col].isin(val)]
else:
return self[self[col].eq(val)]
````
### Alternative Solutions
For similar methods, `df.xs` serves a similar purpose to what this is requesting, however it operates only on the index. Maybe we could follow its implementation or extend it, instead of adding a new method here?
### Additional Context
For other discussion on this topic, please see: https://stackoverflow.com/questions/11869910/pandas-filter-rows-of-dataframe-with-operator-chaining
**Comment From: samukweku**
Kindly provide sample reproducible data with Expected output dataframe, so that your question can be better understood
**Comment From: Samreay**
Sure thing. I've just copied the code above into a notebook and run it with a testing df, so unsure if this adds much information. Happy to elaborate on any point. But to restate, a single, chainable filter which doesn't need lambda functions or code-as-a-string would be great.
Something like so:
<img width="611" alt="image" src="https://user-images.githubusercontent.com/6748505/200543048-58db4054-181d-49ef-96fe-aae501b82699.png">
```python
import pandas as pd
df = pd.util.testing.makeMixedDataFrame()
def select(self, col, val):
if isinstance(val, (list, tuple, pd.Series, dict)):
return self[self[col].isin(val)]
else:
return self[self[col].eq(val)]
pd.DataFrame.select = select
df.select("C", "foo2") # gives just the rows where column "C" is equal to "foo2"
df.select("C", ["foo1", "foo2"]) # gives the rows where column "C" is either "foo1" or "foo2"
Comment From: MarcoGorelli
If I use query instead of loc, that's, fine, by my linters will complain that val is not used
you can use pipe
Comment From: Samreay
Indeed you can, and in terms of all the different methods, I wrote this up (outside of this ticket):
# Download some data
repo = "https://github.com/PacktPublishing"
project = "Pandas-Cookbook/raw/master/data/flights.csv"
df_raw = pd.read_csv(f"{repo}/{project}")
df_raw = df_raw[["MONTH", "ORG_AIR", "DEST_AIR", "CANCELLED"]]
# Patch a custom method in
def select(df, col, val):
return df[df[col] == val]
pd.DataFrame.select = select
# There are a few available methods:
df = (
# You can use query
df_raw.query("DEST_AIR == 'LAX'")
# A custom patched method
.select("MONTH", 1)
# The same method via pipe
.pipe(select, "ORG_AIR", "SFO")
# Or pass a function to loc
.loc[lambda x: x["CANCELLED"] == 0]
)
pipe
with a custom function of course works just fine, but I still feel like this sort of basic filtering would be used by many pandas users and warrants potential inclusion into the provided method set.
In terms of which solution is easiest to read and more pythonic, I feel like having that provided method is the best choice.
Comment From: giordafrancis
Looks good I admit but overall I don't think it's necessary. I think you will soon run into complex syntax depending on the series type and accessor needed and will potentially end up with something similar to a query method.
query, pipe and trusty lambda in loc resolve it for me.
Comment From: Samreay
Agreed in that this wasn't meant to be (and shouldn't be) another kitchen sink alternative for query or a Callable loc. Instead, just meant to be a small utility method similar to the existing df.xs
and df.filter
methods. And by delegating to either isin
or eq
under the hood, the complexities of data type comparison should already be handled.
Comment From: MarcoGorelli
thanks for the suggestion
closing for now as the bar for the (already huge) API is very high, and the existing methods already allow you to do this