Is your feature request related to a problem?
I'd like to be able to easily select only ordered categorical columns, or only unordered categorical columns, from a dataframe.
Example
Here's an example dataset:
import pandas as pd
import numpy.random as npr
n_obs = 20
eye_colors = ["blue", "brown"]
people = pd.DataFrame({
"eye_color": npr.choice(eye_colors, size=n_obs),
"age": npr.randint(20, 60, size=n_obs)
})
people["age_group"] = pd.cut(people["age"], [20, 30, 40, 50, 60], right=False)
people["eye_color"] = pd.Categorical(people["eye_color"], eye_colors)
Here, eye_color
is an unordered categorical column, age_group
is an ordered categorical column, and age
is numeric. I want just the age_group
column.
My best attempt at selecting ordered categorical columns is
categories = people.select_dtypes("category")
categories[[col for col in categories.columns if categories[col].cat.ordered]]
This solution feels overly complicated for such a simple task.
Describe the solution you'd like
There are a few options for what nicer code might look like.
If ordered and unordered categoricals had different dtypes (as in R with factor
vs. ordered
), then I could just write people.select_dtypes("ordered")
. Unfortunately, this would have breaking changes for all other code that assumes the dtype of ordered categoricals.
If dataframe-level .cat.*
methods existed, I could write something like
is_ordered = people.cat.ordered # should return [False, pd.NA, True]
people.loc[:, is_ordered & pd.notnull(is_ordered)]
A variation on this might be to have more specialized equivalents of .api.types.is_categorical_dtype()
, perhaps .api.types.is_ordered_categorical_dtype()
and .api.types.is_unordered_categorical_dtype()
.
API breaking implications
The first option mentioned above has API breaking changes; the other two options do not.
Additional context
I asked the internet for better solutions; no response so far.
Comment From: samukweku
The specialised ideas seem a better route to take
Comment From: ShaopengLin
I am new to pandas, is there a huge performance overhead for apply on DataFrame? If not, then for the specialized version, we can stay consistent with the input of is_categorical_dtype(). A similar boolean array can be achieved with df.apply(pd.api.types.is_ordered_categorical_dtype)
, though it will be without the pd.NA
to signal non-categorical columns.
To retrieve the columns we can then simply do this:
people.loc[:, people.apply(pd.api.types.is_ordered_categorical_dtype)]