Feature Type
-
[X] Adding new functionality to pandas
-
[ ] Changing existing functionality in pandas
-
[ ] Removing existing functionality in pandas
Problem Description
Suppose I have a function that runs as such:
def foo(df: pd.DataFrame) -> pd.DataFrame:
x = df["x"]
y = df["y"]
df["xy"] = x * y
return df
I need my DataFrame to have x
and y
columns, and I'm returning a DataFrame with a xy
column. Currently, there is no way of hinting columns on a DataFrame, leading to guesswork from new joiners on a project if the columns are not specified in the documentation. Linters and other tools that have API to Pandas could also benefit from the feature as they could provide better code completion and inference from the DataFrame structure.
I've implemented this feature for my personal projects as described in my Stack overflow question here and Spark has an even better system described here.
Feature Description
A simple way of implementing this would simply add a __class_getitem__
method to the current DataFrame class as described in Pep 506 which would look something like this:
from typing import List
Columns = type(List[str])
class DataFrame(NDFrame, OpsMixin):
__class_getitem__ = classmethod(Columns)
This could be used as such:
def foo(df: pd.DataFrame[["x", "y"]]) -> pd.DataFrame[["x", "y", "xy"]]:
df["xy"] = df["x"] * df["y"]
return df
Alternative Solutions
Alternatively one could add a column and columns type like this:
from typing import Dict
from pandas._typing import Dtype
Columns = type(Dict[str, Dtype])
class DataFrame(NDFrame, OpsMixin):
__class_getitem__ = classmethod(Columns)
and use it as such:
def foo(df: pd.DataFrame['x': pd.Float64Dtype, 'y': pd.Float64Dtype]) -> pd.DataFrame['x': pd.Float64Dtype, 'y': pd.Float64Dtype, 'xy': pd.Float64Dtype]:
x = df["x"]
y = df["y"]
df["xy"] = x * y
return df
Additional Context
No response
Comment From: MarcoGorelli
thanks for the suggestion - pinging @Dr-Irv @bashtage @twoertwein for thoughts on this one
Comment From: twoertwein
As far as I understand, this approach is "only" helping with documentation but cannot be enforced with static type checkers? It would be interesting to understand the motivation why spark made that move. Assuming it cannot be understood/enforced by type checkers, it sounds much simpler to me to use the doc-string.
Comment From: JoaoAreias
@twoertwein If I'm not mistaken Spark needs the schema to run the computations, and, will run the computation with some top rows to infer the schema. They mention in their documentation that this can be expensive, and avoided if using type-hints:
However, this is potentially expensive. If there are several expensive operations such as a shuffle in the upstream of the execution plan, pandas API on Spark will end up with executing the Spark job twice, once for schema inference, and once for processing actual data with the schema.
Comment From: twoertwein
So these aren't strictly type annotations but more meta-data to speed up the runtime behavior. I'm not familiar with whether pandas could be faster having such annotations (I assume not).
If you have suggestions for type annotations that are understood by mypy/pyright, please open an issue at pandas-stubs. Reflecting the name and type of columns through type annotations would be great - but I'm honestly not sure whether that is feasible.
Comment From: Dr-Irv
Not sure if we should support this. I think that pandera https://github.com/unionai-oss/pandera provides this kind of capability. @JoaoAreias can you evaluate whether pandera would do the job for you?
One issue with the proposal here is that it needs to work for all kinds of column names - not just str
but any Hashable
(ints, tuples, enums, multilevel indexes, etc.)
Comment From: JoaoAreias
Hi @Dr-Irv sorry for the delayed response, I didn't know Pandera before this conversation so I wanted to give it a read beforehand, and life got in the way of me doing that. Seems like Pandera solves my problems, thank you!