Feature Type

  • [X] Adding new functionality to pandas

  • [ ] Changing existing functionality in pandas

  • [ ] Removing existing functionality in pandas

Problem Description

Suppose I have a function that runs as such:

def foo(df: pd.DataFrame) -> pd.DataFrame:
    x = df["x"]
    y = df["y"]
    df["xy"] = x * y

    return df

I need my DataFrame to have x and y columns, and I'm returning a DataFrame with a xy column. Currently, there is no way of hinting columns on a DataFrame, leading to guesswork from new joiners on a project if the columns are not specified in the documentation. Linters and other tools that have API to Pandas could also benefit from the feature as they could provide better code completion and inference from the DataFrame structure.

I've implemented this feature for my personal projects as described in my Stack overflow question here and Spark has an even better system described here.

Feature Description

A simple way of implementing this would simply add a __class_getitem__ method to the current DataFrame class as described in Pep 506 which would look something like this:

from typing import List

Columns = type(List[str])

class DataFrame(NDFrame, OpsMixin):
    __class_getitem__ = classmethod(Columns)

This could be used as such:

def foo(df: pd.DataFrame[["x", "y"]]) -> pd.DataFrame[["x", "y", "xy"]]:
    df["xy"] = df["x"] * df["y"]
    return df 

Alternative Solutions

Alternatively one could add a column and columns type like this:

from typing import Dict
from pandas._typing import Dtype

Columns = type(Dict[str, Dtype])

class DataFrame(NDFrame, OpsMixin):
    __class_getitem__ = classmethod(Columns)

and use it as such:

def foo(df: pd.DataFrame['x': pd.Float64Dtype, 'y': pd.Float64Dtype]) -> pd.DataFrame['x': pd.Float64Dtype, 'y': pd.Float64Dtype, 'xy': pd.Float64Dtype]:
    x = df["x"]
    y = df["y"]
    df["xy"] = x * y

    return df

Additional Context

No response

Comment From: MarcoGorelli

thanks for the suggestion - pinging @Dr-Irv @bashtage @twoertwein for thoughts on this one

Comment From: twoertwein

As far as I understand, this approach is "only" helping with documentation but cannot be enforced with static type checkers? It would be interesting to understand the motivation why spark made that move. Assuming it cannot be understood/enforced by type checkers, it sounds much simpler to me to use the doc-string.

Comment From: JoaoAreias

@twoertwein If I'm not mistaken Spark needs the schema to run the computations, and, will run the computation with some top rows to infer the schema. They mention in their documentation that this can be expensive, and avoided if using type-hints:

However, this is potentially expensive. If there are several expensive operations such as a shuffle in the upstream of the execution plan, pandas API on Spark will end up with executing the Spark job twice, once for schema inference, and once for processing actual data with the schema.

Comment From: twoertwein

So these aren't strictly type annotations but more meta-data to speed up the runtime behavior. I'm not familiar with whether pandas could be faster having such annotations (I assume not).

If you have suggestions for type annotations that are understood by mypy/pyright, please open an issue at pandas-stubs. Reflecting the name and type of columns through type annotations would be great - but I'm honestly not sure whether that is feasible.

Comment From: Dr-Irv

Not sure if we should support this. I think that pandera https://github.com/unionai-oss/pandera provides this kind of capability. @JoaoAreias can you evaluate whether pandera would do the job for you?

One issue with the proposal here is that it needs to work for all kinds of column names - not just str but any Hashable (ints, tuples, enums, multilevel indexes, etc.)

Comment From: JoaoAreias

Hi @Dr-Irv sorry for the delayed response, I didn't know Pandera before this conversation so I wanted to give it a read beforehand, and life got in the way of me doing that. Seems like Pandera solves my problems, thank you!