Splitting discussion off from #51280 PR #52153
The checking and propagation of flags in __finalize__
means a small-but-everywhere performance hit for all users that we should deprecate.
Flags only has allow_duplicate_labels, which can be disallowed by a 3rd-party validation library.
Comment From: mroeschke
Looks like pandera provides a utility for checking for duplicate labels: https://pandera.readthedocs.io/en/stable/reference/generated/pandera.api.pandas.components.Column.html#pandera.api.pandas.components.Column
Comment From: jorisvandenbossche
For context, the .flags
/ set_flags
was a new feature added in pandas 1.2, as a general mechanism but at the time specifically for the "optionally disallow duplicate labels" option ( (https://pandas.pydata.org/docs/whatsnew/v1.2.0.html#optionally-disallow-duplicate-labels). See https://github.com/pandas-dev/pandas/issues/27108 / https://github.com/pandas-dev/pandas/pull/28394 (cc @TomAugspurger)
Comment From: TomAugspurger
The checking and propagation of flags in finalize means a small-but-everywhere performance hit for all users that we should deprecate.
Is that specific to the flags
mechanism, or is it something to do with calling __finalize__
in the first place? I'd be fine with a dedicated boolean to propagate the duplicate labels information.
Comment From: jbrockmendel
Is [the performance penalty of flags] specific to the flags mechanism, or is it something to do with calling finalize in the first place?
I think of it as being __finalize__
holistically.