Feature Type

  • [ ] Adding new functionality to pandas

  • [X] Changing existing functionality in pandas

  • [ ] Removing existing functionality in pandas

Problem Description

IMO it would be an API improvement for pandas if creating dataframes/series/arrays using dtype=str (and dtype="str") would return a dataframe/series/array of dtype StringDtype instead of dtype object. The reason being that IMO in 99,9 % of cases where users instantiate using dtype=str they would have prefer having used dtype="string" and therefore have the guarantee that the array actually only contains strings (and NA's).

This would be similar to when instantiating currently using dtype=int gives a dtype np.int64 and for dtype=float we get np.float64.

The above proposal would be backwards incompatible and too late to introduce depreciations in pandas 1.x now. However, could it become a breaking change as part of the jump to version 2.0 of pandas, similar to the backwards-incompatible changes already listed in #44823?

Feature Description

Basically it would just change the dtype resolution function to return a StringDtype instead the current behavior, so reasonably simple to implement.

Alternative Solutions

The alternative would be to keep the current behavior in pandas 2.0.

Additional Context

No response

Comment From: phofl

I think this needs a more thorough investigation.

How would the behavior of follow up operations change?

Would you also change the behavior of I/O operations? I don't think that we can do this without a deprecation cycle

Comment From: mroeschke

I support dtype=str eventually mapping to StringDtype, but personally I think it would be better through a deprecation than a 2.0 breaking change.

Comment From: topper-123

Thanks for the reply. Yes, I hadn't considered IO, that makes it more challenging than I had though when I wrote up the issue...

I could support a deprecation cycle, though perhaps if it last the entire pandas 2.x cycle, maybe better to deprecate later in the cycle, e.g. pandas 2.3 or similar IMO.

Unless there is a wish do something now, I'll let this lay and I (or someone else) can pick this up at later, after pandas 2.0 has been released.

Comment From: phofl

We want to release 3.0 significantly faster than 2.0, so would be ok to introduce in 2.0 I think. But we want to finish enforcing deprecations first

Comment From: topper-123

Closing as superseded #52429, where the discussion is more current.