Currently the keyword use_nullable_dtypes
(only implemented for read_parquet now) will use nullable dtypes for columns even when that column does not have nulls.
Should this behavior change?
xref discussion https://github.com/pandas-dev/pandas/pull/40687#issuecomment-833447019 https://github.com/pandas-dev/pandas/issues/42588#issuecomment-894981214 https://github.com/pandas-dev/pandas/issues/42588#issuecomment-882773411
(Note that always using nullable dtypes always(the current behavior) is a choice made on our end https://github.com/pandas-dev/pandas/blob/master/pandas/io/parquet.py#L215-L227 not by the pyarrow engine)
cc @pandas-dev/pandas-core
Comment From: bashtage
I would vote that it should change to only use when necessary, and to use standard dtypes when there are no nulls. Users can always upcast to nullable should they want to use nulls.
Comment From: phofl
I think we can close this. We implemented this for other functions like for read_parquet, which makes the most sense imo
Comment From: MarcoGorelli
Agreed - personally I'm not keen on value-dependent behaviour such as
only use when necessary, and to use standard dtypes when there are no nulls
so I'd vote for keeping as-is (use_nullable_dtypes
uses nullable types - which is clear and simple)
Comment From: lithomas1
We'll have to tell fastparquet then. I think they still do it the other way.
Comment From: phofl
Could you open an issue there?