Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
In [1]: df = pd.DataFrame({'a': [1, 2, 3], 'c': ['a', 'b', 'c']}).astype(dtype={'a': 'int64', 'c': 'string'}).set_index('a')
[PYFLYBY] import pandas as pd
In [2]: with pd.option_context("mode.dtype_backend", "pyarrow"):
...: df2 = df.convert_dtypes(infer_objects=False)
...:
In [3]: df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, 1 to 3
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 c 3 non-null string
dtypes: string(1)
memory usage: 48.0 bytes
In [4]: df2.info()
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, 1 to 3
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 c 3 non-null string[pyarrow]
dtypes: string[pyarrow](1)
memory usage: 39.0 bytes
In [5]: df.index.dtype
Out[5]: dtype('int64')
In [6]: df2.index.dtype
Out[6]: dtype('int64')
In [7]: df2.index._data
Out[7]: array([1, 2, 3])
In [8]: df2.index._data_cls
Out[8]: (numpy.ndarray, pandas.core.arrays.base.ExtensionArray)
Issue Description
Calling convert_dtypes
with mode.dtype_backend='pyarrow'
doesn't convert the storage of the index.
Expected Behavior
I think I expect to see that the index is also stored in the extension array type:
In [1]: df = pd.DataFrame({'a': [1, 2, 3], 'c': ['a', 'b', 'c']}).astype(dtype={'a': 'int64', 'c': 'string'}).set_index('a')
[PYFLYBY] import pandas as pd
In [2]: with pd.option_context("mode.dtype_backend", "pyarrow"):
...: df3 = df.reset_index().convert_dtypes(infer_objects=False).set_index('a')
...:
In [3]: df3.index
Out[3]: Index([1, 2, 3], dtype='int64[pyarrow]', name='a')
In [4]: df3.index._data_cls
Out[4]: (numpy.ndarray, pandas.core.arrays.base.ExtensionArray)
I've made a few bug reports today about expected behavior with mode.dtype_backend='pyarrow'
since we're hoping to switch to that flag by default when Pandas 2.0 ships, but let me know if these are unwelcome or a distraction.
Installed Versions
Comment From: phofl
Hi, your reports are very welcome. This helps us a lot!
The current behavior is consistent with the pandas backend, this gives us "int64" as well instead of "Int64". We could add another option to convert the index as well maybe?
cc @mroeschke
Comment From: mroeschke
Yeah the docs state that
Convert columns to best possible dtypes using dtypes supporting pd.NA.
Maybe Index
should have a convert_dtypes
method instead. We don't really have methods that operate on the values and the index at once.
Comment From: Rylie-W
Take
Comment From: phofl
Please wait until we decided on what to do here