Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import pyarrow

# Create a sample DataFrame
data = {
    'id': [1, 2, 3],
    'date1': pd.to_datetime(['2023-01-01', '2023-01-02', '2023-01-03']),
    'date2': pd.to_datetime(['2023-02-01', '2023-02-02', '2023-02-03']),
    'value': [10, 20, 30]
}

df = pd.DataFrame(data)

# Convert datetime columns to 'timestamp[ns][pyarrow]'
df = df.astype({
    'date1': 'timestamp[ns][pyarrow]',
    'date2': 'timestamp[ns][pyarrow]',
    'value': 'int64[pyarrow]'
})

print('\n\nAll types:\n' + str(df.dtypes))
print('\n\nIt works for other types `int64`:\n' + str(df.select_dtypes(include=['int64']).dtypes))
print('\n\nIt works for other types `pd.ArrowDtype(pa.int64())`:\n' + str(df.select_dtypes(include=[pd.ArrowDtype(pa.int64())]).dtypes))
print('\n\nBut it does not for `timestamp[ns][pyarrow]`:\n'  + str(df.select_dtypes(include=['timestamp[ns][pyarrow]']).dtypes))
print('\n\nThe type should not interpreter as `datetime64[ns]`:\n'  + str(df.select_dtypes(include=['datetime64[ns]']).dtypes))
print('\n\nHere is a proper workaround with `pd.ArrowDtype(pa.timestamp(ns))`:\n'  + str(df.select_dtypes(include=[pd.ArrowDtype(pa.timestamp('ns'))]).dtypes))
All types:
id                        int64
date1    timestamp[ns][pyarrow]
date2    timestamp[ns][pyarrow]
value            int64[pyarrow]
dtype: object

It works for other types `int64`:
id                int64
value    int64[pyarrow]
dtype: object

It works for other types `pd.ArrowDtype(pa.int64())`:
id                int64
value    int64[pyarrow]
dtype: object

But it does not for `timestamp[ns][pyarrow]`:
Series([], dtype: object)

The type should not interpreter as `datetime64[ns]`:
date1    timestamp[ns][pyarrow]
date2    timestamp[ns][pyarrow]
dtype: object

Here is a proper workaround with `pd.ArrowDtype(pa.timestamp(ns))`:
date1    timestamp[ns][pyarrow]
date2    timestamp[ns][pyarrow]
dtype: object

Issue Description

For unknown reasons, select_dtypes can not select the timestamp[ns][pyarrow] type because of the incorrect string-representation to type object conversion. It does work when we use type object explicitly like: pd.ArrowDtype(pa.timestamp('ns'))

Expected Behavior

select_dtypes should select columns with timestamp[ns][pyarrow] type when timestamp[ns][pyarrow] string provided. select_dtypes should select columns with timestamp[ns][pyarrow] type when pd.ArrowDtype(pa.timestamp(ns)) object provided. select_dtypes should not select columns with timestamp[ns][pyarrow] when datetime64[ns] provided.

Installed Versions

INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.10.12.final.0 python-bits : 64 OS : Linux OS-release : 5.15.153.1-microsoft-standard-WSL2 Version : #1 SMP Fri Mar 29 23:14:13 UTC 2024 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.2 numpy : 1.26.3 pytz : 2023.4 dateutil : 2.8.2 setuptools : 69.0.3 pip : 24.0 Cython : None pytest : 7.4.4 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.3 IPython : 8.20.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2023.12.2 gcsfs : 2023.12.2post1 matplotlib : 3.8.2 numba : 0.58.1 numexpr : None odfpy : None openpyxl : 3.1.2 pandas_gbq : None pyarrow : 14.0.2 pyreadstat : None python-calamine : None pyxlsb : None s3fs : 2023.12.2 scipy : 1.12.0 sqlalchemy : 2.0.29 tables : None tabulate : 0.9.0 xarray : None xlrd : None zstandard : None tzdata : 2023.4 qtpy : None pyqt5 : None

Comment From: dbalabka

Related to the following line: https://github.com/pandas-dev/pandas/blob/2cdb97e2f806d83965c7dee8fb5fcf164a340379/pandas/core/frame.py#L4895

The following line should not return the timestamp[ns][pyarrow] types:

df.select_dtypes(include=['datetime64[ns]'])

Here is a proper workaround: avoid using string representation of the type and provide object type explicitly:

df.select_dtypes(include=[pd.ArrowDtype(pa.timestamp('ns'))])

Comment From: dbalabka

it looks related to #52577

// cc @phofl

Comment From: yuanx749

Seems it does not work with other pyarrow string representation either:

print(df.select_dtypes(include=['int64[pyarrow]']).dtypes)
# Series([], dtype: object)

Comment From: jacek-pliszka

null has similar problem: ``In [10]: df = pd.DataFrame({"a": [None], "b": [[1]]}).convert_dtypes(dtype_backend="pyarrow")

In [11]: df.dtypes Out[11]: a null[pyarrow] b object dtype: object

In [12]: df.select_dtypes(include=['object']) Out[12]: a b 0 None [1]