Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [ ] I have confirmed this bug exists on the latest version of pandas.

  • [X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
from pandas import *
import sys
import datetime as dt
import pyarrow as pa
import io

data = [dt.datetime(2020, 1, 1, 1, 1, 1, 1), dt.datetime(1999, 1, 1)]

ser = pd.Series(data, dtype='timestamp[ns][pyarrow]')
print(ser)
print()
ser = pd.Series(data, dtype='datetime64[ns]')
print(ser)

Issue Description

The above outputs:

0    2020-01-01 01:01:01.000001
1           1999-01-01 00:00:00
dtype: timestamp[ns][pyarrow]

0   2020-01-01 01:01:01.000001
1   1999-01-01 00:00:00.000000
dtype: datetime64[ns]

Note how, in the pyarrow case, the date formats are different.

This causes issues when parsing with to_datetime and specifying format=:

In [1]: data = [dt.datetime(2020, 1, 1, 1, 1, 1, 1), dt.datetime(1999, 1, 1)]

In [2]: ser = pd.Series(data, dtype='timestamp[ns][pyarrow]')
   ...: string_data = ser.to_csv(index=False, header=None).splitlines()
   ...: pd.to_datetime(string_data, format='%Y-%m-%d %H:%M:%S.%f')
   ...: 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [2], line 3
      1 ser = pd.Series(data, dtype='timestamp[ns][pyarrow]')
      2 string_data = ser.to_csv(index=False, header=None).splitlines()
----> 3 pd.to_datetime(string_data, format='%Y-%m-%d %H:%M:%S.%f')

File ~/pandas-dev/pandas/core/tools/datetimes.py:1110, in to_datetime(arg, errors, dayfirst, yearfirst, utc, format, exact, unit, infer_datetime_format, origin, cache)
   1108         result = _convert_and_box_cache(argc, cache_array)
   1109     else:
-> 1110         result = convert_listlike(argc, format)
   1111 else:
   1112     result = convert_listlike(np.array([arg]), format)[0]

File ~/pandas-dev/pandas/core/tools/datetimes.py:436, in _convert_listlike_datetimes(arg, format, name, tz, unit, errors, infer_datetime_format, dayfirst, yearfirst, exact)
    433         return res
    435 utc = tz == "utc"
--> 436 result, tz_parsed = objects_to_datetime64ns(
    437     arg,
    438     dayfirst=dayfirst,
    439     yearfirst=yearfirst,
    440     utc=utc,
    441     errors=errors,
    442     require_iso8601=require_iso8601,
    443     allow_object=True,
    444     format=format,
    445     exact=exact,
    446 )
    448 if tz_parsed is not None:
    449     # We can take a shortcut since the datetime64 numpy array
    450     # is in UTC
    451     dta = DatetimeArray(result, dtype=tz_to_dtype(tz_parsed))

File ~/pandas-dev/pandas/core/arrays/datetimes.py:2144, in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object, format, exact)
   2142 order: Literal["F", "C"] = "F" if flags.f_contiguous else "C"
   2143 try:
-> 2144     result, tz_parsed = tslib.array_to_datetime(
   2145         data.ravel("K"),
   2146         errors=errors,
   2147         utc=utc,
   2148         dayfirst=dayfirst,
   2149         yearfirst=yearfirst,
   2150         require_iso8601=require_iso8601,
   2151         format=format,
   2152         exact=exact,
   2153     )
   2154     result = result.reshape(data.shape, order=order)
   2155 except OverflowError as err:
   2156     # Exception is raised when a part of date is greater than 32 bit signed int

File ~/pandas-dev/pandas/_libs/tslib.pyx:442, in pandas._libs.tslib.array_to_datetime()
    440 @cython.wraparound(False)
    441 @cython.boundscheck(False)
--> 442 cpdef array_to_datetime(
    443     ndarray[object] values,
    444     str errors='raise',

File ~/pandas-dev/pandas/_libs/tslib.pyx:622, in pandas._libs.tslib.array_to_datetime()
    620     continue
    621 elif is_raise:
--> 622     raise ValueError(
    623         f"time data \"{val}\" at position {i} doesn't "
    624         f"match format \"{format}\""

ValueError: time data "1999-01-01 00:00:00" at position 1 doesn't match format "%Y-%m-%d %H:%M:%S.%f"

Expected Behavior

If it printed

0    2020-01-01 01:01:01.000001
1    1999-01-01 00:00:00.000000
dtype: timestamp[ns][pyarrow]

, just like it does for datetime64[ns], then the issue would be solve

Installed Versions

INSTALLED VERSIONS ------------------ commit : 8020bf1b25ef50ae22f8c799df6982804a2bd543 python : 3.8.13.final.0 python-bits : 64 OS : Linux OS-release : 5.10.102.1-microsoft-standard-WSL2 Version : #1 SMP Wed Mar 2 00:30:59 UTC 2022 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8 pandas : 2.0.0.dev0+697.g8020bf1b25.dirty numpy : 1.23.4 pytz : 2022.6 dateutil : 2.8.2 setuptools : 59.8.0 pip : 22.3.1 Cython : 0.29.32 pytest : 7.2.0 hypothesis : 6.56.4 sphinx : 4.5.0 blosc : None feather : None xlsxwriter : 3.0.3 lxml.etree : 4.9.1 html5lib : 1.1 pymysql : 1.0.2 psycopg2 : 2.9.3 jinja2 : 3.0.3 IPython : 8.6.0 pandas_datareader: 0.10.0 bs4 : 4.11.1 bottleneck : 1.3.5 brotli : fastparquet : 0.8.3 fsspec : 2021.11.0 gcsfs : 2021.11.0 matplotlib : 3.6.2 numba : 0.56.3 numexpr : 2.8.3 odfpy : None openpyxl : 3.0.10 pandas_gbq : 0.17.9 pyarrow : 9.0.0 pyreadstat : 1.2.0 pyxlsb : 1.0.10 s3fs : 2021.11.0 scipy : 1.9.3 snappy : sqlalchemy : 1.4.43 tables : 3.7.0 tabulate : 0.9.0 xarray : 2022.11.0 xlrd : 2.0.1 zstandard : 0.19.0 tzdata : None qtpy : None pyqt5 : None None

Comment From: MarcoGorelli

This only became apparent when the historic issue of to_datetime not respecting exact for ISO8601 formats was solved https://github.com/pandas-dev/pandas/pull/49333

Comment From: MarcoGorelli

Much simpler reproducer of the underlying issue:

In [1]: str(Timestamp(2020, 1, 1, 1, 1, 1, 1))
Out[1]: '2020-01-01 01:01:01.000001'

In [2]: str(Timestamp(2020, 1, 1, 1))
Out[2]: '2020-01-01 01:00:00'

Same thing happens in the standard library:

In [5]: str(dt.datetime(2000, 1, 1, 1, 1, 1))
Out[5]: '2000-01-01 01:01:01'

In [6]: str(dt.datetime(2000, 1, 1, 1, 1, 1, 123))
Out[6]: '2000-01-01 01:01:01.000123'

Could it be that

https://github.com/pandas-dev/pandas/blob/1c51e6004c09e6ccbbb8841dcb327ac0b5c9c80d/pandas/io/formats/format.py#L1406-L1408

should always print with %Y-%m-%d %H:%M:%S.%f? Maybe this only needs doing when saving to csv? Will look into this more

Comment From: MarcoGorelli

Looks like date_format in to_csv doesn't work with the arrow type either:

In [1]: data = [dt.datetime(2020, 1, 1, 1, 1, 1, 1), dt.datetime(1999, 1, 1)]

In [2]: ser1 = pd.Series(data, dtype='timestamp[ns][pyarrow]')
   ...: ser2 = pd.Series(data, dtype='datetime64[ns]')
   ...: 

In [3]: ser1.to_csv(date_format='%d-%m-%Y')
Out[3]: ',0\n0,2020-01-01 01:01:01.000001\n1,1999-01-01 00:00:00\n'

In [4]: ser2.to_csv(date_format='%d-%m-%Y')
Out[4]: ',0\n0,01-01-2020\n1,01-01-1999\n'

Comment From: MarcoGorelli

I think this might be the issue

https://github.com/pandas-dev/pandas/blob/2517199dc9d6174d967683eeb6ad7fe68a76df19/pandas/core/construction.py#L467

here arr.dtype.kind is 'M', not sure why it's excluded.

Would

-    if isinstance(arr, np.ndarray):
+    if isinstance(arr, np.ndarray) or is_extension_array_dtype(arr.dtype):

work? Will try

Comment From: mroeschke

ensure_wrapped_if_datetimelike is called in a lot of places, so I'm not sure if it will always be valid to convert an ArrowExtensionArray (with a datelike type) to a DatetimeArray/TimedeltaArray

cc @jbrockmendel

Comment From: jbrockmendel

haven't looked closely at the rest of the thread, but ensure_wrapped_if_datetimelike should only wrap np.ndarrays

Comment From: MarcoGorelli

ok thanks, I'll see where else this could be handled