Feature Type

  • [ ] Adding new functionality to pandas

  • [X] Changing existing functionality in pandas

  • [ ] Removing existing functionality in pandas

Problem Description

When reading excel files I often get date columns looking like

import pandas as pd
from datetime import datetime

s = pd.Series(["01/02/01", datetime(2001, 2, 2)])

When trying to parse those columns using

pd.to_datetime(s, format="%d/%m/%y")

the Exception ValueError("time data '2001-02-02 00:00:00' does not match format '%d/%m/%y' (match)") is raised.

The origin of this issue is that the already parsed datetime is converted to a string in isoformat and then an attempt to parse it in the given format is made.

Using pd.to_datetime without a format leads to wrong results since the format is ambiguous.

Feature Description

pd.to_datetime should either get an option to handle datetime.datetime objects differently or do so by default.

Alternative Solutions

The only alternative solution I could think off currently is a raw loop and checking the type of each element individually.

Additional Context

No response

Comment From: MarcoGorelli

Hi @sv1990

Thanks for your report - it might make sense to skip datetime objects, investigations / pull requests would be welcome

Otherwise dayfirst should work here:

>>> pd.to_datetime(s, dayfirst=True)
0   2001-02-01
1   2001-02-02
dtype: datetime64[ns]

Comment From: aaossa

Seems like the exception comes from convert_listlike:

https://github.com/pandas-dev/pandas/blob/bbb1cdf13a1e9240b43d691aa0ec3ca1b37afee4/pandas/core/tools/datetimes.py#L1070-L1079

and then from _to_datetime_with_format:

https://github.com/pandas-dev/pandas/blob/bbb1cdf13a1e9240b43d691aa0ec3ca1b37afee4/pandas/core/tools/datetimes.py#L431-L436

which uses _array_strptime_with_fallback as fallback:

https://github.com/pandas-dev/pandas/blob/bbb1cdf13a1e9240b43d691aa0ec3ca1b37afee4/pandas/core/tools/datetimes.py#L539-L542

, a wrapper around array_strptime:

https://github.com/pandas-dev/pandas/blob/bbb1cdf13a1e9240b43d691aa0ec3ca1b37afee4/pandas/core/tools/datetimes.py#L474-L475

From there, pandas/_libs/tslibs/strptime.pyx seems the correct place to skip datetime objects (this has not been tested yet) or directly construct an appropriate object again. I'll give it a try.

Comment From: aaossa

Ok, this change seems to work:

diff --git a/pandas/_libs/tslibs/strptime.pyx b/pandas/_libs/tslibs/strptime.pyx
index 6287c2fbc5..20452bf75e 100644
--- a/pandas/_libs/tslibs/strptime.pyx
+++ b/pandas/_libs/tslibs/strptime.pyx
@@ -2,6 +2,7 @@
 """
 from cpython.datetime cimport (
     date,
+    datetime,
     tzinfo,
 )

@@ -129,6 +130,19 @@ def array_strptime(ndarray[object] values, str fmt, bint exact=True, errors='rai
             if val in nat_strings:
                 iresult[i] = NPY_NAT
                 continue
+        elif isinstance(val, datetime):
+            dts.year = val.year
+            dts.month = val.month
+            dts.day = val.day
+            dts.hour = val.hour
+            dts.min = val.minute
+            dts.sec = val.second
+            dts.us = val.microsecond
+            dts.ps = 0  # Not enough precision in datetime objects (https://github.com/python/cpython/issues/59648)
+
+            iresult[i] = npy_datetimestruct_to_datetime(NPY_FR_ns, &dts)
+            result_timezone[i] = val.tzname()
+            continue
         else:
             if checknull_with_nat_and_na(val):
                 iresult[i] = NPY_NAT

I just replicated the processing applied in the same function a couple of lines below:

https://github.com/pandas-dev/pandas/blob/953921f2782062a67aa3886837fac563892d7ec6/pandas/_libs/tslibs/strptime.pyx#L324-L342

I'll prepare a PR and proper tests if that's ok, so we can move discuss about implementation there.

Comment From: MarcoGorelli

Thanks for looking into this - if it's the same, can it be factored out into a function?