Feature Type
-
[ ] Adding new functionality to pandas
-
[X] Changing existing functionality in pandas
-
[ ] Removing existing functionality in pandas
Problem Description
When reading excel files I often get date columns looking like
import pandas as pd
from datetime import datetime
s = pd.Series(["01/02/01", datetime(2001, 2, 2)])
When trying to parse those columns using
pd.to_datetime(s, format="%d/%m/%y")
the Exception ValueError("time data '2001-02-02 00:00:00' does not match format '%d/%m/%y' (match)")
is raised.
The origin of this issue is that the already parsed datetime
is converted to a string in isoformat and then an attempt to parse it in the given format is made.
Using pd.to_datetime
without a format leads to wrong results since the format is ambiguous.
Feature Description
pd.to_datetime
should either get an option to handle datetime.datetime
objects differently or do so by default.
Alternative Solutions
The only alternative solution I could think off currently is a raw loop and checking the type of each element individually.
Additional Context
No response
Comment From: MarcoGorelli
Hi @sv1990
Thanks for your report - it might make sense to skip datetime
objects, investigations / pull requests would be welcome
Otherwise dayfirst
should work here:
>>> pd.to_datetime(s, dayfirst=True)
0 2001-02-01
1 2001-02-02
dtype: datetime64[ns]
Comment From: aaossa
Seems like the exception comes from convert_listlike
:
https://github.com/pandas-dev/pandas/blob/bbb1cdf13a1e9240b43d691aa0ec3ca1b37afee4/pandas/core/tools/datetimes.py#L1070-L1079
and then from _to_datetime_with_format
:
https://github.com/pandas-dev/pandas/blob/bbb1cdf13a1e9240b43d691aa0ec3ca1b37afee4/pandas/core/tools/datetimes.py#L431-L436
which uses _array_strptime_with_fallback
as fallback:
https://github.com/pandas-dev/pandas/blob/bbb1cdf13a1e9240b43d691aa0ec3ca1b37afee4/pandas/core/tools/datetimes.py#L539-L542
, a wrapper around array_strptime
:
https://github.com/pandas-dev/pandas/blob/bbb1cdf13a1e9240b43d691aa0ec3ca1b37afee4/pandas/core/tools/datetimes.py#L474-L475
From there, pandas/_libs/tslibs/strptime.pyx
seems the correct place to skip datetime
objects (this has not been tested yet) or directly construct an appropriate object again. I'll give it a try.
Comment From: aaossa
Ok, this change seems to work:
diff --git a/pandas/_libs/tslibs/strptime.pyx b/pandas/_libs/tslibs/strptime.pyx
index 6287c2fbc5..20452bf75e 100644
--- a/pandas/_libs/tslibs/strptime.pyx
+++ b/pandas/_libs/tslibs/strptime.pyx
@@ -2,6 +2,7 @@
"""
from cpython.datetime cimport (
date,
+ datetime,
tzinfo,
)
@@ -129,6 +130,19 @@ def array_strptime(ndarray[object] values, str fmt, bint exact=True, errors='rai
if val in nat_strings:
iresult[i] = NPY_NAT
continue
+ elif isinstance(val, datetime):
+ dts.year = val.year
+ dts.month = val.month
+ dts.day = val.day
+ dts.hour = val.hour
+ dts.min = val.minute
+ dts.sec = val.second
+ dts.us = val.microsecond
+ dts.ps = 0 # Not enough precision in datetime objects (https://github.com/python/cpython/issues/59648)
+
+ iresult[i] = npy_datetimestruct_to_datetime(NPY_FR_ns, &dts)
+ result_timezone[i] = val.tzname()
+ continue
else:
if checknull_with_nat_and_na(val):
iresult[i] = NPY_NAT
I just replicated the processing applied in the same function a couple of lines below:
https://github.com/pandas-dev/pandas/blob/953921f2782062a67aa3886837fac563892d7ec6/pandas/_libs/tslibs/strptime.pyx#L324-L342
I'll prepare a PR and proper tests if that's ok, so we can move discuss about implementation there.
Comment From: MarcoGorelli
Thanks for looking into this - if it's the same, can it be factored out into a function?