Code Sample, a copy-pastable example if possible
Without errors='coerce'
```>>> import pandas as pd
d = { 'ts': ['2019-02-02 08:07:13+00', '2019-02-02 08:03:22.54+00'] } df = pd.DataFrame(data=d) d {'ts': ['2019-02-02 08:07:13+00', '2019-02-02 08:03:22.54+00']} df ts 0 2019-02-02 08:07:13+00 1 2019-02-02 08:03:22.54+00 df.dtypes ts object dtype: object
df['ts'] = pd.to_datetime(df['ts'], infer_datetime_format=True) df ts 0 2019-02-02 08:07:13+00:00 1 2019-02-02 08:03:22.540000+00:00 df.dtypes ts datetime64[ns, UTC] dtype: object
With errors='coerce'
```>>> import pandas as pd
>>> d = { 'ts': ['2019-02-02 08:07:13+00', '2019-02-02 08:03:22.54+00'] }
>>> df = pd.DataFrame(data=d)
>>> d
{'ts': ['2019-02-02 08:07:13+00', '2019-02-02 08:03:22.54+00']}
>>> df
ts
0 2019-02-02 08:07:13+00
1 2019-02-02 08:03:22.54+00
>>> df.dtypes
ts object
dtype: object
>>> df['ts'] = pd.to_datetime(df['ts'], infer_datetime_format=True, errors='coerce')
>>> df
ts
0 2019-02-02 08:07:13
1 NaT
Problem description
The functionality of to_datetime() with errors='coerce' is different than without. If I understand some of the other issues raised on this topic correctly, the functionality is different in some cases by design. In this case, the dates are very similiar, although different format.
Expected Output
```>>> import pandas as pd
d = { 'ts': ['2019-02-02 08:07:13+00', '2019-02-02 08:03:22.54+00'] } df = pd.DataFrame(data=d) d {'ts': ['2019-02-02 08:07:13+00', '2019-02-02 08:03:22.54+00']} df ts 0 2019-02-02 08:07:13+00 1 2019-02-02 08:03:22.54+00 df.dtypes ts object dtype: object
df['ts'] = pd.to_datetime(df['ts'], infer_datetime_format=True, errors='coerce') df ts 0 2019-02-02 08:07:13+00:00 1 2019-02-02 08:03:22.540000+00:00 df.dtypes ts datetime64[ns, UTC] dtype: object
#### Output of ``pd.show_versions()``
<details>
[paste the output of ``pd.show_versions()`` here below this line]
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.24.1
pytest: None
pip: 10.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.14.2
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 3.8.0
bs4: None
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: 2.7.5 (dt dec pq3 ext lo64)
jinja2: 2.8.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
>>>
</details>
**Comment From: mroeschke**
Agreed this looks strange. Investigation and PR's welcome!
**Comment From: mroeschke**
So in `to_datetime`, we have a lot of sub functions that also attempts error handling while the the logic of when to return a result is inconsistent as well. I have a feeling that pulling out the error handling logic at more of a top level would help make this keyword return a more consistent result.
**Comment From: saurav2608**
I will have a look at this.
**Comment From: ngutzmann**
Perhaps this is a bit of useful information. In pandas 0.23.4 the functionality looks to be correct:
import pandas as pd pd.version '0.23.4' d = { 'ts': ['2019-02-02 08:07:13+00', '2019-02-02 08:03:22.54+00'] } df = pd.DataFrame(data=d) df ts 0 2019-02-02 08:07:13+00 1 2019-02-02 08:03:22.54+00 df['ts'] = pd.to_datetime(df['ts'], infer_datetime_format=True, errors='coerce') df ts 0 2019-02-02 08:07:13.000 1 2019-02-02 08:03:22.540 df.dtypes ts datetime64[ns] dtype: object
**Comment From: saurav2608**
I think we need some discussion on what is the correct behavior in some of the corner cases. For example in the below case (on version 0.24.0). I think as errors is 'raise' the second command should raise an error. Currently, we retry parsing on error if '_infer_datetime_format_' is **True**. And reading the code here(https://github.com/pandas-dev/pandas/blob/4a20d5b9e526435a53b7cb1dc0a819299d31f040/pandas/core/tools/datetimes.py#L296) makes me think that this behavior is by design.
I think, if errors = 'raise' and anytime we encounter error (even with infer_datetime_format) we should raise and stop.
In [12]: s = pd.Series(np.array(['01/01/2011 00:00:00', ...: '01-02-2011 00:00:00', ...: '2011-01-03T00:00:00']))
In [13]: s
Out[13]:
0 01/01/2011 00:00:00
1 01-02-2011 00:00:00
2 2011-01-03T00:00:00
dtype: object
In [19]: pd.to_datetime(s, infer_datetime_format=True, errors='coerce')
Out[19]:
0 2011-01-01
1 NaT
2 NaT
dtype: datetime64[ns]
In [20]: pd.to_datetime(s, infer_datetime_format=True, errors='raise')
Out[20]:
0 2011-01-01
1 2011-01-02
2 2011-01-03
dtype: datetime64[ns]
**Comment From: saurav2608**
> I think we need some discussion on what is the correct behavior in some of the corner cases. For example in the below case (on version 0.24.0). I think as errors is 'raise' the second command should raise an error. Currently, we retry parsing on error if '_infer_datetime_format_' is **True**. And reading the code here(
>
> [pandas/pandas/core/tools/datetimes.py](https://github.com/pandas-dev/pandas/blob/4a20d5b9e526435a53b7cb1dc0a819299d31f040/pandas/core/tools/datetimes.py#L296)
>
> Line 296 in [4a20d5b](/pandas-dev/pandas/commit/4a20d5b9e526435a53b7cb1dc0a819299d31f040)
>
> if result is None:
> ) makes me think that this behavior is by design.
> I think, if errors = 'raise' and anytime we encounter error (even with infer_datetime_format) we should raise and stop.
>
> ```
> In [12]: s = pd.Series(np.array(['01/01/2011 00:00:00',
> ...: '01-02-2011 00:00:00',
> ...: '2011-01-03T00:00:00']))
>
> In [13]: s
> Out[13]:
> 0 01/01/2011 00:00:00
> 1 01-02-2011 00:00:00
> 2 2011-01-03T00:00:00
> dtype: object
> In [19]: pd.to_datetime(s, infer_datetime_format=True, errors='coerce')
> Out[19]:
> 0 2011-01-01
> 1 NaT
> 2 NaT
> dtype: datetime64[ns]
>
> In [20]: pd.to_datetime(s, infer_datetime_format=True, errors='raise')
> Out[20]:
> 0 2011-01-01
> 1 2011-01-02
> 2 2011-01-03
> dtype: datetime64[ns]
> ```
@mroeschke : any thoughts on this.
**Comment From: mroeschke**
So the core issue is that our format guessing function is unable to guess %z, UTC offsets.
https://github.com/pandas-dev/pandas/blob/2909b830fa21c6bc2e9797aae25b13f9a060653a/pandas/_libs/tslibs/parsing.pyx#L635
I agree with your suggestion though; `errors='raise'` should always raise whenever there's an error.
EDIT: Actually this would fail independently of timezones. Since `infer_datetime_format` guesses a format based on the first non-null time string, and the 2nd string is the example already is in a different format than the first.
**Comment From: nicolasdaviaud**
I have a similar issue on 0.24.1, it also seems to be related to different time zones in the same query:
```python
dates = ['2016-05-19T10:27:05', '20/05/2016 11:28:06', '']
print(pd.to_datetime(dates, errors='raise', infer_datetime_format=True, box=False))
print(pd.to_datetime(dates, errors='coerce', infer_datetime_format=True, box=False))
returns
['2016-05-19T10:27:05.000000000' '2016-05-20T11:28:06.000000000' 'NaT']
['2016-05-19T10:27:05.000000000' '2016-05-20T11:28:06.000000000' 'NaT']
but adding a Z
to the first date changes the behaviour on the second
dates = ['2016-05-19T10:27:05Z', '20/05/2016 11:28:06', '']
print(pd.to_datetime(dates, errors='raise', infer_datetime_format=True, box=False))
print(pd.to_datetime(dates, errors='coerce', infer_datetime_format=True, box=False))
returns
[datetime.datetime(2016, 5, 19, 10, 27, 5, tzinfo=tzutc()) datetime.datetime(2016, 5, 20, 11, 28, 6) 'NaT']
['2016-05-19T10:27:05.000000000' 'NaT' 'NaT']
Output of pd.show_versions()
Comment From: ericman93
same issue in 1.0.4
args = {
'cache': True,
'format': None,
'infer_datetime_format': True
}
ser = pd.Series(['03-12-2016', '29-07-2017', '07-05-2016'])
pd.to_datetime(ser, errors='raise', **args)
outputs:
0 2016-03-12
1 2017-07-29
2 2016-07-05
dtype: datetime64[ns]
pd.to_datetime(ser, errors='coerce', **args)
outputs:
0 2016-03-12
1 NaT
2 2016-07-05
dtype: datetime64[ns]
Comment From: evgeniikozlov
Still reproduced in 1.2.4
>>pandas.to_datetime(['01-May-2021 00:00:00', '01-Sep-2021 00:00:00'], infer_datetime_format=True, errors="coerce")
DatetimeIndex(['2021-05-01', 'NaT'], dtype='datetime64[ns]', freq=None)
>>pandas.to_datetime(['01-May-2021 00:00:00', '01-Sep-2021 00:00:00'], infer_datetime_format=True, errors="raise")
DatetimeIndex(['2021-05-01', '2021-09-01'], dtype='datetime64[ns]', freq=None)
>>pandas.to_datetime(['01-May-2021 00:00:00', '01-Sep-2021 00:00:00'], infer_datetime_format=False, errors="coerce")
DatetimeIndex(['2021-05-01', '2021-09-01'], dtype='datetime64[ns]', freq=None)
The output is dependent on combination of infer_datetime_format and errors arguments.
Output of pd.show_versions()
Comment From: jreback
@evgeniikozlov hence the open issue label
PRs from the community are how these things get fixed
Comment From: MarcoGorelli
As of PDEP4, infer_datetime_format
is obsolete
I've tried all the examples here, and they're all now behaving as expected
For example:
In [26]: dates = ['2016-05-19T10:27:05Z', '20/05/2016 11:28:06', '']
...: pd.to_datetime(dates)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[26], line 2
1 dates = ['2016-05-19T10:27:05Z', '20/05/2016 11:28:06', '']
----> 2 pd.to_datetime(dates)
File ~/pandas-dev/pandas/core/tools/datetimes.py:1098, in to_datetime(arg, errors, dayfirst, yearfirst, utc, format, exact, unit, infer_datetime_format, origin, cache)
1096 result = _convert_and_box_cache(argc, cache_array)
1097 else:
-> 1098 result = convert_listlike(argc, format)
1099 else:
1100 result = convert_listlike(np.array([arg]), format)[0]
File ~/pandas-dev/pandas/core/tools/datetimes.py:452, in _convert_listlike_datetimes(arg, format, name, utc, unit, errors, dayfirst, yearfirst, exact)
441 if format is not None and not require_iso8601:
442 return _to_datetime_with_format(
443 arg,
444 orig_arg,
(...)
449 errors,
450 )
--> 452 result, tz_parsed = objects_to_datetime64ns(
453 arg,
454 dayfirst=dayfirst,
455 yearfirst=yearfirst,
456 utc=utc,
457 errors=errors,
458 require_iso8601=require_iso8601,
459 allow_object=True,
460 format=format,
461 exact=exact,
462 )
464 if tz_parsed is not None:
465 # We can take a shortcut since the datetime64 numpy array
466 # is in UTC
467 dta = DatetimeArray(result, dtype=tz_to_dtype(tz_parsed))
File ~/pandas-dev/pandas/core/arrays/datetimes.py:2162, in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object, format, exact)
2160 order: Literal["F", "C"] = "F" if flags.f_contiguous else "C"
2161 try:
-> 2162 result, tz_parsed = tslib.array_to_datetime(
2163 data.ravel("K"),
2164 errors=errors,
2165 utc=utc,
2166 dayfirst=dayfirst,
2167 yearfirst=yearfirst,
2168 require_iso8601=require_iso8601,
2169 format=format,
2170 exact=exact,
2171 )
2172 result = result.reshape(data.shape, order=order)
2173 except OverflowError as err:
2174 # Exception is raised when a part of date is greater than 32 bit signed int
File ~/pandas-dev/pandas/_libs/tslib.pyx:453, in pandas._libs.tslib.array_to_datetime()
451 @cython.wraparound(False)
452 @cython.boundscheck(False)
--> 453 cpdef array_to_datetime(
454 ndarray[object] values,
455 str errors="raise",
File ~/pandas-dev/pandas/_libs/tslib.pyx:614, in pandas._libs.tslib.array_to_datetime()
612 continue
613 elif is_raise:
--> 614 raise ValueError(
615 f"time data \"{val}\" at position {i} doesn't "
616 f"match format \"{format}\""
ValueError: time data "20/05/2016 11:28:06" at position 1 doesn't match format "%Y-%m-%dT%H:%M:%S%z"
It's correct to raise here, as the second element doesn't match the format inferred from the first element
In the 01-Sep-2021
example, you'll need to pass %b
in the format, else it'll be inferred to be %B
and will (correctly) error, as Sep
doesn't match that directive:
In [28]: pandas.to_datetime(['01-May-2021 00:00:00', '01-Sep-2021 00:00:00'], format='%d-%b-%Y %H:%M:%S')
Out[28]: DatetimeIndex(['2021-05-01', '2021-09-01'], dtype='datetime64[ns]', freq=None)
Closing for now then, but thanks for the report, and please do let me know if I've misunderstood anything heree
Comment From: evgeniikozlov
@MarcoGorelli thanks for the explanation about "May" parsing, but it is still unclear why it works if errors="raise"
(my second example). It is confusing a lot why errors
option has any affect on parsing in this case.
Comment From: MarcoGorelli
That was a bug, and was fixed as part of the PDEP4 change
On the main branch, you'd get an error:
In [3]: pandas.to_datetime(['01-May-2021 00:00:00', '01-Sep-2021 00:00:00'], infer_datetime_format=True, errors="raise")
<ipython-input-3-08582d0c0651>:1: UserWarning: The argument 'infer_datetime_format' is deprecated and will be removed in a future version. A strict version of it is now the default, see https://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html. You can safely remove this argument.
pandas.to_datetime(['01-May-2021 00:00:00', '01-Sep-2021 00:00:00'], infer_datetime_format=True, errors="raise")
---------------------------------------------------------------------------
ValueError: time data '01-Sep-2021 00:00:00' does not match format '%d-%B-%Y %H:%M:%S' (match)
This behaviour will be available in pandas 2.0.0, which'll hopefully come out around February