Pandas to_datetime() with errors=coerce and without return different values

Code Sample, a copy-pastable example if possible

Without errors='coerce'

```>>> import pandas as pd

d = { 'ts': ['2019-02-02 08:07:13+00', '2019-02-02 08:03:22.54+00'] } df = pd.DataFrame(data=d) d {'ts': ['2019-02-02 08:07:13+00', '2019-02-02 08:03:22.54+00']} df ts 0 2019-02-02 08:07:13+00 1 2019-02-02 08:03:22.54+00 df.dtypes ts object dtype: object

df['ts'] = pd.to_datetime(df['ts'], infer_datetime_format=True) df ts 0 2019-02-02 08:07:13+00:00 1 2019-02-02 08:03:22.540000+00:00 df.dtypes ts datetime64[ns, UTC] dtype: object


With errors='coerce' 

```>>> import pandas as pd
>>> d = { 'ts': ['2019-02-02 08:07:13+00', '2019-02-02 08:03:22.54+00'] }
>>> df = pd.DataFrame(data=d)
>>> d
{'ts': ['2019-02-02 08:07:13+00', '2019-02-02 08:03:22.54+00']}
>>> df
                          ts
0     2019-02-02 08:07:13+00
1  2019-02-02 08:03:22.54+00
>>> df.dtypes
ts    object
dtype: object
>>> df['ts'] = pd.to_datetime(df['ts'], infer_datetime_format=True, errors='coerce')
>>> df
                   ts
0 2019-02-02 08:07:13
1                 NaT

Problem description

The functionality of to_datetime() with errors='coerce' is different than without. If I understand some of the other issues raised on this topic correctly, the functionality is different in some cases by design. In this case, the dates are very similiar, although different format.

Expected Output

```>>> import pandas as pd

d = { 'ts': ['2019-02-02 08:07:13+00', '2019-02-02 08:03:22.54+00'] } df = pd.DataFrame(data=d) d {'ts': ['2019-02-02 08:07:13+00', '2019-02-02 08:03:22.54+00']} df ts 0 2019-02-02 08:07:13+00 1 2019-02-02 08:03:22.54+00 df.dtypes ts object dtype: object

df['ts'] = pd.to_datetime(df['ts'], infer_datetime_format=True, errors='coerce') df ts 0 2019-02-02 08:07:13+00:00 1 2019-02-02 08:03:22.540000+00:00 df.dtypes ts datetime64[ns, UTC] dtype: object


#### Output of ``pd.show_versions()``

<details>

[paste the output of ``pd.show_versions()`` here below this line]

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.1
pytest: None
pip: 10.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.14.2
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 3.8.0
bs4: None
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: 2.7.5 (dt dec pq3 ext lo64)
jinja2: 2.8.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
>>>

</details>


**Comment From: mroeschke**

Agreed this looks strange. Investigation and PR's welcome!

**Comment From: mroeschke**

So in `to_datetime`, we have a lot of sub functions that also attempts error handling while the the logic of when to return a result is inconsistent as well. I have a feeling that pulling out the error handling logic at more of a top level would help make this keyword return a more consistent result.

**Comment From: saurav2608**

I will have a look at this. 

**Comment From: ngutzmann**

Perhaps this is a bit of useful information. In pandas 0.23.4 the functionality looks to be correct:

import pandas as pd pd.version '0.23.4' d = { 'ts': ['2019-02-02 08:07:13+00', '2019-02-02 08:03:22.54+00'] } df = pd.DataFrame(data=d) df ts 0 2019-02-02 08:07:13+00 1 2019-02-02 08:03:22.54+00 df['ts'] = pd.to_datetime(df['ts'], infer_datetime_format=True, errors='coerce') df ts 0 2019-02-02 08:07:13.000 1 2019-02-02 08:03:22.540 df.dtypes ts datetime64[ns] dtype: object



**Comment From: saurav2608**

I think we need some discussion on what is the correct behavior in some of the corner cases. For example in the below case (on version 0.24.0). I think as errors is 'raise' the second command should raise an error. Currently, we retry parsing on error if '_infer_datetime_format_' is **True**. And reading the code here(https://github.com/pandas-dev/pandas/blob/4a20d5b9e526435a53b7cb1dc0a819299d31f040/pandas/core/tools/datetimes.py#L296) makes me think that this behavior is by design. 

I think, if errors = 'raise' and anytime we encounter error (even with infer_datetime_format) we should raise and stop.

In [12]: s = pd.Series(np.array(['01/01/2011 00:00:00', ...: '01-02-2011 00:00:00', ...: '2011-01-03T00:00:00']))

In [13]: s
Out[13]: 0 01/01/2011 00:00:00 1 01-02-2011 00:00:00 2 2011-01-03T00:00:00 dtype: object In [19]: pd.to_datetime(s, infer_datetime_format=True, errors='coerce')
Out[19]: 0 2011-01-01 1 NaT 2 NaT dtype: datetime64[ns]

In [20]: pd.to_datetime(s, infer_datetime_format=True, errors='raise')
Out[20]: 0 2011-01-01 1 2011-01-02 2 2011-01-03 dtype: datetime64[ns]



**Comment From: saurav2608**

> I think we need some discussion on what is the correct behavior in some of the corner cases. For example in the below case (on version 0.24.0). I think as errors is 'raise' the second command should raise an error. Currently, we retry parsing on error if '_infer_datetime_format_' is **True**. And reading the code here(
> 
> [pandas/pandas/core/tools/datetimes.py](https://github.com/pandas-dev/pandas/blob/4a20d5b9e526435a53b7cb1dc0a819299d31f040/pandas/core/tools/datetimes.py#L296)
> 
> Line 296 in [4a20d5b](/pandas-dev/pandas/commit/4a20d5b9e526435a53b7cb1dc0a819299d31f040)
> 
>  if result is None: 
> ) makes me think that this behavior is by design.
> I think, if errors = 'raise' and anytime we encounter error (even with infer_datetime_format) we should raise and stop.
> 
> ```
> In [12]: s = pd.Series(np.array(['01/01/2011 00:00:00', 
>     ...:                                 '01-02-2011 00:00:00', 
>     ...:                                 '2011-01-03T00:00:00']))                                                                                                                 
> 
> In [13]: s                                                                                                                                                                        
> Out[13]: 
> 0    01/01/2011 00:00:00
> 1    01-02-2011 00:00:00
> 2    2011-01-03T00:00:00
> dtype: object
> In [19]: pd.to_datetime(s, infer_datetime_format=True, errors='coerce')                                                                                                           
> Out[19]: 
> 0   2011-01-01
> 1          NaT
> 2          NaT
> dtype: datetime64[ns]
> 
> In [20]: pd.to_datetime(s, infer_datetime_format=True, errors='raise')                                                                                                            
> Out[20]: 
> 0   2011-01-01
> 1   2011-01-02
> 2   2011-01-03
> dtype: datetime64[ns]
> ```

@mroeschke : any thoughts on this. 

**Comment From: mroeschke**

So the core issue is that our format guessing function is unable to guess %z, UTC offsets.

https://github.com/pandas-dev/pandas/blob/2909b830fa21c6bc2e9797aae25b13f9a060653a/pandas/_libs/tslibs/parsing.pyx#L635

I agree with your suggestion though; `errors='raise'` should always raise whenever there's an error. 

EDIT: Actually this would fail independently of timezones. Since `infer_datetime_format` guesses a format based on the first non-null time string, and the 2nd string is the example already is in a different format than the first.

**Comment From: nicolasdaviaud**

I have a similar issue on 0.24.1, it also seems to be related to different time zones in the same query:
```python
dates = ['2016-05-19T10:27:05', '20/05/2016 11:28:06', '']
print(pd.to_datetime(dates, errors='raise', infer_datetime_format=True, box=False))
print(pd.to_datetime(dates, errors='coerce', infer_datetime_format=True, box=False))

returns

['2016-05-19T10:27:05.000000000' '2016-05-20T11:28:06.000000000' 'NaT']
['2016-05-19T10:27:05.000000000' '2016-05-20T11:28:06.000000000' 'NaT']

but adding a Z to the first date changes the behaviour on the second

dates = ['2016-05-19T10:27:05Z', '20/05/2016 11:28:06', '']
print(pd.to_datetime(dates, errors='raise', infer_datetime_format=True, box=False))
print(pd.to_datetime(dates, errors='coerce', infer_datetime_format=True, box=False))

returns

[datetime.datetime(2016, 5, 19, 10, 27, 5, tzinfo=tzutc()) datetime.datetime(2016, 5, 20, 11, 28, 6) 'NaT']
['2016-05-19T10:27:05.000000000' 'NaT' 'NaT']

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.7.2.final.0 python-bits: 64 OS: Darwin OS-release: 18.2.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: fr_FR.UTF-8 LOCALE: fr_FR.UTF-8 pandas: 0.24.1 pytest: None pip: 18.1 setuptools: 40.6.3 Cython: None numpy: 1.16.1 scipy: 1.2.1 pyarrow: None xarray: None IPython: 7.2.0 sphinx: None patsy: 0.5.1 dateutil: 2.8.0 pytz: 2018.9 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 3.0.2 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml.etree: None bs4: None html5lib: None sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None gcsfs: None

Comment From: ericman93

same issue in 1.0.4

args = {
    'cache': True,
    'format': None,
    'infer_datetime_format': True
}
ser = pd.Series(['03-12-2016', '29-07-2017', '07-05-2016'])

pd.to_datetime(ser, errors='raise', **args)

outputs:
0   2016-03-12
1   2017-07-29
2   2016-07-05
dtype: datetime64[ns]

pd.to_datetime(ser, errors='coerce', **args)

outputs:
0   2016-03-12
1          NaT
2   2016-07-05
dtype: datetime64[ns]

Comment From: evgeniikozlov

Still reproduced in 1.2.4

>>pandas.to_datetime(['01-May-2021 00:00:00', '01-Sep-2021 00:00:00'], infer_datetime_format=True, errors="coerce")
DatetimeIndex(['2021-05-01', 'NaT'], dtype='datetime64[ns]', freq=None)

>>pandas.to_datetime(['01-May-2021 00:00:00', '01-Sep-2021 00:00:00'], infer_datetime_format=True, errors="raise")
DatetimeIndex(['2021-05-01', '2021-09-01'], dtype='datetime64[ns]', freq=None)

>>pandas.to_datetime(['01-May-2021 00:00:00', '01-Sep-2021 00:00:00'], infer_datetime_format=False, errors="coerce")
DatetimeIndex(['2021-05-01', '2021-09-01'], dtype='datetime64[ns]', freq=None)

The output is dependent on combination of infer_datetime_format and errors arguments.

Output of `pd.show_versions()`

pandas.show_versions() INSTALLED VERSIONS ------------------ commit : 2cb96529396d93b46abab7bbc73a208e708c642e python : 3.7.10.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19041 machine : AMD64 processor : Intel64 Family 6 Model 94 Stepping 3, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : None.None pandas : 1.2.4 numpy : 1.20.2 pytz : 2021.1 dateutil : 2.8.1 pip : 21.1.2 setuptools : 52.0.0.post20210125 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : 3.3.4 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyxlsb : None s3fs : None scipy : 1.6.2 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : None

Comment From: jreback

@evgeniikozlov hence the open issue label

PRs from the community are how these things get fixed

Comment From: MarcoGorelli

As of PDEP4, infer_datetime_format is obsolete

I've tried all the examples here, and they're all now behaving as expected

For example:

In [26]: dates = ['2016-05-19T10:27:05Z', '20/05/2016 11:28:06', '']
    ...: pd.to_datetime(dates)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[26], line 2
      1 dates = ['2016-05-19T10:27:05Z', '20/05/2016 11:28:06', '']
----> 2 pd.to_datetime(dates)

File ~/pandas-dev/pandas/core/tools/datetimes.py:1098, in to_datetime(arg, errors, dayfirst, yearfirst, utc, format, exact, unit, infer_datetime_format, origin, cache)
   1096         result = _convert_and_box_cache(argc, cache_array)
   1097     else:
-> 1098         result = convert_listlike(argc, format)
   1099 else:
   1100     result = convert_listlike(np.array([arg]), format)[0]

File ~/pandas-dev/pandas/core/tools/datetimes.py:452, in _convert_listlike_datetimes(arg, format, name, utc, unit, errors, dayfirst, yearfirst, exact)
    441 if format is not None and not require_iso8601:
    442     return _to_datetime_with_format(
    443         arg,
    444         orig_arg,
   (...)
    449         errors,
    450     )
--> 452 result, tz_parsed = objects_to_datetime64ns(
    453     arg,
    454     dayfirst=dayfirst,
    455     yearfirst=yearfirst,
    456     utc=utc,
    457     errors=errors,
    458     require_iso8601=require_iso8601,
    459     allow_object=True,
    460     format=format,
    461     exact=exact,
    462 )
    464 if tz_parsed is not None:
    465     # We can take a shortcut since the datetime64 numpy array
    466     # is in UTC
    467     dta = DatetimeArray(result, dtype=tz_to_dtype(tz_parsed))

File ~/pandas-dev/pandas/core/arrays/datetimes.py:2162, in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object, format, exact)
   2160 order: Literal["F", "C"] = "F" if flags.f_contiguous else "C"
   2161 try:
-> 2162     result, tz_parsed = tslib.array_to_datetime(
   2163         data.ravel("K"),
   2164         errors=errors,
   2165         utc=utc,
   2166         dayfirst=dayfirst,
   2167         yearfirst=yearfirst,
   2168         require_iso8601=require_iso8601,
   2169         format=format,
   2170         exact=exact,
   2171     )
   2172     result = result.reshape(data.shape, order=order)
   2173 except OverflowError as err:
   2174     # Exception is raised when a part of date is greater than 32 bit signed int

File ~/pandas-dev/pandas/_libs/tslib.pyx:453, in pandas._libs.tslib.array_to_datetime()
    451 @cython.wraparound(False)
    452 @cython.boundscheck(False)
--> 453 cpdef array_to_datetime(
    454     ndarray[object] values,
    455     str errors="raise",

File ~/pandas-dev/pandas/_libs/tslib.pyx:614, in pandas._libs.tslib.array_to_datetime()
    612     continue
    613 elif is_raise:
--> 614     raise ValueError(
    615         f"time data \"{val}\" at position {i} doesn't "
    616         f"match format \"{format}\""

ValueError: time data "20/05/2016 11:28:06" at position 1 doesn't match format "%Y-%m-%dT%H:%M:%S%z"

It's correct to raise here, as the second element doesn't match the format inferred from the first element

In the 01-Sep-2021 example, you'll need to pass %b in the format, else it'll be inferred to be %B and will (correctly) error, as Sep doesn't match that directive:

In [28]: pandas.to_datetime(['01-May-2021 00:00:00', '01-Sep-2021 00:00:00'], format='%d-%b-%Y %H:%M:%S')
Out[28]: DatetimeIndex(['2021-05-01', '2021-09-01'], dtype='datetime64[ns]', freq=None)

Closing for now then, but thanks for the report, and please do let me know if I've misunderstood anything heree

Comment From: evgeniikozlov

@MarcoGorelli thanks for the explanation about "May" parsing, but it is still unclear why it works if errors="raise" (my second example). It is confusing a lot why errorsoption has any affect on parsing in this case.

Comment From: MarcoGorelli

That was a bug, and was fixed as part of the PDEP4 change

On the main branch, you'd get an error:

In [3]: pandas.to_datetime(['01-May-2021 00:00:00', '01-Sep-2021 00:00:00'], infer_datetime_format=True, errors="raise")
<ipython-input-3-08582d0c0651>:1: UserWarning: The argument 'infer_datetime_format' is deprecated and will be removed in a future version. A strict version of it is now the default, see https://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html. You can safely remove this argument.
  pandas.to_datetime(['01-May-2021 00:00:00', '01-Sep-2021 00:00:00'], infer_datetime_format=True, errors="raise")
---------------------------------------------------------------------------
ValueError: time data '01-Sep-2021 00:00:00' does not match format '%d-%B-%Y %H:%M:%S' (match)

This behaviour will be available in pandas 2.0.0, which'll hopefully come out around February

Pandas to_datetime() with errors=coerce and without return different values

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

Output of pd.show_versions()

Output of `pd.show_versions()`

Output of `pd.show_versions()`