Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

In [15]: to_datetime(['2020-01-01 00:00:00+01:00', '2020-01-01 00:00:00+02:00', None], format='%Y-%m-%d %H:%M:%S%z')
Out[15]: Index([2020-01-01 00:00:00+01:00, 2020-01-01 00:00:00+02:00, None], dtype='object')

In [16]: to_datetime(['2020-01-01 00:00:00+01:00', '2020-01-01 00:00:00+02:00', None], format='%Y-%d-%m %H:%M:%S%z')
Out[16]: Index([2020-01-01 00:00:00+01:00, 2020-01-01 00:00:00+02:00, NaT], dtype='object')

Issue Description

In the first (ISO) case, None stays None. But in the second (non-ISO) case, None becomes NaT.

Expected Behavior

@jbrockmendel any thoughts on what should be correct?

I think I'd prefer the latter (just change to NaT)

Installed Versions

INSTALLED VERSIONS ------------------ commit : 8dab54d6573f7186ff0c3b6364d5e4dd635ff3e7 python : 3.8.15.final.0 python-bits : 64 OS : Linux OS-release : 5.10.102.1-microsoft-standard-WSL2 Version : #1 SMP Wed Mar 2 00:30:59 UTC 2022 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8 pandas : 1.5.2 numpy : 1.23.5 pytz : 2022.6 dateutil : 2.8.2 setuptools : 65.5.1 pip : 22.3.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.6.0 pandas_datareader: None bs4 : 4.11.1 bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None tzdata : None None

Comment From: MarcoGorelli

@mroeschke do you have thoughts on which is correct?

Comment From: jbrockmendel

i think id expect both to be NaT and both to be DatetimeIndex

Comment From: MarcoGorelli

I don't think DatetimeIndex is possible with mixed offsets, is it? (note how the strings have different offsets)

Comment From: jbrockmendel

didnt notice that, you're right

Comment From: mroeschke

Should the errors keyword be applicable here?

Since there's an implicit to_datetime(..., format=..., errors="raise") shouldn't this raise since format can't be applied to None? Not sure if we have testing for None being special

Comment From: MarcoGorelli

yeah it's tested here

https://github.com/pandas-dev/pandas/blob/8da8743729118139910b9eeb27e27a1ceceb2c4f/pandas/tests/tools/test_to_datetime.py#L1343-L1363

I just think expected is wrong on two counts: - the first "NaT" should be NaT - the None should be NaT

Comment From: mroeschke

I guess in that particular example format is not provided thought.

I haven't looked at mixed-argument to_datetime testing in a while but leads me to 2 questions

  1. If format is provided, should all arguments be strings (or datetimes that were recently added?) or can other arguments be parsed with non-formatting paths?
  2. Should in theory a None.strftime(format) -> NaT parsing be supported? Since NaT.strp/stftime are designed to raise errors my initial thought is no but could be convinced otherwise

Comment From: mroeschke

I just think expected is wrong on two counts:

Agreed with your conclusions

Comment From: MarcoGorelli

Not sure it should depend on whether or not format was passed - what I'm leaning towards is: - skip over any NaT-y (None, NaT, nan, 'nan', etc.) -> convert all to NaT - skip (or rather, preserve) pydatetime objects - if format is provided, parse according to that (respecting exact). Else, guess/infer

Note that with PDEP4, it'll always behave as if format was passed (unless pandas can't guess the format)

Comment From: mroeschke

Ah okay I like those guidelines then.

Comment From: jbrockmendel

looks like the example you used is the same as from #40111, so a PR that addresses this may close that

Comment From: jorisvandenbossche

Resolved by https://github.com/pandas-dev/pandas/pull/50242