Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
>>> pd.__version__
'1.5.3'
s = ['2023-01-02', 'Mon Mar 20 11:00:00 UTC 2023', '', 'NaN', 'foo', float('NaN')]
>>> pd.to_datetime(s, errors='coerce')
Index([2023-01-02 00:00:00, 2023-03-20 11:00:00+00:00, 'NaT', 'NaT', NaT, nan], dtype='object')
>>> [type(x) for x in pd.to_datetime(s, errors='coerce')]
[datetime.datetime,
datetime.datetime,
str,
str,
pandas._libs.tslibs.nattype.NaTType,
float]
Issue Description
According to the docs, errors='coerce'
should convert entries to either a proper datetime
, or NaT
. Instead, we get a mix of:
* datetime
,
* float('NaN')
,
* pandas._libs.tslibs.nattype.NaTType
,
* 'NaT'
as string.
This makes it very difficult to have some consistent handling of the errors.
For example, instead of:
s = pd.Series(...)
res = pd.to_datetime(s, errors='coerce')
s.loc[res.isna()]
to inspect the problematic input, we have to handle string 'NaT'
as well as .isna()
. The float
NaN
is also unexpected in that context.
Expected Behavior
The result should only contain NaT
(as type pandas._libs.tslibs.nattype.NaTType
) for entries where entries that failed to convert. All others should be valid datetime
values. There should be no string 'NaN'
nor float nan
in the results.
A simple .isna()
should indicate all the entries that were not successfully parsed and converted.
Installed Versions
/mnt/miniconda3/envs/test/lib/python3.10/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")
INSTALLED VERSIONS
------------------
commit : 2e218d10984e9919f0296931d92ea851c6a6faf5
python : 3.10.9.final.0
python-bits : 64
OS : Linux
OS-release : 4.14.281-212.502.amzn2.x86_64
Version : #1 SMP Thu May 26 09:52:17 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.5.3
numpy : 1.23.5
pytz : 2022.7
dateutil : 2.8.2
setuptools : 65.6.3
pip : 22.3.1
Cython : None
pytest : 7.1.2
hypothesis : 6.29.3
sphinx : 5.0.2
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.1
html5lib : 1.1
pymysql : None
psycopg2 : 2.9.3
jinja2 : 3.1.2
IPython : 8.10.0
pandas_datareader: 0.10.0
bs4 : 4.11.1
bottleneck : 1.3.5
brotli :
fastparquet : None
fsspec : 2022.11.0
gcsfs : None
matplotlib : 3.6.2
numba : 0.56.4
numexpr : 2.8.4
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : 8.0.0
pyreadstat : None
pyxlsb : None
s3fs : 2022.11.0
scipy : 1.10.0
snappy : None
sqlalchemy : 1.4.39
tables : 3.7.0
tabulate : None
xarray : 2022.11.0
xlrd : None
xlwt : None
zstandard : 0.19.0
tzdata : None
Comment From: MarcoGorelli
thanks for the report
here's the result with the 2.0.0 release candidate:
In [78]: s = ['2023-01-02', 'Mon Mar 20 11:00:00 UTC 2023', '', 'NaN', 'foo', float('NaN')]
In [79]: pd.to_datetime(s, errors='coerce')
Out[79]: DatetimeIndex(['2023-01-02', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT'], dtype='datetime64[ns]', freq=None)
does this look right?
EDIT: the mixed path still shows mixed types in the result, which doesn't look right
In [80]: pd.to_datetime(s, errors='coerce', format='mixed')
Out[80]: Index([2023-01-02 00:00:00, 2023-03-20 11:00:00+00:00, 'NaT', 'NaT', NaT, nan], dtype='object')
Thanks for the report!
Comment From: pdemarti
Out[79]
looks lovely.
And yes, the output with format='mixed'
doesn't look right.