-
[x] I have checked that this issue has not already been reported.
-
[x] I have confirmed this bug exists on the latest version of pandas.
-
[x] (optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
>>> import pandas as pd
>>> from datetime import datetime
>>> pd.to_numeric(datetime(2021, 8, 22), errors="coerce")
nan
>>> pd.to_numeric(pd.Series(datetime(2021, 8, 22)), errors="coerce")
0 1629590400000000000
dtype: int64
>>> pd.Series([datetime(2021, 8, 22)]).apply(partial(pd.to_numeric), errors="coerce")
0 NaN
dtype: float64
>>>
>>> pd.to_numeric(pd.NaT, errors="coerce")
nan
>>> pd.to_numeric(pd.Series(pd.NaT), errors="coerce")
0 -9223372036854775808
dtype: int64
>>> pd.Series([pd.NaT]).apply(partial(pd.to_numeric), errors="coerce")
0 NaN
dtype: float64
Problem description
When using pd.to_numeric
to convert a pd.Series
with dtype datetime64[ns]
, it returns different values than converting the series value by value
Expected Output
Converting a pd.Series
as a whole should be the same than converting it value by value.
I am not sure about what the correct output should be, but IMO the output should be consistent in these two scenarios.
What I suggest:
- For no-null values, returns the same value. Maybe the integer?
- For pd.NaT
, always returns np.NaN
Output of pd.show_versions()
I am using the latest version of master
until today
Comment From: DAKSHA2001
Assign me this issue. I will solve it. I have to make an open source contribution as part of Microsoft Research Intern Role so assign this to me
Comment From: ShreyasPatel031
take
Comment From: Navaneethan2503
this PR will close this issue #43289 , thanks for this issue @hec10r . closes #43280
Comment From: hec10r
take
Comment From: hec10r
After some investigation, it looks like this behavior is explained by: https://github.com/numpy/numpy/issues/19782, so not sure if this is a pandas
issue or needs to be fixed in numpy
. Before creating a PR here I would like to have the input from the maintainers.
Comment From: mroeschke
The key point to note is that datetime
objects in pandas get converted to datetime64[ns]
(this has been the convention for a while) and not explicitly an object unless dtype=object
is specified. So that being said:
Incorrect, should be 1629590400000000000
>>> pd.to_numeric(datetime(2021, 8, 22), errors="coerce")
nan
Correct
>>> pd.to_numeric(pd.Series(datetime(2021, 8, 22)), errors="coerce")
0 1629590400000000000
dtype: int64
Incorrect, should be 1629590400000000000
>>> pd.Series([datetime(2021, 8, 22)]).apply(partial(pd.to_numeric), errors="coerce")
0 NaN
dtype: float64
Correct
>>> pd.to_numeric(pd.NaT, errors="coerce")
nan
Incorrect, probably related to https://github.com/pandas-dev/pandas/issues/16674
>>> pd.to_numeric(pd.Series(pd.NaT), errors="coerce")
0 -9223372036854775808
dtype: int64
Correct
>>> pd.Series([pd.NaT]).apply(partial(pd.to_numeric), errors="coerce")
0 NaN
Comment From: hec10r
Hi Matthew, thanks for answering.
That's the other approach that I considered, but since there weren't documentation of the pd.to_numeric
behavior for date-like objects and I didn't find it very intuitive, I thought that changing the whole behavior for something more intuitive would be good.
The use case where I am struggling with is this one:
>>> import pandas as pd
>>> from datetime import datetime
>>> pd.to_numeric(pd.Series([datetime(2021,8,22)]), errors="raise") # Same for errors="ignore"/"coerce"
0 1629590400000000000
dtype: int64
>>> pd.to_numeric(pd.Series(["apple", 1, datetime(2021,8,22)]), errors="coerce")
0 NaN
1 1.0
2 NaN
dtype: float64
>>> pd.to_numeric(pd.Series(["apple", 1, datetime(2021,8,22)]), errors="ignore")
0 apple
1 1
2 2021-08-22 00:00:00
dtype: object
Is this desired/expected? Should we return 1629590400000000000
in the three cases?
Comment From: mroeschke
Sorry, yes the documentation could use improvement to address how datetime-like objects (e.g. np.datetime64
, datetime.datetime
, etc) are treated.
I would expect 1629590400000000000
in all 3 cases, thought this case is a bit tricky since the entire Series
has dtype=object
, but certain elements in that Series
can be converted to numbers as it aligns with what you mentioned in your OP
Converting a pd.Series as a whole should be the same as converting it value by value.
Comment From: hec10r
IMO, the best approach would be to have a function pd._scalar_to_numeric
and then call it inside pd.to_numeric
, something like:
from functools import partial
def to_numeric(arg, errors="raise", downcast=None):
if is_scalar(arg):
return pd._scalar_to_numeric(arg, errors=errors, downcast=downcast)
elif is_series_or_index(arg):
return arg.apply(pd._scalar_to_numeric, errors=errors, downcast=downcast)
elif is_list_tuple_or_np_array(arg):
# Maybe keeping the dtype as well for `np.array`?
return np.array(map(partial(pd._scalar_to_numeric, errors=errors, downcast=downcast), arg)
What do you think?
Comment From: mroeschke
Probably the best approach would be to modify this line here to handle datetime like scalars (use one of the existing functions in pandas.core.dtypes.common
https://github.com/pandas-dev/pandas/blob/5f648bf1706dd75a9ca0d29f26eadfbb595fe52b/pandas/core/tools/numeric.py#L155
Comment From: hec10r
By only doing that, the problem will persist for lists, tuples and np.arrays
Comment From: mroeschke
Best to address scalars vs array-likes in separate PRs. I imagine addressing array-likes may need to use one of the datetime inference functions for arrays
Comment From: hec10r
I don't see a way of fixing array-likes without using the same logic than for scalars. Inferring the type of the array isn't enough given all the possible cases. Having a function that handles scalars and then mapping this function element by element for iterables seems to be the easiest solution.
Comment From: mroeschke
mapping this function element by element for iterables seems to be the easiest solution
This will kill performance for the existing cases, so that implementation is probably a non-starter.
Comment From: hec10r
Understood. Then let's split the problem
1. PR to fix behavior for scalar date-like objects, including pd.NaT
(will do this before weekend)
2. PR to fix behavior for iterables: infer types for list/tuples and use current approach for np.array
when type is number-like. Approach for mixed types TBF (open to discuss and implement)
Comment From: jack5github
This is still an ongoing issue as of the 13th of February, 2025. This could theoretically be fixed by using the dtype Int64, which has support for NA values.
from datetime import datetime
import pandas as pd
print(pd.to_numeric(pd.Series([datetime(2025, 2, 13), pd.NaT])))
"""
0 1739404800000000000
1 -9223372036854775808
dtype: int64
"""