Code Sample, a copy-pastable example if possible
>>> s = pd.Series(['9/22/1677', '4/11/2262'], dtype="U")
>>> s
0 9/22/1677
1 4/11/2262
dtype: object
>>> s.astype('datetime64')
0 1677-09-22
1 2262-04-11
dtype: datetime64[ns]
>>> s.astype('datetime64[s]')
0 1677-09-22
1 2262-04-11
dtype: datetime64[ns]
>>> s.astype('datetime64[m]')
0 1677-09-22
1 2262-04-11
dtype: datetime64[ns]
>>> s.astype('datetime64[h]')
0 1677-09-22
1 2262-04-11
dtype: datetime64[ns]
>>> s.astype('datetime64[D]')
0 2262-04-11 # <<<< wait, what?
1 2262-04-11
dtype: datetime64[ns]
Problem description
It appears that astype('datetime64[D]')
breaks on the earliest representable full day -- it wraps around to the latest representable date.
While I know this is super edge case-y and highly unlikely to be worth a lot of time, the fact that it wraps around is indicative of a potential for integer overflow somewhere that may or may not bite us in the future.
Expected Output
.astype('datetime64[D]')
should give 1677-09-22 instead of 2262-04-11.
Output of pd.show_versions()
Comment From: MarPur
Any objections if I take a stab at this?
Comment From: MarPur
This behaviour comes from numpy. Since pandas first casts to datetime64[ns]
before casting to datetime64[D]
, the calls to numpy's API look something like this:
>>> np.datetime64('1677-09-22').astype('datetime64[ns]').astype('datetime64[D]')
numpy.datetime64('2262-04-11')
One way to fix this would be to raise the minimum datetime value from 1677-09-21 00:12:43.145225
to 1677-09-22 00:12:43.145225
(add one day)
>>> np.datetime64('1677-09-22 00:12:43.145224190').astype('datetime64[D]')
numpy.datetime64('2262-04-11')
>>> np.datetime64('1677-09-22 00:12:43.145224191').astype('datetime64[D]')
numpy.datetime64('1677-09-22')
I have been working on a PR to do this, I just need to fix the tests.
Comment From: jbrockmendel
We now raise when trying to cast to non-supported resos ("D") and use astype_overflowsafe to prevent silent overflows when converting between resos. Closing.