-
[x] I have checked that this issue has not already been reported.
-
[x] I have confirmed this bug exists on the latest version of pandas.
-
[ ] (optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample
import datetime
import numpy as np
import pandas as pd
# This works fine:
pd.DataFrame([datetime.datetime(3015,1,1,0,0,0)])
# returned dtype is object, all good.
# But the same timestamp as a numpy object fails:
pd.DataFrame(np.array(['3015-01-01T00:00:00.000'], dtype='datetime64[ms]'))
# OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 3015-01-01 00:00:00
Problem description
pd.DataFrame tries to infer dtypes, and this works for datetime.datetime objects. If they are within the datetime64[ns] limits, they are converted, while if they are outside, the dtype is kept as object
. For numpy datetime64 objects as input, the same flexibility is not in place, and pd.DataFrame() errors hard instead during auto-conversion.
Expected Output
dtype: object when pd.DataFrame is called on the numpy timestamps above
Output of pd.show_versions()
Comment From: attack68
might be related to #39537
Comment From: jbrockmendel
pandas doesn't support datetime64[ms]
only ns, so when we get datetime64[ms]
we try to cast to datetime64[ns], which is why this is raising.
the short term solution is not to pass "datetime64[ms]"
and pandas will infer the dt64ns dtype where possible, keep object otherwise.
longer-term we will hopefully get non-nano support
Comment From: pganssle
@jbrockmendel What's strange is that a single np.datetime64[ms]
(or [us
]) will work here:
import numpy as np
import pandas as pd
dt = np.datetime64('1000-01-01T00:00:00', 'us')
df = pd.DataFrame([[dt]])
print(df)
# 0
# 0 1000-01-01T00:00:00.000000
df_multi = pd.DataFrame([[dt], [dt]])
print(df_multi)
# 0
# 0 1000-01-01T00:00:00.000000
# 1 1000-01-01T00:00:00.000000
df_array = pd.DataFrame([np.array(dt)])
# pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1000-01-01 00:00:00
Not sure of how the code is organized, but I think you could probably fix this relatively easily by considering numpy arrays with non-ns dtypes as if they were a list of non-ns datetimes. This might come at a performance penalty, but I imagine you can mitigate that by either:
- Try to do it the way it's being done now, but with a
try/catch
where if the constructor fails withOutOfBoundsDatetime
and the input is a numpy array with non-nanoseconddatetime64
, convert it to a list (or treat it as a list) and try again. - Check to see if it's a
datetime64
with non-nanosecond units, check if any of the contents are outside of the nanosecond bounds and if so convert to a list.
I think the first one has the smallest performance impact on the currently-working code.
Comment From: jbrockmendel
I think you could probably fix this relatively easily by considering numpy arrays with non-ns dtypes as if they were a list of non-ns datetimes
that's probably viable. im optimistic about getting non-nano support in before too long, but if someone wanted to make a PR to do this in the interim, I wouldn't object
Comment From: MarcoGorelli
this one can be closed as of #49058, right?
In [1]: pd.DataFrame(np.array(['3015-01-01T00:00:00.000'], dtype='datetime64[ms]'))
Out[1]:
0
0 3015-01-01
closing then, but LMK if I've misunderstood and will reopen
Comment From: berland
Yes, I have tested with Pandas development version and can confirm that I see no issue. Thanks!