-
[x] I have checked that this issue has not already been reported.
-
[x] I have confirmed this bug exists on the latest version of pandas.
-
[ ] (optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample
import datetime
import numpy as np
import pandas as pd
# This works fine:
pd.DataFrame([datetime.datetime(3015,1,1,0,0,0)])
# returned dtype is object, all good.
# But the same timestamp as a numpy object fails:
pd.DataFrame(np.array(['3015-01-01T00:00:00.000'], dtype='datetime64[ms]'))
# OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 3015-01-01 00:00:00
Problem description
pd.DataFrame tries to infer dtypes, and this works for datetime.datetime objects. If they are within the datetime64[ns] limits, they are converted, while if they are outside, the dtype is kept as object. For numpy datetime64 objects as input, the same flexibility is not in place, and pd.DataFrame() errors hard instead during auto-conversion.
Expected Output
dtype: object when pd.DataFrame is called on the numpy timestamps above
Output of pd.show_versions()
Comment From: attack68
might be related to #39537
Comment From: jbrockmendel
pandas doesn't support datetime64[ms] only ns, so when we get datetime64[ms] we try to cast to datetime64[ns], which is why this is raising.
the short term solution is not to pass "datetime64[ms]" and pandas will infer the dt64ns dtype where possible, keep object otherwise.
longer-term we will hopefully get non-nano support
Comment From: pganssle
@jbrockmendel What's strange is that a single np.datetime64[ms] (or [us]) will work here:
import numpy as np
import pandas as pd
dt = np.datetime64('1000-01-01T00:00:00', 'us')
df = pd.DataFrame([[dt]])
print(df)
# 0
# 0 1000-01-01T00:00:00.000000
df_multi = pd.DataFrame([[dt], [dt]])
print(df_multi)
# 0
# 0 1000-01-01T00:00:00.000000
# 1 1000-01-01T00:00:00.000000
df_array = pd.DataFrame([np.array(dt)])
# pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1000-01-01 00:00:00
Not sure of how the code is organized, but I think you could probably fix this relatively easily by considering numpy arrays with non-ns dtypes as if they were a list of non-ns datetimes. This might come at a performance penalty, but I imagine you can mitigate that by either:
- Try to do it the way it's being done now, but with a
try/catchwhere if the constructor fails withOutOfBoundsDatetimeand the input is a numpy array with non-nanoseconddatetime64, convert it to a list (or treat it as a list) and try again. - Check to see if it's a
datetime64with non-nanosecond units, check if any of the contents are outside of the nanosecond bounds and if so convert to a list.
I think the first one has the smallest performance impact on the currently-working code.
Comment From: jbrockmendel
I think you could probably fix this relatively easily by considering numpy arrays with non-ns dtypes as if they were a list of non-ns datetimes
that's probably viable. im optimistic about getting non-nano support in before too long, but if someone wanted to make a PR to do this in the interim, I wouldn't object
Comment From: MarcoGorelli
this one can be closed as of #49058, right?
In [1]: pd.DataFrame(np.array(['3015-01-01T00:00:00.000'], dtype='datetime64[ms]'))
Out[1]:
0
0 3015-01-01
closing then, but LMK if I've misunderstood and will reopen
Comment From: berland
Yes, I have tested with Pandas development version and can confirm that I see no issue. Thanks!