• [x] I have checked that this issue has not already been reported.

  • [x] I have confirmed this bug exists on the latest version of pandas.

  • [ ] (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample

import datetime
import numpy as np
import pandas as pd
# This works fine:
pd.DataFrame([datetime.datetime(3015,1,1,0,0,0)])
# returned dtype is object, all good.

# But the same timestamp as a numpy object fails:
pd.DataFrame(np.array(['3015-01-01T00:00:00.000'], dtype='datetime64[ms]'))
# OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 3015-01-01 00:00:00

Problem description

pd.DataFrame tries to infer dtypes, and this works for datetime.datetime objects. If they are within the datetime64[ns] limits, they are converted, while if they are outside, the dtype is kept as object. For numpy datetime64 objects as input, the same flexibility is not in place, and pd.DataFrame() errors hard instead during auto-conversion.

Expected Output

dtype: object when pd.DataFrame is called on the numpy timestamps above

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : 7d32926db8f7541c356066dcadabf854487738de python : 3.8.5.final.0 python-bits : 64 OS : Linux OS-release : 5.4.0-62-generic Version : #70-Ubuntu SMP Tue Jan 12 12:45:47 UTC 2021 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.2.2 numpy : 1.19.5 pytz : 2019.3 dateutil : 2.8.1 pip : 21.0.1 setuptools : 44.0.0 Cython : None pytest : 6.2.2 hypothesis : None sphinx : 3.4.3 blosc : None feather : None xlsxwriter : None lxml.etree : 4.6.2 html5lib : None pymysql : None psycopg2 : None jinja2 : 2.11.2 IPython : 7.19.0 pandas_datareader: None bs4 : None bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : 3.3.4 numexpr : None odfpy : None openpyxl : 3.0.6 pandas_gbq : None pyarrow : None pyxlsb : None s3fs : None scipy : 1.6.0 sqlalchemy : 1.3.22 tables : None tabulate : None xarray : 0.16.2 xlrd : 1.2.0 xlwt : None numba : None

Comment From: attack68

might be related to #39537

Comment From: jbrockmendel

pandas doesn't support datetime64[ms] only ns, so when we get datetime64[ms] we try to cast to datetime64[ns], which is why this is raising.

the short term solution is not to pass "datetime64[ms]" and pandas will infer the dt64ns dtype where possible, keep object otherwise.

longer-term we will hopefully get non-nano support

Comment From: pganssle

@jbrockmendel What's strange is that a single np.datetime64[ms] (or [us]) will work here:

import numpy as np
import pandas as pd

dt = np.datetime64('1000-01-01T00:00:00', 'us')

df = pd.DataFrame([[dt]])
print(df)
                            # 0
# 0  1000-01-01T00:00:00.000000


df_multi = pd.DataFrame([[dt], [dt]])
print(df_multi)
                            # 0
# 0  1000-01-01T00:00:00.000000
# 1  1000-01-01T00:00:00.000000

df_array = pd.DataFrame([np.array(dt)])
# pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1000-01-01 00:00:00

Not sure of how the code is organized, but I think you could probably fix this relatively easily by considering numpy arrays with non-ns dtypes as if they were a list of non-ns datetimes. This might come at a performance penalty, but I imagine you can mitigate that by either:

  1. Try to do it the way it's being done now, but with a try/catch where if the constructor fails with OutOfBoundsDatetime and the input is a numpy array with non-nanosecond datetime64, convert it to a list (or treat it as a list) and try again.
  2. Check to see if it's a datetime64 with non-nanosecond units, check if any of the contents are outside of the nanosecond bounds and if so convert to a list.

I think the first one has the smallest performance impact on the currently-working code.

Comment From: jbrockmendel

I think you could probably fix this relatively easily by considering numpy arrays with non-ns dtypes as if they were a list of non-ns datetimes

that's probably viable. im optimistic about getting non-nano support in before too long, but if someone wanted to make a PR to do this in the interim, I wouldn't object

Comment From: MarcoGorelli

this one can be closed as of #49058, right?

In [1]: pd.DataFrame(np.array(['3015-01-01T00:00:00.000'], dtype='datetime64[ms]'))
Out[1]: 
           0
0 3015-01-01

closing then, but LMK if I've misunderstood and will reopen

Comment From: berland

Yes, I have tested with Pandas development version and can confirm that I see no issue. Thanks!