Pandas numpy.datetime64 casted to Timestamp when added to DataFrame

import numpy
import pandas as pd

# We create a list of strings. 
time_str_arr = ['2017-06-30T13:51:15.854', '2017-06-30T13:51:16.250',
                '2017-06-30T13:51:16.452', '2017-06-30T13:51:16.659']
# Then we create a time array, rounded to 10ms (actually floored, 
# not rounded), everything seems to be fine here.
rounded_time = numpy.array(time_str_arr, dtype="datetime64[10ms]")
rounded_time 

# Then we create a Pandas DataFrame and assign the time array as a 
# column to it. The datetime64 content is destroyed.
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
  'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
df = df.assign(wrong_time=rounded_time)
df

# The output I get:
#    one two wrong_time
# a   1.0 1.0 1974-10-01 18:11:07.585
# b   2.0 2.0 1974-10-01 18:11:07.625
# c   3.0 3.0 1974-10-01 18:11:07.645
# d   NaN 4.0 1974-10-01 18:11:07.665

# (the timestamps are wrong)

Problem description

I am having a list of timestamps, with millisecond accuracy, encoded as strings. Then I round them to 10ms resolution, that goes well. The bug comes when I add the rounded timestamps to DataFrame as a new column - the values of datetime64 objects get totally destroyed. My suspicion - the numpy.datetime64 is converted to some other datatype in the DataFrame.assign() method. It should maintain the same type.

Output of `pd.show_versions()`

INSTALLED VERSIONS commit: None python: 3.6.1.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.20.1 pytest: 3.0.7 pip: 9.0.1 setuptools: 27.2.0 Cython: 0.25.2 numpy: 1.12.1 scipy: 0.19.0 xarray: None IPython: 5.3.0 sphinx: 1.5.6 patsy: 0.4.1 dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: 1.2.1 tables: 3.2.2 numexpr: 2.6.2 feather: None matplotlib: 2.0.2 openpyxl: 2.4.7 xlrd: 1.0.0 xlwt: 1.2.0 xlsxwriter: 0.9.6 lxml: 3.7.3 bs4: 4.6.0 html5lib: 0.999 sqlalchemy: 1.1.9 pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None

Comment From: gfyoung

@strazdinsg : Thanks for reporting this! For starters, I should point out that datetime64[10ms] is an unusual frequency from our point of view. We largely operate in datetime64[ns]. Thus, I suspect something went funny when we encountered that frequency.

In fact, if you pass in dtype="datetime64[ns]", everything looks OK (though not rounded as you had hoped). What I would suggest (as a workaround) is to recast your timestamps to datetime64[ns] before assigning the column.

Given the unusual nature of the frequency, I'm hesitant to classify as a bug, though we should be able to better handle such an input than silently "break" like this I think.

Comment From: jreback

there is an open issue about this already

Comment From: jreback

https://github.com/pandas-dev/pandas/issues/7996

much more idiomatic to simply do:

In [16]: # We create a list of strings. 
    ...: time_str_arr = ['2017-06-30T13:51:15.854', '2017-06-30T13:51:16.250',
    ...:                 '2017-06-30T13:51:16.452', '2017-06-30T13:51:16.659']
    ...: 

In [17]: pd.to_datetime(time_str_arr).floor('10ms')
Out[17]: DatetimeIndex(['2017-06-30 13:51:15.850000', '2017-06-30 13:51:16.250000', '2017-06-30 13:51:16.450000', '2017-06-30 13:51:16.650000'], dtype='datetime64[ns]', freq=None)

numpy in general is not very friendly to any kind of non-standard datetimes. That said pandas should actually convert these non-ns dtypes.

Comment From: strazdinsg

Thanks for an elegant solution @jreback !

Comment From: abhimanyu3-zz

can we convert this to general timestamp... I am not able to do that and when i am applying groupby on this time stamp its not working..I am not even able to sort my columns and i am having millions of observation:- 2018-08-26T14:05:31.000Z The reason i am trying to sort it because i want to get a time series. I just want my time to be 14:05:31 which is general pd.datetime and on which i can perform groupby or sort whatever i want.

I will really appreciate your help @jreback

Comment From: jeybrahms

This issue is not limited to exotic types, e.g. it applies to datetime64[D], and it has nothing to do with formatting. The issue appears to be that the resolution changes to ns when a numpy array gets stored in a DataFrame or Series. Here is an example:

make [D] array

In [194]: x = np.array(['2018-01-03'], dtype='datetime64[D]')

put it into pd.Series and extract the values

In [195]: y = pd.Series(x).values

note the dtype has changed to [ns]

In [196]: y.dtype Out[196]: dtype('<M8[ns]')

turns out x and y arrays are not interchangeable. This creates subtle bugs in cython code.

In [197]: x.view('int') == y.view('int') Out[197]: array([False])

Comment From: TomAugspurger

@jeybrahms pandas only supports nanosecond-precision datetimes currently.