import numpy
import pandas as pd
# We create a list of strings.
time_str_arr = ['2017-06-30T13:51:15.854', '2017-06-30T13:51:16.250',
'2017-06-30T13:51:16.452', '2017-06-30T13:51:16.659']
# Then we create a time array, rounded to 10ms (actually floored,
# not rounded), everything seems to be fine here.
rounded_time = numpy.array(time_str_arr, dtype="datetime64[10ms]")
rounded_time
# Then we create a Pandas DataFrame and assign the time array as a
# column to it. The datetime64 content is destroyed.
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
df = df.assign(wrong_time=rounded_time)
df
# The output I get:
# one two wrong_time
# a 1.0 1.0 1974-10-01 18:11:07.585
# b 2.0 2.0 1974-10-01 18:11:07.625
# c 3.0 3.0 1974-10-01 18:11:07.645
# d NaN 4.0 1974-10-01 18:11:07.665
# (the timestamps are wrong)
Problem description
I am having a list of timestamps, with millisecond accuracy, encoded as strings. Then I round them to 10ms resolution, that goes well. The bug comes when I add the rounded timestamps to DataFrame as a new column - the values of datetime64 objects get totally destroyed. My suspicion - the numpy.datetime64 is converted to some other datatype in the DataFrame.assign() method. It should maintain the same type.
Output of pd.show_versions()
Comment From: gfyoung
@strazdinsg : Thanks for reporting this! For starters, I should point out that datetime64[10ms]
is an unusual frequency from our point of view. We largely operate in datetime64[ns]
. Thus, I suspect something went funny when we encountered that frequency.
In fact, if you pass in dtype="datetime64[ns]"
, everything looks OK (though not rounded as you had hoped). What I would suggest (as a workaround) is to recast your timestamps to datetime64[ns]
before assigning the column.
Given the unusual nature of the frequency, I'm hesitant to classify as a bug, though we should be able to better handle such an input than silently "break" like this I think.
Comment From: jreback
there is an open issue about this already
Comment From: jreback
https://github.com/pandas-dev/pandas/issues/7996
much more idiomatic to simply do:
In [16]: # We create a list of strings.
...: time_str_arr = ['2017-06-30T13:51:15.854', '2017-06-30T13:51:16.250',
...: '2017-06-30T13:51:16.452', '2017-06-30T13:51:16.659']
...:
In [17]: pd.to_datetime(time_str_arr).floor('10ms')
Out[17]: DatetimeIndex(['2017-06-30 13:51:15.850000', '2017-06-30 13:51:16.250000', '2017-06-30 13:51:16.450000', '2017-06-30 13:51:16.650000'], dtype='datetime64[ns]', freq=None)
numpy in general is not very friendly to any kind of non-standard datetimes. That said pandas should actually convert these non-ns dtypes.
Comment From: strazdinsg
Thanks for an elegant solution @jreback !
Comment From: abhimanyu3-zz
can we convert this to general timestamp... I am not able to do that and when i am applying groupby on this time stamp its not working..I am not even able to sort my columns and i am having millions of observation:- 2018-08-26T14:05:31.000Z The reason i am trying to sort it because i want to get a time series. I just want my time to be 14:05:31 which is general pd.datetime and on which i can perform groupby or sort whatever i want.
I will really appreciate your help @jreback
Comment From: jeybrahms
This issue is not limited to exotic types, e.g. it applies to datetime64[D], and it has nothing to do with formatting. The issue appears to be that the resolution changes to ns when a numpy array gets stored in a DataFrame or Series. Here is an example:
make [D] array
In [194]: x = np.array(['2018-01-03'], dtype='datetime64[D]')
put it into pd.Series and extract the values
In [195]: y = pd.Series(x).values
note the dtype has changed to [ns]
In [196]: y.dtype Out[196]: dtype('<M8[ns]')
turns out x and y arrays are not interchangeable. This creates subtle bugs in cython code.
In [197]: x.view('int') == y.view('int') Out[197]: array([False])
Comment From: TomAugspurger
@jeybrahms pandas only supports nanosecond-precision datetimes currently.