Python datetime.datetime object get converted inside pandas dataframes. This is not in all cases desired.
Code Sample
```from datetime import datetime import numpy as np import pandas as pd
rd = lambda : datetime(2017,1,np.random.randint(1,32))
datelist = [rd() for i in range(10)]
print(type(datelist[0])) #
df = pd.DataFrame({"date" : datelist, "y" : np.random.rand(10)})
print(type(df["date"][0])) #
dates = [d.to_pydatetime() for d in df["date"]]
print(type(dates[0])) #
Creating a dataframe with a list of `datetime` objects converts them to `pandas._libs.tslib.Timestamp`. This is undesired. In this simple case it is at least possible to get back a datetime object by list comprehension, `dates = [d.to_pydatetime() for d in df["date"]]`, which I would consider as an unnecessary step and makes the use of a DataFrame somehow obsolete.
Background:
The real problem with this behaviour is shown in the following. Any instance of a class that gets derived from `datetime` is also converted, hence loosing all its properties.
```from datetime import datetime
import numpy as np
import pandas as pd
class someobject(datetime):
prop = "property"
def __init__(self,*args,**kwargs):
datetime.__init__(*args,**kwargs)
rd = lambda : someobject(2017,1,np.random.randint(1,32))
datelist = [rd() for i in range(10)]
print(type(datelist[0])) # <class '__main__.someobject'>
print(datelist[0].prop) # property
df = pd.DataFrame({"date" : datelist,
"y" : np.random.rand(10)})
print(type(df["date"][0])) # <class 'pandas._libs.tslib.Timestamp'>
print(df["date"][0].prop) # Error AttributeError: 'Timestamp' object has no attribute 'prop'
How to prevent the DataFrame to convert an object that I put into it to something else?
As a workaround, how to change someobject
in the above such that it does not get converted automatically to something else?
[Here, python 2.7, pandas 0.20.1 are used]
Comment From: TomAugspurger
You can use object
dtype for storing arbitrary python objects.
In [51]: df = pd.DataFrame({'date': pd.Series(datelist, dtype='object'), 'y': np.random.rand(10)})
In [52]: df.date[0]
Out[52]: someobject(2017, 1, 2, 0, 0)
YMMV on how well this works and which operations in pandas will cast those to datetime64[ns].
Comment From: jorisvandenbossche
In this simple case it is at least possible to get back a datetime object by list comprehension, dates = [d.to_pydatetime() for d in df["date"]], which I would consider as an unnecessary step and makes the use of a DataFrame somehow obsolete.
Also df['date'].dt.to_pydatetime()
will do this for you without list comprehension
Comment From: hrushikesh-dhumal
df['date'].dt.to_pydatetime()
does not work when assigning a column to dataframe, it will revert back to pandas._libs.tslibs.timestamps.Timestamp
.
This also happens when groupby is used on datetime column.
Comment From: li-dennis
Are there any good solutions to this? Pandas/numpy timestamps and timedeltas seem to be extremely nonperformant (ie 2 orders of magnitude slower), so simple arithmetic involving dates is slowing down code substantially. [edit] corrected SO link: https://stackoverflow.com/a/29192601/805763 [/edit]
I'd prefer to use pydatetime/pytimedelta given the perf issues, but if that's not possible, it seems like the alternative is to use longs?
Comment From: TomAugspurger
Are there any good solutions to this? Pandas/numpy timestamps and timedeltas seem to be extremely nonperformant
Can you post an example with timing?
Comment From: li-dennis
Are there any good solutions to this? Pandas/numpy timestamps and timedeltas seem to be extremely nonperformant
Can you post an example with timing?
Sure. Apologies for the lack of example. Here's an excerpt from some bit of code I was profiling where this came up when using line profiler, with python 3.7:
When using pd.Timedelta
and pd.Timestamp
Time Per Hit % Time Line Contents
==============================================================
19.3 46.6 slot_sub_duration = end_slot_dt - start_dt
When using datetime.datetime
and datetime.timedelta
Time Per Hit % Time Line Contents
==============================================================
0.7 6.7 slot_sub_duration = end_slot_dt - start_dt
Sorry if the formatting broke, but the time went from 19.3 us to 0.7us per hit.
The attached SO post was the incorrect link, here's the corrected one, and my numbers seem to agree with theirs: https://stackoverflow.com/a/29192601/805763
Comment From: kkawabat
You can use
object
dtype for storing arbitrary python objects.```python In [51]: df = pd.DataFrame({'date': pd.Series(datelist, dtype='object'), 'y': np.random.rand(10)})
In [52]: df.date[0] Out[52]: someobject(2017, 1, 2, 0, 0) ```
YMMV on how well this works and which operations in pandas will cast those to datetime64[ns].
@TomAugspurger I tried to use your suggestion above however the element of the datetime column still shows up as timestamps. Here is the minimal reproducible example.
a = [['2019-04-21 21:17:00+00:00'], ['2019-04-21 21:17:00+00:00'], ['2019-04-21 21:17:00+00:00'], ['2019-04-21 21:17:00+00:00']]
df = pd.DataFrame(a, columns=['start_time'])
df['start_time'] = pd.to_datetime(df['start_time'])
print(f"dtype of column: {df['start_time'].dtype}")
print(f"type of element: {type(df['start_time'][0])}")
dtype of column: datetime64[ns, UTC] type of element:
I am using pandas 1.1.5 on python 3.7