Python datetime.datetime object get converted inside pandas dataframes. This is not in all cases desired.

Code Sample

```from datetime import datetime import numpy as np import pandas as pd

rd = lambda : datetime(2017,1,np.random.randint(1,32)) datelist = [rd() for i in range(10)] print(type(datelist[0])) #

df = pd.DataFrame({"date" : datelist, "y" : np.random.rand(10)})

print(type(df["date"][0])) #

dates = [d.to_pydatetime() for d in df["date"]] print(type(dates[0])) #


Creating a dataframe with a list of `datetime` objects converts them to `pandas._libs.tslib.Timestamp`. This is undesired. In this simple case it is at least possible to get back a datetime object by list comprehension, `dates = [d.to_pydatetime() for d in df["date"]]`, which I would consider as an unnecessary step and makes the use of a DataFrame somehow obsolete.

Background:

The real problem with this behaviour is shown in the following. Any instance of a class that gets derived from `datetime` is also converted, hence loosing all its properties.

```from datetime import datetime
import numpy as np
import pandas as pd

class someobject(datetime):
    prop = "property"
    def __init__(self,*args,**kwargs):
        datetime.__init__(*args,**kwargs)

rd = lambda : someobject(2017,1,np.random.randint(1,32))
datelist = [rd() for i in range(10)]
print(type(datelist[0])) # <class '__main__.someobject'>
print(datelist[0].prop)  # property

df = pd.DataFrame({"date" : datelist,
                   "y" : np.random.rand(10)})

print(type(df["date"][0])) # <class 'pandas._libs.tslib.Timestamp'>
print(df["date"][0].prop)  # Error AttributeError: 'Timestamp' object has no attribute 'prop'

How to prevent the DataFrame to convert an object that I put into it to something else?
As a workaround, how to change someobject in the above such that it does not get converted automatically to something else?

[Here, python 2.7, pandas 0.20.1 are used]

Comment From: TomAugspurger

You can use object dtype for storing arbitrary python objects.


In [51]: df = pd.DataFrame({'date': pd.Series(datelist, dtype='object'), 'y': np.random.rand(10)})

In [52]: df.date[0]
Out[52]: someobject(2017, 1, 2, 0, 0)

YMMV on how well this works and which operations in pandas will cast those to datetime64[ns].

Comment From: jorisvandenbossche

In this simple case it is at least possible to get back a datetime object by list comprehension, dates = [d.to_pydatetime() for d in df["date"]], which I would consider as an unnecessary step and makes the use of a DataFrame somehow obsolete.

Also df['date'].dt.to_pydatetime() will do this for you without list comprehension

Comment From: hrushikesh-dhumal

df['date'].dt.to_pydatetime() does not work when assigning a column to dataframe, it will revert back to pandas._libs.tslibs.timestamps.Timestamp.

This also happens when groupby is used on datetime column.

Comment From: li-dennis

Are there any good solutions to this? Pandas/numpy timestamps and timedeltas seem to be extremely nonperformant (ie 2 orders of magnitude slower), so simple arithmetic involving dates is slowing down code substantially. [edit] corrected SO link: https://stackoverflow.com/a/29192601/805763 [/edit]

I'd prefer to use pydatetime/pytimedelta given the perf issues, but if that's not possible, it seems like the alternative is to use longs?

Comment From: TomAugspurger

Are there any good solutions to this? Pandas/numpy timestamps and timedeltas seem to be extremely nonperformant

Can you post an example with timing?

Comment From: li-dennis

Are there any good solutions to this? Pandas/numpy timestamps and timedeltas seem to be extremely nonperformant

Can you post an example with timing?

Sure. Apologies for the lack of example. Here's an excerpt from some bit of code I was profiling where this came up when using line profiler, with python 3.7:

When using pd.Timedelta and pd.Timestamp

Time  Per Hit   % Time  Line Contents
==============================================================
19.3     46.6                  slot_sub_duration = end_slot_dt - start_dt

When using datetime.datetime and datetime.timedelta

Time  Per Hit   % Time  Line Contents
==============================================================
0.7      6.7                  slot_sub_duration = end_slot_dt - start_dt

Sorry if the formatting broke, but the time went from 19.3 us to 0.7us per hit.

The attached SO post was the incorrect link, here's the corrected one, and my numbers seem to agree with theirs: https://stackoverflow.com/a/29192601/805763

Comment From: kkawabat

You can use object dtype for storing arbitrary python objects.

```python In [51]: df = pd.DataFrame({'date': pd.Series(datelist, dtype='object'), 'y': np.random.rand(10)})

In [52]: df.date[0] Out[52]: someobject(2017, 1, 2, 0, 0) ```

YMMV on how well this works and which operations in pandas will cast those to datetime64[ns].

@TomAugspurger I tried to use your suggestion above however the element of the datetime column still shows up as timestamps. Here is the minimal reproducible example.

a = [['2019-04-21 21:17:00+00:00'], ['2019-04-21 21:17:00+00:00'], ['2019-04-21 21:17:00+00:00'], ['2019-04-21 21:17:00+00:00']]
df = pd.DataFrame(a, columns=['start_time'])
df['start_time'] = pd.to_datetime(df['start_time'])
print(f"dtype of column: {df['start_time'].dtype}")
print(f"type of element: {type(df['start_time'][0])}")

dtype of column: datetime64[ns, UTC] type of element:

I am using pandas 1.1.5 on python 3.7