Pandas PERF: dt accessor has massive performance drain when accessing date, time

A small, complete example of the issue

import pandas as pd
import numpy as np

df = pd.DataFrame({'tstamp' : pd.date_range('2016-11-15', periods=4*365*60, freq='T'),
                   'data' : np.random.rand(4*365*60)})
len(df)
## 87600

%%timeit
__ = df['tstamp'].dt.date
## 10 loops, best of 3: 128 ms per loop

%%timeit
__ = df['tstamp'].dt.time
## 10 loops, best of 3: 132 ms per loop

%%timeit
__ = df['tstamp'].dt.dayofyear
## 100 loops, best of 3: 3.04 ms per loop

%%timeit
__ = df['tstamp'].dt.day
## 100 loops, best of 3: 2.83 ms per loop

As clearly demonstrated, accessing date and time take a really long time to compute. I do not know what is causing the bottleneck, but some speed-up will be definitely appreciated.

Also, accessing date and time require more than double the memory that the DataFrame requires. I don't have a memory profiler working, but I can attest that my computer with 30 GB of available RAM (after OS use), can load a massive csv that consumes 10.2 GB in memory as a DataFrame. However, trying to access date from that DataFrame raises MemoryError. It basically fills up the remaining 19.8 GB of RAM trying to compute the date from a timestamp column. The DataFrame in question has 113,587,339 rows, and 5 columns of numeric data, one column of strings, and a column with the datetime stamp similar to the example above.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-47-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.19.0 nose: None pip: 8.1.2 setuptools: 27.2.0 Cython: 0.25.1 numpy: 1.11.2 scipy: 0.18.1 statsmodels: 0.6.1 xarray: 0.8.2 IPython: 5.1.0 sphinx: None patsy: 0.4.1 dateutil: 2.5.3 pytz: 2016.7 blosc: None bottleneck: 1.1.0 tables: 3.3.0 numexpr: 2.6.1 matplotlib: 1.5.3 openpyxl: None xlrd: None xlwt: None xlsxwriter: 0.9.3 lxml: None bs4: None html5lib: None httplib2: None apiclient: None sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.8 boto: None pandas_datareader: None

Comment From: jorisvandenbossche

The date and time attributes return python datetime.date and datetime.time objects, while the other attributes return integers. This is certainly a reason that it is both slower and takes more memory.

Example of memory usage (in MBs)

In [144]: df.memory_usage(deep=True, index=False) / (1024**2)
Out[144]: 
data      0.668335
tstamp    0.668335
dtype: float64

In [145]: df['tstamp'].dt.date.memory_usage(deep=True, index=False) / (1024**2)
Out[145]: 3.3416748046875

In [146]: df['tstamp'].dt.time.memory_usage(deep=True, index=False) / (1024**2)
Out[146]: 4.010009765625

In [147]: df['tstamp'].dt.dayofyear.memory_usage(deep=True, index=False) / (1024**2)
Out[147]: 0.6683349609375

So you can see that the datetime.date/time objects take a lot more memory compared to the numpy integers or foats (up to almost 6 times as much memory). This easily explains the MemoryError you see since you only have 5 columns.

Speed-wise, the creation of the datetime.date/time objects is probably a reason for slowdown as well, although I am not sure all slowdown can be attributed to this. You are welcome to look into this.

Anyway, if speed and memory is of concern, you better avoid date and time anyway, and you cannot perform vectorized operations anymore on those columns. Depending on your application, all information about the date and time is also available in the other attributes (day, month, hour, etc ..)

Comment From: jreback

further to @jorisvandenbossche comments. One of the strengths of pandas is that data can be represented by a specific implementation with fast-path accessing for most operations, while retaining compat with python object if needed. This is true for datetime, timedelta, period, and category dtypes. (other dtypes map the interface directly to the implementation).

What this means is that a user must be aware of the performance/memory tradeoffs when asking for python objects.

Comment From: dragonator4

@jorisvandenbossche Then how is it that storing datetime together consumes just as much memory as storing the day of year? If datetime can be represented as ints, and floats, but still have a display format as YYYY-MM-DD HH:MM:SS, then similarly dates, if not also times, can be stored as ints and floats and displayed correctly.

@jreback How much of an effort is it to build pandas equivalent for Python date and time types?

Comment From: jreback

@dragonator4 you are missing the point. this already exists as Timestamp (which is a sub-class of datetime actually). It only instantiated when needed. If you are actually using date and time directly as python objects then you are doing it wrong.

Timestamp is a performant replacement that is feature-full like datetime.datetime and performant like np.datetime64

Comment From: dragonator4

@jreback I have timestamped sensor data, and I need to group the data by day and sensor ID, and fit models on each group. Since the data spans more than a year, I need something robust to form the groups on. Date is perfect, but it has performance and memory drain, as we discussed. As a work around, I can form groups on year, and day of year. Doing so requires an extra step of recovering the date from year, and day of year. Can you recommend a better way?

Ideally, I would like to do this:

df['date'] = df['tstamp'].dt.date
res = df.groupby(['sens_id', 'date']).apply(modeler)

Now, I would do this (for lack of something better):

df['year'], df['doy'] = df['tstamp'].dt.year, df['tstamp'].dt.dayofyear
res = df.groupby(['sens_id', 'year', 'doy']).apply(modeler)

Since the gap in performance is so huge, the latter is actually faster than the former, despite the extra computations.

Comment From: TomAugspurger

@dragonator4 looks like you could round your tstamp. df['date'] = df['tstamp'].dt.round('d') and then group by that and sensor ID.

It'd be quite a bit of work to support date and time, and in many cases (like this) a suitably rounded datetime does the job.

EDIT: mmm, you need to round down though. Pretty sure there's a way to do this. Yeah, df['date'] = df['tstamp'].dt.floor('d')

Comment From: dragonator4

@TomAugspurger Thanks for that trick. I'll perhaps use floor. round may round to the next day if the time is after noon..?

Comment From: jorisvandenbossche

There is also normalize to get only the date part (not sure about what would be better performance wise, normalize or floor, you can test):

In [14]: s = pd.Series(pd.date_range('2012-01-01', freq='8h', periods=10))

In [15]: s
Out[15]: 
0   2012-01-01 00:00:00
1   2012-01-01 08:00:00
2   2012-01-01 16:00:00
3   2012-01-02 00:00:00
4   2012-01-02 08:00:00
5   2012-01-02 16:00:00
6   2012-01-03 00:00:00
7   2012-01-03 08:00:00
8   2012-01-03 16:00:00
9   2012-01-04 00:00:00
dtype: datetime64[ns]

In [17]: s.dt.normalize()
Out[17]: 
0   2012-01-01
1   2012-01-01
2   2012-01-01
3   2012-01-02
4   2012-01-02
5   2012-01-02
6   2012-01-03
7   2012-01-03
8   2012-01-03
9   2012-01-04
dtype: datetime64[ns]

Comment From: jreback

I seriously doubt the grouping part is actually slowing you down. For sure its the .apply which IS executed in python space.

In any event you can simply do

df.groupby(['sens_id', pd.Grouper(key='date', freq='D')]).apply(modeler)

for proper semantics or as @TomAugspurger and @jorisvandenbossche suggest this is de-facto equivalent to rounding.

Pandas PERF: dt accessor has massive performance drain when accessing date, time

A small, complete example of the issue

Output of pd.show_versions()

Output of `pd.show_versions()`