A small, complete example of the issue
import pandas as pd
import numpy as np
df = pd.DataFrame({'tstamp' : pd.date_range('2016-11-15', periods=4*365*60, freq='T'),
'data' : np.random.rand(4*365*60)})
len(df)
## 87600
%%timeit
__ = df['tstamp'].dt.date
## 10 loops, best of 3: 128 ms per loop
%%timeit
__ = df['tstamp'].dt.time
## 10 loops, best of 3: 132 ms per loop
%%timeit
__ = df['tstamp'].dt.dayofyear
## 100 loops, best of 3: 3.04 ms per loop
%%timeit
__ = df['tstamp'].dt.day
## 100 loops, best of 3: 2.83 ms per loop
As clearly demonstrated, accessing date and time take a really long time to compute. I do not know what is causing the bottleneck, but some speed-up will be definitely appreciated.
Also, accessing date
and time
require more than double the memory that the DataFrame requires. I don't have a memory profiler working, but I can attest that my computer with 30 GB of available RAM (after OS use), can load a massive csv that consumes 10.2 GB in memory as a DataFrame. However, trying to access date
from that DataFrame raises MemoryError
. It basically fills up the remaining 19.8 GB of RAM trying to compute the date
from a timestamp column. The DataFrame in question has 113,587,339 rows, and 5 columns of numeric data, one column of strings, and a column with the datetime stamp similar to the example above.
Output of pd.show_versions()
Comment From: jorisvandenbossche
The date
and time
attributes return python datetime.date
and datetime.time
objects, while the other attributes return integers. This is certainly a reason that it is both slower and takes more memory.
Example of memory usage (in MBs)
In [144]: df.memory_usage(deep=True, index=False) / (1024**2)
Out[144]:
data 0.668335
tstamp 0.668335
dtype: float64
In [145]: df['tstamp'].dt.date.memory_usage(deep=True, index=False) / (1024**2)
Out[145]: 3.3416748046875
In [146]: df['tstamp'].dt.time.memory_usage(deep=True, index=False) / (1024**2)
Out[146]: 4.010009765625
In [147]: df['tstamp'].dt.dayofyear.memory_usage(deep=True, index=False) / (1024**2)
Out[147]: 0.6683349609375
So you can see that the datetime.date/time
objects take a lot more memory compared to the numpy integers or foats (up to almost 6 times as much memory). This easily explains the MemoryError you see since you only have 5 columns.
Speed-wise, the creation of the datetime.date/time
objects is probably a reason for slowdown as well, although I am not sure all slowdown can be attributed to this. You are welcome to look into this.
Anyway, if speed and memory is of concern, you better avoid date
and time
anyway, and you cannot perform vectorized operations anymore on those columns. Depending on your application, all information about the date and time is also available in the other attributes (day, month, hour, etc ..)
Comment From: jreback
further to @jorisvandenbossche comments. One of the strengths of pandas is that data can be represented by a specific implementation with fast-path accessing for most operations, while retaining compat with python object if needed. This is true for datetime, timedelta, period, and category dtypes. (other dtypes map the interface directly to the implementation).
What this means is that a user must be aware of the performance/memory tradeoffs when asking for python objects.
Comment From: dragonator4
@jorisvandenbossche Then how is it that storing datetime together consumes just as much memory as storing the day of year? If datetime can be represented as ints, and floats, but still have a display format as YYYY-MM-DD HH:MM:SS, then similarly dates, if not also times, can be stored as ints and floats and displayed correctly.
@jreback How much of an effort is it to build pandas equivalent for Python date and time types?
Comment From: jreback
@dragonator4 you are missing the point. this already exists as Timestamp
(which is a sub-class of datetime actually). It only instantiated when needed. If you are actually using date
and time
directly as python objects then you are doing it wrong.
Timestamp
is a performant replacement that is feature-full like datetime.datetime
and performant like np.datetime64
Comment From: dragonator4
@jreback I have timestamped sensor data, and I need to group the data by day and sensor ID, and fit models on each group. Since the data spans more than a year, I need something robust to form the groups on. Date is perfect, but it has performance and memory drain, as we discussed. As a work around, I can form groups on year, and day of year. Doing so requires an extra step of recovering the date from year, and day of year. Can you recommend a better way?
Ideally, I would like to do this:
df['date'] = df['tstamp'].dt.date
res = df.groupby(['sens_id', 'date']).apply(modeler)
Now, I would do this (for lack of something better):
df['year'], df['doy'] = df['tstamp'].dt.year, df['tstamp'].dt.dayofyear
res = df.groupby(['sens_id', 'year', 'doy']).apply(modeler)
Since the gap in performance is so huge, the latter is actually faster than the former, despite the extra computations.
Comment From: TomAugspurger
@dragonator4 looks like you could round your tstamp
. df['date'] = df['tstamp'].dt.round('d')
and then group by that and sensor ID.
It'd be quite a bit of work to support date
and time
, and in many cases (like this) a suitably rounded datetime does the job.
EDIT: mmm, you need to round down though. Pretty sure there's a way to do this. Yeah, df['date'] = df['tstamp'].dt.floor('d')
Comment From: dragonator4
@TomAugspurger Thanks for that trick. I'll perhaps use floor
. round
may round to the next day if the time is after noon..?
Comment From: jorisvandenbossche
There is also normalize
to get only the date part (not sure about what would be better performance wise, normalize or floor, you can test):
In [14]: s = pd.Series(pd.date_range('2012-01-01', freq='8h', periods=10))
In [15]: s
Out[15]:
0 2012-01-01 00:00:00
1 2012-01-01 08:00:00
2 2012-01-01 16:00:00
3 2012-01-02 00:00:00
4 2012-01-02 08:00:00
5 2012-01-02 16:00:00
6 2012-01-03 00:00:00
7 2012-01-03 08:00:00
8 2012-01-03 16:00:00
9 2012-01-04 00:00:00
dtype: datetime64[ns]
In [17]: s.dt.normalize()
Out[17]:
0 2012-01-01
1 2012-01-01
2 2012-01-01
3 2012-01-02
4 2012-01-02
5 2012-01-02
6 2012-01-03
7 2012-01-03
8 2012-01-03
9 2012-01-04
dtype: datetime64[ns]
Comment From: jreback
I seriously doubt the grouping part is actually slowing you down. For sure its the .apply
which IS executed in python space.
In any event you can simply do
df.groupby(['sens_id', pd.Grouper(key='date', freq='D')]).apply(modeler)
for proper semantics or as @TomAugspurger and @jorisvandenbossche suggest this is de-facto equivalent to rounding.