Pandas Incorrect resampling from hourly to monthly values

EDIT: Modern reproducible example:

In [1]: import pandas as pd

In [2]: dates = pd.date_range('3/20/2000', '5/20/2000', freq='1h')

In [4]: ts = pd.Series(range(len(dates)), index=dates)

In [5]: def start(x):
   ...:     return x.index[0]
   ...:
   ...: def end(x):
   ...:     return x.index[-1]
   ...:

In [6]: ts.resample('1M', closed='right', label='right').apply(start)
Out[6]:
2000-03-31   2000-03-20
2000-04-30   2000-04-01
2000-05-31   2000-05-01
Freq: M, dtype: datetime64[ns]

In [7]: ts.resample('1M', closed='right', label='right').apply(end)
Out[7]:
2000-03-31   2000-03-31 23:00:00
2000-04-30   2000-04-30 23:00:00
2000-05-31   2000-05-20 00:00:00
Freq: M, dtype: datetime64[ns]

In [8]: ts.resample('1M', closed='left', label='right').apply(start)
Out[8]:
2000-03-31   2000-03-20
2000-04-30   2000-03-31
2000-05-31   2000-04-30
Freq: M, dtype: datetime64[ns]

In [9]: ts.resample('1M', closed='left', label='right').apply(end)
Out[9]:
2000-03-31   2000-03-30 23:00:00
2000-04-30   2000-04-29 23:00:00
2000-05-31   2000-05-20 00:00:00
Freq: M, dtype: datetime64[ns]

Expected behavior: https://github.com/pandas-dev/pandas/issues/2665#issuecomment-29155578

There seems to be something wrong when resampling hourly values to monthly values. The 'closed=' argument does not do what it should. I am using pandas 0.10.0.

import pandas as pd
from pandas import TimeSeries
import numpy as np

# Timeseries with hourly values
dates = pd.date_range('3/20/2000', '5/20/2000', freq='1h')
ts = TimeSeries(np.arange(len(dates)), index=dates)

# I want monthly values equal to the summed values of the hours in that month
# With "closed=left" I want the hourly value (1 value) at midnight to be 
# included in the month after that very midnight
data_monthly_1 = ts.resample('1M', how='sum', closed='left', label='right') 
sum_april1 = data_monthly_1.ix[1]

# If we check this, the summed values per month do not match : 
data_hourly_april2 = ts[ts.index.month == 4]
sum_april2 = sum(data_hourly_april2)

# Instead, "closed=left" above resulted in an inclusion of the entire last day of the
# previous month (25 values !) (and exclusion of the last day of the month)
# This is also strange as I am nowhere concerned with using 'days'.
data_hourly_april3 = ts[   ((ts.index.month == 3) & (ts.index.day == 31)) | 
                           ((ts.index.month == 4) & (ts.index.day < 30))      ]
sum_april3 = sum(data_hourly_april3)

# PS: "closed=right" results in the answer I would expect for "closed = left":
data_monthly_4 = ts.resample('1M', how='sum', closed='right', label='right') 
sum_april4 = data_monthly_4.ix[1]

When resampling these hourly to daily values, there was no problem.

Comment From: changhiskhan

If you leave out the closed/label arguments, pandas will choose the right one for you:

In [27]: ts.resample('M', how='sum')
Out[27]: 
2000-03-31     41328
2000-04-30    466200
2000-05-31    564852
Freq: M

edit: I thought it about it some more and am not certain this is a bug. If you're thinking about it terms of monthly periods, you should leave out the closed argument and let pandas choose the right one. That's the easiest way.

If you want explicit control, because this is timestamp format, the monthend datetime is midnight of the last day of the month. This is so it'll play nice with daily frequencies. It's possible to hack it so IF the input data is intraday frequency then we adjust the bin edges, but it is often the case that you might have intraday data without an explicit frequency. So the result would be inconsistent behavior.

Comment From: changhiskhan

I'm moving this to 0.10.2 in case more discussion is needed/desired

Comment From: filmackay

Isn't the above example results wrong - the dates indicated are actually midnight of the start of the relevant date. Is not the correct range for March 2000:

2000-03-01 0:00 (left) 2000-04-01 0:00 (right)

I would have thought mathematically, saying that a monthly period finishes on 2000-03-31 is incorrect, since this represents the midnight between 2000-03-30 and 2000-03-31, not the midnight after 2000-03-31 (which is in fact 2000-04-01 0:00).

Comment From: wesm

Well that's look at this here:

def start(x):
    return x.index[0]

def end(x):
    return x.index[-1]

In [18]: ts.resample('1M', how=[start, end], closed='right', label='right')
Out[18]: 
                         start                 end
2000-03-31 2000-03-20 00:00:00 2000-03-31 23:00:00
2000-04-30 2000-04-01 00:00:00 2000-04-30 23:00:00
2000-05-31 2000-05-01 00:00:00 2000-05-20 00:00:00

In [19]: ts.resample('1M', how=[start, end], closed='left', label='right')
Out[19]: 
                         start                 end
2000-03-31 2000-03-20 00:00:00 2000-03-30 23:00:00
2000-04-30 2000-03-31 00:00:00 2000-04-29 23:00:00
2000-05-31 2000-04-30 00:00:00 2000-05-20 00:00:00

So I think the closed='left' behavior is what you want. I think there's a tweak in the code to account for the convention of using the last day of the month at midnight (even though the month isn't "technically" over) as the date.

Closing this issue ...

Comment From: kdebrab

This issue keeps on coming back... in my eyes this is not yet ready to be closed...

@changhiskhan, leaving out the closed argument may be ok when you are dealing with (intraday) source data that has start time stamp (or left) labels, but it is not giving correct results when your (intraday) source data has end time stamp (or right) labels.

@wesm, the closed='left' example you provide is not the desired behavior. On the contrary, it confirms the problem raised by @Cd48 and commented by @filmackay: the entire last day (24 values) is aggregated into the next month!

Instead, I believe the desired behavior should be:

In [18]: ts.resample('1M', how=[start, end], closed='right', label='right')
Out[18]: 
                         start                 end
2000-03-31 2000-03-20 00:00:00 2000-04-01 00:00:00
2000-04-30 2000-04-01 01:00:00 2000-05-01 00:00:00
2000-05-31 2000-05-01 01:00:00 2000-05-20 00:00:00

In [19]: ts.resample('1M', how=[start, end], closed='left', label='right')
Out[19]: 
                         start                 end
2000-03-31 2000-03-20 00:00:00 2000-03-31 23:00:00
2000-04-30 2000-04-01 00:00:00 2000-04-30 23:00:00
2000-05-31 2000-05-01 00:00:00 2000-05-20 00:00:00

I did some research into the code... Actually, the behavior for source data with start time stamp labels previously had a very similar problem (see http://stackoverflow.com/questions/11018120/bug-in-resampling-with-pandas-0-8). It was handled by introducing the following code (#1471):

    if self.freq != 'D' and is_superperiod(self.freq, 'D'):
        day_nanos = _delta_to_nanoseconds(timedelta(1))
        if self.closed == 'right':
            bin_edges = bin_edges + day_nanos - 1
        else:
            bin_edges = bin_edges + day_nanos

of which the else close was later removed (#1726).

Shouldn't the correct code be:

    if self.freq != 'D' and is_superperiod(self.freq, 'D'):
        day_nanos = _delta_to_nanoseconds(timedelta(1))
        if self.closed == 'right':
            bin_edges = bin_edges + day_nanos
        else:
            bin_edges = bin_edges + day_nanos - 1

That indeed results in the desired behavior shown above and also solves #5440. Though that raises again issue #1726... I think #1726 should be solved separately, maybe by adjusting the range edge first?

Comment From: jreback

why don't u do a PR with the new code and some tests to validate?

Comment From: kdebrab

It also involves changing the default behavior of closed.

Changing the default to closed='left' for all frequencies (as of now the default is already left for intraday frequencies), but keeping the default label as it is now (i.e. 'left' for intraday, 'right' otherwise) would result in the same default behaviour for most users.

Comment From: kdebrab

I'm afraid a PR is too much for me...

I'm not accustomed with the PR process yet (and I also lack the time). Moreover, a solution for #1726 (and probably other issues) still needs to be found.

As a side note: the resample code looks to me quite complicated with already many hacks. Instead of adding another hack, maybe a profound (and time-consuming) code refactoring is desirable?

Comment From: jreback

ok will reopen this not sure if this will be addressed in the near future so would appreciate a pr from you at some point as you appear pretty familiar

Comment From: MarcoGorelli

This looks correct to me

If we have

dates = pd.date_range('3/20/2000', '5/20/2000', freq='1h')
ts = pd.Series(range(len(dates)), index=dates)

and we do ts.resample('1M', closed='right', label='left'), then we have the bins: - '2000-02-29': ('2000-02-29', '2000-03-31'] - '2000-03-31': ('2000-03-31', '2000-04-30'] - '2000-04-30': ('2000-04-30', '2000-05-31']

If we do

dates = pd.date_range('3/20/2000', '5/20/2000', freq='1h')
ts = pd.Series(range(len(dates)), index=dates)

bins = [
    ('2000-02-29', '2000-03-31'),
    ('2000-03-31', '2000-04-30'),
    ('2000-04-30', '2000-05-31'),
]

# left, right
result = []
for left, right in bins:
    df = ts[(ts.index.normalize()>left)&(ts.index.normalize()<=right)].copy()
    result.append(df.sum())
print(result)
print(ts.resample('1M', label='left', closed='right').sum())

then result and ts.resample('1M', label='left', closed='right').sum() show the same numbers

I'm marking as 'closing candidate' for now then, but I'll double-check later

Comment From: MarcoGorelli

closing then, but please do let me know if I've misunderstood