EDIT: Modern reproducible example:
In [1]: import pandas as pd
In [2]: dates = pd.date_range('3/20/2000', '5/20/2000', freq='1h')
In [4]: ts = pd.Series(range(len(dates)), index=dates)
In [5]: def start(x):
...: return x.index[0]
...:
...: def end(x):
...: return x.index[-1]
...:
In [6]: ts.resample('1M', closed='right', label='right').apply(start)
Out[6]:
2000-03-31 2000-03-20
2000-04-30 2000-04-01
2000-05-31 2000-05-01
Freq: M, dtype: datetime64[ns]
In [7]: ts.resample('1M', closed='right', label='right').apply(end)
Out[7]:
2000-03-31 2000-03-31 23:00:00
2000-04-30 2000-04-30 23:00:00
2000-05-31 2000-05-20 00:00:00
Freq: M, dtype: datetime64[ns]
In [8]: ts.resample('1M', closed='left', label='right').apply(start)
Out[8]:
2000-03-31 2000-03-20
2000-04-30 2000-03-31
2000-05-31 2000-04-30
Freq: M, dtype: datetime64[ns]
In [9]: ts.resample('1M', closed='left', label='right').apply(end)
Out[9]:
2000-03-31 2000-03-30 23:00:00
2000-04-30 2000-04-29 23:00:00
2000-05-31 2000-05-20 00:00:00
Freq: M, dtype: datetime64[ns]
Expected behavior: https://github.com/pandas-dev/pandas/issues/2665#issuecomment-29155578
There seems to be something wrong when resampling hourly values to monthly values. The 'closed=' argument does not do what it should. I am using pandas 0.10.0.
import pandas as pd
from pandas import TimeSeries
import numpy as np
# Timeseries with hourly values
dates = pd.date_range('3/20/2000', '5/20/2000', freq='1h')
ts = TimeSeries(np.arange(len(dates)), index=dates)
# I want monthly values equal to the summed values of the hours in that month
# With "closed=left" I want the hourly value (1 value) at midnight to be
# included in the month after that very midnight
data_monthly_1 = ts.resample('1M', how='sum', closed='left', label='right')
sum_april1 = data_monthly_1.ix[1]
# If we check this, the summed values per month do not match :
data_hourly_april2 = ts[ts.index.month == 4]
sum_april2 = sum(data_hourly_april2)
# Instead, "closed=left" above resulted in an inclusion of the entire last day of the
# previous month (25 values !) (and exclusion of the last day of the month)
# This is also strange as I am nowhere concerned with using 'days'.
data_hourly_april3 = ts[ ((ts.index.month == 3) & (ts.index.day == 31)) |
((ts.index.month == 4) & (ts.index.day < 30)) ]
sum_april3 = sum(data_hourly_april3)
# PS: "closed=right" results in the answer I would expect for "closed = left":
data_monthly_4 = ts.resample('1M', how='sum', closed='right', label='right')
sum_april4 = data_monthly_4.ix[1]
When resampling these hourly to daily values, there was no problem.
Comment From: changhiskhan
If you leave out the closed/label arguments, pandas will choose the right one for you:
In [27]: ts.resample('M', how='sum')
Out[27]:
2000-03-31 41328
2000-04-30 466200
2000-05-31 564852
Freq: M
edit: I thought it about it some more and am not certain this is a bug. If you're thinking about it terms of monthly periods, you should leave out the closed
argument and let pandas choose the right one. That's the easiest way.
If you want explicit control, because this is timestamp format, the monthend datetime is midnight of the last day of the month. This is so it'll play nice with daily frequencies. It's possible to hack it so IF the input data is intraday frequency then we adjust the bin edges, but it is often the case that you might have intraday data without an explicit frequency. So the result would be inconsistent behavior.
Comment From: changhiskhan
I'm moving this to 0.10.2 in case more discussion is needed/desired
Comment From: filmackay
Isn't the above example results wrong - the dates indicated are actually midnight of the start of the relevant date. Is not the correct range for March 2000:
2000-03-01 0:00 (left) 2000-04-01 0:00 (right)
I would have thought mathematically, saying that a monthly period finishes on 2000-03-31 is incorrect, since this represents the midnight between 2000-03-30 and 2000-03-31, not the midnight after 2000-03-31 (which is in fact 2000-04-01 0:00).
?
Comment From: wesm
Well that's look at this here:
def start(x):
return x.index[0]
def end(x):
return x.index[-1]
In [18]: ts.resample('1M', how=[start, end], closed='right', label='right')
Out[18]:
start end
2000-03-31 2000-03-20 00:00:00 2000-03-31 23:00:00
2000-04-30 2000-04-01 00:00:00 2000-04-30 23:00:00
2000-05-31 2000-05-01 00:00:00 2000-05-20 00:00:00
In [19]: ts.resample('1M', how=[start, end], closed='left', label='right')
Out[19]:
start end
2000-03-31 2000-03-20 00:00:00 2000-03-30 23:00:00
2000-04-30 2000-03-31 00:00:00 2000-04-29 23:00:00
2000-05-31 2000-04-30 00:00:00 2000-05-20 00:00:00
So I think the closed='left'
behavior is what you want. I think there's a tweak in the code to account for the convention of using the last day of the month at midnight (even though the month isn't "technically" over) as the date.
Closing this issue ...
Comment From: kdebrab
This issue keeps on coming back... in my eyes this is not yet ready to be closed...
@changhiskhan, leaving out the closed
argument may be ok when you are dealing with (intraday) source data that has start time stamp (or left) labels, but it is not giving correct results when your (intraday) source data has end time stamp (or right) labels.
@wesm, the closed='left'
example you provide is not the desired behavior. On the contrary, it confirms the problem raised by @Cd48 and commented by @filmackay: the entire last day (24 values) is aggregated into the next month!
Instead, I believe the desired behavior should be:
In [18]: ts.resample('1M', how=[start, end], closed='right', label='right')
Out[18]:
start end
2000-03-31 2000-03-20 00:00:00 2000-04-01 00:00:00
2000-04-30 2000-04-01 01:00:00 2000-05-01 00:00:00
2000-05-31 2000-05-01 01:00:00 2000-05-20 00:00:00
In [19]: ts.resample('1M', how=[start, end], closed='left', label='right')
Out[19]:
start end
2000-03-31 2000-03-20 00:00:00 2000-03-31 23:00:00
2000-04-30 2000-04-01 00:00:00 2000-04-30 23:00:00
2000-05-31 2000-05-01 00:00:00 2000-05-20 00:00:00
I did some research into the code... Actually, the behavior for source data with start time stamp labels previously had a very similar problem (see http://stackoverflow.com/questions/11018120/bug-in-resampling-with-pandas-0-8). It was handled by introducing the following code (#1471):
if self.freq != 'D' and is_superperiod(self.freq, 'D'):
day_nanos = _delta_to_nanoseconds(timedelta(1))
if self.closed == 'right':
bin_edges = bin_edges + day_nanos - 1
else:
bin_edges = bin_edges + day_nanos
of which the else
close was later removed (#1726).
Shouldn't the correct code be:
if self.freq != 'D' and is_superperiod(self.freq, 'D'):
day_nanos = _delta_to_nanoseconds(timedelta(1))
if self.closed == 'right':
bin_edges = bin_edges + day_nanos
else:
bin_edges = bin_edges + day_nanos - 1
That indeed results in the desired behavior shown above and also solves #5440. Though that raises again issue #1726... I think #1726 should be solved separately, maybe by adjusting the range edge first
?
Comment From: jreback
why don't u do a PR with the new code and some tests to validate?
Comment From: kdebrab
It also involves changing the default behavior of closed
.
Changing the default to closed='left'
for all frequencies (as of now the default is already left for intraday frequencies), but keeping the default label
as it is now (i.e. 'left'
for intraday, 'right'
otherwise) would result in the same default behaviour for most users.
Comment From: kdebrab
I'm afraid a PR is too much for me...
I'm not accustomed with the PR process yet (and I also lack the time). Moreover, a solution for #1726 (and probably other issues) still needs to be found.
As a side note: the resample code looks to me quite complicated with already many hacks. Instead of adding another hack, maybe a profound (and time-consuming) code refactoring is desirable?
Comment From: jreback
ok will reopen this not sure if this will be addressed in the near future so would appreciate a pr from you at some point as you appear pretty familiar
Comment From: MarcoGorelli
This looks correct to me
If we have
dates = pd.date_range('3/20/2000', '5/20/2000', freq='1h')
ts = pd.Series(range(len(dates)), index=dates)
and we do ts.resample('1M', closed='right', label='left')
, then we have the bins:
- '2000-02-29': ('2000-02-29', '2000-03-31']
- '2000-03-31': ('2000-03-31', '2000-04-30']
- '2000-04-30': ('2000-04-30', '2000-05-31']
If we do
dates = pd.date_range('3/20/2000', '5/20/2000', freq='1h')
ts = pd.Series(range(len(dates)), index=dates)
bins = [
('2000-02-29', '2000-03-31'),
('2000-03-31', '2000-04-30'),
('2000-04-30', '2000-05-31'),
]
# left, right
result = []
for left, right in bins:
df = ts[(ts.index.normalize()>left)&(ts.index.normalize()<=right)].copy()
result.append(df.sum())
print(result)
print(ts.resample('1M', label='left', closed='right').sum())
then result
and ts.resample('1M', label='left', closed='right').sum()
show the same numbers
I'm marking as 'closing candidate' for now then, but I'll double-check later
Comment From: MarcoGorelli
closing then, but please do let me know if I've misunderstood