Pandas Slice timeseries with "pro-rata" behaviour at endpoints

Hi all

Suppose I have a timeseries indexed by a PeriodIndex at day resolution, and I want to numerically integrate it over a range whose start- and end-point are known to a higher time resolution (say, to the exact second).

As the first step towards this, it'd help to take a slice of my timeseries in which the values at the start and end are computed "pro rata" according to the proportion of that calendar day Period which is included in the range.

I was looking for a way to do this with reindex, but it's not really a reindexing, more a slicing kind of operation with special-casing of the endpoints. Is there a way to do this in pandas? Seems like it could be a useful addition if not, I imagine it could be useful for various accounting applications too.

(Even better would be a way to numerically integrate a timeseries directly while specifying endpoints at higher time resolution than the index. Unfortunately this doesn't seem possible with the sample-based integrators in scipy.integrate.)

Cheers!

Comment From: mjwillson

I figured a useful way to achieve this would be something like the following, which also helps out in cases where periods in a period index are of different lengths and you want to take that into account (e.g. calendar months):

def period_series(start_datetime, end_datetime, **period_range_kwargs):
    """Like period_range, but instead of returning a PeriodIndex,
    returns a Series indexed by the PeriodIndex, whose values are the
    length (in seconds) of each period which overlaps the requested
    timestamp range. At the endpoints of the range, this may not be
    the full length of the period.
    """
    start_datetime = p.to_datetime(start_datetime)
    end_datetime = p.to_datetime(end_datetime)

    # this will include both endpoints:
    index = p.period_range(start_datetime, end_datetime, **period_range_kwargs)

    # Maybe there's a nicer way to do this which avoids the ugly .99999 ?
    # Note, periods in a PeriodIndex can be of different lengths, e.g. calendar
    # months.
    lengths = index.map(lambda x : (x.end_time - x.start_time).total_seconds())

    # need to special-case the first and last value:
    if len(lengths) == 1:
        # start period = end period, needs special handling
        lengths[0] = (end_datetime - start_datetime).total_seconds()
    else:
        lengths[0] = (index[0].end_time - start_datetime).total_seconds()
        lengths[-1] = (end_datetime - index[-1].start_time).total_seconds()

    return p.Series(lengths, index=index)

Can put a pull request together if you think this would be worth having in pandas.

Comment From: jreback

can you show a usage of this?

Comment From: mjwillson

In general it feels like most reductions over series with irregular period lengths (like calendar month) really ought to be taking into account the varying period length somehow. So that's one use case.

The other use case which is made possible once you start taking period lengths into account, is the ability to handle the length of endpoint periods in a tidy pro-rata way when you're working over a time range which doesn't start and end on an exact period boundary.

For example, if you want to compute a weighted mean of some period-indexed time series over a time range:

[EDIT: use numpy.average]

period_lengths = period_lengths(start_time, end_time, freq=values.index.freq)
weighted_mean = np.average(values[period_lengths.index], weights=period_lengths)

Comment From: jreback

can you show the example (with the output) for using your procedure (and then w/o for comparison).

Comment From: mjwillson

OK -- not sure exactly what you're looking for there, but here's an example of usage:

In [321]: period_lengths('2014-01-14', '2014-05-14', freq='M')
Out[321]: 
2014-01    1555199.999999
2014-02    2419199.999999
2014-03    2678399.999999
2014-04    2591999.999999
2014-05    1123200.000000
Freq: M, dtype: float64

Note how the different lengths of February and March show up, and that, since the start and end dates aren't aligned to month boundaries, the lengths at the endpoints aren't a full month.

(The .9999 thing is a bit silly, but it's just inheriting that from Period.end_date)

It's just a thought anyway, no worries if you don't think it belongs.

For some other use cases, it might make more sense to return a series of the proportion of the period overlapping the range, e.g.

In [324]: period_overlap_proportions('2014-01-14', '2014-05-14', freq='M')
Out[324]: 
2014-01    0.580645
2014-02    1.000000
2014-03    1.000000
2014-04    1.000000
2014-05    0.433333
Freq: M, dtype: float64

This would then be useful if you wanted to sum up a series (like monthly revenue figures) over pro-rata over some range which isn't aligned with the month boundaries.

Comment From: mjwillson

A lot of the use cases could also be handled better by numerical integration routines for series of samples, which allow endpoints to lie between sampling points. scipy.integrate doesn't seem to allow this at the moment.

Comment From: jreback

you might find this interesting: http://pandas.pydata.org/pandas-docs/stable/cookbook.html#miscellaneous (see the numerical integration example)

Comment From: jreback

closing. this might involve a transformation of the series to weight by durations. but would need a concrete use case here.