Pandas More flexible binning/aggregation of time series data

from mailing list

I have TimeSeries that has observations at the second level:


mport pandas as pd
import numpy as np
rng = pd.date_range('2012-01-01', '2012-01-10', freq='T')
s = pd.TimeSeries(np.random.randn(12961)*100, index=rng)


I now want to resample the data to daily at 17:59:59, so that I have the sum of positions at 17:59:59 for all days in range.

If I didn't care about the time I could just do:

s1 = s.resample('D',how='sum')

Does anyone have a suggestion?```

Comment From: wesm

user also posted:

Hello All,

Thank you for your replies, after a day of fiddling around with pandas and moving up the ranks from newbie to conversant, I've figured out how do what I want to do. First of all my first example was wrong, I was not trying to figure out the sum of my positions at a given Frequency at a specific time, rather I was trying to determine my actual position at a given frequency for a given time. In the end, the code that I needed to write looked like this:

import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import pytz
positions = [10,100,30,150,170,12,90,100,30]
timestamps = pd.DatetimeIndex(['2012-01-01 09:10:00','2012-01-01 14:10:00','2012-01-01 17:15:00','2012-01-02 09:11:00','2012-01-02 12:10:00','2012-01-02 17:59:00','2012-01-02 18:10:00','2012-01-03 05:10:00','2012-01-03 18:10:00'])
symbols = ['SDF','SDF','SDF','HS','SDF','HS','SDF','HS','SDF']
df = pd.DataFrame(zip(timestamps,positions,symbols),columns=['timestamp','position','symbol'],index=timestamps)
start = df.first_valid_index().replace(hour=17,minute=59,second=59)
end = df.last_valid_index().replace(hour=17,minute=59, second=59)
selection = pd.date_range(start, end, freq='D')
# positions_at_close = df['position'].asof(selection)
df2 = pd.DataFrame()
g = df.groupby('symbol')
for key, data in g:
    df2 = df2.append(pd.concat([data['position'].asof(selection),data['symbol'].asof(selection)], axis=1, keys=['position','symbol']))

Which took the DataFrame df:

In [574]: df
Out[574]: 
                               timestamp  position symbol
2012-01-01 09:10:00  2012-01-01 09:10:00        10    SDF
2012-01-01 14:10:00  2012-01-01 14:10:00       100    SDF
2012-01-01 17:15:00  2012-01-01 17:15:00        30    SDF
2012-01-02 09:11:00  2012-01-02 09:11:00       150     HS
2012-01-02 12:10:00  2012-01-02 12:10:00       170    SDF
2012-01-02 17:59:00  2012-01-02 17:59:00        12     HS
2012-01-02 18:10:00  2012-01-02 18:10:00        90    SDF
2012-01-03 05:10:00  2012-01-03 05:10:00       100     HS
2012-01-03 18:10:00  2012-01-03 18:10:00        30    SDF

To df2:
In [575]: df2
Out[575]: 
                     position symbol
2012-01-01 17:59:59       NaN    NaN
2012-01-02 17:59:59        12     HS
2012-01-03 17:59:59       100     HS
2012-01-01 17:59:59        30    SDF
2012-01-02 17:59:59       170    SDF
2012-01-03 17:59:59        90    SDF

There is probably a more elegant way to do this, but I this works.

Comment From: aabdullah-bos

Hello Wes,

Thanks for posting this. I just wanted to make one small change, to the code above. The selection needs to made inside the loop over the groups, so that you don't get the "as of" date for dates when data shouldn't exist.

for key, data in g: start_rng = data.first_valid_index().replace(hour=17,minute=59,second=59) end_rng = data.last_valid_index().replace(hour=17,minute=59,second=59) selection = pd.date_range(start_rng,end_rng, freq='D') df2 = df2.append(pd.concat([data['position'].asof(selection),data['symbol'].asof(selection)], axis=1, keys=['position','symbol']))

Comment From: hayd

This seems like something that we should use the base argument of resample for.

Comment From: MarcoGorelli

closing as there's been no activity in over a decade, and it's not that clear what's being asked for