Lets consider the following DataFrame:
date_range = pd.date_range(dt(2010,1,1), dt(2010,1,31), freq='1D')
df = pd.DataFrame(data = np.random.rand(len(date_range),2), index = date_range)
If I group the datapoints by periods of 1 week and visualize the groups definition, I get:
In: [1]:df.groupby(pd.TimeGrouper('W')).groups
Out:[1]:
{Timestamp('2010-01-03 00:00:00', freq='W-SUN'): 3,
Timestamp('2010-01-10 00:00:00', freq='W-SUN'): 10,
Timestamp('2010-01-17 00:00:00', freq='W-SUN'): 17,
Timestamp('2010-01-24 00:00:00', freq='W-SUN'): 24,
Timestamp('2010-01-31 00:00:00', freq='W-SUN'): 31}
I retrieve the keys of that dictionary:
In: [2]: list(df.groupby(pd.TimeGrouper('W')).keys())
Out:[2]:
[Timestamp('2010-01-03 00:00:00', freq='W-SUN'),
Timestamp('2010-01-10 00:00:00', freq='W-SUN'),
Timestamp('2010-01-31 00:00:00', freq='W-SUN'),
Timestamp('2010-01-17 00:00:00', freq='W-SUN'),
Timestamp('2010-01-24 00:00:00', freq='W-SUN')]
However I am left with those funny variables such as Timestamp('2010-01-24 00:00:00', freq='W-SUN') that have the prefix Timestamp but are structured like Periods. I think this is not correct, and those variables should be Periods?
Comment From: jorisvandenbossche
You are starting with a DatetimeIndex (so index of Timestamps), so also after grouping this will give you Timestamps. If you want periods, you can start with Periods (eg using period_range
instead of date_range
).
Or you can convert those Timestamps to Periods afterwards, if you want Periods:
In [82]: df.groupby(pd.TimeGrouper('W')).mean()
Out[82]:
0 1
2010-01-03 0.639762 0.376022
2010-01-10 0.663659 0.543635
2010-01-17 0.286419 0.395908
2010-01-24 0.420077 0.470722
2010-01-31 0.440236 0.473810
In [83]: df.groupby(pd.TimeGrouper('W')).mean().to_period()
Out[83]:
0 1
2009-12-28/2010-01-03 0.639762 0.376022
2010-01-04/2010-01-10 0.663659 0.543635
2010-01-11/2010-01-17 0.286419 0.395908
2010-01-18/2010-01-24 0.420077 0.470722
2010-01-25/2010-01-31 0.440236 0.473810
Note that in this case df.resample('W')
instead of df.groupby(pd.TimeGrouper('W'))
is actually nicer IMO.
Comment From: jimbasquiat
OK thanks! If I dodf.groupby(pd.TimeGrouper('W')).mean().to_period()
I receive PeriodIndex(['2009-12-28/2010-01-03', '2010-01-04/2010-01-10',
'2010-01-11/2010-01-17', '2010-01-18/2010-01-24',
'2010-01-25/2010-01-31'],
dtype='period[W-SUN]', freq='W-SUN')
.I'm even more confused now. What does dtype='period[W-SUN]'
stands for?
Also just as a "cosmetic" thing: I am only after the list of periods. I can do it with the solution you suggested: df.groupby(pd.TimeGrouper('W')).mean().to_period().index
But would there be no other way passing by the .groups attribute? This would be more readable and consistent in terms of coding.
Comment From: jorisvandenbossche
What does dtype='period[W-SUN]'
See http://pandas.pydata.org/pandas-docs/stable/timeseries.html#anchored-offsets, this means 'weekly, but anchored at the sunday' (because your original timestamps that represented the full week were sundays). I have to admit that the documentation can certainly be improved here.
I am only after the list of periods. .. But would there be no other way passing by the .groups attribute?
Apparently, when grouping on a period index, then the .groups
attribute does not work (this may be a bug). But starting from the timestamp keys, you could do:
In [26]: [p.to_period() for p in df.groupby(pd.TimeGrouper('W')).groups.keys()]
Out[26]:
[Period('2010-01-04/2010-01-10', 'W-SUN'),
Period('2010-01-25/2010-01-31', 'W-SUN'),
Period('2010-01-11/2010-01-17', 'W-SUN'),
Period('2010-01-18/2010-01-24', 'W-SUN'),
Period('2009-12-28/2010-01-03', 'W-SUN')]
Although you probably need to think about why you need those, and if there is not a better method to do the same.
Comment From: jimbasquiat
this means 'weekly, but anchored at the sunday'
This is not my question. My question is: "Why is it stated as a dtype for the timestamps?"
This was my question to begin with. If you look at my first post you can see the Timestamps are stated as follow: Timestamp('2010-01-03 00:00:00', freq='W-SUN')
. Why would a Timestamp have a period as a dtype?
Comment From: jreback
@jimbasquiat this is as expected.
In [9]: df.groupby(pd.TimeGrouper('W')).mean().to_period().index
Out[9]: PeriodIndex(['2009-12-28/2010-01-03', '2010-01-04/2010-01-10', '2010-01-11/2010-01-17', '2010-01-18/2010-01-24', '2010-01-25/2010-01-31'], dtype='period[W-SUN]', freq='W-SUN')
PeriodIndex
has a pandas extension dtype. This indicates the frequency of the periods contained in it.
see the whatsnew here: http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#period-changes.
Comment From: jreback
also note #12871 ; period resampling is in flux and being worked on. contributions are welcome.
Comment From: jimbasquiat
I am not sure why I have to repeat my question for the 3rd time... I am not talking about a PeriodIndex but a Timestamp object. Do I have to copy-paste here my inital post?
Comment From: jreback
@jimbasquiat that's correct as well. its simply a frequency attached to a Timestamp
. not sure what the problem is.
Comment From: jimbasquiat
@jreback Is there any documentation somewhere about such Timestamp object? To me it makes little sense. The dtype for a Timestamp should be datetime64 or those "<M8..." no? A Timestamp with a frequency, is it not the definition of a Period?
Comment From: jorisvandenbossche
Not sure if there is much information on the freq attribute for the individual Timestamp objects, but the frequency of the Timestamps is mainly of importance if you have a regular index of them:
In [8]: pd.date_range("2016-01-01", periods=3, freq='D')
Out[8]: DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03'], dtype='datetime64[ns]', freq='D')
In [9]: pd.date_range("2016-01-01", periods=3, freq='D')[0]
Out[9]: Timestamp('2016-01-01 00:00:00', freq='D')
So a accessing a single element here just kind of 'inherits' the frequency of the DatetimeIndex.
The Timestamp is a scalar and has no dtype attribute. But the dtype of an index or series of those values is certainly 'datetime64[ns]'.
Comment From: jreback
So a Timestamp
with a freq is de-facto a Period
.
In [2]: t = Timestamp('2010-01-03 00:00:00', freq='W-SUN')
In [3]: t
Out[3]: Timestamp('2010-01-03 00:00:00', freq='W-SUN')
In [4]: t.to_period()
Out[4]: Period('2009-12-28/2010-01-03', 'W-SUN')
In [5]: t.to_period().to_timestamp(how='start')
Out[5]: Timestamp('2009-12-28 00:00:00')
In [6]: t.to_period().to_timestamp(how='end')
Out[6]: Timestamp('2010-01-03 00:00:00')
However, I suspect that this is an implementation detail. IOW, resampling with Periods is still pretty buggy and these are prob just easier to deal with. Further in prior implementation of pandas (well actually currently as well, though a PR is in the works), Periods
are much less performant than Timestamp
, so it may be that a freq attached Timestamp is convenient.
@jimbasquiat I will create an issue specifically to address this. If you want to do some legwork to see if it is possible to instead return Periods
where we now do Timestamp
with a freq
Comment From: jorisvandenbossche
to see if it is possible to instead return Periods where we now do Timestamp with a freq
Is that something we actually want? -> OK see your other issue, will comment there
Comment From: jreback
@jorisvandenbossche I created #15146 for that very discussion.
my answer is if there are actually no differences then sure.
Comment From: jorisvandenbossche
@jreback yes, updated my comment above that I saw that and commented over there.
Comment From: jimbasquiat
Thank you guys for picking it up. In my earliest example with the groupby I think returning a Period would have made more sense. In the example you are giving with the Date Range probably not. I am not sure I can be of help on that topic as I am very new to Pandas (started learning it about 3 weeks ago). But thanks!