For PA, freq is redundant with dtype. For DTA/TDA, it means something very different from what it means for PA. The only thing it is really used for is _time_shift (and along with it, add/sub of integers, which is deprecated in DTI/TDI anyway)
Without freq
going into the constructors, DTA/TDA/PA would all just need dtype
, so some simplification/unification could be done.
frequency validation in __init__
could be saved for DTI/TDI and omitted from DTA/TDA.
If we ever come to our senses and allow 2D EAs (so we can simplify a ton of internals), freq stops making sense for 2D, so would be unused in those cases anyway.
Comment From: TomAugspurger
I'd probably support this, but would need to think things through a bit more :) The different meaning of freq between period and timedelta / datetime is quite confusing.
Additionally
- There's a (perhaps surprising) performance penalty to specifying
freq
when creating the new array. It's hard to say in the abstract whether downstream operations would benefit from it. - It's hard to know when operations should invalidate / re-infer the freq (https://github.com/pandas-dev/pandas/issues/24555)
Thinking more though, those apply equally to the index classes, and I suppose that freq would not be deprecated there (nor would we want to?).
Comment From: jorisvandenbossche
The idea would be that freq
is then a cached attribute on DatetimeIndex/TimedeltaIndex, just like it has other cached attributes (is_unique
, ..) ?
Some use cases of freq
is in timeseries are resampling operations and plotting. For both of these, it would be useful to cache the freq to avoid recomputing it (but also a way to never have to compute it, if a regular DatetimeIndex is created)
Comment From: TomAugspurger
In theory, I suppose it could be cached on DatetimeArray/TimedeltaArray as well? I believe that __setitem__
is the only method that mutates inplace, and we already invalidate the freq there.
Comment From: jbrockmendel
Thinking about this again following 1.0 deprecation policy discussion. I'm considering this in conjunction with removing Timestamp.freq (#15146). The difficult part of the implementation for both DTA and Timestamp is that without self.freq
, what do we do with the is_month_start
etc properties that depend on self.freq.
One option would be to make them into methods to which you can pass a freq
. But it isnt obvious what warning to issue for that deprecation. We want to tell users "do X now and it will be compatible with this version and future versions" but AFAICT there is no way to do that.
Comment From: mroeschke
I did something vaguely similar with weekday_name
. We deprecated the weekday_name
attribute and implemented day_name
as a function that takes a locale at the same time.
Similarly we can deprecate freq
, is_month_start
, is_*_[start|end]
, and introduce a is_start(freq)
method.
Comment From: jorisvandenbossche
@jbrockmendel your proposal here is to remove it from Datetime/TimedeltaArray, but keep it for the Index classes, right?
So dealing with freq
, caching it, etc, would be dealt with on the Index-level ?
Comment From: jbrockmendel
So dealing with freq, caching it, etc, would be dealt with on the Index-level ?
Correct.
Comment From: jorisvandenbossche
And what do you see as the advantage of dealing with this on the Index level instead of the array level?
The places where freq gets used (eg resample, plotting?) also support datetimes in columns, not only in the index. For those, it would be beneficial if it was handled on the array-level?
Comment From: jbrockmendel
And what do you see as the advantage of dealing with this on the Index level instead of the array level?
__setitem__
on a DTA/TDA invalidates the freq, but i haven't found a good way to propagate that to views. As a result, we can into situations where the freq attr is inaccurate.
Comment From: jorisvandenbossche
setitem on a DTA/TDA invalidates the freq, but i haven't found a good way to propagate that to views.
Can you describe that issue in a bit more detail? (or give an example) How is setitem related to views?
Comment From: jbrockmendel
Can you describe that issue in a bit more detail? (or give an example) How is setitem related to views?
dti = pd.date_range("2016-01-01", periods=4)
dta = dti._data
view = dta[:]
dta[0] = pd.NaT
assert dta.freq is None
assert view[0] is pd.NaT
assert view.freq is None # <-- fails
Comment From: jorisvandenbossche
Thanks for that example! That clarifies ;)
That's an interesting problem. So Index circumvents this by being "immutable". I am wondering if that doesn't point to a general issue with EAs and views.
Do we have other EAs with attributes / metadata?
IntervalArray has the closed
attribute, but that is something that doesn't depend on the actual values (meaning, you can't "invalidate" closed
by setting a value in the IntervalArray).
In GeoPandas, we just added a crs
attribute (holding metadata information about the coordinate reference system), but there, when setting values in place (__setitem__
), we assume the user is responsible to set a value that matches the CRS (so setting a value also shouldn't invalidate the crs
attribute). Although there might be some esoteric cases where you can run into problems with this.
We have been thinking about having a faster hasna
metadata for the masked array EAs. One idea is to have this check the mask being None, if we can make the mask optional. Another idea would be to actually "cache" this and invalidate when the array gets mutated. But, so that would run into the same problem as the freq of DTA.
Comment From: jorisvandenbossche
To put it another way: 1) would it technically be possible to implement something similar to "cache_readonly" for EAs (that can handle those problems) and 2) if possible, would it be useful?
Comment From: jbrockmendel
To put it another way: 1) would it technically be possible to implement something similar to "cache_readonly" for EAs (that can handle those problems) and 2) if possible, would it be useful?
So something that, when cleared/invalidated, would also clear the cache on any other EAs that are a view on the same data?
Comment From: jbrockmendel
I've been experimenting on this, am finding that it is a lot of churn. Having second thoughts about whether this is worth it.
A DTA/TDA inside a Series/DataFrame has its freq set to none, so there isn't a real risk there. The only real risk is if a user is dealing with DTA/TDA directly, which is very rare.