Pandas DISC: Remove DTA/TDA/PA freq - Nineya|java/go/python

For PA, freq is redundant with dtype. For DTA/TDA, it means something very different from what it means for PA. The only thing it is really used for is _time_shift (and along with it, add/sub of integers, which is deprecated in DTI/TDI anyway)

Without freq going into the constructors, DTA/TDA/PA would all just need dtype, so some simplification/unification could be done.

frequency validation in __init__ could be saved for DTI/TDI and omitted from DTA/TDA.

If we ever come to our senses and allow 2D EAs (so we can simplify a ton of internals), freq stops making sense for 2D, so would be unused in those cases anyway.

Comment From: TomAugspurger

I'd probably support this, but would need to think things through a bit more :) The different meaning of freq between period and timedelta / datetime is quite confusing.

Additionally

There's a (perhaps surprising) performance penalty to specifying freq when creating the new array. It's hard to say in the abstract whether downstream operations would benefit from it.
It's hard to know when operations should invalidate / re-infer the freq (https://github.com/pandas-dev/pandas/issues/24555)

Thinking more though, those apply equally to the index classes, and I suppose that freq would not be deprecated there (nor would we want to?).

Comment From: jorisvandenbossche

The idea would be that freq is then a cached attribute on DatetimeIndex/TimedeltaIndex, just like it has other cached attributes (is_unique, ..) ?

Some use cases of freq is in timeseries are resampling operations and plotting. For both of these, it would be useful to cache the freq to avoid recomputing it (but also a way to never have to compute it, if a regular DatetimeIndex is created)

Comment From: TomAugspurger

In theory, I suppose it could be cached on DatetimeArray/TimedeltaArray as well? I believe that __setitem__ is the only method that mutates inplace, and we already invalidate the freq there.

Comment From: jbrockmendel

Thinking about this again following 1.0 deprecation policy discussion. I'm considering this in conjunction with removing Timestamp.freq (#15146). The difficult part of the implementation for both DTA and Timestamp is that without self.freq, what do we do with the is_month_start etc properties that depend on self.freq.

One option would be to make them into methods to which you can pass a freq. But it isnt obvious what warning to issue for that deprecation. We want to tell users "do X now and it will be compatible with this version and future versions" but AFAICT there is no way to do that.

Comment From: mroeschke

I did something vaguely similar with weekday_name. We deprecated the weekday_name attribute and implemented day_name as a function that takes a locale at the same time.

Similarly we can deprecate freq, is_month_start, is_*_[start|end], and introduce a is_start(freq) method.

Comment From: jorisvandenbossche

@jbrockmendel your proposal here is to remove it from Datetime/TimedeltaArray, but keep it for the Index classes, right?

So dealing with freq, caching it, etc, would be dealt with on the Index-level ?

Comment From: jbrockmendel

So dealing with freq, caching it, etc, would be dealt with on the Index-level ?

Correct.

Comment From: jorisvandenbossche

And what do you see as the advantage of dealing with this on the Index level instead of the array level?

The places where freq gets used (eg resample, plotting?) also support datetimes in columns, not only in the index. For those, it would be beneficial if it was handled on the array-level?

Comment From: jbrockmendel

And what do you see as the advantage of dealing with this on the Index level instead of the array level?

__setitem__ on a DTA/TDA invalidates the freq, but i haven't found a good way to propagate that to views. As a result, we can into situations where the freq attr is inaccurate.

Comment From: jorisvandenbossche

setitem on a DTA/TDA invalidates the freq, but i haven't found a good way to propagate that to views.

Can you describe that issue in a bit more detail? (or give an example) How is setitem related to views?

Comment From: jbrockmendel

Can you describe that issue in a bit more detail? (or give an example) How is setitem related to views?

dti = pd.date_range("2016-01-01", periods=4)
dta = dti._data

view = dta[:]
dta[0] = pd.NaT

assert dta.freq is None
assert view[0] is pd.NaT
assert view.freq is None   # <-- fails

Comment From: jorisvandenbossche

Thanks for that example! That clarifies ;)

That's an interesting problem. So Index circumvents this by being "immutable". I am wondering if that doesn't point to a general issue with EAs and views.

Do we have other EAs with attributes / metadata? IntervalArray has the closed attribute, but that is something that doesn't depend on the actual values (meaning, you can't "invalidate" closed by setting a value in the IntervalArray). In GeoPandas, we just added a crs attribute (holding metadata information about the coordinate reference system), but there, when setting values in place (__setitem__), we assume the user is responsible to set a value that matches the CRS (so setting a value also shouldn't invalidate the crs attribute). Although there might be some esoteric cases where you can run into problems with this.

We have been thinking about having a faster hasna metadata for the masked array EAs. One idea is to have this check the mask being None, if we can make the mask optional. Another idea would be to actually "cache" this and invalidate when the array gets mutated. But, so that would run into the same problem as the freq of DTA.

Comment From: jorisvandenbossche

To put it another way: 1) would it technically be possible to implement something similar to "cache_readonly" for EAs (that can handle those problems) and 2) if possible, would it be useful?

Comment From: jbrockmendel

To put it another way: 1) would it technically be possible to implement something similar to "cache_readonly" for EAs (that can handle those problems) and 2) if possible, would it be useful?

So something that, when cleared/invalidated, would also clear the cache on any other EAs that are a view on the same data?

Comment From: jbrockmendel

I've been experimenting on this, am finding that it is a lot of churn. Having second thoughts about whether this is worth it.

A DTA/TDA inside a Series/DataFrame has its freq set to none, so there isn't a real risk there. The only real risk is if a user is dealing with DTA/TDA directly, which is very rare.