In my attempt to enable DataFrame reductions for ExtensionArrays (and at the same time cleaning up DataFrame._reduce
a bit) at https://github.com/pandas-dev/pandas/pull/32867, I am running into the following inconsistency.
Consider a Timedelta array, and how it behaves for the any
and all
logical functions (on current master, bit similar for released pandas):
>>> td = pd.timedelta_range(0, periods=3)
# Index
>>> td.any()
...
TypeError: cannot perform any with this index type: TimedeltaIndex
# Series
>>> pd.Series(td).any()
...
TypeError: cannot perform any with type timedelta64[ns]
# DataFrame
>>> pd.DataFrame({'a': td}).any()
a True
dtype: bool
(and exactly the same pattern for datetime64[ns])
For DataFrame, it works because there we call pandas.core.nanops.nanany()
, and our internal version of nan-aware any
/all
work fine with timedelta64/datetime64 dtypes.
The reason it fails for Index and Series is because we simply up front say: this operation is not possible with this dtype (we never get to call nanany()
in those cases). But in the DataFrame reduction, we simply try it on df.values
, and it works.
This is clearly an inconsistency that we should solve. And especially for my PR https://github.com/pandas-dev/pandas/pull/32867 this gives problems, as I want to rely on the array (column-wise) ops for DataFrame reductions, but thus those give a differen result leading to failing tests.
The question is thus: do we think that the any()
and all()
operations make sense for datetime-like data?
(and if yes, we can simply enable them for the array/index/series case)
- For timedelta, I think you can say there is a concept of "zero" and "non-zero" (which is what any/all are testing in the end, for non-boolean, numeric data)
- But for datetimes this might make less sense? The fact that 1970-01-01 00:00:00 would be "False" is just an implementation detail of the numbering starting at that point in time.
As reference, this operation fails in numpy for both timedelta and datetime:
>>> np.array([0, 1], dtype="datetime64[ns]").any()
...
TypeError: No loop matching the specified signature and casting was found for ufunc logical_or
>>> np.array([0, 1], dtype="timedelta64[ns]").any()
...
TypeError: No loop matching the specified signature and casting was found for ufunc logical_or
But for timedeltas, that might rather be an issue of "never got implemented" rather than "deliberately not implemented" ? (there was a similar conversation about "mean" for timedelta) cc @shoyer
Comment From: jbrockmendel
xref #19759
Comment From: jbrockmendel
ATM i lean towards deprecating the nanops behavior and not supporting any/all for either dt64 or td64.
If numpy were to implement it for td64 id be OK with matching that.
We should also look at consistency with tda.astype(bool)
, which we apparently do support (though looks like Index gives object-dtype whereas DTA/TDA give bool-dtype)
Comment From: simonjayhawkins
@jbrockmendel did #38723 close this?
Comment From: jbrockmendel
no, that ensured consistency between dt64 vs dt64tz, but didn't pin down the desired behavior
Comment From: simonjayhawkins
Thanks @jbrockmendel for confirming. I didn't check in detail. Will remove the milestone.
Comment From: jbrockmendel
i propose we deprecate the any/all behavior for dt64/dt64tz, keep it for td64