Problem description
We have the Interval type in pandas, which is extremely useful, however the standard interval arithmetic operations are missing from the pandas implementation. I would be happy to work on this enhancement.
One should be able to do the following with pandas.Interval
The example uses numeric intervals, but the same operations are also valid for time series intervals.
# following proposed operations and suggested behaviour:
import pandas as pd
i0 = pd.Interval(0, 3, closed='right')
i1 = pd.Interval(2, 4, closed='right')
i2 = pd.Interval(5, 8, closed='right')
# 1. intersection
i0.intersection(i1)
# should return: pd.Interval(2, 3, closed='right')
i0.intersection(i3)
# should return: np.nan (or, perhaps a more appropriate null-interval representation)
# 2. union
i0.union(i1)
# should return: pd.IntervalIndex([pd.Interval(0, 4, closed='right')])
i0.union(i2)
# should return: pd.IntervalIndex([pd.Interval(0, 2, closed='right'), pd.Interval(5,8, closed='right')])
3. overlaps
i0.overlaps(i1)
# should return: True
i0.overlaps(i2)
# should return: False
3. difference
i0.difference(i1)
# should return: pd.Interval(0, 2, closed='right')
i0.difference(i2)
# should return: pd.interval(0, 3, closed='right')
Comment From: jschendel
xref #19480
This is reasonable, though some care is needed if these operations are to work between intervals with mixed closed
. My initial inclination was to simply not allow mixed closed
operations, but they seem generally well defined after thinking about it.
For example, the following seems reasonable:
In [2]: i0 = pd.Interval(0, 2, closed='both')
In [3]: i1 = pd.Interval(1, 3, closed='neither')
In [4]: i0.intersection(i1)
Out[4]: Interval(1, 2, closed='right')
The only thing that immediately comes to mind as problematic is union
. Mixed closed
should be fine if the intervals are overlapping. For example, using the intervals defined above, the following seems reasonable:
In [5]: i0.union(i1)
Out[5]: Interval(0, 3, closed='left')
Non-overlapping intervals with the same closed
should also be fine, though I'd like to note that my preference would be to return an IntervalArray
(newly implemented, slated for 0.24.0) instead of an IntervalIndex
:
In [6]: i2 = pd.Interval(8, 10, closed='both')
In [7]: i0.union(i2)
Out[7]:
IntervalArray([[0, 2], [8, 10]],
closed='both',
dtype='interval[int64]')
The problematic case for union
is when you have mixed closed
intervals that are non-overlapping. We can't use IntervalArray
(or IntervalIndex
) in that case, as they require all intervals to be closed on the same side. The options would be to either raise an error, or return a numpy object dtype array:
In [8]: i3 = pd.Interval(8, 10, closed='neither')
In [9]: i0.union(i3)
Out[9]: array([Interval(0, 2, closed='both'), Interval(8, 10, closed='neither')], dtype=object)
I'm not sure if an object dtype array actually provides an utility here, and seems a bit unnatural, so I'd lean towards raising. Could be convinced otherwise if anyone has a practical use for it though.
Comment From: jschendel
Actually, I think difference
could also be similarly problematic, even when both intervals are closed on the same side. Specifically, I'd expect problematic behavior to occur with nested intervals:
In [2]: i0 = pd.Interval(0, 3, closed='both')
In [3]: i1 = pd.Interval(1, 2, closed='both')
In [4]: i0.difference(i1)
Out[4]: array([Interval(0, 1, closed='left'), Interval(2, 3, closed='right')], dtype=object)
But with mixed closed
you could actually get a valid IntervalArray
:
In [5]: i2 = pd.Interval(1, 2, closed='neither')
In [6]: i0.difference(i2)
Out[6]:
IntervalArray([[0, 1], [2, 3]],
closed='both',
dtype='interval[int64]')
So not really sure if difference
should be supported, as it can become a bit confusing from a user perspective to keep track of what is valid and what isn't (i.e. what returns an IntervalArray
vs. what raises/returns an object dtype array).
Comment From: haleemur
@jschendel you are correct in the above observations. The complexity in my proposal derives from trying to return multiple intervals in either IntervalIndex
or IntervalRange
with mixed boundary types.
I looked into the behaviour of postgresql's range types, and perhaps we can model these operations similarly. Postgresql avoids the mixed boundary type problem by only returning a single value (boolean or single-continuous-interval) as result.
I think the following range operators from postgresql taking two intervals as arguments should be interesting for pandas users:
Operator | Description | Example | Result |
---|---|---|---|
&& |
overlap (have points in common) | int8range(3,7) && int8range(4,12) |
t |
<< |
strictly left of | int8range(1,10) << int8range(100,110) |
t |
>> |
strictly right of | int8range(50,60) >> int8range(20,30) |
t |
&< |
does not extend to the right of | int8range(1,20) &< int8range(18,20) |
t |
&> |
does not extend to the left of | int8range(7,20) &> int8range(5,10) |
t |
-\|- |
is adjacent to | numrange(1.1,2.2) -\|- numrange(2.2,3.3) |
t |
+ |
union | numrange(5,15) + numrange(10,20) |
[5,20) |
* |
intersection | int8range(5,15) * int8range(10,20) |
[10,15) |
- |
difference | int8range(5,15) - int8range(10,20) |
[5,10) |
and this function:
Function | Return Type | Description | Example | Result |
---|---|---|---|---|
range_merge(anyrange, anyrange) |
anyrange | the smallest range which includes both of the given ranges | range_merge('[1,2)'::int4range, '[3,4)'::int4range) |
[1,4) |
examples of how postgresql handles non trivial range operations:
intersection:
hal=> select int8range(4,8) * int8range(10,20);
?column?
----------
empty
(1 row)
difference:
hal=> select int8range(4,8) - int8range(5,7);
ERROR: result of range difference would not be contiguous
union:
hal=> select int8range(4,8) + int8range(10,20);
ERROR: result of range union would not be contiguous
I think we could implement behaviour similar to casting functions, where the difference
& union
functions would have an errors
parameter, with the following options: raise|coerce|first|last|greatest|smallest
- raise: raises an error
- coerce: sets the value to np.nan
- first: returns the first interval in the the result array
- last: returns the last interval in the result array
- greatest: returns the biggest interval in the result array
- smallest: returns the smallest interval in the result array
greatest
& smallest
could be useful options for the difference operation.
Comment From: randolf-scholz
Why do IntervalArrays
have to have the same closedness for all elements? I think this really limits their usefulness, just to presumable save a pair of bits.