Pandas ENH: Interval type should support intersection, union & overlaps & difference

Problem description

We have the Interval type in pandas, which is extremely useful, however the standard interval arithmetic operations are missing from the pandas implementation. I would be happy to work on this enhancement.

One should be able to do the following with `pandas.Interval`

The example uses numeric intervals, but the same operations are also valid for time series intervals.

# following proposed operations and suggested behaviour:

import pandas as pd
i0 = pd.Interval(0, 3, closed='right')
i1 = pd.Interval(2, 4, closed='right')
i2 = pd.Interval(5, 8, closed='right')

# 1. intersection
i0.intersection(i1)
# should return: pd.Interval(2, 3, closed='right')
i0.intersection(i3)
# should return: np.nan (or, perhaps a more appropriate null-interval representation)

# 2. union
i0.union(i1)
# should return: pd.IntervalIndex([pd.Interval(0, 4, closed='right')])
i0.union(i2)
# should return: pd.IntervalIndex([pd.Interval(0, 2, closed='right'), pd.Interval(5,8, closed='right')])

3. overlaps
i0.overlaps(i1)
# should return: True
i0.overlaps(i2)
# should return: False

3. difference
i0.difference(i1)
# should return: pd.Interval(0, 2, closed='right')
i0.difference(i2)
# should return: pd.interval(0, 3, closed='right')

Comment From: jschendel

xref #19480

This is reasonable, though some care is needed if these operations are to work between intervals with mixed closed. My initial inclination was to simply not allow mixed closed operations, but they seem generally well defined after thinking about it.

For example, the following seems reasonable:

In [2]: i0 = pd.Interval(0, 2, closed='both')

In [3]: i1 = pd.Interval(1, 3, closed='neither')

In [4]: i0.intersection(i1)
Out[4]: Interval(1, 2, closed='right')

The only thing that immediately comes to mind as problematic is union. Mixed closed should be fine if the intervals are overlapping. For example, using the intervals defined above, the following seems reasonable:

In [5]: i0.union(i1)
Out[5]: Interval(0, 3, closed='left')

Non-overlapping intervals with the same closed should also be fine, though I'd like to note that my preference would be to return an IntervalArray (newly implemented, slated for 0.24.0) instead of an IntervalIndex:

In [6]: i2 = pd.Interval(8, 10, closed='both')

In [7]: i0.union(i2)
Out[7]:
IntervalArray([[0, 2], [8, 10]],
              closed='both',
              dtype='interval[int64]')

The problematic case for union is when you have mixed closed intervals that are non-overlapping. We can't use IntervalArray (or IntervalIndex) in that case, as they require all intervals to be closed on the same side. The options would be to either raise an error, or return a numpy object dtype array:

In [8]: i3 = pd.Interval(8, 10, closed='neither')

In [9]: i0.union(i3)
Out[9]: array([Interval(0, 2, closed='both'), Interval(8, 10, closed='neither')], dtype=object)

I'm not sure if an object dtype array actually provides an utility here, and seems a bit unnatural, so I'd lean towards raising. Could be convinced otherwise if anyone has a practical use for it though.

Comment From: jschendel

Actually, I think difference could also be similarly problematic, even when both intervals are closed on the same side. Specifically, I'd expect problematic behavior to occur with nested intervals:

In [2]: i0 = pd.Interval(0, 3, closed='both')

In [3]: i1 = pd.Interval(1, 2, closed='both')

In [4]: i0.difference(i1)
Out[4]: array([Interval(0, 1, closed='left'), Interval(2, 3, closed='right')], dtype=object)

But with mixed closed you could actually get a valid IntervalArray:

In [5]: i2 = pd.Interval(1, 2, closed='neither')

In [6]: i0.difference(i2)
Out[6]:
IntervalArray([[0, 1], [2, 3]],
              closed='both',
              dtype='interval[int64]')

So not really sure if difference should be supported, as it can become a bit confusing from a user perspective to keep track of what is valid and what isn't (i.e. what returns an IntervalArray vs. what raises/returns an object dtype array).

Comment From: haleemur

@jschendel you are correct in the above observations. The complexity in my proposal derives from trying to return multiple intervals in either IntervalIndex or IntervalRange with mixed boundary types.

I looked into the behaviour of postgresql's range types, and perhaps we can model these operations similarly. Postgresql avoids the mixed boundary type problem by only returning a single value (boolean or single-continuous-interval) as result.

I think the following range operators from postgresql taking two intervals as arguments should be interesting for pandas users:

Operator	Description	Example	Result
`&&`	overlap (have points in common)	`int8range(3,7) && int8range(4,12)`	`t`
`<<`	strictly left of	`int8range(1,10) << int8range(100,110)`	`t`
`>>`	strictly right of	`int8range(50,60) >> int8range(20,30)`	`t`
`&<`	does not extend to the right of	`int8range(1,20) &< int8range(18,20)`	`t`
`&>`	does not extend to the left of	`int8range(7,20) &> int8range(5,10)`	`t`
`-\\|-`	is adjacent to	`numrange(1.1,2.2) -\\|- numrange(2.2,3.3)`	`t`
`+`	union	`numrange(5,15) + numrange(10,20)`	`[5,20)`
`*`	intersection	`int8range(5,15) * int8range(10,20)`	`[10,15)`
`-`	difference	`int8range(5,15) - int8range(10,20)`	`[5,10)`

and this function:

Function	Return Type	Description	Example	Result
`range_merge(anyrange, anyrange)`	anyrange	the smallest range which includes both of the given ranges	`range_merge('[1,2)'::int4range, '[3,4)'::int4range)`	`[1,4)`

examples of how postgresql handles non trivial range operations:

intersection:

hal=> select int8range(4,8) * int8range(10,20);
 ?column?
----------
 empty
(1 row)

difference:

hal=> select int8range(4,8) - int8range(5,7);
ERROR:  result of range difference would not be contiguous

union:

hal=> select int8range(4,8) + int8range(10,20);
ERROR:  result of range union would not be contiguous

raise: raises an error
coerce: sets the value to np.nan
first: returns the first interval in the the result array
last: returns the last interval in the result array
greatest: returns the biggest interval in the result array
smallest: returns the smallest interval in the result array

greatest & smallest could be useful options for the difference operation.

Comment From: randolf-scholz

Why do IntervalArrays have to have the same closedness for all elements? I think this really limits their usefulness, just to presumable save a pair of bits.

Pandas ENH: Interval type should support intersection, union & overlaps & difference

Problem description

One should be able to do the following with pandas.Interval

One should be able to do the following with `pandas.Interval`