xref #9595 for an overview the the indexing API, this is focused on the state of the code.

Needs Eyeballs

  • Tests are scattered:
  • tests/indexing/
  • tests/(frame|series)/indexing/
  • tests/indexes/*/indexing

    • Moreover, many of the indexing tests are leftover from ix, and it isn't clear whether they are testing anything meaningful anymore. Organizing these tests and getting a handle on exactly what we are testing is going to be a marathon (cc @MomIsBestFriend)
    • Many of our tests use the indices fixture or tm.makeFooIndex, which leaves many cases un-covered, off the top of my head:
      • above/below _libs.index._SIZE_CUTOFF
      • monotonic increasing/decreasing/non-monotonic
      • has NAs or not
      • near implementation bounds
      • name attributes
      • readonly or not
      • view or not
  • Ditto for indexing Benchmarks

  • GH Issue Tracker

    • The "Indexing" label is used pretty loosely. A thorough pass through the tracker to get a handle on what issues are about __setitem__, __getitem__, loc, iloc, at, and iat would be worthwhile.

_libs

  • In _libs.index we define _bin_search as an alternative to ndarray.searchsorted. AFAICT this is more performant than ndarray.searchsorted for object dtype, but not for other dtypes. If this is correct, then we should override the non-object IndexEngine subclasses to use searchsorted (we do this for DatetimeEngine)
  • The IndexEngine methods get_indexer, get_pad_indexer, get_backfill_indexer don't need to be in cython, can go back on the Index classes.
  • Having them in cython makes it harder to tell that the PeriodEngine versions are never called
  • We could avoid some casting if we had HashTable classes for itemsizes smaller than 64 bits
    • I expect this would also let us avoid some casting in core.algorithms.

Potential Optimizations

Some of these increase code complexity, so it isn't obvious whether they are worthwhile: - Separate Loc/iLoc/At/iAt classes for 1D vs 2D could allow some optimizations - iloc input validations could be removed, letting numpy's exception messages surface instead - cached check (or just separate subclass?) for if a object-dtype Index contains any tuples. If we can rule out tuples, a bunch of loc and __getitem__ code can be simplified.

Potential Refactors

  • DatetimeIndex, TimedeltaIndex, and PeriodIndex get_loc could be refactored to move most of the action into _maybe_cast_indexer. I think that the resulting _maybe_cast_indexer methods could then be shared with other DTA/TDA/PA methods, in particular the casting done in comparison methods.

Comment From: jbrockmendel

I think this has been taken as far as it can for the time being. Closing.