xref #9595 for an overview the the indexing API, this is focused on the state of the code.
Needs Eyeballs
- Tests are scattered:
- tests/indexing/
- tests/(frame|series)/indexing/
-
tests/indexes/*/indexing
- Moreover, many of the indexing tests are leftover from ix, and it isn't clear whether they are testing anything meaningful anymore. Organizing these tests and getting a handle on exactly what we are testing is going to be a marathon (cc @MomIsBestFriend)
- Many of our tests use the
indices
fixture ortm.makeFooIndex
, which leaves many cases un-covered, off the top of my head:- above/below _libs.index._SIZE_CUTOFF
- monotonic increasing/decreasing/non-monotonic
- has NAs or not
- near implementation bounds
- name attributes
- readonly or not
- view or not
-
Ditto for indexing Benchmarks
-
GH Issue Tracker
- The "Indexing" label is used pretty loosely. A thorough pass through the tracker to get a handle on what issues are about
__setitem__
,__getitem__
,loc
,iloc
,at
, andiat
would be worthwhile.
- The "Indexing" label is used pretty loosely. A thorough pass through the tracker to get a handle on what issues are about
_libs
- In
_libs.index
we define_bin_search
as an alternative tondarray.searchsorted
. AFAICT this is more performant than ndarray.searchsorted for object dtype, but not for other dtypes. If this is correct, then we should override the non-object IndexEngine subclasses to use searchsorted (we do this forDatetimeEngine
) - The
IndexEngine
methodsget_indexer
,get_pad_indexer
,get_backfill_indexer
don't need to be in cython, can go back on the Index classes. - Having them in cython makes it harder to tell that the
PeriodEngine
versions are never called - We could avoid some casting if we had HashTable classes for itemsizes smaller than 64 bits
- I expect this would also let us avoid some casting in core.algorithms.
Potential Optimizations
Some of these increase code complexity, so it isn't obvious whether they are worthwhile:
- Separate Loc/iLoc/At/iAt classes for 1D vs 2D could allow some optimizations
- iloc input validations could be removed, letting numpy's exception messages surface instead
- cached check (or just separate subclass?) for if a object-dtype Index contains any tuples. If we can rule out tuples, a bunch of loc
and __getitem__
code can be simplified.
Potential Refactors
- DatetimeIndex, TimedeltaIndex, and PeriodIndex
get_loc
could be refactored to move most of the action into_maybe_cast_indexer
. I think that the resulting_maybe_cast_indexer
methods could then be shared with other DTA/TDA/PA methods, in particular the casting done in comparison methods.
Comment From: jbrockmendel
I think this has been taken as far as it can for the time being. Closing.