As the next step of separation-of-concerns plan (#6744) I'd like to
propose adding a method (or several, actually) to Index
class that
would encapsulate the details of foo.loc[l1,l2,...]
lookup.
Implementation Idea
Roughly, the idea is to make loc
's getitem as simple as
def __getitem__(self, indexer):
axes = self.obj.axes
return self.obj.iloc[axes[0].lookup_labels_nd(indexer, axes[1:], typ='loc')]
Not quite, but hopefully you get the point. The default lookup_labels_nd
implementation would then look something like this:
def lookup(self, indexer, other_axes, typ=None):
if not isinstance(indexer, tuple):
return self.lookup_labels(indexer, typ=typ)
else:
# ndim mismatch error handling is omitted intentionally
return (self.lookup_labels(indexer[0]),) + \
tuple(ax.lookup_labels(ix, typ=typ)
for ax, ix in zip(other_axes, indexer))
The result should be an object that could be fed to an underlying
BlockManager to perform the requested operation. To support adding
new rows with "setitem", it is only needed to agree that lookup_labels_nd
will
never return negative indices unless they reference newly appended
items along that axis.
This would allow to hide Index-subclass-specific lookup peculiarities
in their respective overrides of lookup_labels_nd
and lookup_labels
(proposals for
better names are welcome), e.g.:
- looking up str in DatetimeIndex/PeriodIndex
- looking up int in FloatIndex
- looking up per-level slices in MultiIndex
Benefits
- no more confusing errors due to
try .. catch
block carpet-catching a logic error, because corner cases will be handled precisely where they are needed and nowhere else - no more relying on isinstance checks and exceptions to seek for alternative lookup scenarios, meaning more performance
- the API will provide a contract that is simple to grasp, test, benchmark and, eventually, cythonize (as a side effect of this point I'd like to try putting up a wiki page with indexing API reference)
Comment From: immerrr
I was trying to mock up a proof of concept (while at ep2014) and decided to start with simple things, at
and iat
, but found something weird:
In [19]: s = pd.Series([1,2,3], index=list('abc'))
In [20]: s.at['a']
Out[20]: 1
In [21]: s.at['b']
Out[21]: 2
In [22]: s.at['c']
Out[22]: 3
In [23]: s.at[0]
Out[23]: 1
I couldn't find any reference to this feature in pandas doc, and I have a strong feeling it shouldn't be there. @jrebac, what do you think, do I have to maintain it?
Comment From: jreback
I think that is a bug, at
should be a restricted version of loc
, e.g. (no slicing or lists), and certainly no fallback. Prob just not checking that the label type is a valid type for the index. pls do a PR to test/fix this if you would. thanks
Comment From: jreback
https://github.com/pydata/pandas/issues/7814
Comment From: shoyer
Just stumbled across this -- it gets a bit +1 from me!
Comment From: jreback
cc @toobaz
closing as stale, but certainly would take a PR (or more) to simplify internal indexing :>
Comment From: toobaz
Just for the records, since I was cced: I'm -1 on this, for two reasons.
The first is conceptual: I don't like Index
to have a method that work on multiple indexes. Despite the mess in core/indexing.py
, which would certainly benefit from some decoupling, the general idea that indexing must be done at the object level, rather than at the index level, makes sense to me.
The second is practical: there are calls to .loc
which cannot be expressed in calls to .iloc
, such as when you do partial indexing on a MultiIndex
.
By the way, the crucial step to allow real decoupling is to enable BlockManager.set_value
for mixed dtypes.