Pandas Selection by labels with non-transitive equality

Code Sample, a copy-pastable example if possible

import pandas as pd


class KV(object):

  def __init__(self, key, val):
    self.key = key
    self.val = val

  def __str__(self):
    return 'KV({!r}:{!r})'.format(self.key, self.val)

  def __hash__(self):
    return hash(self.key)

  def __eq__(self, other):
    if isinstance(other, basestring):
      return self.key == other
    elif isinstance(other, KV):
      return self.key == other.key and self.val == other.val
    return NotImplemented

  def __ne__(self, other):
    equals = self == other
    if equals is NotImplemented:
      return NotImplemented
    return not equals

  def __lt__(self, other):
    if isinstance(other, basestring):
      return self.key < other
    return NotImplemented

  def __le__(self, other):
    if isinstance(other, basestring):
      return self.key <= other
    return NotImplemented

  def __gt__(self, other):
    if isinstance(other, basestring):
      return self.key > other
    return NotImplemented

  def __ge__(self, other):
    if isinstance(other, basestring):
      return self.key >= other
    return NotImplemented


# labels with transitive equality
idx = pd.Index(['a', 'a'])

assert idx.get_loc('a') == slice(0, 2) # string selects both values
assert (idx == 'a').tolist() == [True, True]  # string selects both values

# labels with non-transitive equality
a1 = KV('a', 1)
a2 = KV('a', 2)

assert a1 == 'a' # KV and string match on key
assert a2 == 'a' # KV and string match on key
assert a1 != a2 # KV and KV match on key, but not on value

idx = pd.Index([a1, a2])

assert idx.get_loc('a') == 0 # string selects first KV
assert (idx == 'a').tolist() == [True, True]  # string selects both KVs

Problem description

Selection by labels with non-transitive equality has inconsistent behavior with duplicate labels. I would expect duplicate labels to be selected in all cases.

Expected Output

assert idx.get_loc('a') == slice(0, 2) # string selects both KVs

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 2.7.6.final.0 python-bits: 64 OS: Linux OS-release: 3.13.0-96-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None pandas: 0.20.3 pytest: 3.1.3 pip: 9.0.1 setuptools: 36.2.0 Cython: 0.26 numpy: 1.13.1 scipy: 0.19.1 xarray: None IPython: 5.4.1 sphinx: 1.6.3 patsy: 0.4.1 dateutil: 2.6.1 pytz: 2017.2 blosc: None bottleneck: 1.2.1 tables: 3.4.2 numexpr: 2.6.2 feather: None matplotlib: 2.0.2 openpyxl: None xlrd: None xlwt: 0.7.5 xlsxwriter: None lxml: None bs4: None html5lib: 0.999 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None

Comment From: gfyoung

Admittedly, this is an unusual use case. That being said, your point about selecting duplicate labels despite the non-transitivity makes sense. A PR is welcome to patch this!

Comment From: adshieh

I narrowed down the issue to the ObjectEngine in the Index not checking for duplicate labels because the values are unique by a1 != a2.

If I create an Index with non-unique values, then the selection of duplicate labels works as expected.

assert pd.Index([a1, a2, a2]).get_loc('a') == slice(0, 3) # string selects all KVs

Comment From: gfyoung

Interesting...can you share here the location of the code you're talking about (i.e. the specific logic that you found was causing the issue) ?

Comment From: adshieh

This line checks for uniqueness to select duplicate labels, or proceeds to the hash table.

Comment From: gfyoung

Awesome. I understand why the code does what it does, and although I agree that for "correctness," it should check for all possible matches to the key, I'm wary of the performance impact if we were to simply just call _get_loc_duplicates every time just to account for a special case as yours.

@chris-b1 @jreback : your thoughts?

Out of curiosity, what was the use-case behind this example?

Comment From: adshieh

I don't see an easy fix either, since there's no way for the Index to know that its values will be compared in a way that makes them non-unique short of attaching a flag to override the uniqueness check. My use case is storing the nodes of a computation graph in the columns of a DataFrame to associate them with the arrays of values they compute. I want the nodes to compare to each other based on metadata about the computations they perform, but I also want to be able to tag the nodes with strings for easier access in the DataFrame.

Comment From: gfyoung

I see. So you're really trying to index based on an attribute of these nodes rather than the nodes themselves. That's a little different from what current indexing does and would therefore be an enhancement (potentially) on what we currently do.

Actually, it occurs to me that maybe this wouldn't be so complicated. Perhaps we could just pass in a lambda function to get_loc i.e.:

idx.get_loc(lambda x: x == 'a')

Comment From: adshieh

Support for callables would eliminate the need to make the nodes compare to strings at all.

idx.get_loc(lambda x: x.key == 'a')

However, callables would be overloaded on DataFrames, where the argument refers to the DataFrame rather than the index elements.

Comment From: gfyoung

However, callables would be overloaded on DataFrames, where the argument refers to the DataFrame rather than the index elements.

I'm not sure I follow you here.

In any case, the implementation of the callable would be the responsibility of the user.

Comment From: adshieh

It wouldn't be possible to select columns of a DataFrame with a callable through df.__getitem__ or df.loc in the same way as an Index through idx.get_loc, since x in df[lambda x: x.key == 'a'] would refer to df rather than an element of df.columns.

Comment From: gfyoung

True, but you can pre-select with df.columns.

Comment From: jreback

this is not supported, you are violating hashing-equality constraints. Using custom things which are not strings in an Index is non-performant, opaque and non-idiomatic.

We already support callables in the cond generation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#selection-by-callable

Pandas Selection by labels with non-transitive equality

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`