Code Sample, a copy-pastable example if possible
import pandas as pd
class KV(object):
def __init__(self, key, val):
self.key = key
self.val = val
def __str__(self):
return 'KV({!r}:{!r})'.format(self.key, self.val)
def __hash__(self):
return hash(self.key)
def __eq__(self, other):
if isinstance(other, basestring):
return self.key == other
elif isinstance(other, KV):
return self.key == other.key and self.val == other.val
return NotImplemented
def __ne__(self, other):
equals = self == other
if equals is NotImplemented:
return NotImplemented
return not equals
def __lt__(self, other):
if isinstance(other, basestring):
return self.key < other
return NotImplemented
def __le__(self, other):
if isinstance(other, basestring):
return self.key <= other
return NotImplemented
def __gt__(self, other):
if isinstance(other, basestring):
return self.key > other
return NotImplemented
def __ge__(self, other):
if isinstance(other, basestring):
return self.key >= other
return NotImplemented
# labels with transitive equality
idx = pd.Index(['a', 'a'])
assert idx.get_loc('a') == slice(0, 2) # string selects both values
assert (idx == 'a').tolist() == [True, True] # string selects both values
# labels with non-transitive equality
a1 = KV('a', 1)
a2 = KV('a', 2)
assert a1 == 'a' # KV and string match on key
assert a2 == 'a' # KV and string match on key
assert a1 != a2 # KV and KV match on key, but not on value
idx = pd.Index([a1, a2])
assert idx.get_loc('a') == 0 # string selects first KV
assert (idx == 'a').tolist() == [True, True] # string selects both KVs
Problem description
Selection by labels with non-transitive equality has inconsistent behavior with duplicate labels. I would expect duplicate labels to be selected in all cases.
Expected Output
assert idx.get_loc('a') == slice(0, 2) # string selects both KVs
Output of pd.show_versions()
Comment From: gfyoung
Admittedly, this is an unusual use case. That being said, your point about selecting duplicate labels despite the non-transitivity makes sense. A PR is welcome to patch this!
Comment From: adshieh
I narrowed down the issue to the ObjectEngine in the Index not checking for duplicate labels because the values are unique by a1 != a2
.
If I create an Index with non-unique values, then the selection of duplicate labels works as expected.
assert pd.Index([a1, a2, a2]).get_loc('a') == slice(0, 3) # string selects all KVs
Comment From: gfyoung
Interesting...can you share here the location of the code you're talking about (i.e. the specific logic that you found was causing the issue) ?
Comment From: adshieh
This line checks for uniqueness to select duplicate labels, or proceeds to the hash table.
Comment From: gfyoung
Awesome. I understand why the code does what it does, and although I agree that for "correctness," it should check for all possible matches to the key, I'm wary of the performance impact if we were to simply just call _get_loc_duplicates
every time just to account for a special case as yours.
@chris-b1 @jreback : your thoughts?
Out of curiosity, what was the use-case behind this example?
Comment From: adshieh
I don't see an easy fix either, since there's no way for the Index to know that its values will be compared in a way that makes them non-unique short of attaching a flag to override the uniqueness check. My use case is storing the nodes of a computation graph in the columns of a DataFrame to associate them with the arrays of values they compute. I want the nodes to compare to each other based on metadata about the computations they perform, but I also want to be able to tag the nodes with strings for easier access in the DataFrame.
Comment From: gfyoung
I see. So you're really trying to index based on an attribute of these nodes rather than the nodes themselves. That's a little different from what current indexing does and would therefore be an enhancement (potentially) on what we currently do.
Actually, it occurs to me that maybe this wouldn't be so complicated. Perhaps we could just pass in a lambda
function to get_loc
i.e.:
idx.get_loc(lambda x: x == 'a')
Comment From: adshieh
Support for callables would eliminate the need to make the nodes compare to strings at all.
idx.get_loc(lambda x: x.key == 'a')
However, callables would be overloaded on DataFrames, where the argument refers to the DataFrame rather than the index elements.
Comment From: gfyoung
However, callables would be overloaded on DataFrames, where the argument refers to the DataFrame rather than the index elements.
I'm not sure I follow you here.
In any case, the implementation of the callable would be the responsibility of the user.
Comment From: adshieh
It wouldn't be possible to select columns of a DataFrame with a callable through df.__getitem__
or df.loc
in the same way as an Index through idx.get_loc
, since x
in df[lambda x: x.key == 'a']
would refer to df
rather than an element of df.columns
.
Comment From: gfyoung
True, but you can pre-select with df.columns
.
Comment From: jreback
this is not supported, you are violating hashing-equality constraints. Using custom things which are not strings in an Index is non-performant, opaque and non-idiomatic.
We already support callables in the cond generation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#selection-by-callable