Pandas version checks
-
[x] I have checked that this issue has not already been reported.
-
[x] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
## example A
import pandas as pd # 2.2.3
df = pd.DataFrame([[1, 2, 3]], columns=['a', ['b', 'c'], ['b', 'c']])
print(df.columns.drop_duplicates())
# Traceback (most recent call last):
# File "/home/cameron/.vim-excerpt", line 5, in <module>
# print(df.columns.drop_duplicates())
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/indexes/base.py", line 3117, in drop_duplicates
# if self.is_unique:
# ^^^^^^^^^^^^^^
# File "properties.pyx", line 36, in pandas._libs.properties.CachedProperty.__get__
# File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/indexes/base.py", line 2346, in is_unique
# return self._engine.is_unique
# ^^^^^^^^^^^^^^^^^^^^^^
# File "index.pyx", line 266, in pandas._libs.index.IndexEngine.is_unique.__get__
# File "index.pyx", line 271, in pandas._libs.index.IndexEngine._do_unique_check
# File "index.pyx", line 333, in pandas._libs.index.IndexEngine._ensure_mapping_populated
# File "pandas/_libs/hashtable_class_helper.pxi", line 7115, in pandas._libs.hashtable.PyObjectHashTable.map_locations
# TypeError: unhashable type: 'list'
## --------
## example B
import pandas as pd # 2.2.3
df = pd.DataFrame([[1, 2, 3]], columns=['a', ['b', 'c'], ['b', 'c']])
# hasattr triggers a side effect where the `df.columns.drop_duplicates()` now works.
hasattr(df, 'hello_world')
print(df.columns.drop_duplicates())
# Index(['a', ['b', 'c']], dtype='object')
Issue Description
pandas.Index.drop_duplicates()
inconsistently raises TypeError: unhashable type: 'list'
when its values encompass a list. This error does not seem to prevent the underlying uniqueness computation from happening. In addition to the submitted reproducible example there is a direct causation here in the Index
object:
If we call .drop_duplicates
when the Index contains unhashable types, we observe a TypeError
.
import pandas as pd
idx = pd.Index(['a', ['b', 'c'], ['b', 'c']])
idx.drop_duplicates() # TypeError: unhashable type: 'list'
But for some reason if we simply ignore the error the first time and try .drop_duplicates()
again it works and removes the duplicated entities including the unhashable ones?
import pandas as pd
idx = pd.Index(['a', ['b', 'c'], ['b', 'c']])
try:
idx.drop_duplicates() # TypeError: unhashable type: 'list'
except TypeError:
pass
print(idx.drop_duplicates()) # Index(['a', ['b', 'c']], dtype='object')
Where we can see that the underlying Index implementation populates its hashtable mapping even though the original call to drop_duplicates
fails. We know this population is successful because the second attempt at .drop_duplicates
works.
import pandas as pd
idx = pd.Index(['a', ['b', 'c'], ['b', 'c']])
print(idx._engine.mapping) # None
try:
idx.drop_duplicates() # TypeError: unhashable type: 'list'
except TypeError:
pass
print(idx._engine.mapping) # <pandas._libs.hashtable.PyObjectHashTable>
print(idx.drop_duplicates()) # Index(['a', ['b', 'c']], dtype='object')
Finally, it appears that attribute checking on a pandas.DataFrame
causes the PyObjectHashTable
to be constructed for the column index. This is likely due to the shared code path between __getattr__
and __getitem__
.
import pandas as pd
df = pd.DataFrame([[1, 2, 3]], columns=['a', ['b', 'c'], ['b', 'c']])
print(df.columns._engine.mapping) # None
hasattr(df, 'hello_world')
print(df.columns._engine.mapping) # <pandas._libs.hashtable.PyObjectHashTable>
print(df.columns.drop_duplicates()) # Index(['a', ['b', 'c']], dtype='object')
Expected Behavior
I expect that Index.drop_duplicates()
should work regardless of whether an attribute has been checked or not. The following two snippets should produce equivalent results (whether that is to raise an error or to produce a result):
import pandas as pd # 2.2.3
df = pd.DataFrame([[1, 2, 3]], columns=['a', ['b', 'c'], ['b', 'c']])
print(df.columns.drop_duplicates()) # Currently produces → TypeError
import pandas as pd # 2.2.3
df = pd.DataFrame([[1, 2, 3]], columns=['a', ['b', 'c'], ['b', 'c']])
hasattr(df, 'hello_world')
print(df.columns.drop_duplicates()) # Currently produces → Index(['a', ['b', 'c']], dtype='object')
Installed Versions
Comment From: rhshadrach
Thanks for the report! This should raise consistently. Further investigations and PRs to fix are welcome!
Edit for visibility: As @MarcoGorelli points out below, this should even raise on index construction!
The source of the issue appears to be here:
https://github.com/pandas-dev/pandas/blob/03056560682bccd2a9e1ea1aa4c8d4914bd0fe89/pandas/_libs/index.pyx#L347-L348
We create an unpopulated hash table, and then fail on the map_locations
line. However, this only happens when self.is_mapping_populated
is False, which uses:
https://github.com/pandas-dev/pandas/blob/03056560682bccd2a9e1ea1aa4c8d4914bd0fe89/pandas/_libs/index.pyx#L335-L337
Still, I do not see how this could then return correct results. It is certainly inefficient.
tuples = [(k,) for k in range(20000)] + [(0,)]
idx = pd.Index(tuples)
%timeit idx.drop_duplicates()
# 445 μs ± 3.66 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
lists = [[k] for k in range(20000)] + [[0]]
idx = pd.Index(lists)
try:
idx.drop_duplicates()
except TypeError:
pass
%timeit idx.drop_duplicates()
# 3.03 s ± 18.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
My guess is that the hash table is somehow degenerating into saying everything is a collision, and therefore doing an O(n) lookup (where n is the size of the hash table).
Comment From: johnmwu
Took a look. Think I see what's happening.
The root cause is that values are put in hashtables in two different ways - one where hash(val)
is called (raises) and one where it isn't.
First path. Raises due to calling hash()
.
https://github.com/pandas-dev/pandas/blob/19ea997815d4dadf490d7052a0a3c289be898588/pandas/_libs/hashtable_class_helper.pxi.in#L1387-L1392
Second path. Does not raise. https://github.com/pandas-dev/pandas/blob/19ea997815d4dadf490d7052a0a3c289be898588/pandas/_libs/hashtable_func_helper.pxi.in#L169-L172
The reason there are two code paths is:
- In the first call to
drop_duplicates()
, computation ofself.is_unique
initializesidx._engine.mapping
which needs to callPyObjectHashTable.map_locations
(the snippet linked for the first path) https://github.com/pandas-dev/pandas/blob/19ea997815d4dadf490d7052a0a3c289be898588/pandas/core/indexes/base.py#L2847-L2850 - The second time around,
self.is_unique
is False. We then callsuper().drop_duplicates
, which if you trace through goes down the second code path. Here, we actually rebuild an entirely new and identical (from what I can tell) hash table toidx._engine.mapping
in theduplicated
function ofhashtable_func_helper.pxi
. The natural question of course is why this is done.
As for what to do, I'm going on the premise given by @rhshadrach that we should raise twice. Intuitively, there is no need to have strong support for nonhashable values.
I'm going to suggest a few options, though I'm not yet familiar with the code so not sure which is best (if any).
- The minimum code option is adding
hash(values[i])
in the second path. This achieves the desired behavior. - In
drop_duplicates
, if computation ofself.is_unique
failed the first time, should it be False on the second call? I would think no. In this case, the second call will go down exactly the same code path as the first. - Reuse the already computed hash table when calling
drop_duplicates
. Not a direct fix, but would remove the second code path.
Thoughts on how to proceed?
Comment From: MarcoGorelli
Should this not raise even earlier? As in, should columns=['a', ['b', 'c'], ['b', 'c']]
even be allowed in the DataFrame
constructor? (I think not?)