Code Sample, a copy-pastable example if possible
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,10, size=(9999, 2)), columns=['key', 'value'])
# df.value contains integers 2 and 4, does not contain strings '2', '4', or 'fart' or integer 11
# example A: this returns a non-empty df
print(len(df[df.value.isin(['2'])]))
# example B: this returns a non-empty df
print(len(df[df.value.isin(['2', 4])]))
# example C: this returns a non-empty df
print(len(df[df.value.isin(['fart', 4])]))
# example D: this returns a non-empty df
print(len(df[df.value.isin(['fart', 4, '2'])]))
# example E: this returns a non-empty df
print(len(df[df.value.isin(['2', '4'])]))
# example F: this returns a non-empty df
print(len(df[df.value.isin(['2', '4', '11'])]))
# example G: this DOES NOT return a non-empty df
print(len(df[df.value.isin(['fart', '2', '4'])]))
Problem description
In the examples above, examples A-F behave as expected; only example G behaves unexpectedly.
When using df[df.mycol.isin(alist)]
, and if mycol is an integer column, Pandas appears to be able to convert strings in alist to integers, and intelligently check if they occur in mycol. Only in the last case (example G) outlined above does it fail to do so. This seems inconsistent to me because it is able to ignore 'fart' in previous examples (examples C and D), It is also able to correctly cast '2' to an int and return matching rows (examples A, B, D). It is also able to do both of these things simultaneously (example D). It is also able to correctly cast two matching elements (Example E). Only when all elements need to be casted, and one of them is uncastable, does the unexpected behavior occur (example G). In Example F, all need to be casted, but they are all castable, so it works fine.
Basically, as long as they are all castable to the correct type, or there is at least one element of the correct type, it works.
To put it another way, if df[df.mycol.isin(alist)]
returns something, then df[df.mycol.isin(alist + [newel])]
should return at least that same thing. This seems reasonable to me and I think how it was intended to work.
For my own edification, I would like to understand where the code that does the casting / comparisons is implemented. Is it in C or Python?
Expected Output
last line should ignore the string 'fart', correctly cast the string '2' to int 2 and string '4' to int 4, as it does in previous examples, and return the matching rows in df.
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None python: 3.7.2.final.0 python-bits: 64 OS: Darwin OS-release: 18.2.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8
pandas: 0.23.4 pytest: None pip: 18.1 setuptools: 40.6.3 Cython: None numpy: 1.16.0 scipy: 1.2.0 pyarrow: None xarray: None IPython: 7.2.0 sphinx: None patsy: None dateutil: 2.7.5 pytz: 2018.9 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 3.0.2 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None sqlalchemy: 1.2.16 pymysql: None psycopg2: 2.7.7 (dt dec pq3 ext lo64) jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None
Comment From: WillAyd
Thanks for the report, though I believe the expectation is wrong. Why should A, E, F or G return anything?
Comment From: BasilBeirouti
@WillAyd A, E and F do though. Why should G be different? Either all of them should return something or none of them should.
I think it is counterintuitive for a .isin(mylist) to return some rows, and .isin(mylist + [newel]) to return no rows. Adding more elements to the list that pandas is matching on should never result in fewer elements being returned.
Do you know if there is a C subroutine the performs the comparisons or if it's in Python?
Comment From: WillAyd
Either all of them should return something or none of them should.
That's my point. None of these should return anything
The implementation of this for a Series is linked below if you want to take a look
https://github.com/pandas-dev/pandas/blob/83eb2428ceb6257042173582f3f436c2c887aa69/pandas/core/series.py#L3943
Comment From: BasilBeirouti
Thanks @WillAyd. I followed that function into the isin()
implementation in core/algorithms.py
https://github.com/pandas-dev/pandas/blob/83eb2428ceb6257042173582f3f436c2c887aa69/pandas/core/algorithms.py#L418
That function sees that the series comps
(which corresponds to mycol
in my example) is of type int, so it casts values
(same as alist
in my example) to int. This throws ValueError
with example G, but not with the others. The way ValueError
is handled is to cast both values and comps to python objects and then attempt to compare one last time.
At first I thought that example D was returning matches for '2'
and 4
, but actually, it is only returning matching rows for 4
. This actually make sense to me.
So really what it comes down to is that .astype('int64')
is all or nothing, meaning that if only one element cannot be cast then no elements will be cast and ValueError
is thrown. This of course makes sense.
Iterating through each element and casting it is probably not what we want. I can also see why ValueError
is handled and not returned to the user. So really, I'm not sure if anything should be changed here, although @WillAyd may prefer to make the comparisons more strict and produce fewer matches. Another option might be to just include a warning message to alert the user that attempting to cast the elements raised a ValueError
, and that elements were being compared as objects instead.
It was a fun rabbit hole to go down, thanks again Will!
Comment From: WillAyd
@BasilBeirouti thanks a lot for investigating! Looking at the line you've called out I get the impression this is an unintended consequence of that exception handling
cc @jreback in case he knows of something I don't and/or objects to making this comparison stricter
Comment From: jreback
this takes a pretty tricky path of code and sometimes we call out to numpy which has some weird inference rules
it’s possible might be a bug - it’s also somewhat performance sensitive
Comment From: ma-ji
This is a consistency issue, and pandas
should have a consistent strategy. On my end the problem is:
int isin str
- this works; but str isin int
, this does not work.
Comment From: jbrockmendel
As described https://github.com/pandas-dev/pandas/issues/24918#issuecomment-457778641, the expected behavior is not to cast strings to ints, so the behavior we see in the OP is correct. Closing.