Some comparisons between different classes of string (e.g. string[pyarrow]
and str
) raise. Resolving this is straightforward except for what class should be returned. I would expect it should always be the left obj, e.g. string[pyarrow] == str
should return string[pyarrow]
whereas str == string[pyarrow]
should return str
. Is this the concensus?
We currently run into issues with how Python handles subclasses with comparison dunders.
lhs = pd.array(["x", pd.NA, "y"], dtype="string[pyarrow]")
rhs = pd.array(["x", pd.NA, "y"], dtype=pd.StringDtype("pyarrow", np.nan))
print(lhs.__eq__(rhs))
# <ArrowExtensionArray>
# [True, <NA>, True]
# Length: 3, dtype: bool[pyarrow]
print(lhs == rhs)
# [ True False True]
The two results above differ because ArrowStringArrayNumpySemantics
is a proper subclass of ArrowStringArray
and therefore Python first calls rhs.__eq__(lhs)
.
We can avoid this by special casing this particular case in ArrowStringArrayNumpySemantics
, but I wanted to open up an issue for discussion before proceeding.
cc @WillAyd @jorisvandenbossche
Comment From: WillAyd
I would expect it should always be the left obj, e.g.
string[pyarrow] == str
should returnstring[pyarrow]
whereasstr == string[pyarrow]
should returnstr
You mean the return types should be bool
right? I'm assuming so but let me know if I misunderstand
This is another motivating case for PDEP-13 https://github.com/pandas-dev/pandas/pull/58455
I think without that, they should probably just return the bool extension type to at least preserve the possibility of null values
Comment From: rhshadrach
Ah, indeed, thanks! I meant the Boolean dtype determined by the left side. So string[pyarrow] == str
would return bool[pyarrow]
. Is this in conflict with your last sentence above? I'm not sure what the bool extension type
means.
Comment From: WillAyd
pd.BooleanDtype
is the bool extension type
Comment From: jorisvandenbossche
Just to be explicit, the two possible return values we currently have for the example above (in case of consistent dtypes for left and right operand):
>>> lhs == rhs.astype(lhs.dtype)
<ArrowExtensionArray>
[True, <NA>, True]
Length: 3, dtype: bool[pyarrow]
>>> lhs.astype(rhs.dtype) == rhs
array([ True, False, True])
are bool[pyarrow]
and np.dtype('bool')
.
I personally agree that this should be pd.BooleanDtype
instead of ArrowDtype for the first one (that is something that has been changed a while ago, and I thought we had an issue about it, but can't directly find it). But let's here focus on which of the current dtypes should be returned in case of mixed operands, i.e. essentially the question of whether to always prioritize the left operand or whether there should be a kind of hierarchy.
While in general letting the left operand take priority sounds fine and something that typically happens in Python (with Python also automatically first calling __eq__
of the lhs before trying the rhs), in the context of array objects and data types, it might make more sense to have a form of hierarchy / priority between groups of dtypes?
For example also for the nullable numeric dtypes vs numpy dtypes, we always give preference to the nullable dtype in that case:
>>> ser1 = pd.Series([1, 2, 3], dtype="int64") # numpy
>>> ser2 = pd.Series([1, 2, 3], dtype="Int64") # nullable
# numpy gives numpy bool
>>> (ser1 == ser1).dtype
dtype('bool')
# but once left OR right is nullable, result is nullable
>>> (ser1 == ser2).dtype
BooleanDtype
>>> (ser2 == ser1).dtype
BooleanDtype
Comment From: jorisvandenbossche
Making the overview of the result of ==
for all possible string dtype combinations (row labels is LHS, column labels RHS dtype):
object | str | str | string | string | string[pyarrow] | |
---|---|---|---|---|---|---|
object | bool | bool | bool | bool[pyarrow] | boolean | bool[pyarrow] |
str (pyarrow) | bool | bool | \<error> | bool | \<error> | bool |
str (python) | bool | bool | bool | bool | bool | bool |
string (pyarrow) | bool[pyarrow] | bool | \<error> | bool[pyarrow] | \<error> | bool[pyarrow] |
string (python) | boolean | boolean | bool | boolean | boolean | boolean |
string[pyarrow] | bool[pyarrow] | bool | \<error> | bool[pyarrow] | \<error> | bool[pyarrow] |
import numpy as np
import pandas as pd
import pyarrow as pa
dtypes = [
np.dtype(object),
pd.StringDtype("pyarrow", na_value=np.nan),
pd.StringDtype("python", na_value=np.nan),
pd.StringDtype("pyarrow", na_value=pd.NA),
pd.StringDtype("python", na_value=pd.NA),
pd.ArrowDtype(pa.string())
]
results = []
for dt1, dt2 in itertools.product(dtypes, dtypes):
ser1 = pd.Series(["a", None, "b"], dtype=dt1)
ser2 = pd.Series(["a", None, "b"], dtype=dt2)
try:
res = ser1 == ser2
results.append((dt1, dt2, res.dtype))
except:
results.append((dt1, dt2, "<error>"))
df_results = pd.DataFrame(results, columns=["left", "right", "result"])
print(df_results.pivot(index="left", columns="right", values="result").to_markdown())
Some quick observations:
- There are a few cases that error right now, especially involving the python vs pyarrow storage of the future default
str
dtype, so something we should fix. - For object dtype "strings", all other actual string dtypes take precedence (i.e. they return the "native" bool dtype for their family of strings), even if they are the RHS operand (i.e. this is the first row of the table)
- Should we do the same for the new default str dtype? I.e. should essentially the second and third row be the same as the first row?
Comment From: WillAyd
That's quite the table there...thanks for putting that together. I still think that all of the extension types equality ops should return the boolean extension type. The "string(python)" type has done this for years, so I think the new iterations (backed by pyarrow and the "str" types) should continue along with that API.
In general I find this a really non-value added activity for end users of the library, so long term still push for the logical type system. But in the interim I think we can not rock the boat to much and just stick with what was in place already for the string extension type
Comment From: jorisvandenbossche
(about returning the boolean extension dtype) ... The "string(python)" type has done this for years, so I think the new iterations (backed by pyarrow and the "str" types) should continue along with that API.
For "string(pyarrow)", fully agreed (and as mentioned above, this dtype also did that in the past, this was only changed in pandas 2.0).
But to be clear, for the "str" variants, that is IMO out of the question on the short term. As that was for me the whole point to write a PDEP about it and the reason we introduced the "str"
/ pd.StringDtype(na_value=np.nan)
variants in addition to the existing StringDtype using pd.NA, i.e. so it could have different beahviour regarding default dtypes and missing value (and so not use the opt-in "nullable" dtypes, for now).
(of course we could also have added a NaN variant of the boolean extension dtype, and then it could have used that. But we didn't do that, and I think that is now too late for 3.x?)
And I also fully agree we should improve that messy table with logical dtypes.
But that said, we still need to make a decision on the short term for 2.3 / 3.x about which resulting dtype to use in case of mixed dtype operands:
- Give priority to the LHS
ser[str] == ser[string]
-> numpy bool dtypeser[string] == ser[str]
-> boolean extension dtype- Define some hierarchy (e.g. object < NaN-variant "str" < NA-variant "string" < ArrowDtype):
ser[str] == ser[string]
-> boolean extension dtypeser[string] == ser[str]
-> boolean extension dtype
(and for the hierarchy option, also have "python < pyarrow" for the storage within the NaN or NA variant. This is less relevant for ==
equality, but for example for +
resulting in new strings, the exact string dtype and its storage is also relevant)
I personally lean to the second option, I think.
(side note: also when having logical dtypes, we will have to answer this question if we have multiple variants of the same logical dtype)
Comment From: WillAyd
Out of those two options I definitely prefer the hierarchy; I think giving the lhs more importance than the rhs is really hard to keep track of, especially in a potentially long pipeline of operations. I think your proposed solution works well too, although something does feel weird about mixing between NA markers for the str / string types...but that may be the least of all evils.
(side note: also when having logical dtypes, we will have to answer this question if we have multiple variants of the same logical dtype)
This may be worth further discussing in the PDEP, but I think it is just an implementation detail (?) and not something that should worry the end user too much
Comment From: rhshadrach
My main opposition to "always return Boolean extension dtype", in addition to what @jorisvandenbossche mentioned, is that if I have two pyarrow-backed Series with different NA-semantics, I would be very surprised to get back a NumPy-backed Series. Likewise, if I have two NaN Series (pd.StringDtype("python", na_value=np.nan)
and pd.StringDtype("pyarrow", na_value=np.nan)
, I would be also surprised to get back a Series with pd.NA
-semantics.
I think this puts me in the hierarchy camp. My proposed hierarchy is:
object < (pyarrow, NaN) < (python, NaN) < (pyarrow, NA) < (python, NA)
My reasoning is that if you have NA-backed data, you've done something to opt into it as it isn't the default. Likewise, if you have PyArrow installed and you have python-backed strings, you've done something to get that as it isn't the default. So since you've opted into, we should give it preference.
Of course, hopefully all of this is a pathological edge-case that users don't encounter.
Comment From: WillAyd
My reasoning is that if you have NA-backed data, you've done something to opt into it as it isn't the default. Likewise, if you have PyArrow installed and you have python-backed strings, you've done something to get that as it isn't the default. So since you've opted into, we should give it preference.
Of course, hopefully all of this is a pathological edge-case that users don't encounter.
This makes sense but then shouldn't the pyarrow implementations get a higher priority than the python ones in your hierarchy?
Of course, hopefully all of this is a pathological edge-case that users don't encounter.
Unfortunately I think this is going to be a very common occurrence. Especially when considering I/O formats that preserve metadata (namely parquet) it will very pretty easy to mix all these up
My main opposition to "always return Boolean extension dtype", in addition to what @jorisvandenbossche mentioned, is that if I have two pyarrow-backed Series with different NA-semantics, I would be very surprised to get back a NumPy-backed Series.
This is definitely valid with the current implementation of the extension types, although keep in mind with the proposed logical types that the storage becomes an implementation detail. We could set up whatever rules we wanted to manage this, although in the least pathological cases I would think pyarrow would be the main storage
Comment From: WillAyd
object < (pyarrow, NaN) < (python, NaN) < (pyarrow, NA) < (python, NA)
Thinking through this one some more, I'm also not sure that the python string implementations should ever take priority over pyarrow. That's a huge performance degradation
Comment From: rhshadrach
So since you've opted into, we should give it preference. This makes sense but then shouldn't the pyarrow implementations get a higher priority than the python ones in your hierarchy?
If you have PyArrow installed, then pd.Series(list("xyz"))
gives you PyArrow backed. So to end up with Python, you need to have opted into it.
Unfortunately I think this is going to be a very common occurrence. Especially when considering I/O formats that preserve metadata (namely parquet) it will very pretty easy to mix all these up
I expect the common occurrence will be object dtype against one of the other 4, but not e.g. NA-pyarrow against NaN-pyarrow.
Comment From: WillAyd
That's probably true for data that you create in your process, but when you bring I/O into the mix things can easily get mixed up.. For instance, if you load a parquet file that someone saved with dtype="string"
but you use the default type system, you are going to mix up NA/NaN sentinels, even with PyArrow installed.
I don't know if our parquet I/O keeps the storage backend as part of the metadata, but if it does that would also make it easy to mix up types (ex: a user without PyArrow installed saves a file that gets loaded by someone with PyArrow)
Comment From: rhshadrach
I don't know if our parquet I/O keeps the storage backend as part of the metadata
Current parquet behavior is to always infer str
.
pd.set_option("infer_string", True)
df = pd.DataFrame({"a": pd.array(list("xyz"), dtype="object")})
df.to_parquet("test.parquet")
print(pd.read_parquet("test.parquet")["a"].dtype)
# str
df = pd.DataFrame({"a": pd.array(list("xyz"), dtype="string")})
df.to_parquet("test.parquet")
print(pd.read_parquet("test.parquet")["a"].dtype)
# str
df = pd.DataFrame({"a": pd.array(list("xyz"), dtype="str")})
df.to_parquet("test.parquet")
print(pd.read_parquet("test.parquet")["a"].dtype)
# str
Comment From: WillAyd
Ah that looks like a bug. dtype="string"
should be roundtripping (it does without infer_string
)
Comment From: WillAyd
I vaguely recall there being some work done upstream in Arrow to better differentiate these. Not sure if the version of that matters but Joris would know best
Comment From: jorisvandenbossche
For the hierarchy, my first reaction was also to find it odd to do "pyarrow < python" (since pyarrow is always preferred over python if installed). But your explanation certainly makes sense (you only get that situation when explicitly opting in to "python", so in that case preserve it).
Now, the question if how often users will accidentally have data using the "python" storage? Our constructors and most IO methods (for formats that don't store pandas-specific metadata) will be fine I suppose.
One exception I can think of is pickle. Using to_pickle()
/read_pickle
roundtrip right now will preserve the storage (so when pickling in an env without pyarrow and then reading an env with pyarrow can give you a non-default storage for the string dtype).
For Parquet:
Current parquet behavior is to always infer
str
.Ah that looks like a bug.
dtype="string"
should be roundtripping (it does withoutinfer_string
) .. I vaguely recall there being some work done upstream in Arrow to better differentiate these. Not sure if the version of that matters but Joris would know best
That's indeed a known limitation right now. Normally pyarrow's to_pandas()
tries to preserve the pandas extension dtype stored in the metadata, but for getting the default string dtype to work we currently overrule that mechanism by simply asking for "str" for all string columns through the types_mapper
keyword:
https://github.com/pandas-dev/pandas/blob/1be26374dd7ef43bc709c4bc6db2daf7bfd606c8/pandas/io/_util.py#L79-L80
(and when specifying that keyword, that currently overrides any other mechanisms to infer the dtype from the metadata)
I am fixing this on the pyarrow side in https://github.com/apache/arrow/pull/44195 (essentially moving the logic of choosing the "str" dtype by default for pandas >= 3 to pyarrow, so that then we don't have to override that anymore). So starting with pyarrow 19.0, we should be able to restore the functionality of preserving "string"
dtype.
For both "string" and "str" dtype, it will preserve that dtype (i.e. whether it uses NaN vs NA), but not the storage. That is because we don't include that information in the __str__
repr, only in the __repr__
repr (for "string", as for "str" we don't include it in either), and pyarrow uses str(..)
for storing information about the dtype in the metadata.
(we currently don't really control the pyarrow->pandas conversion, which makes changes to the default behaviour for this conversion tricky and difficult, see xref https://github.com/pandas-dev/pandas/issues/59780 for moving that logic from pyarrow to pandas)
Comment From: WillAyd
Maybe we can just disallow a user from trying to force python storage when pyarrow is installed? That would help simplify the rules further
Comment From: rhshadrach
Maybe we can just disallow a user from trying to force python storage when pyarrow is installed?
What happens with pickles that contain Python storage?
Comment From: jbrockmendel
Haven’t read the whole thread, just OP. In general == should be commutative. We use pandas_priority to determine which object defers to which.
Comment From: WillAyd
What happens with pickles that contain Python storage?
Do you think this would happen from someone starting in an environment without pyarrow, exporting an object via pickle, then loading it into another environment with pyarrow? If so, I wonder if we shouldn't restore that state using the pyarrow backend.
Comment From: rhshadrach
Talked about this in the dev meeting today, the main idea floated there was raising on certain cases:
- raise anytime storage or NA-semantics differ; or
- raise anytime NA-semantics differ, but not storage.
I think the main thing captured is that we do not what to be changing what NA-semantics the user is dealing with. I think that when presented with both NA and NaN we should resist the temptation to guess.
For storage, there was a desire to have pyarrow
take priority over python
if we aren't going to raise.
However we choose to proceed, the consensus in the dev meeting that it should apply to all binary operations.
Comment From: WillAyd
I definitely don't think we should be raising on storage - one of the main values of PDEP-14 was to offer an abstraction over what storage is being used.
I think the main thing captured is that we do not what to be changing what NA-semantics the user is dealing with. I think that when presented with both NA and NaN we should resist the temptation to guess.
This is a good point when thinking about the output of an operation. The input is pretty clear though - we are comparing values that are missing (at least with strings, floats have some ambiguity). Given our long term strategy for types has been to move away from using np.nan as a missing value, I think we should prefer pd.NA in this case
Comment From: jorisvandenbossche
the main idea floated there was raising on certain cases:
I would find it strange to start raising an error for those certain cases, while right now we almost never raise an error, but rather return False instead of an error if values are not really comparable. (for example if you ask if floats are equal to datetimes, we don't error but return all Falses)
I know that in this case it is not really about the values being comparable or not (the string values itself are perfectly comparable), but about implicit conversion to different NA semantics. So that's a different situation ..
But still, we are talking here in the context of string dtype. But what do we do for all the other dtypes where we also support those different NA semantics? Would you also want to start raising for that? (eg when comparing numpy floats with nullable ints or floats?)
Comment From: WillAyd
That's a good point. FWIW our status quo seems to prefer using pd.NA as the missing value sentinel when mixing NA-semantics today:
>>> ser1 = pd.Series([1, np.nan, 2])
>>> ser2 = pd.Series([1, pd.NA, 2], dtype=pd.Float64Dtype())
>>> ser1 < ser2
0 False
1 <NA>
2 False
dtype: boolean
Although that comparison is likely not very common, I think sticking with that status quo is the best path forward for now
Comment From: jorisvandenbossche
We talked about this once more on the last dev meeting this week, and the idea would be to use the proposal of a hierarchy mentioned earlier in this thread (https://github.com/pandas-dev/pandas/issues/60639#issuecomment-2569878403), but then with giving pyarrow priority over python (although python is the "opt-in" storage):
object < (python, NaN) < (pyarrow, NaN) < (python, NA) < (pyarrow, NA)
When you have different dtypes for the two operands of the comparison, this hierarchy determines which of the two operands "controls" the execution, and thus determines the result dtype (for NaN semantics, the result is always numpy bool dtype, for NA semantics, it is one of the nullable bool dtypes)
For the comparison with object dtype, we just need to be careful to not cast non-strings to the string dtype for evaluating the comparison.