Code Sample, a copy-pastable example if possible
df1 = pd.DataFrame([['foo', 'bar', 'baz'], [None, None, None]])
df2 = pd.DataFrame([['foo', 'bar', 'baz'], [np.nan, np.nan, np.nan]])
tm.assert_frame_equal(df1, df2)
Problem description
No AssertionError
gets raised in this case, even though I would not expect None
and np.nan
to be considered equal values (I found this while testing #18450)
Note that if you simply compared a DataFrame with only np.nan
with another containing only None
you would get an AttributeError
that the dtypes are different (float64
vs object
) but because the missing values are mixed into object dtypes here I think that differentiation gets lost
Expected Output
AssertionError: DataFrame.iloc[1, :] are different
Output of pd.show_versions()
Comment From: jorisvandenbossche
Generally None and np.nan are considered equivalent in pandas, both are seen as missing value. For other types of dtypes the None will on construction be converted to NaN, so you don't have this issue in such a case:
In [19]: pd.Series([1, 2, None])
Out[19]:
0 1.0
1 2.0
2 NaN
dtype: float64
and for example the equals
method also sees both frames in your example as equal:
In [18]: df1.equals(df2)
Out[18]: True
That said, to be able to test we are actually retuning NaN and not None in the issue you referenced, I agree it is annoying that you don't have a way to discriminate between both in assert_frame_equal
.
For the PR as a workaround now, you can get the values of that column and assert that those are equal (eg np.testing.assert_array_equal(df1[0].values, df2[0].values)
differentiates between None and np.nan)
Comment From: jreback
I would add a kwarg to assert_*
, maybe check_null='strict'
or something. we almost always want 'strict' meaning not only is isnull(a) == isnull(b)
, but that np.isnan(a) & np.isnan(b)
.
I don't think working around this is a good idea, let's actually fix the testing.
Comment From: WillAyd
@jreback do you know why assert_frame_equal
doesn't use the NDFrame.equals
method? There's a comment preceding that method that mentions it should.
The reason I'm asking is because while we could add a kwarg to the assert_frame_equal
method to control the null checking behavior and do it in a strict manner by default, as @jorisvandenbossche pointed out that would be inconsistent with how the equals
method of the NDFrame
works. I would think we would either want to make that change in both comparison methods, or attempt to refactor assert_frame_equal
to match NDFrame
as the comment suggests
FWIW if we wanted to modify the behavior of the NDFrame.equals
we'd probably want to add strict_nan=True
to the following line:
https://github.com/pandas-dev/pandas/blob/aec33479b923a05ead1ca35335f00aba87e4145e/pandas/core/internals.py#L1532
The array_equivalent
method might also need some tweaks, as providing that argument is raising a ValueError
for me, but nonetheless it seems like the intention is already in place there to handle this
Comment From: jreback
assert_frame_equal
is NOT the same, it gives excruciating detail on where the error is (this is the entire point of this). .equals
is just a user-facing shortcut. Under the hood these both use array_equivalent
as the comparisons. I don't see a problem with adding things to the test methods, these are meant to be very fine grained.
so for example, .equals
on the example (which compares None/np.nan) should fail. These are not exactly equal. We should not add things to .equals
(as this opens a big can of worms).
Comment From: jreback
from above, this is actually wrong and we should strictly check this.
In [18]: df1.equals(df2)
Out[18]: True
Comment From: TomAugspurger
I think this also affects pd.NA.
In [45]: a = pd.Series([1, np.nan], dtype=object)
In [46]: b = pd.Series([1, pd.NA], dtype=object)
In [47]: pd.testing.assert_series_equal(a, b)
A keyword to control this would be great. Based on the recent experience with check_freq
, we should perhaps not make things strict by default. But we could do non-strict + a warning by default, with the option to override in the function or via a global option.
def assert_series_equal(..., check_na=None):
if check_na is None:
check_na = pandas.config.get("testing.check_na")
if check_na is None;
warnings.warn("default to false, changing to strict in the future")
check_na = False
Comment From: jbrockmendel
In _libs.testing.assert_almost_equal we have
if checknull(a) and checknull(b):
# TODO: Should require same-dtype NA?
# nan / None comparison
return True
Changing that to
if checknull(a):
# nan / None comparison
if is_matching_na(a, b):
return True
raise AssertionError(f"{a} != {b}")
elif checknull(b):
raise AssertionError(f"{a} != {b}")
breaks 431 tests. 224 of those are in tests/strings/, 76 in tests/arithmetic/
is_matching_na has a nan_matches_none
keyword we could use to be less strict specifically in the None vs np.nan case while still catching e.g. np.timedelta64("nat") vs pd.NaT (relevant in the tests/arithmetic/ cases). Passing that gets us down to 308 failing tests.
I definitely think we should become stricter here. As Tom mentioned when we made check_freq stricter, doing this probably needs a deprecation cycle.