Pandas assert_frame_equal not differentiating NaN and None within object dtype

Code Sample, a copy-pastable example if possible

df1 = pd.DataFrame([['foo', 'bar', 'baz'], [None, None, None]])
df2 = pd.DataFrame([['foo', 'bar', 'baz'], [np.nan, np.nan, np.nan]])
tm.assert_frame_equal(df1, df2)

Problem description

No AssertionError gets raised in this case, even though I would not expect None and np.nan to be considered equal values (I found this while testing #18450)

Note that if you simply compared a DataFrame with only np.nan with another containing only None you would get an AttributeError that the dtypes are different (float64 vs object) but because the missing values are mixed into object dtypes here I think that differentiation gets lost

Expected Output

AssertionError: DataFrame.iloc[1, :] are different

Output of `pd.show_versions()`

[paste the output of ``pd.show_versions()`` here below this line] INSTALLED VERSIONS ------------------ commit: d64995a4fcc3269dff4366988230563b8aeffb9f python: 3.6.2.final.0 python-bits: 64 OS: Darwin OS-release: 17.2.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.22.0.dev0+205.gd64995a4f pytest: 3.2.5 pip: 9.0.1 setuptools: 36.4.0 Cython: 0.26 numpy: 1.13.1 scipy: None pyarrow: None xarray: None IPython: 6.2.1 sphinx: 1.6.3 patsy: None dateutil: 2.6.1 pytz: 2017.2 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Comment From: jorisvandenbossche

Generally None and np.nan are considered equivalent in pandas, both are seen as missing value. For other types of dtypes the None will on construction be converted to NaN, so you don't have this issue in such a case:

In [19]: pd.Series([1, 2, None])
Out[19]: 
0    1.0
1    2.0
2    NaN
dtype: float64

and for example the equals method also sees both frames in your example as equal:

In [18]: df1.equals(df2)
Out[18]: True

That said, to be able to test we are actually retuning NaN and not None in the issue you referenced, I agree it is annoying that you don't have a way to discriminate between both in assert_frame_equal.

For the PR as a workaround now, you can get the values of that column and assert that those are equal (eg np.testing.assert_array_equal(df1[0].values, df2[0].values) differentiates between None and np.nan)

Comment From: jreback

I would add a kwarg to assert_*, maybe check_null='strict' or something. we almost always want 'strict' meaning not only is isnull(a) == isnull(b), but that np.isnan(a) & np.isnan(b).

I don't think working around this is a good idea, let's actually fix the testing.

Comment From: WillAyd

@jreback do you know why assert_frame_equal doesn't use the NDFrame.equals method? There's a comment preceding that method that mentions it should.

The reason I'm asking is because while we could add a kwarg to the assert_frame_equal method to control the null checking behavior and do it in a strict manner by default, as @jorisvandenbossche pointed out that would be inconsistent with how the equals method of the NDFrame works. I would think we would either want to make that change in both comparison methods, or attempt to refactor assert_frame_equal to match NDFrame as the comment suggests

FWIW if we wanted to modify the behavior of the NDFrame.equals we'd probably want to add strict_nan=True to the following line:

https://github.com/pandas-dev/pandas/blob/aec33479b923a05ead1ca35335f00aba87e4145e/pandas/core/internals.py#L1532

The array_equivalent method might also need some tweaks, as providing that argument is raising a ValueError for me, but nonetheless it seems like the intention is already in place there to handle this

Comment From: jreback

assert_frame_equal is NOT the same, it gives excruciating detail on where the error is (this is the entire point of this). .equals is just a user-facing shortcut. Under the hood these both use array_equivalent as the comparisons. I don't see a problem with adding things to the test methods, these are meant to be very fine grained.

so for example, .equals on the example (which compares None/np.nan) should fail. These are not exactly equal. We should not add things to .equals (as this opens a big can of worms).

Comment From: jreback

from above, this is actually wrong and we should strictly check this.

In [18]: df1.equals(df2)
Out[18]: True

Comment From: TomAugspurger

I think this also affects pd.NA.

In [45]: a = pd.Series([1, np.nan], dtype=object)

In [46]: b = pd.Series([1, pd.NA], dtype=object)

In [47]: pd.testing.assert_series_equal(a, b)

A keyword to control this would be great. Based on the recent experience with check_freq, we should perhaps not make things strict by default. But we could do non-strict + a warning by default, with the option to override in the function or via a global option.

def assert_series_equal(..., check_na=None):
    if check_na is None:
        check_na = pandas.config.get("testing.check_na")
    if check_na is None;
        warnings.warn("default to false, changing to strict in the future")
        check_na = False

Comment From: jbrockmendel

In _libs.testing.assert_almost_equal we have

    if checknull(a) and checknull(b):
        # TODO: Should require same-dtype NA?
        # nan / None comparison
        return True

Changing that to

if checknull(a):
    # nan / None comparison
    if is_matching_na(a, b):
        return True
    raise AssertionError(f"{a} != {b}")
elif checknull(b):
    raise AssertionError(f"{a} != {b}")

breaks 431 tests. 224 of those are in tests/strings/, 76 in tests/arithmetic/

is_matching_na has a nan_matches_none keyword we could use to be less strict specifically in the None vs np.nan case while still catching e.g. np.timedelta64("nat") vs pd.NaT (relevant in the tests/arithmetic/ cases). Passing that gets us down to 308 failing tests.

I definitely think we should become stricter here. As Tom mentioned when we made check_freq stricter, doing this probably needs a deprecation cycle.

Pandas assert_frame_equal not differentiating NaN and None within object dtype

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`