Pandas check_names=False parameter for pandas.util.testing.assert_frame_equal applies to index.names but not columns.names

edit by @TomAugspurger

The check_names docstring for pandas.util.testing.assert_frame_equal is unclear:

check_names : bool, default True
    Whether to check the Index names attribute.

This should clarify that both the index and columns names attribute are checked.

Code Sample, a copy-pastable example if possible

import pandas as pd
from pandas.util.testing import assert_frame_equal

df1 = pd.DataFrame({'A':[1.0]})
df2 = pd.DataFrame({'B':[1.0]})

assert_frame_equal(df1, df2, check_names=False)

""" will return:
In [7]: assert_frame_equal(df1, df2, check_names=False)
Traceback (most recent call last):

  File "<ipython-input-7-d273edeeb6af>", line 1, in <module>
    assert_frame_equal(df1, df2, check_names=False)

  File "<snipped>/lib/python3.5/site-packages/pandas/util/testing.py", line 1372, in assert_frame_equal
    obj='{obj}.columns'.format(obj=obj))

  File "<snipped>/lib/python3.5/site-packages/pandas/util/testing.py", line 927, in assert_index_equal
    obj=obj, lobj=left, robj=right)

  File "pandas/_libs/testing.pyx", line 59, in pandas._libs.testing.assert_almost_equal

  File "pandas/_libs/testing.pyx", line 173, in pandas._libs.testing.assert_almost_equal

  File "<snipped>/lib/python3.5/site-packages/pandas/util/testing.py", line 1093, in raise_assert_detail
    raise AssertionError(msg)

AssertionError: DataFrame.columns are different

DataFrame.columns values are different (100.0 %)
[left]:  Index(['A'], dtype='object')
[right]: Index(['B'], dtype='object')
"""

Problem description

When the parameter check_names=False is set for assert_frames_equal, the index and columns names should be ignored in the comparison, but an assertion error is still raised if the index or columns names are different. This is the same behaviour as when check_names=True (the default) is set, and the opposite of what I believe is intended.

Expected Output

The expected output for the case above should be nothing - a valid assertion.

Output of `pd.show_versions()`

In [8]: pd.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.5.4.final.0 python-bits: 64 OS: Linux OS-release: 3.10.0-327.13.1.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: en_GB.UTF-8 pandas: 0.22.0 pytest: 3.2.1 pip: 9.0.1 setuptools: 27.2.0 Cython: None numpy: 1.13.1 scipy: 1.0.0 pyarrow: None xarray: None IPython: 5.1.0 sphinx: 1.4.8 patsy: None dateutil: 2.6.1 pytz: 2017.2 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 2.0.2 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.8 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Comment From: TomAugspurger

From the docstring:

check_names : bool, default True
    Whether to check the Index names attribute.

So it checks if left.index.names == right.index.names. Same for .columns, so everything is correct I think.

Do you have a usecase for ignoring the actual column labels themselves?

Comment From: willjbrown88

Ah ok, I guess I'd assumed that as both: AssertionError: DataFrame.index are different and AssertionError: DataFrame.columns are different are possible returns from assert_frame_equal, either could differ if check_names=False was set.

check_names to me implies both index names and column names are being checked, as they are, but also either can differ. Only index name can actually differ. Perhaps the docstring should clarify that column names are also checked, but cannot differ regardless of this parameter setting.

I can't speak for it being a common use case but yes - I test if various data processing functions can handle df populated by reading from different format source files. The df get assigned column names using whatever columns names are provided by the file or are assigned by user input. My tests only know that the column order for the parts of the df I'm interested in checking should be the same in every case. So the data values, index values and column order of the df should match, but column names and index names don't have to. I could simply assign a temporary set of column names internally of course.

Comment From: TomAugspurger

Well, the docstring could be improved. Both df.index.names and df.columns.names are checked. That's still different from your original issue though, which was about the values.

I could simply assign a temporary set of column names internally of course.

I'd recommend doing that. I don't think changing assert_frame_equal to ignore index / column labels is generally useful enough to warrant a parameter.

Comment From: willjbrown88

I've been a little unclear here still I think.

I don't think changing assert_frame_equal to ignore index / column labels is generally useful enough to warrant a parameter

This is exactly what check_names is for and is doing, but only for df.index.names. check_names=False allows for the case of left.index.names != right.index.names to pass assert_frame_equal. My issue was that I assumed check_names=False would also pass left.columns.names != right.columns.names, as I was intending to use it.

If the latter isn't a common use, then I agree it's not worth changes. In that case, my vote would be to simply rename check_names to something like check_index_names, as that is exactly what it does, and all that bool setting applies to.

Comment From: TomAugspurger

@willjbrown88

In [3]: pd.util.testing.assert_frame_equal(
   ...:     pd.DataFrame(columns=pd.Index(['a', 'b'], name='c1')),
   ...:     pd.DataFrame(columns=pd.Index(['a', 'b'], name='c2'))
   ...: )
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-3-653d160e6f54> in <module>()
      1 pd.util.testing.assert_frame_equal(
      2     pd.DataFrame(columns=pd.Index(['a', 'b'], name='c1')),
----> 3     pd.DataFrame(columns=pd.Index(['a', 'b'], name='c2'))
      4 )

~/sandbox/pandas-ip/pandas/pandas/util/testing.py in assert_frame_equal(left, right, check_dtype, check_index_type, check_column_type, check_frame_type, check_less_precise, check_names, by_blocks, check_exact, check_datetimelike_compat, check_categorical, check_like, obj)
   1285                        check_exact=check_exact,
   1286                        check_categorical=check_categorical,
-> 1287                        obj='{obj}.columns'.format(obj=obj))
   1288
   1289     # compare by blocks

~/sandbox/pandas-ip/pandas/pandas/util/testing.py in assert_index_equal(left, right, exact, check_names, check_less_precise, check_exact, check_categorical, obj)
    844     # metadata comparison
    845     if check_names:
--> 846         assert_attr_equal('names', left, right, obj=obj)
    847     if isinstance(left, pd.PeriodIndex) or isinstance(right, pd.PeriodIndex):
    848         assert_attr_equal('freq', left, right, obj=obj)

~/sandbox/pandas-ip/pandas/pandas/util/testing.py in assert_attr_equal(attr, left, right, obj)
    921     else:
    922         msg = 'Attribute "{attr}" are different'.format(attr=attr)
--> 923         raise_assert_detail(obj, msg, left_attr, right_attr)
    924
    925

~/sandbox/pandas-ip/pandas/pandas/util/testing.py in raise_assert_detail(obj, message, left, right, diff)
   1006         msg += "\n[diff]: {diff}".format(diff=diff)
   1007
-> 1008     raise AssertionError(msg)
   1009
   1010

AssertionError: DataFrame.columns are different

Attribute "names" are different
[left]:  ['c1']
[right]: ['c2']

In [4]: pd.util.testing.assert_frame_equal(
   ...:     pd.DataFrame(columns=pd.Index(['a', 'b'], name='c1')),
   ...:     pd.DataFrame(columns=pd.Index(['a', 'b'], name='c2')),
   ...:     check_names=False
   ...: )

Comment From: cchummer

Ran into exactly this behavior/issue this evening and was stumped for a bit. Indeed would appreciate clearing up the doc's to include the fact the differing columns will indeed throw an exception and cannot be ignored via check_names. In any interpretation of English I have ever come across, a flag determining "whether to check that the names attribute for both the index and column attributes of the DataFrame is identical" should, when set to False, not check / raise an exception or error upon differing column names. For example, the output of my attempted comparison of two dataframes with differing columns: ` DataFrame.columns values are different (66.66667 %)

Comment From: willjbrown88

I think the confusion stems from the inconsistent naming conventions used for column name attributes:

For the example:

import pandas as pd
In [0]: df = pd.DataFrame(columns=pd.Index(['a', 'b'], name='c'), index=pd.Index([0], name='d'))
In [1] df
Out[1]: 
c    a    b
d          
0  NaN  NaN

The index has df.index.name: d and df.index.value: 0. df.index.name is not checked when check_names=False.
The set of columns have df.columns.name : c and df.columns.values: a, b. df.columns.name is not checked when check_names=False.
The column series themselves have df.a.name: a and df.b.name: b and e.g. df.a.values: NaN. These "names" are not considered names but values under df.columns.values, which are still checked when check_names=False.

While the check_names parameter description was updated, I don't think it made it any clearer for the case when check_names=False in the case of column "names".

Also just spotted that check_names=False behaviour doesn't have an explicit test case, so there is no assurance this behaviour will be maintained.

Pandas check_names=False parameter for pandas.util.testing.assert_frame_equal applies to index.names but not columns.names

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`