edit by @TomAugspurger
The check_names
docstring for pandas.util.testing.assert_frame_equal
is unclear:
check_names : bool, default True
Whether to check the Index names attribute.
This should clarify that both the index and columns names attribute are checked.
Code Sample, a copy-pastable example if possible
import pandas as pd
from pandas.util.testing import assert_frame_equal
df1 = pd.DataFrame({'A':[1.0]})
df2 = pd.DataFrame({'B':[1.0]})
assert_frame_equal(df1, df2, check_names=False)
""" will return:
In [7]: assert_frame_equal(df1, df2, check_names=False)
Traceback (most recent call last):
File "<ipython-input-7-d273edeeb6af>", line 1, in <module>
assert_frame_equal(df1, df2, check_names=False)
File "<snipped>/lib/python3.5/site-packages/pandas/util/testing.py", line 1372, in assert_frame_equal
obj='{obj}.columns'.format(obj=obj))
File "<snipped>/lib/python3.5/site-packages/pandas/util/testing.py", line 927, in assert_index_equal
obj=obj, lobj=left, robj=right)
File "pandas/_libs/testing.pyx", line 59, in pandas._libs.testing.assert_almost_equal
File "pandas/_libs/testing.pyx", line 173, in pandas._libs.testing.assert_almost_equal
File "<snipped>/lib/python3.5/site-packages/pandas/util/testing.py", line 1093, in raise_assert_detail
raise AssertionError(msg)
AssertionError: DataFrame.columns are different
DataFrame.columns values are different (100.0 %)
[left]: Index(['A'], dtype='object')
[right]: Index(['B'], dtype='object')
"""
Problem description
When the parameter check_names=False
is set for assert_frames_equal
, the index and columns names should be ignored in the comparison, but an assertion error is still raised if the index or columns names are different. This is the same behaviour as when check_names=True
(the default) is set, and the opposite of what I believe is intended.
Expected Output
The expected output for the case above should be nothing - a valid assertion.
Output of pd.show_versions()
Comment From: TomAugspurger
From the docstring:
check_names : bool, default True
Whether to check the Index names attribute.
So it checks if left.index.names == right.index.names
. Same for .columns
, so everything is correct I think.
Do you have a usecase for ignoring the actual column labels themselves?
Comment From: willjbrown88
Ah ok, I guess I'd assumed that as both:
AssertionError: DataFrame.index are different
and
AssertionError: DataFrame.columns are different
are possible returns from assert_frame_equal
, either could differ if check_names=False
was set.
check_names
to me implies both index names and column names are being checked, as they are, but also either can differ. Only index name can actually differ. Perhaps the docstring should clarify that column names are also checked, but cannot differ regardless of this parameter setting.
I can't speak for it being a common use case but yes - I test if various data processing functions can handle df populated by reading from different format source files. The df get assigned column names using whatever columns names are provided by the file or are assigned by user input. My tests only know that the column order for the parts of the df I'm interested in checking should be the same in every case. So the data values, index values and column order of the df should match, but column names and index names don't have to. I could simply assign a temporary set of column names internally of course.
Comment From: TomAugspurger
Well, the docstring could be improved. Both df.index.names
and df.columns.names
are checked. That's still different from your original issue though, which was about the values.
I could simply assign a temporary set of column names internally of course.
I'd recommend doing that. I don't think changing assert_frame_equal
to ignore index / column labels is generally useful enough to warrant a parameter.
Comment From: willjbrown88
I've been a little unclear here still I think.
I don't think changing
assert_frame_equal
to ignore index / column labels is generally useful enough to warrant a parameter
This is exactly what check_names
is for and is doing, but only for df.index.names
. check_names=False
allows for the case of left.index.names != right.index.names
to pass assert_frame_equal
. My issue was that I assumed check_names=False
would also pass left.columns.names != right.columns.names
, as I was intending to use it.
If the latter isn't a common use, then I agree it's not worth changes. In that case, my vote would be to simply rename check_names
to something like check_index_names
, as that is exactly what it does, and all that bool setting applies to.
Comment From: TomAugspurger
@willjbrown88
In [3]: pd.util.testing.assert_frame_equal(
...: pd.DataFrame(columns=pd.Index(['a', 'b'], name='c1')),
...: pd.DataFrame(columns=pd.Index(['a', 'b'], name='c2'))
...: )
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-3-653d160e6f54> in <module>()
1 pd.util.testing.assert_frame_equal(
2 pd.DataFrame(columns=pd.Index(['a', 'b'], name='c1')),
----> 3 pd.DataFrame(columns=pd.Index(['a', 'b'], name='c2'))
4 )
~/sandbox/pandas-ip/pandas/pandas/util/testing.py in assert_frame_equal(left, right, check_dtype, check_index_type, check_column_type, check_frame_type, check_less_precise, check_names, by_blocks, check_exact, check_datetimelike_compat, check_categorical, check_like, obj)
1285 check_exact=check_exact,
1286 check_categorical=check_categorical,
-> 1287 obj='{obj}.columns'.format(obj=obj))
1288
1289 # compare by blocks
~/sandbox/pandas-ip/pandas/pandas/util/testing.py in assert_index_equal(left, right, exact, check_names, check_less_precise, check_exact, check_categorical, obj)
844 # metadata comparison
845 if check_names:
--> 846 assert_attr_equal('names', left, right, obj=obj)
847 if isinstance(left, pd.PeriodIndex) or isinstance(right, pd.PeriodIndex):
848 assert_attr_equal('freq', left, right, obj=obj)
~/sandbox/pandas-ip/pandas/pandas/util/testing.py in assert_attr_equal(attr, left, right, obj)
921 else:
922 msg = 'Attribute "{attr}" are different'.format(attr=attr)
--> 923 raise_assert_detail(obj, msg, left_attr, right_attr)
924
925
~/sandbox/pandas-ip/pandas/pandas/util/testing.py in raise_assert_detail(obj, message, left, right, diff)
1006 msg += "\n[diff]: {diff}".format(diff=diff)
1007
-> 1008 raise AssertionError(msg)
1009
1010
AssertionError: DataFrame.columns are different
Attribute "names" are different
[left]: ['c1']
[right]: ['c2']
In [4]: pd.util.testing.assert_frame_equal(
...: pd.DataFrame(columns=pd.Index(['a', 'b'], name='c1')),
...: pd.DataFrame(columns=pd.Index(['a', 'b'], name='c2')),
...: check_names=False
...: )
Comment From: cchummer
Ran into exactly this behavior/issue this evening and was stumped for a bit. Indeed would appreciate clearing up the doc's to include the fact the differing columns will indeed throw an exception and cannot be ignored via check_names. In any interpretation of English I have ever come across, a flag determining "whether to check that the names attribute for both the index and column attributes of the DataFrame is identical" should, when set to False, not check / raise an exception or error upon differing column names. For example, the output of my attempted comparison of two dataframes with differing columns: ` DataFrame.columns values are different (66.66667 %)
`
Comment From: willjbrown88
I think the confusion stems from the inconsistent naming conventions used for column name attributes:
For the example:
import pandas as pd
In [0]: df = pd.DataFrame(columns=pd.Index(['a', 'b'], name='c'), index=pd.Index([0], name='d'))
In [1] df
Out[1]:
c a b
d
0 NaN NaN
- The index has
df.index.name
:d
anddf.index.value
:0
.df.index.name
is not checked whencheck_names=False
. - The set of columns have
df.columns.name
:c
anddf.columns.values
:a
,b
.df.columns.name
is not checked whencheck_names=False
. - The column series themselves have
df.a.name
:a
anddf.b.name
:b
and e.g.df.a.values
:NaN
. These "names" are not considered names but values underdf.columns.values
, which are still checked whencheck_names=False
.
While the check_names
parameter description was updated, I don't think it made it any clearer for the case when check_names=False
in the case of column "names".
Also just spotted that check_names=False
behaviour doesn't have an explicit test case, so there is no assurance this behaviour will be maintained.