Code to replicate problem
Please see this link to download the referenced files.
df = pd.read_pickle('bug_df.pickle')
df2 = pd.read_pickle('bug_s2c.pickle')
try:
assert(df.equals(df2.loc[df.index]))
except AssertionError:
print("This fails - but why?")
pass
assert(set(df2.index.values) == set(df.index.values)) # Indexes have the same values, but maybe in different order
df_list = df.values.tolist()
df2_list = df2.loc[df.index].values.tolist() # Ensure index is in the same order...
vc = [x == y for (x, y) in zip(df_list, df2_list)] # vc is list of lists...
try:
assert(all(vc))
except AssertionError:
print("This fails, so must be different values. But which one...?")
pass
idx = vc.index(False)
print(df_list[idx])
print(df2_list[idx])
print("Hmmm... lists look the same...")
print(len(df2_list[idx]) == len(df_list[idx])) # Check length equality
df2_el = df2_list[idx]
df_el = df_list[idx]
el_comp = [x == y for (x,y) in zip(df2_el, df_el)]
print(el_comp) # First element is different
el_diff = el_comp.index(False)
print("Clearly its the first element that is different, but the first element in" + \
" both lists are NaNs - pandas is meant to treat them as equal.")
print("Are the first elements even NaNs...?")
print(pd.isna(df_el[el_diff]))
print(pd.isna(df2_el[el_diff]))
print("... so yes, they are apprently are. How is this possible...?")
This produces the output:
This fails - but why?
This fails, so must be different values. But which one...?
[nan, nan, nan, nan, nan, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[nan, nan, nan, nan, nan, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
Hmmm... lists look the same...
True
[False, False, False, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True]
Clearly its the first element that is different, but the first element in both lists are NaNs - pandas is meant to treat them as equal.
Are the first elements even NaNs...?
True
True
... so yes, they are apprently are. How is this possible...?
Problem description
With the DataFrame.equals() method, pandas seems to be failing to identify NaNs as equal, as stated by the documentation.
The problem occurs with Pandas v0.22.
Expected Output
There should be no failure of assertion at the first assert statement. Equivalently, This fails - but why?
should not be printed.
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-112-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8
pandas: 0.22.0 pytest: 3.0.3 pip: 9.0.1 setuptools: 28.8.0 Cython: 0.25.1 numpy: 1.11.2 scipy: 0.18.1 pyarrow: None xarray: None IPython: 5.1.0 sphinx: 1.4.8 patsy: 0.4.1 dateutil: 2.6.1 pytz: 2016.7 blosc: None bottleneck: 1.1.0 tables: 3.3.0 numexpr: 2.6.1 feather: None matplotlib: 1.5.3 openpyxl: 2.4.9 xlrd: 1.0.0 xlwt: 1.1.2 xlsxwriter: 0.9.3 lxml: 3.8.0 bs4: 4.5.1 html5lib: 1.0b10 sqlalchemy: 1.1.3 pymysql: None psycopg2: None jinja2: 2.8 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None
Comment From: jreback
these aren't even close to being equal, they have different dtypes as well as some different values. you need a simple example.
In [19]: tm.assert_frame_equal(pd.read_pickle('bug_df.pickle'), pd.read_pickle('bug_s2c.pickle'))
DataFrame.index values are different (100.0 %)
[left]: Index(['S', '1 SheepCattle', '2 DairyCattle', '3 OtherAnimals', '4 Crops',
'5 OtherAg', '6 FishHuntTrap', '7 ForestryLogs', '8 AgSrv', '9 Coal',
'10 Oil', '11 GAS', '12 LNG', '13 IronOre', '14 NonFeOres',
'15 NonMetMins', '16 MiningSrv', '17 MeatProds', '18 DairyProds',
'19 OtherFood', '20 Beverages', '21 TCF', '22 WoodProds',
'23 PulpPaper', '24 Printing', '25 RefineProd', '26 Chemicals',
'27 PlasticRub', '28 NonMetalMin', '29 CementLime', '30 IronSteel',
'31 Aluminium', '32 OtherNonFeMt', '33 MetalProds', '34 MVPOtherTran',
'35 OtherEquip', '36 OtherMan', '37 ElecCoal', '38 ElecGas',
'39 ElecHydro', '40 ElecOther', '41 ElecNuclear', '42 ElecSupply',
'43 GasSupply', '44 WaterDrains', '45 ResidCons', '46 NonResidCons',
'47 ConsSrv', '48 WholeTrade', '49 RetailTrade', '50 AccomFood',
'51 RoadFreight', '52 RoadPass', '53 RailFreight', '54 RailPass',
'55 Pipeline', '56 WaterTrans', '57 AirTrans', '58 Commun',
'59 Banking', '60 Finance', '61 Insurance', '62 DwellingLow',
'63 DwellingHigh', '64 Rental', '65 RealEstate', '66 OthBusServ',
'67 PubAdminReg', '68 Defence', '69 Education', '70 HealthSrv',
'71 ResidCare', '72 Culture', '73 Gambling', '74 Repairs',
'75 OtherSrv', '76 PrivTranServ'],
dtype='object')
[right]: Index(['1 SheepCattle', '10 Oil', '11 GAS', '12 LNG', '13 IronOre',
'14 NonFeOres', '15 NonMetMins', '16 MiningSrv', '17 MeatProds',
'18 DairyProds', '19 OtherFood', '2 DairyCattle', '20 Beverages',
'21 TCF', '22 WoodProds', '23 PulpPaper', '24 Printing',
'25 RefineProd', '26 Chemicals', '27 PlasticRub', '28 NonMetalMin',
'29 CementLime', '3 OtherAnimals', '30 IronSteel', '31 Aluminium',
'32 OtherNonFeMt', '33 MetalProds', '34 MVPOtherTran', '35 OtherEquip',
'36 OtherMan', '37 ElecCoal', '38 ElecGas', '39 ElecHydro', '4 Crops',
'40 ElecOther', '41 ElecNuclear', '42 ElecSupply', '43 GasSupply',
'44 WaterDrains', '45 ResidCons', '46 NonResidCons', '47 ConsSrv',
'48 WholeTrade', '49 RetailTrade', '5 OtherAg', '50 AccomFood',
'51 RoadFreight', '52 RoadPass', '53 RailFreight', '54 RailPass',
'55 Pipeline', '56 WaterTrans', '57 AirTrans', '58 Commun',
'59 Banking', '6 FishHuntTrap', '60 Finance', '61 Insurance',
'62 DwellingLow', '63 DwellingHigh', '64 Rental', '65 RealEstate',
'66 OthBusServ', '67 PubAdminReg', '68 Defence', '69 Education',
'7 ForestryLogs', '70 HealthSrv', '71 ResidCare', '72 Culture',
'73 Gambling', '74 Repairs', '75 OtherSrv', '76 PrivTranServ',
'8 AgSrv', '9 Coal', 'S'],
dtype='object')
Comment From: ghcn
This is not a bug. nan == nan is False.
Comment From: charlie0389
Read the documentation (v0.22):
Determines if two NDFrame objects contain the same elements. NaNs in the same location are considered equal.