I am not getting the expected output of the command a.combine_first(b)
. All values are NaN. As you will see in my code, I want my index and columns to be the union of the indices and columns and the values to be of a and only of b if it does not exist in a or is NaN in a. Hence, I do not believe DataFrame.merge is appropriate for this task, but I may be wrong.
Here is the code to reproduce:
from __future__ import print_function
import pandas as pd
import dateutil.parser
import numpy as np
date = dateutil.parser.parse('2016-05-03').date()
a_values = [[100.0, 0.0, 0, 0, 100], [97.29, 2.70, np.nan, 0, 100], [97.33, 2.66, 0, 0, 100]]
a_index = pd.MultiIndex.from_tuples([('an', date), ('rc', date), ('avg/total', '')], names=['ls', 'date'])
a_columns = ['complete', 'no', 'over', 'partial', 'total']
a = pd.DataFrame(a_values, index=a_index, columns=a_columns)
b_values = np.arange(5 * 4).reshape(5, 4)
b_index = pd.MultiIndex.from_product([['ad', 'aj', 'an', 'rc', 'ra'], pd.Index([date])], names=['ls', 'date'])
b_columns = ['no', 'partial', 'complete', 'over']
b = pd.DataFrame(b_values, index=b_index, columns=b_columns)
c_values = [[2, 0, 3, 1, np.nan], [6, 4, 7, 5, np.nan], [100.0, 0.0, 0, 0, 100], [97.33, 2.66, 0, 0, 100], [14, 12, 15, 13, np.nan], [97.29, 2.70,\
14, 0, 100]]
c_index = pd.MultiIndex.from_tuples([('ad', date), ('aj', date), ('an', date), ('avg/total', ''), ('ra', date), ('rc', date)], names=['ls', 'date'\
])
c_columns = ['complete', 'no', 'over', 'partial', 'total']
c = pd.DataFrame(c_values, index=c_index, columns=c_columns)
print('=========================')
print('DataFrame a:')
print(a)
print()
print('DataFrame b:')
print(b)
print()
print('=========================')
print()
print("a.index[0] == b.index[2]? {0}".format(a.index[0] == b.index[2]))
print("a.index[1] == b.index[3]? {0}".format(a.index[1] == b.index[3]))
print()
print('=========================')
print()
print("a.combine_first(b):")
print(a.combine_first(b))
print()
print("expected:")
print(c)
And here is the expected output, along with some intermediate steps:
=========================
DataFrame a:
complete no over partial total
ls date
an 2016-05-03 100.00 0.00 0 0 100
rc 2016-05-03 97.29 2.70 NaN 0 100
avg/total 97.33 2.66 0 0 100
DataFrame b:
no partial complete over
ls date
ad 2016-05-03 0 1 2 3
aj 2016-05-03 4 5 6 7
an 2016-05-03 8 9 10 11
rc 2016-05-03 12 13 14 15
ra 2016-05-03 16 17 18 19
=========================
a.index[0] == b.index[2]? True
a.index[1] == b.index[3]? True
=========================
a.combine_first(b):
complete no over partial total
ls date
ad 2016-05-03 NaN NaN NaN NaN NaN
aj 2016-05-03 NaN NaN NaN NaN NaN
an 2016-05-03 NaN NaN NaN NaN NaN
avg/total NaT NaN NaN NaN NaN NaN
ra 2016-05-03 NaN NaN NaN NaN NaN
rc 2016-05-03 NaN NaN NaN NaN NaN
expected:
complete no over partial total
ls date
ad 2016-05-03 2.00 0.00 3 1 NaN
aj 2016-05-03 6.00 4.00 7 5 NaN
an 2016-05-03 100.00 0.00 0 0 100
avg/total NaT 97.33 2.66 0 0 100
ra 2016-05-03 14.00 12.00 15 13 NaN
rc 2016-05-03 97.29 2.70 14 0 100
output of pd.show_versions()
In [270]: pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.11.0-26-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.16.2
nose: 1.3.3
Cython: 0.19.2
numpy: 1.8.0
scipy: 0.13.0
statsmodels: 0.5.0
IPython: 0.12.1
sphinx: None
patsy: 0.2.1
dateutil: 1.5
pytz: 2013.7
bottleneck: None
tables: 3.0.0
numexpr: 2.2.2
matplotlib: 1.4.2
openpyxl: None
xlrd: 0.9.3
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: 0.7.2
apiclient: None
sqlalchemy: 0.9.8
pymysql: None
psycopg2: None
Comment From: jreback
this is not a copy-pastable example, which means copying code not output.
this might be an error, but you should be using merge anyhow.
Comment From: bcm939
Apologies. I've updated the issue, hopefully adequately.
Comment From: jreback
So this doesn't align.
In [34]: a.align(b)
Out[34]:
( complete no over partial total
ls date
ad 2016-05-03 NaN NaN NaN NaN NaN
aj 2016-05-03 NaN NaN NaN NaN NaN
an 2016-05-03 NaN NaN NaN NaN NaN
avg/total NaT NaN NaN NaN NaN NaN
ra 2016-05-03 NaN NaN NaN NaN NaN
rc 2016-05-03 NaN NaN NaN NaN NaN,
complete no over partial total
ls date
ad 2016-05-03 NaN NaN NaN NaN NaN
aj 2016-05-03 NaN NaN NaN NaN NaN
an 2016-05-03 NaN NaN NaN NaN NaN
avg/total NaT NaN NaN NaN NaN NaN
ra 2016-05-03 NaN NaN NaN NaN NaN
rc 2016-05-03 NaN NaN NaN NaN NaN)
The problem is you are using datetime.date
, which is barely a supported type in pandas. It is non-performant and just makes a mess of things. just don't use it. This works correctly when Timestamps
are used.
In [14]: date
Out[14]: datetime.date(2016, 5, 3)
In [15]: date = Timestamp(date)
In [16]: date
Out[16]: Timestamp('2016-05-03 00:00:00')
In [17]: a_index = pd.MultiIndex.from_tuples([('an', date), ('rc', date), ('avg/total', '')], names=['ls', 'date'])
In [18]: b_index = pd.MultiIndex.from_product([['ad', 'aj', 'an', 'rc', 'ra'], pd.Index([date])], names=['ls', 'date'])
In [19]: a = pd.DataFrame(a_values, index=a_index, columns=a_columns)
In [20]: b = pd.DataFrame(b_values, index=b_index, columns=b_columns)
In [21]: a.align(b)
Out[21]:
( complete no over partial total
ls date
ad 2016-05-03 NaN NaN NaN NaN NaN
aj 2016-05-03 NaN NaN NaN NaN NaN
an 2016-05-03 100.00 0.00 0.0 0.0 100.0
avg/total NaT 97.33 2.66 0.0 0.0 100.0
ra 2016-05-03 NaN NaN NaN NaN NaN
rc 2016-05-03 97.29 2.70 NaN 0.0 100.0,
complete no over partial total
ls date
ad 2016-05-03 2.0 0.0 3.0 1.0 NaN
aj 2016-05-03 6.0 4.0 7.0 5.0 NaN
an 2016-05-03 10.0 8.0 11.0 9.0 NaN
avg/total NaT NaN NaN NaN NaN NaN
ra 2016-05-03 18.0 16.0 19.0 17.0 NaN
rc 2016-05-03 14.0 12.0 15.0 13.0 NaN)