I have two multi-index dataframes a_df
and b_df
. When I try to concat them, the operation fails, but it works if head
them first to a small size.
More specifically, note the following:
> pd.concat([a, b],
join='outer', axis=1, verify_integrity=True).dropna()
Empty DataFrame
Columns: [budget, default_bid]
Index: []
versus the following:
> pd.concat([a.head(100), b.head(100)],
join='outer', axis=1, verify_integrity=True).dropna()
budget default_bid
flight_uid created_at
0F0092Ntoi 2015-03-28 306115.26 6.00
0F0099v8iI 2015-03-28 10984.27 41.00
0F01MfxxSI 2015-03-28 3000.00 0.90
0F01SZnTfs 2015-03-28 3000.00 1.60
0F01ddgkRa 2015-03-28 1414.71 2.52
0F01ee81fL 2015-03-28 0.00 6.00
0F01f2HpD3 2015-03-28 425.00 40.00
0F01f4aW8n 2015-03-28 1575.76 1.25
0F01o6T23a 2015-03-28 0.00 9.00
0F02BZeTr6 2015-03-28 50893.42 1.50
Here are the first few entries for a_df
and b_df
:
> a_df.head(10)
flight_uid created_at
0F0092Ntoi 2015-03-28 306115.26
0F0099v8iI 2015-03-28 10984.27
0F01MfxxSI 2015-03-28 3000.00
0F01SZnTfs 2015-03-28 3000.00
0F01ddgkRa 2015-03-28 1414.71
0F01ee81fL 2015-03-28 0.00
0F01f2HpD3 2015-03-28 425.00
0F01f4aW8n 2015-03-28 1575.76
0F01o6T23a 2015-03-28 0.00
0F02BZeTr6 2015-03-28 50893.42
Name: budget, dtype: float64
> b_df.head(10)
flight_uid created_at
0F0092Ntoi 2015-03-28 6.00
0F0099v8iI 2015-03-28 41.00
0F01MfxxSI 2015-03-28 0.90
0F01SZnTfs 2015-03-28 1.60
0F01ddgkRa 2015-03-28 2.52
0F01ee81fL 2015-03-28 6.00
0F01f2HpD3 2015-03-28 40.00
0F01f4aW8n 2015-03-28 1.25
0F01o6T23a 2015-03-28 9.00
0F02BZeTr6 2015-03-28 1.50
Name: default_bid, dtype: float64
Here are the types for both:
> a_df.reset_index().dtypes
flight_uid object
created_at object
budget float64
dtype: object
More specifically:
> a_df.reset_index()['created_at'][0]
datetime.date(2015, 3, 28)
This is my configuration:
INSTALLED VERSIONS
------------------
commit: None
python: 3.4.3.final.0
python-bits: 64
OS: Darwin
OS-release: 14.1.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.16.0
nose: 1.3.4
Cython: 0.22
numpy: 1.9.2
scipy: 0.15.1
statsmodels: 0.6.1
IPython: 3.0.0
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.4.1
pytz: 2015.2
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: None
xlsxwriter: 0.5.7
lxml: 3.4.0
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 0.9.8
pymysql: None
psycopg2: 2.6 (dt dec pq3 ext lo64)
Happy to post the data anywhere if that helps.
In case it helps, it seems to work if I pre-convert the field 'created_at'
(in the index) to pd.DatetimeIndex()
, i.e., doing the following with both a_df
and b_df
:
a_df = a_df.reset_index()
a_df['created_at'] = pd.DatetimeIndex(a_df['created_at'])
a_df.set_index(['flight_uid', 'created_at'])
Comment From: jreback
the problem is that you are using datetime.date
which is a non-really-supported type. Use Timestamps
and proper datetime64[ns]
dtypes for datetimes. Will be orders of magnitudes faster / better.
Comment From: jreback
when you say 'operation fails' what do you mean? This should actually work the way you have it, can you show a traceback?
Comment From: amelio-vazquez-reina
Thank you @jreback By failing I mean that it returns a dataframe with NaN
. Sorry if that wasn't clear, at the top of my post I tried to show that if you do a dropna()
you get an empty dataframe.
Comment From: jorisvandenbossche
@amelio-vazquez-reina Can you post a reproducible example? (so some runnable code that makes up some data and reproduces the problem)
Closing for now, but feel free to re-open when you can update this.