Create MultiIndex DataFrame

import pandas as pd

df = pd.DataFrame(
[
    [5777, 100, 5385, 200, 5419, 4887, 100, 200],
    [4849, 0, 4539, 0, 3381, 0, 0, ],
    [4971, 0, 3824, 0, 4645, 3424, 0, 0, ],
    [4827, 200, 3459, 300, 4552, 3153, 100, 200, ],
    [5207, 0, 3670, 0, 4876, 3358, 0, 0, ],
],
index=pd.to_datetime(['2010-01-01',
                      '2010-01-02',
                      '2010-01-03',
                      '2010-01-04',
                      '2010-01-05']),
columns=pd.MultiIndex.from_tuples(
    [('Portfolio A', 'GBP', 'amount'), ('Portfolio A', 'GBP', 'injection'),
     ('Portfolio B', 'EUR', 'amount'), ('Portfolio B', 'EUR', 'injection'),
     ('Portfolio A', 'USD', 'amount'), ('Portfolio A', 'USD', 'injection'),
     ('Portfolio B', 'JPY', 'amount'), ('Portfolio B', 'JPY', 'injection')])
).sortlevel(axis=1)

Add a column at Level 2

amount = df.loc[:, pd.IndexSlice[:, :, 'amount']]
inject = df.loc[:, pd.IndexSlice[:, :, 'injection']]
dav = amount - amount.shift() - inject.shift().values
dav.columns.set_levels(['daily_added_value'], level=2, inplace=True)
df = pd.concat([df, dav], axis=1).sortlevel(axis=1)
print(df.T) #transposed to fit only

I can repeat the add a column at Level 2 code above multiple times, however if I try the following I get an error (possibly related to NAN's):

dav1 = df.loc[:, pd.IndexSlice[:, :, 'daily_added_value']]
amount = df.loc[:, pd.IndexSlice[:, :, 'amount']]
dr = dav1 * amount
dr.columns.set_levels(['daily_return'], level=2, inplace=True)
df = pd.concat([df, dr], axis=1).sortlevel(axis=1)

Expected Output - ( daily_returns is added to level 2 ( daily_added_value * amount )

Note I have put zeros as expected values but the desired output is daily_added_value * amount) I am also adding other measures at level 2 with division and shift() in the formulae. These fail with same error.

                               2010-01-01  2010-01-02  2010-01-03  \
Portfolio A GBP amount                 5777.0      4849.0      4971.0   
                daily_added_value         NaN     -1028.0       122.0   
                injection               100.0         0.0         0.0   
            USD amount                 5419.0      3381.0      4645.0   
                daily_added_value         NaN     -6925.0      1264.0
                 daily_returns             0     0      0  
                injection              4887.0         0.0      3424.0   
Portfolio B EUR amount                 5385.0      4539.0      3824.0   
                daily_added_value         NaN     -1046.0      -715.0   
                injection               200.0         0.0         0.0   
            JPY amount                  100.0         0.0         0.0   
                daily_added_value         NaN      -300.0         NaN   
                daily_returns             0     0      0  
                injection               200.0         NaN         0.0   

                                   2010-01-04  2010-01-05  
Portfolio A GBP amount                 4827.0      5207.0  
                daily_added_value      -144.0       180.0  
                injection               200.0         0.0  
            USD amount                 4552.0      4876.0  
                daily_added_value     -3517.0     -2829.0  
                daily_returns             0     0      0  
                injection              3153.0      3358.0  
Portfolio B EUR amount                 3459.0      3670.0  
                daily_added_value      -365.0       -89.0  
                injection               300.0         0.0  
            JPY amount                  100.0         0.0  
                daily_added_value       100.0      -300.0  
                daily_returns             0     0      0  
                injection               200.0         0.0

Error

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-56-7a5afb5f45f1> in <module>()
      2 amount = df.loc[:, pd.IndexSlice[:, :, 'amount']]
      3 dr = dav1 * amount
----> 4 dr.columns.set_levels(['daily_return'], level=2, inplace=True)
      5 df = pd.concat([df, dr], axis=1).sortlevel(axis=1)

C:\Users\xxxxxxxx\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\indexes\multi.py in set_levels(self, levels, level, inplace, verify_integrity)
    227         idx._reset_identity()
    228         idx._set_levels(levels, level=level, validate=True,
--> 229                         verify_integrity=verify_integrity)
    230         if not inplace:
    231             return idx

C:\Users\xxxxxxxx\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\indexes\multi.py in _set_levels(self, levels, level, copy, validate, verify_integrity)
    164 
    165         if verify_integrity:
--> 166             self._verify_integrity()
    167 
    168     def set_levels(self, levels, level=None, inplace=False,

C:\Users\xxxxxxxx\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\indexes\multi.py in _verify_integrity(self)
    127                                  " level  (%d). NOTE: this index is in an"
    128                                  " inconsistent state" % (i, label.max(),
--> 129                                                           len(level)))
    130 
    131     def _get_levels(self):

ValueError: On level 2, label max (1) >= length of level  (1). NOTE: this index is in an inconsistent state

How do you make the index consistent?

output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 62 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: 0.24
numpy: 1.11.1
scipy: 0.17.1
statsmodels: 0.6.1
xarray: None
IPython: 4.2.0
sphinx: 1.3.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.0
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.40.0
pandas_datareader: 0.2.1

This post is also raised as a question and also related to this question

Comment From: jorisvandenbossche

See also the discussion on gitter

Comment From: toasteez

This works (needed values to align.

dav = df.loc[:, pd.IndexSlice[:, :, 'daily_added_value']]
amount = df.loc[:, pd.IndexSlice[:, :, 'amount']]
dr = (dav.shift(-1).values / amount) * 100

Comment From: toasteez

Is there a better way to align?

Comment From: toasteez

I am reopening as there does not seem to be consistent behavior for a cumprod formula? 100*((1+dr / 100).cumprod()-1)

Comment From: toasteez

I tried the set_levels with verify_integrity = False but this yieds:

The failing command can be changed to drc.columns.set_levels(['daily_return_cumulative'], level=2, inplace=True, verify_integrity = False) however this yields an algos.cp35-win_amd64.pyd unhandled exception

Pandas MultiIndex - ValueError: On level 2, label max (1) >= length of level  (1). NOTE: this index is in an inconsistent state

Comment From: toasteez

Code for checking:

import pandas as pd
import numpy as np

df = pd.DataFrame(
[
    [5777, 100, 5385, 200, 5419, 4887, 100, 200],
    [4849, 0, 4539, 0, 3381, 0, 0, ],
    [4971, 0, 3824, 0, 4645, 3424, 0, 0, ],
    [4827, 200, 3459, 300, 4552, 3153, 100, 200, ],
    [5207, 0, 3670, 0, 4876, 3358, 0, 0, ],
],
index=pd.to_datetime(['2010-01-01',
                      '2010-01-02',
                      '2010-01-03',
                      '2010-01-04',
                      '2010-01-05']),
columns=pd.MultiIndex.from_tuples(
    [('Portfolio A', 'GBP', 'amount'), ('Portfolio A', 'GBP', 'injection'),
     ('Portfolio B', 'EUR', 'amount'), ('Portfolio B', 'EUR', 'injection'),
     ('Portfolio A', 'USD', 'amount'), ('Portfolio A', 'USD', 'injection'),
     ('Portfolio B', 'JPY', 'amount'), ('Portfolio B', 'JPY', 'injection')])
).sortlevel(axis=1)

amount = df.loc[:, pd.IndexSlice[:, :, 'amount']]
inject = df.loc[:, pd.IndexSlice[:, :, 'injection']]
dav = amount - amount.shift() - inject.shift().values
dav.columns.set_levels(['daily_added_value'], level=2, inplace=True)
df = pd.concat([df, dav], axis=1).sortlevel(axis=1)

dav = df.loc[:, pd.IndexSlice[:, :, 'daily_added_value']]
amount = df.loc[:, pd.IndexSlice[:, :, 'amount']]
dr = (dav.values / amount.shift()) * 100
dr.columns.set_levels(['daily_return'], level=2, inplace=True)
df = pd.concat([df, dr], axis=1).sortlevel(axis=1)

## TODO Fail

dr = df.loc[:, pd.IndexSlice[:, :, 'daily_return']]
drc = (100 * (1 + dr / 100).cumprod()-1)
drc = drc.replace([np.inf, -np.inf], np.nan)
print(drc.head()) # DataFrame looks OK???

# Fails here when verify_integrity is True (levels.max() recursion error and when False algo exception
drc.columns.set_levels(['dr_cumulative'], level=2, inplace=True, verify_integrity = False)
df = pd.concat([df, drc], axis=1).sort_index(axis=1)`

Comment From: jorisvandenbossche

@toasteez I opened a separate issue for the inplace=True case: https://github.com/pydata/pandas/issues/13754

Comment From: toasteez

Great :+1: The work around I am using is:

def map_level(df, dct, level=2):
    index = df.index
    index.set_levels([[dct.get(item, item) for item in names] if i==level else names
                                for i, names in enumerate(index.levels)], inplace=True)
dct = {'daily_return':'daily_cumulative_return'}
map_level(drc.T, dct, level=2)

Comment From: bkandel

@toasteez If I understand your question correctly, the issue here is that set_levels is not doing what you think it's doing. A MultiIndex has a levels component and a labels component. The levels are the various names for the different values that the MultiIndex can take on, and the labels associates each column with a given level. When you say set_levels, what you're doing is changing the possible names that a column can have, but importantly, you're not changing the labels. So concretely, if you have the following MultiIndex:

In [1]: multi = MultiIndex.from_product([['A', 'B'], ['a', 'b']])
In [2]: multi
Out[2]: 
MultiIndex(levels=[[u'A', u'B'], [u'a', u'b']],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

and then change the levels at the second layer to 'a', you will get an error because the labels say that you should look at the second entry in the list to find the correct column header, but there is only one entry in the list.

That's what's happening in your example. The columns for dr look like:

In [3]: dr.columns
Out[3]: 
MultiIndex(levels=[[u'Portfolio A', u'Portfolio B'], [u'EUR', u'GBP', u'JPY', u'USD'], [u'amount', u'daily_added_value']],
           labels=[[0, 0, 0, 0, 1, 1, 1, 1], [1, 1, 3, 3, 0, 0, 2, 2], [0, 1, 0, 1, 0, 1, 0, 1]])

and if you set the last level to be a list of length 1, your labels value will still be saying that you should look at the second entry in the list, which doesn't exist. That's why you get the error, which states that On level 2, label max (1) >= length of level (1), because that's exactly what's happening here.

If you want to reset the level from scratch, you would need to do droplevel and reconstruct it or something similar (something like this: http://stackoverflow.com/questions/14744068/prepend-a-level-to-a-pandas-multiindex).

Comment From: bkandel

@jorisvandenbossche I think this can be closed unless @toasteez objects -- I don't see any further action here.

Comment From: xoelop

I think this is still not working.

In this case, selecting top-level columns from a multiindex of a df:

data = sales.set_index(keys=[sales.index, 'descr', 'PERIODO']) data = data.unstack(['descr', 'PERIODO'])[cols]

and then making res.columns.levels[0]

the result is bigger than cols. It's the same columns that the original DataFrame (sales) had

Comment From: sorenwacker

I still get this error when trying to set level values of a multiindex.