Code Sample
import pandas as pd
import io
text = """
A B C
1 1 1
1 2 2
2 1 3
2 2 4
"""
df = pd.read_csv(io.StringIO(text), delimiter = ' ')
# Note the output of get_level_values is the list of all values [1, 1, 2, 2]
print(df.index.get_level_values('A').tolist())
df.set_index(['A','B'], inplace = True)
df.index.set_levels(df.index.get_level_values('A').map(lambda x: x * 2), level='A', inplace=True)
print(df.index.get_level_values('A').tolist())
# outputs [2, 2, 2, 2]
Expected Output
[2, 2, 4, 4]
# If I re-order the input as:
text = """
A B C
1 1 1
2 1 2
1 2 3
2 2 4
"""
# it works as I expected giving [2, 4, 2, 4]
Is this a bug or expected behaviour?
output of pd.show_versions()
INSTALLED VERSIONS
commit: None python: 3.5.1.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None
pandas: 0.18.1 nose: 1.3.7 pip: 8.1.1 setuptools: 20.3 Cython: 0.23.4 numpy: 1.10.4 scipy: 0.17.0 statsmodels: 0.6.1 xarray: None IPython: 4.1.2 sphinx: 1.3.1 patsy: 0.4.0 dateutil: 2.5.1 pytz: 2016.2 blosc: None bottleneck: 1.0.0 tables: 3.2.2 numexpr: 2.5 matplotlib: 1.5.1 openpyxl: 2.3.2 xlrd: 0.9.4 xlwt: 1.0.0 xlsxwriter: 0.8.4 lxml: 3.6.0 bs4: 4.4.1 html5lib: None httplib2: None apiclient: None sqlalchemy: 1.0.12 pymysql: 0.7.5.None psycopg2: None jinja2: 2.8 boto: 2.39.0 pandas_datareader: None
Comment From: jreback
duplicate of #13754 and PR in #14236
Comment From: danio
@jreback I don't think from the description of #13754 and the code changes in #14236 that it is an exact duplicate of those. I will try testing it with the suggested change and see if that changes the behaviour.
Comment From: danio
@jreback I have confirmed it is not a duplicate. #13754 is about set_levels still changing the index even when verification fails. This issue is not about that.
I have realised that set_levels and get_level_values should not necessarily be symmetrical. What I think is missing from pandas is a get_levels function that would be symmetrical to get_levels (the levels property is not the same as you cannot use labels).
Failing that, I have found that the workaround for this is to replace
df.index.get_level_values(level)
with
df.index.levels[df.index.names.index(level)]
in my code, i.e.
level_index = df.index.names.index('A')
df.index.set_levels(df.index.levels[level_index].map(lambda x: x * 2), level='A', inplace=True)
Comment From: bkandel
@danio I agree that this issue is not a duplicate of https://github.com/pydata/pandas/issues/13754, but I think that it's quite close to this description: https://github.com/pydata/pandas/issues/13741#issuecomment-248009216 Basically the issue is that there's no way to directly set the equivalent of the column names in a multiindex. I.e. where you would be used to saying df.columns = ['A', 'B', 'C']
or df.index = ['A', 'B', 'C']
, you can't say df.set_columns(['A', 'B', 'C'], level=0)
.
Comment From: danio
@bkandel, yes, the example there is quite convoluted but I think this is caused by the same underlying issue as #13741. I can't see any options to change the duplicate setting of this issue, probably I don't have the permissions?
It feels to me that MutilIndex.set_levels is exposing too much of the underlying representation of the index, and it should be replaced by a new function with an interface more like get_level_values. That's probably not a discussion for the issue tracker though.
Comment From: bkandel-picwell
@danio Yes, I agree that exposing set_levels
and set_labels
but not set_level_values
(like what get_level_values
does) makes it harder for users to see the functionality they want. I think that should be raised as a separate issue. I'll spend a bit of time to make sure exactly what functionality is needed and then file an issue.