I'm trying to trim a very big DataFrame
with a MultiIndex
using head
, and I noticed that the new DataFrame
was still ridicolously big because it kept all the previous data from the entire MultiIndex
. This doesn't happen if the DataFrame
has a regular Index
.
In [1]: from pandas import DataFrame
In [2]: df = DataFrame({'a': [1, 2, 3, 4, 5, 6], 'b': ['q', 'w', 'e', 'r', 't', 'y'], 'c': ['a', 's', 'd', 'f', 'g', 'h']}).set_index(['a', 'b'])
In [3]: df.head(3)
Out[3]:
c
a b
1 q a
2 w s
3 e d
In [4]: df.head(3).index
Out[4]:
MultiIndex(levels=[[1, 2, 3, 4, 5, 6], [u'e', u'q', u'r', u't', u'w', u'y']],
labels=[[0, 1, 2], [1, 4, 0]],
names=[u'a', u'b'])
#### Expected Output
MultiIndex(levels=[[1, 2, 3], [u'e', u'q', u'w']],
labels=[[0, 1, 2], [1, 2, 0]],
names=[u'a', u'b'])
Output of pd.show_versions()
# Paste the output here
## INSTALLED VERSIONS
commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-45-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.18.1
nose: None
pip: 8.1.2
setuptools: 25.1.1
Cython: None
numpy: 1.11.1
scipy: 0.18.0
statsmodels: None
xarray: None
IPython: 5.0.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None
Comment From: jreback
duplicate of this: https://github.com/pandas-dev/pandas/issues/11724
indexing a multi-index does not delete unused levels on purpose. It is a bit expensive to reconstruct them and it is not clear when to do that. This is an implementation detail (e.g. peering into the MultiIndex).