Pandas No roundtrip DataFrame.to/from_csv() with multiindex columns

Code Sample, a copy-pastable example if possible


import pandas as pd
import numpy as np

# create a dataframe with multiindex columns
arrays = [['A','A','B','B'],['a','b','a','b']]
tuples = list(zip(*arrays))
columnIndex = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
A = pd.DataFrame(data=np.random.randn(4,4),columns=columnIndex)



# save it to csv
A.to_csv('test.csv')
print(A.columns)

# try to do a round trip...
B = pd.DataFrame.from_csv('test.csv')
print(B.columns)

Output:

 A.colums =  MultiIndex(levels=[['A', 'B'], ['a', 'b']],
       labels=[[0, 0, 1, 1], [0, 1, 0, 1]],
       names=['first', 'second'])

 B.columns = Index(['A', 'A.1', 'B', 'B.1'], dtype='object')

Problem description

I would expect .to_csv() and .from_csv() to result in a round trip, however the column multiindex is not being read correctly by .from_csv(). It is possible to use csv_read(), but it requires more extra parameters. I think it would be sufficient to add a parameter to .from_csv similar to index_col=sequence in csv_read()

Expected Output

A.columns == B.columns

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.1.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-53-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.20.1 pytest: 3.0.7 pip: 9.0.1 setuptools: 27.2.0 Cython: 0.25.2 numpy: 1.12.1 scipy: 0.19.0 xarray: None IPython: 5.3.0 sphinx: 1.5.6 patsy: 0.4.1 dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: 1.2.1 tables: 3.3.0 numexpr: 2.6.2 feather: None matplotlib: 2.0.2 openpyxl: 2.4.7 xlrd: 1.0.0 xlwt: 1.2.0 xlsxwriter: 0.9.6 lxml: 3.7.3 bs4: 4.6.0 html5lib: 0.999 sqlalchemy: 1.1.9 pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None

Comment From: gfyoung

@sjev : Try specifying the header to be header=[0, 1]. AFAIK the reason you're seeing what you're getting now is because it doesn't know which rows correspond to the header.

Comment From: jreback

csv is not a format which can store meta-data, it is fundamentally a lossy format. by-definition it cannot faithfully round-trip; read_csv is pretty good at inference so it can usually 'guess' correctly (e.g. on dtypes).

Comment From: gfyoung

csv is not a format which can store meta-data, it is fundamentally a lossy format.

What meta-data are you referring to? He's just talking about a MultiIndex here.

Comment From: jreback

can NOT have meta data you have no idea how many rows are in the header or even if it IS a MI

Comment From: gfyoung

But there is no meta-data here...I think it's just an issue of not specifying the header rows. What am I missing here?

Comment From: chris-b1

the header rows are metadata

Comment From: gfyoung

the header rows are metadata

But header rows can be used as columns. That's not metadata per se

Comment From: jreback

the very fact that you have to specify the number of rows of header is the issue

you simply don't know and that's why this is not an easily tractable problem

Comment From: sjev

I'll try to clarify why I've raised this issue in the first place. According to the documentation:

It is preferable to use the more powerful pandas.read_csv() for most general purposes, but from_csv makes for an easy roundtrip to and from a file (the exact counterpart of to_csv), especially with a DataFrame of time series data

So a user would expect to use it as a convenience function, with minimal configuration. Of course the CSV format does not support metadata, so working with multiindex becomes a bit tricky. To overcome this issue for the rows, there is a parameter index_col, which can be a sequence in case of multiindex:

 index_col : int or sequence, default 0
     Column to use for index. If a sequence is given, a MultiIndex is used. Different default from read_table

However, when multiindex is used on columns, no such option is present (AFAIK), so the user is forced to use csv_read instead just providing a sequence of header rows.

Comment From: gfyoung

So a user would expect to use it as a convenience function, with minimal configuration.

Yes, you just needed to configure one parameter. That seems pretty minimalist to me I would think.

We can't read minds unfortunately, so there will be cases where just doing read_csv without any other modifications will not work. MultiIndex columns are one of them. For the vast majority cases where there is one row for the column names, that is the case though.

Comment From: sjev

One parameter is perfectly fine. The issue for me was that I wasn't able to figure that out from documentation, there is no mention that header can accept a sequence. Updating docs would be enough 👍

Comment From: gfyoung

The issue for me was that I wasn't able to figure that out from documentation, there is no mention that header can accept a sequence.

This is the current documentation:

header : int or list of ints, default ‘infer’

I'm not sure I follow you here.

Comment From: sjev

You are probably referring to read_csv docs. These are indeed very well documented.

I am talking about from_csv, which only contains

header : int, default 0 Row to use as header (skip prior rows)

... Apparently everything is in place, only the docs of DataFrame.from_csv are lagging behind csv_read. I'd be happy to update the docs and make a pull request.

Comment From: gfyoung

Ah, true. I think we both overlooked this. Generally, we encourage people to use read_csv, as per the docs.

By all means, feel free to update the documentation as a PR, though we should consider just deprecating the function (@jreback thoughts?) in the future.

Comment From: farleylai

To make things more interesting, what about an additional column with only one level header? What is the right parameters to_csv() and from_csv()/read_csv() so as to recover the orginial DataFrame A without unnecessary 'Unnamed:XXX'?

>>> arrays = [['A','A','B','B'],['a','b','a','b']]
>>> tuples = list(zip(*arrays))
>>> columnIndex = pd.MultiIndex.from_tuples(tuples)
>>> A = pd.DataFrame(data=np.random.randn(4,4),columns=columnIndex)
>>> A
          A                   B          
          a         b         a         b
0 -1.325581  0.734176 -0.503851  0.593437
1 -0.480105  0.179591  0.326949 -0.669441
2 -1.784733  0.516683 -0.785407 -0.794819
3 -0.235099  1.292330 -0.089105 -1.825709

>>> A['product'] = ['p1','p2','p3','p4']
>>> A
          A                   B           product
          a         b         a         b        
0 -1.325581  0.734176 -0.503851  0.593437      p1
1 -0.480105  0.179591  0.326949 -0.669441      p2
2 -1.784733  0.516683 -0.785407 -0.794819      p3
3 -0.235099  1.292330 -0.089105 -1.825709      p4

Now save and reread A:

>>> A.to_csv('A.csv')
>>> AA = pd.DataFrame.from_csv('A.csv', header=[0,1])
>>> AA
          A                   B                      product
          a         b         a         b Unnamed: 5_level_1
0 -1.325581  0.734176 -0.503851  0.593437                 p1
1 -0.480105  0.179591  0.326949 -0.669441                 p2
2 -1.784733  0.516683 -0.785407 -0.794819                 p3
3 -0.235099  1.292330 -0.089105 -1.825709                 p4

Any idea to get rid of the Unnamed things?

Pandas No roundtrip DataFrame.to/from_csv() with multiindex columns

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`