Code Sample, a copy-pastable example if possible
import pandas as pd
import numpy as np
# create a dataframe with multiindex columns
arrays = [['A','A','B','B'],['a','b','a','b']]
tuples = list(zip(*arrays))
columnIndex = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
A = pd.DataFrame(data=np.random.randn(4,4),columns=columnIndex)
# save it to csv
A.to_csv('test.csv')
print(A.columns)
# try to do a round trip...
B = pd.DataFrame.from_csv('test.csv')
print(B.columns)
Output:
A.colums = MultiIndex(levels=[['A', 'B'], ['a', 'b']],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]],
names=['first', 'second'])
B.columns = Index(['A', 'A.1', 'B', 'B.1'], dtype='object')
Problem description
I would expect .to_csv()
and .from_csv()
to result in a round trip, however the column multiindex is not being read correctly by .from_csv().
It is possible to use csv_read()
, but it requires more extra parameters.
I think it would be sufficient to add a parameter to .from_csv
similar to index_col=sequence in csv_read()
Expected Output
A.columns == B.columns
Output of pd.show_versions()
Comment From: gfyoung
@sjev : Try specifying the header to be header=[0, 1]
. AFAIK the reason you're seeing what you're getting now is because it doesn't know which rows correspond to the header.
Comment From: jreback
csv is not a format which can store meta-data, it is fundamentally a lossy format. by-definition it cannot faithfully round-trip; read_csv
is pretty good at inference so it can usually 'guess' correctly (e.g. on dtypes).
Comment From: gfyoung
csv is not a format which can store meta-data, it is fundamentally a lossy format.
What meta-data are you referring to? He's just talking about a MultiIndex
here.
Comment From: jreback
can NOT have meta data you have no idea how many rows are in the header or even if it IS a MI
Comment From: gfyoung
But there is no meta-data here...I think it's just an issue of not specifying the header rows. What am I missing here?
Comment From: chris-b1
the header rows are metadata
Comment From: gfyoung
the header rows are metadata
But header rows can be used as columns. That's not metadata per se
Comment From: jreback
the very fact that you have to specify the number of rows of header is the issue
you simply don't know and that's why this is not an easily tractable problem
Comment From: sjev
I'll try to clarify why I've raised this issue in the first place. According to the documentation:
It is preferable to use the more powerful pandas.read_csv() for most general purposes, but from_csv makes for an easy roundtrip to and from a file (the exact counterpart of to_csv), especially with a DataFrame of time series data
So a user would expect to use it as a convenience function, with minimal configuration.
Of course the CSV format does not support metadata, so working with multiindex becomes a bit tricky. To overcome this issue for the rows, there is a parameter index_col
, which can be a sequence in case of multiindex:
index_col : int or sequence, default 0
Column to use for index. If a sequence is given, a MultiIndex is used. Different default from read_table
However, when multiindex is used on columns, no such option is present (AFAIK), so the user is forced to use csv_read
instead just providing a sequence of header rows.
Comment From: gfyoung
So a user would expect to use it as a convenience function, with minimal configuration.
Yes, you just needed to configure one parameter. That seems pretty minimalist to me I would think.
We can't read minds unfortunately, so there will be cases where just doing read_csv
without any other modifications will not work. MultiIndex
columns are one of them. For the vast majority cases where there is one row for the column names, that is the case though.
Comment From: sjev
One parameter is perfectly fine. The issue for me was that I wasn't able to figure that out from documentation, there is no mention that header can accept a sequence. Updating docs would be enough 👍
Comment From: gfyoung
The issue for me was that I wasn't able to figure that out from documentation, there is no mention that header can accept a sequence.
This is the current documentation:
header : int or list of ints, default ‘infer’
I'm not sure I follow you here.
Comment From: sjev
You are probably referring to read_csv
docs. These are indeed very well documented.
I am talking about from_csv
, which only contains
header : int, default 0 Row to use as header (skip prior rows)
... Apparently everything is in place, only the docs of DataFrame.from_csv
are lagging behind csv_read
.
I'd be happy to update the docs and make a pull request.
Comment From: gfyoung
Ah, true. I think we both overlooked this. Generally, we encourage people to use read_csv
, as per the docs.
By all means, feel free to update the documentation as a PR, though we should consider just deprecating the function (@jreback thoughts?) in the future.
Comment From: farleylai
To make things more interesting, what about an additional column with only one level header? What is the right parameters to_csv() and from_csv()/read_csv() so as to recover the orginial DataFrame A without unnecessary 'Unnamed:XXX'?
>>> arrays = [['A','A','B','B'],['a','b','a','b']]
>>> tuples = list(zip(*arrays))
>>> columnIndex = pd.MultiIndex.from_tuples(tuples)
>>> A = pd.DataFrame(data=np.random.randn(4,4),columns=columnIndex)
>>> A
A B
a b a b
0 -1.325581 0.734176 -0.503851 0.593437
1 -0.480105 0.179591 0.326949 -0.669441
2 -1.784733 0.516683 -0.785407 -0.794819
3 -0.235099 1.292330 -0.089105 -1.825709
>>> A['product'] = ['p1','p2','p3','p4']
>>> A
A B product
a b a b
0 -1.325581 0.734176 -0.503851 0.593437 p1
1 -0.480105 0.179591 0.326949 -0.669441 p2
2 -1.784733 0.516683 -0.785407 -0.794819 p3
3 -0.235099 1.292330 -0.089105 -1.825709 p4
Now save and reread A:
>>> A.to_csv('A.csv')
>>> AA = pd.DataFrame.from_csv('A.csv', header=[0,1])
>>> AA
A B product
a b a b Unnamed: 5_level_1
0 -1.325581 0.734176 -0.503851 0.593437 p1
1 -0.480105 0.179591 0.326949 -0.669441 p2
2 -1.784733 0.516683 -0.785407 -0.794819 p3
3 -0.235099 1.292330 -0.089105 -1.825709 p4
Any idea to get rid of the Unnamed things?