Let's assume there are two excel files with almost identical content:
file_1.xlsx:
file_2.xlsx:
Then reading first file in pandas:
df1 = pd.read_excel('file_1.xlsx',
header=0, index_col=None,
converters={'A0': str, 'B0': str})
print(df1)
Would produce expected result:
A0 B0 C0 D0 E0
0 0001 0004 0.1 1 a
1 0002 0005 0.2 2 b
2 0003 0006 0.3 3 c
However trying the same with the second file:
df2 = pd.read_excel('file_2.xlsx',
header=[0,1], index_col=None,
converters={('A0', 'A1'): str,
('A0', 'B1'): str},
)
print(df2)
Would yield somewhat different and unexpected (in comparison with previous example) result:
A0 A0 C0 E0
A1 B1 C1 D1 E1
1 0004 0.1 1 a
2 0005 0.2 2 b
3 0006 0.3 3 c
Since it is not possible to use has_index_names=False
as it has been depreciated since 0.16.2, there seems to be no way to have control over how pandas imports this first column (i.e. no way to convert values before original formatting is lost).
And there is no way to tell pandas DO_NOT_ASSIGN first column to index as it ignores index_col=None
when header
is a list.
So the question is what would be the sensible way to regain control over import process of first columns with multi-index header:
- revive or de-depreciate (would that be appreciate?)
has_index_names
; - make
index_col
play a role in parsing header?
Comment From: jreback
cc @chris-b1
Comment From: chris-b1
Thanks for the report and detailed example - this is a duplicate of #11733. I'm entirely in favor of supporting this, though it is tricky to handle all the various formats and not break back-compat.
One idea I'm not sure I had explored - right now the default for index_col
is None
, so passing index_col=None
can't give any information. We possibly could change the default to 'infer'
(since that is what is really happening) - so that passing index_col=None
is meaningful. PRs / additional ideas welcome!