While going over some tests, I noticed the following here:
df = DataFrame([[1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11, 12]],
index=['one', 'two'],
columns=MultiIndex.from_tuples(
[('a', 'q'), ('a', 'r'), ('a', 's'),
('b', 't'), ('c', 'u'), ('c', 'v')]))
# to_csv
data = """,a,a,a,b,c,c
,q,r,s,t,u,v
,,,,,,
one,1,2,3,4,5,6
two,7,8,9,10,11,12"""
result = self.read_csv(StringIO(data), header=[0, 1], index_col=0)
tm.assert_frame_equal(df, result)
Is this really expected behaviour, that pandas.read_csv
should silently discard an empty row following the header if multiple rows are passed as headers?
Comment From: WillAyd
Well there is a parameter for skip_blank_lines
which is True
by default, but setting it to False
didn't change your example which seems off.
Investigation and PRs certainly welcome!
Comment From: dahlbaek
I think skip_blank_lines
is related to truly blank lines, not lines that contain separator characters. Consider
from io import StringIO
import pandas as pd
filepath_or_buffer = StringIO("a,b\n\n\n1,2")
pd.read_csv(filepath_or_buffer)
as opposed to
from io import StringIO
import pandas as pd
filepath_or_buffer = StringIO("a,b\n,\n,\n1,2")
pd.read_csv(filepath_or_buffer)
or
from io import StringIO
import pandas as pd
filepath_or_buffer = StringIO("a,b\n\n\n1,2")
pd.read_csv(filepath_or_buffer, skip_blank_lines=False)
Comment From: dahlbaek
I guess this is what git-blame
is for! Related: #5298
As far as I can tell, this had to do with supporting the output format of pandas.DataFrame.to_csv
. I dusted off an old version of pandas and ran the following:
df = pd.DataFrame([[1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11, 12]],
index=['one', 'two'],
columns=pd.MultiIndex.from_tuples(
[('a', 'q'), ('a', 'r'), ('a', 's'),
('b', 't'), ('c', 'u'), ('c', 'v')]))
print(result.to_csv(), "\npandas version:", pd.__version__)
Output:
,a,a,a,b,c,c
,q,r,s,t,u,v
,,,,,,
one,1,2,3,4,5,6
two,7,8,9,10,11,12
pandas version: 0.14.0
When running the same code on master, I get
,a,a,a,b,c,c
,q,r,s,t,u,v
one,1,2,3,4,5,6
two,7,8,9,10,11,12
pandas version: 0.24.0.dev0+332.g1f6ddc4
Looks like this 'feature' is no longer needed, right?
Comment From: WillAyd
Thanks for the investigation. I would agree with you that the "feature" in question is not necessary. PRs welcome!
Comment From: dahlbaek
Great, I'll give it a go then!
Comment From: dahlbaek
This is not as clear cut as i had hoped. Modifying the above example to
df = pd.DataFrame([[1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11, 12]],
index=pd.Index(['one', 'two'], name='index_name'),
columns=pd.MultiIndex.from_tuples(
[('a', 'q'), ('a', 'r'), ('a', 's'),
('b', 't'), ('c', 'u'), ('c', 'v')]))
print(df.to_csv())
yields the output
,a,a,a,b,c,c
,q,r,s,t,u,v
index_name,,,,,,
one,1,2,3,4,5,6
two,7,8,9,10,11,12
Thus, the current behaviour seems to be that DataFrame.to_csv
will change the output format according to the presence of the name
/names
attribute of DataFrame.index
. However, pandas.read_csv
cannot know how to correctly parse csv files then. Consider
df_nan = pd.DataFrame(
data=[[np.NaN, np.NaN], [1, 2]],
index=pd.MultiIndex.from_tuples([
('first', 'foo'),
('second', 'bar'),
]),
)
print(
"dataframe:", df_nan,
"\ncsv representation:", df_nan.to_csv(),
"\nroundtrip dataframe:", pd.read_csv(StringIO(df_nan.to_csv()), index_col=[0, 1]),
sep="\n",
)
which outputs
dataframe:
0 1
first foo NaN NaN
second bar 1.0 2.0
csv representation:
,,0,1
first,foo,,
second,bar,1.0,2.0
roundtrip dataframe:
0 1
Unnamed: 1
first foo NaN NaN
second bar 1.0 2.0
Suprisingly, the output changes if a multiindex is supplied for the header:
df_nan = pd.DataFrame(
data=[[np.NaN, np.NaN], [1, 2]],
index=pd.MultiIndex.from_tuples([
('first', 'foo'),
('second', 'bar'),
]),
columns=pd.MultiIndex.from_tuples([
('a', 'baz'),
('b', 'baz'),
]),
)
print(
"dataframe:", df_nan,
"\ncsv representation:", df_nan.to_csv(),
"\nroundtrip dataframe:", pd.read_csv(StringIO(df_nan.to_csv()),
index_col=[0, 1],
header=[0,1]),
sep="\n",
)
outputs
dataframe:
a b
baz baz
first foo NaN NaN
second bar 1.0 2.0
csv representation:
,,a,b
,,baz,baz
first,foo,,
second,bar,1.0,2.0
roundtrip dataframe:
a b
baz baz
first foo
second bar 1.0 2.0
Note that this is still not the parsing I would expect; here pandas is interpreting the third line as a names
attribute for the index multiindex.
To me, the default settings are surprising. As far as I can tell, the problem is that the csv format is not sufficiently rich to naturally encode the name
/names
attributes of multiindices. In view of that, the behaviour I would expect is:
When outputting to csv with a multiindex header, pandas strips name
/names
attributes, as if the function
def strip_level_names(df):
df_copy = df.copy()
index = df_copy.index
header = df_copy.columns
if issubclass(type(index), pd.MultiIndex):
index = pd.MultiIndex.from_tuples(list(index))
elif issubclass(type(index), pd.Index):
index = pd.Index(list(index))
if issubclass(type(header), pd.MultiIndex):
header = pd.MultiIndex.from_tuples(list(header))
elif issubclass(type(header), pd.Index):
header = pd.Index(list(header))
return pd.DataFrame(data=df_copy.values, index=index, columns=header)
had been called before to_csv
. Setting a keyword parameter level_names=True
changes the behaviour such that the output contains header and index name
/names
by always including an extra row for the index name
/names
(similar to current functionality, but explicit and tuneable instead of implicit). Similarly, read_csv
would support a level_names
keyword, in order to determine how to parse csv files.
This way the behaviour would be completely predictable, with the level_names
keyword allowing the user to opt-in for pandas specific formating to include index level name
/names
if necessary.
Comment From: dahlbaek
So, here's the behaviour I would expect:
Input:
df = pd.DataFrame(
data=[[np.NaN, np.NaN], [1, 2]],
index=pd.MultiIndex.from_tuples([
('first', 'foo'),
('second', 'bar'),
]),
)
print(df.to_csv())
Output (no change needed):
,,0,1
first,foo,,
second,bar,1.0,2.0
Input:
df = pd.DataFrame(
data=[[np.NaN, np.NaN], [1, 2]],
index=pd.MultiIndex.from_tuples(
[
('first', 'foo'),
('second', 'bar'),
],
names=["lvl_0", "lvl_1"]
),
columns=["a", "b"]
)
print(df.to_csv())
Output (no change needed):
lvl_0,lvl_1,a,b
first,foo,,
second,bar,1.0,2.0
Input:
df = pd.DataFrame(
data=[[np.NaN, np.NaN], [1, 2]],
index=pd.MultiIndex.from_tuples(
[
('first', 'foo'),
('second', 'bar'),
],
),
columns=pd.MultiIndex.from_tuples(
[
("a", "A"),
("b", "A"),
]
)
)
print(df.to_csv())
Output (no change needed):
,,a,b
,,A,A
first,foo,,
second,bar,1.0,2.0
print(df.to_csv(level_names=True))
Output (change needed):
,,a,b
,,A,A
,,,
first,foo,,
second,bar,1.0,2.0
Input:
df = pd.DataFrame(
data=[[np.NaN, np.NaN], [1, 2]],
index=pd.MultiIndex.from_tuples(
[
('first', 'foo'),
('second', 'bar'),
],
names=["idx_lvl_0", "idx_lvl_1"],
),
columns=pd.MultiIndex.from_tuples(
[
("a", "A"),
("b", "A"),
],
names=["hdr_lvl_0", "hdr_lvl_1"],
)
)
print(df.to_csv())
Output (change needed):
,,a,b
,,A,A
first,foo,,
second,bar,1.0,2.0
print(df.to_csv(level_names=True))
Output (change needed):
hdr_lvl_0,,a,b
hdr_lvl_1,,A,A
idx_lvl_0,idx_lvl_1,,
first,foo,,
second,bar,1.0,2.0
In the other direction, I would expect pd.read_csv
with appropriate values for header
, index_col
and level_names
to correctly reconstruct the original dataframes from the csv output (possibly omitting level names according to situation, but never failing to reconstruct the data itself).
Comment From: jreback
@dahlbaek I am not sure why you are trying to make csv a high-fidelity format, its not. You have to do the best you can with the options and data in the file. We 'write' this format, why should we not read it?
cc @gfyoung
Comment From: gfyoung
Yeah...missing data situations are likely to introduce corner cases here and there. @dahlbaek : welcome to investigate and see, but be careful to not introduce too make special cases.
Comment From: dahlbaek
@jreback I'm not sure what you mean that csv is not a high-fidelity format. As far as I know, it is the main workhorse format for import/export used by postgres for instance, see COPY and Populating a Database. This happens to be a usecase that is relevant to my interests, as I sometimes want to move data between pandas and postgres . What formats do you consider high-fidelity?
@jreback @gfyoung Of course it is entirely up to you which csv formats pandas
should accept/support, and pandas
is great whichever one(s) you pick.
However:
I do believe that the csv family is a main workhorse for a lot of data analysis today—not least because so many tools can read and write csv files (like… well… pandas
). I would wager that to many users of pandas
, the csv format is the main language that pandas
uses to communicate with the outside world. As such, I do think there is good reason to support the more 'standard' dialects, and a great place to start seems to be the Frictionless Data Tabular Specification. Which, of course, pandas
already supports to a high degree!
Quoting the Readme.md
, pandas
…
has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language.
I should think that robust and predictable handling of the most classical and wide spread open data format is a prerequisite in order to achieve that goal!
Just to make my point of view completely clear: I am not proposing that pandas
should not be able to encode level names in csv files. I am proposing that header/index level names should be encoded in a consistent and predictable manner, or not at all. This is possible. The pandas
specific format (with index level names in the row following the header row(s)) is consistent and predictable (and pandas
does use this output format sometimes). Similarly, the more 'standard' format where level names of the header/index is omitted is consistent and predictable (and pandas
does use this output format sometimes). But mixing the two formats without providing a keyword argument (or a hint in the docs) for the user inevitably leads to unexpected behaviour.
Perhaps a middle ground could just be a section in the read_csv
/to_csv
docstrings describing how one might ensure consistent csv input/output (without any source code changes), coupled with tests that enforce that precise consistent behaviour?
Comment From: Michael-E-Rose
The problem is not only when you want multiple rows as header, but also when you want no header at all.
My file:
aa
aa adopt
aa ambiguity
aa bank
aa banks
aa baseball
aa becomes
My code:
terms = pd.read_csv(TERM_FOLDER/'terms.csv', header=None, na_filter=False)
terms.head()
My output:
0
0 aa
1 aa adopt
2 aa ambiguity
3 aa bank
4 aa banks
Interestingly, with terms = pd.read_csv(TERM_FOLDER/'terms.csv', header=None, na_filter=False, skip_blank_lines=False)
I get a pandas.errors.EmptyDataError: No columns to parse from file
. Is this related or must I file a new issue?
Comment From: gfyoung
@Michael-E-Rose: since the parameters are relatively close, I would keep here for now.