Pandas TST: read_csv silently drops empty row

While going over some tests, I noticed the following here:

        df = DataFrame([[1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11, 12]],
                       index=['one', 'two'],
                       columns=MultiIndex.from_tuples(
                           [('a', 'q'), ('a', 'r'), ('a', 's'),
                            ('b', 't'), ('c', 'u'), ('c', 'v')]))


        # to_csv
        data = """,a,a,a,b,c,c
,q,r,s,t,u,v
,,,,,,
one,1,2,3,4,5,6
two,7,8,9,10,11,12"""


        result = self.read_csv(StringIO(data), header=[0, 1], index_col=0)
        tm.assert_frame_equal(df, result)

Is this really expected behaviour, that pandas.read_csv should silently discard an empty row following the header if multiple rows are passed as headers?

Comment From: WillAyd

Well there is a parameter for skip_blank_lines which is True by default, but setting it to False didn't change your example which seems off.

Investigation and PRs certainly welcome!

Comment From: dahlbaek

I think skip_blank_lines is related to truly blank lines, not lines that contain separator characters. Consider

from io import StringIO
import pandas as pd
filepath_or_buffer = StringIO("a,b\n\n\n1,2")
pd.read_csv(filepath_or_buffer)

as opposed to

from io import StringIO
import pandas as pd
filepath_or_buffer = StringIO("a,b\n,\n,\n1,2")
pd.read_csv(filepath_or_buffer)

from io import StringIO
import pandas as pd
filepath_or_buffer = StringIO("a,b\n\n\n1,2")
pd.read_csv(filepath_or_buffer, skip_blank_lines=False)

Comment From: dahlbaek

I guess this is what git-blame is for! Related: #5298

As far as I can tell, this had to do with supporting the output format of pandas.DataFrame.to_csv. I dusted off an old version of pandas and ran the following:

df = pd.DataFrame([[1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11, 12]],
               index=['one', 'two'],
               columns=pd.MultiIndex.from_tuples(
                   [('a', 'q'), ('a', 'r'), ('a', 's'),
                    ('b', 't'), ('c', 'u'), ('c', 'v')]))
print(result.to_csv(), "\npandas version:", pd.__version__)

Output:

,a,a,a,b,c,c
,q,r,s,t,u,v
,,,,,,
one,1,2,3,4,5,6
two,7,8,9,10,11,12

pandas version: 0.14.0

When running the same code on master, I get

,a,a,a,b,c,c
,q,r,s,t,u,v
one,1,2,3,4,5,6
two,7,8,9,10,11,12

pandas version: 0.24.0.dev0+332.g1f6ddc4

Looks like this 'feature' is no longer needed, right?

Comment From: WillAyd

Thanks for the investigation. I would agree with you that the "feature" in question is not necessary. PRs welcome!

Comment From: dahlbaek

Great, I'll give it a go then!

Comment From: dahlbaek

This is not as clear cut as i had hoped. Modifying the above example to

df = pd.DataFrame([[1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11, 12]],
               index=pd.Index(['one', 'two'], name='index_name'),
               columns=pd.MultiIndex.from_tuples(
                   [('a', 'q'), ('a', 'r'), ('a', 's'),
                    ('b', 't'), ('c', 'u'), ('c', 'v')]))
print(df.to_csv())

yields the output

,a,a,a,b,c,c
,q,r,s,t,u,v
index_name,,,,,,
one,1,2,3,4,5,6
two,7,8,9,10,11,12

Thus, the current behaviour seems to be that DataFrame.to_csv will change the output format according to the presence of the name/names attribute of DataFrame.index. However, pandas.read_csv cannot know how to correctly parse csv files then. Consider

df_nan = pd.DataFrame(
    data=[[np.NaN, np.NaN], [1, 2]],
    index=pd.MultiIndex.from_tuples([
        ('first', 'foo'),
        ('second', 'bar'),
    ]),
)
print(
    "dataframe:", df_nan,
    "\ncsv representation:", df_nan.to_csv(),
    "\nroundtrip dataframe:", pd.read_csv(StringIO(df_nan.to_csv()), index_col=[0, 1]),
    sep="\n",
)

which outputs

dataframe:
              0    1
first  foo  NaN  NaN
second bar  1.0  2.0

csv representation:
,,0,1
first,foo,,
second,bar,1.0,2.0


roundtrip dataframe:
                     0    1
       Unnamed: 1
first  foo         NaN  NaN
second bar         1.0  2.0

Suprisingly, the output changes if a multiindex is supplied for the header:

df_nan = pd.DataFrame(
    data=[[np.NaN, np.NaN], [1, 2]],
    index=pd.MultiIndex.from_tuples([
        ('first', 'foo'),
        ('second', 'bar'),
    ]),
    columns=pd.MultiIndex.from_tuples([
        ('a', 'baz'),
        ('b', 'baz'),
    ]),
)
print(
    "dataframe:", df_nan,
    "\ncsv representation:", df_nan.to_csv(),
    "\nroundtrip dataframe:", pd.read_csv(StringIO(df_nan.to_csv()),
                                          index_col=[0, 1],
                                          header=[0,1]),
    sep="\n",
)

outputs

dataframe:
              a    b
            baz  baz
first  foo  NaN  NaN
second bar  1.0  2.0

csv representation:
,,a,b
,,baz,baz
first,foo,,
second,bar,1.0,2.0


roundtrip dataframe:
              a    b
            baz  baz
first  foo
second bar  1.0  2.0

Note that this is still not the parsing I would expect; here pandas is interpreting the third line as a names attribute for the index multiindex.

To me, the default settings are surprising. As far as I can tell, the problem is that the csv format is not sufficiently rich to naturally encode the name/names attributes of multiindices. In view of that, the behaviour I would expect is:

When outputting to csv with a multiindex header, pandas strips name/names attributes, as if the function

def strip_level_names(df):
    df_copy = df.copy()
    index = df_copy.index
    header = df_copy.columns
    if issubclass(type(index), pd.MultiIndex):
        index = pd.MultiIndex.from_tuples(list(index))
    elif issubclass(type(index), pd.Index):
        index = pd.Index(list(index))
    if issubclass(type(header), pd.MultiIndex):
        header = pd.MultiIndex.from_tuples(list(header))
    elif issubclass(type(header), pd.Index):
        header = pd.Index(list(header))
    return pd.DataFrame(data=df_copy.values, index=index, columns=header)

had been called before to_csv. Setting a keyword parameter level_names=True changes the behaviour such that the output contains header and index name/names by always including an extra row for the index name/names (similar to current functionality, but explicit and tuneable instead of implicit). Similarly, read_csv would support a level_names keyword, in order to determine how to parse csv files.

This way the behaviour would be completely predictable, with the level_names keyword allowing the user to opt-in for pandas specific formating to include index level name/names if necessary.

Comment From: dahlbaek

So, here's the behaviour I would expect:

Input:

df = pd.DataFrame(
    data=[[np.NaN, np.NaN], [1, 2]],
    index=pd.MultiIndex.from_tuples([
        ('first', 'foo'),
        ('second', 'bar'),
    ]),
)
print(df.to_csv())

Output (no change needed):

,,0,1
first,foo,,
second,bar,1.0,2.0

Input:

df = pd.DataFrame(
    data=[[np.NaN, np.NaN], [1, 2]],
    index=pd.MultiIndex.from_tuples(
        [
            ('first', 'foo'),
            ('second', 'bar'),
        ],
        names=["lvl_0", "lvl_1"]
    ),
    columns=["a", "b"]
)
print(df.to_csv())

Output (no change needed):

lvl_0,lvl_1,a,b
first,foo,,
second,bar,1.0,2.0

Input:

df = pd.DataFrame(
    data=[[np.NaN, np.NaN], [1, 2]],
    index=pd.MultiIndex.from_tuples(
        [
            ('first', 'foo'),
            ('second', 'bar'),
        ],
    ),
    columns=pd.MultiIndex.from_tuples(
        [
            ("a", "A"),
            ("b", "A"),
        ]
    )
)
print(df.to_csv())

Output (no change needed):

,,a,b
,,A,A
first,foo,,
second,bar,1.0,2.0

print(df.to_csv(level_names=True))

Output (change needed):

,,a,b
,,A,A
,,,
first,foo,,
second,bar,1.0,2.0

Input:

df = pd.DataFrame(
    data=[[np.NaN, np.NaN], [1, 2]],
    index=pd.MultiIndex.from_tuples(
        [
            ('first', 'foo'),
            ('second', 'bar'),
        ],
        names=["idx_lvl_0", "idx_lvl_1"],
    ),
    columns=pd.MultiIndex.from_tuples(
        [
            ("a", "A"),
            ("b", "A"),
        ],
        names=["hdr_lvl_0", "hdr_lvl_1"],
    )
)
print(df.to_csv())

Output (change needed):

,,a,b
,,A,A
first,foo,,
second,bar,1.0,2.0

print(df.to_csv(level_names=True))

Output (change needed):

hdr_lvl_0,,a,b
hdr_lvl_1,,A,A
idx_lvl_0,idx_lvl_1,,
first,foo,,
second,bar,1.0,2.0

In the other direction, I would expect pd.read_csv with appropriate values for header, index_col and level_names to correctly reconstruct the original dataframes from the csv output (possibly omitting level names according to situation, but never failing to reconstruct the data itself).

Comment From: jreback

@dahlbaek I am not sure why you are trying to make csv a high-fidelity format, its not. You have to do the best you can with the options and data in the file. We 'write' this format, why should we not read it?

cc @gfyoung

Comment From: gfyoung

Yeah...missing data situations are likely to introduce corner cases here and there. @dahlbaek : welcome to investigate and see, but be careful to not introduce too make special cases.

Comment From: dahlbaek

@jreback I'm not sure what you mean that csv is not a high-fidelity format. As far as I know, it is the main workhorse format for import/export used by postgres for instance, see COPY and Populating a Database. This happens to be a usecase that is relevant to my interests, as I sometimes want to move data between pandas and postgres . What formats do you consider high-fidelity?

@jreback @gfyoung Of course it is entirely up to you which csv formats pandas should accept/support, and pandas is great whichever one(s) you pick.

However:

I do believe that the csv family is a main workhorse for a lot of data analysis today—not least because so many tools can read and write csv files (like… well… pandas). I would wager that to many users of pandas, the csv format is the main language that pandas uses to communicate with the outside world. As such, I do think there is good reason to support the more 'standard' dialects, and a great place to start seems to be the Frictionless Data Tabular Specification. Which, of course, pandas already supports to a high degree!

Quoting the Readme.md, pandas…

has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language.

I should think that robust and predictable handling of the most classical and wide spread open data format is a prerequisite in order to achieve that goal!

Just to make my point of view completely clear: I am not proposing that pandas should not be able to encode level names in csv files. I am proposing that header/index level names should be encoded in a consistent and predictable manner, or not at all. This is possible. The pandas specific format (with index level names in the row following the header row(s)) is consistent and predictable (and pandas does use this output format sometimes). Similarly, the more 'standard' format where level names of the header/index is omitted is consistent and predictable (and pandas does use this output format sometimes). But mixing the two formats without providing a keyword argument (or a hint in the docs) for the user inevitably leads to unexpected behaviour.

Perhaps a middle ground could just be a section in the read_csv/to_csv docstrings describing how one might ensure consistent csv input/output (without any source code changes), coupled with tests that enforce that precise consistent behaviour?

Comment From: Michael-E-Rose

The problem is not only when you want multiple rows as header, but also when you want no header at all.

My file:


aa
aa adopt
aa ambiguity
aa bank
aa banks
aa baseball
aa becomes

My code:

terms = pd.read_csv(TERM_FOLDER/'terms.csv', header=None, na_filter=False)
terms.head()

My output:

              0
0            aa
1      aa adopt
2  aa ambiguity
3       aa bank
4      aa banks

Interestingly, with terms = pd.read_csv(TERM_FOLDER/'terms.csv', header=None, na_filter=False, skip_blank_lines=False) I get a pandas.errors.EmptyDataError: No columns to parse from file. Is this related or must I file a new issue?

Comment From: gfyoung

@Michael-E-Rose: since the parameters are relatively close, I would keep here for now.