Pandas to_csv with lists of strings and unicode encoding produces wrong output

If I have a dataframe with cells containing lists of strings (or unicode strings), then these lists are broken when I use to_csv() with the encoding parameter set. The error does not occur if the encoding is not set.

Here is an example (using pandas version 0.16.2):

df = pd.DataFrame.from_records(
    [('Mary S.',['Detroit, MI','New York, NY']),
     ('John U.',[u'Atlanta, GA',u'Paris, France'])],
    columns=['name','residences'])
df.to_csv('ascii.csv')
df.to_csv('utf8.csv',encoding='utf-8')

The ascii-encoded CSV file is fine. (contents of 'ascii.csv' below)

,name,residences
0,Mary S.,"['Detroit, MI', 'New York, NY']"
1,John U.,"[u'Atlanta, GA', u'Paris, France']"

But the unicode CSV file fails to quote the strings within the lists. (contents of 'utf8.csv' below)

,name,residences
0,Mary S.,"[Detroit, MI, New York, NY]"
1,John U.,"[Atlanta, GA, Paris, France]"

This results in the data being impossible to recover. For example, if I load this file using read_csv(), the relevant cells are treated as strings, and cannot be accurately recast as lists.

The behavior is the same using encoding='utf-16' but I didn't check any other encodings.

Comment From: rtkaleta

Hi,

This is still an issue in Pandas v0.18.1:

>>> import pandas as pd
>>> pd.__version__
u'0.18.1'
>>> data = [{'names': ['foo', 'bar']}, {'names': ['baz', 'qux']}]
>>> df = pd.DataFrame(data)
>>> df.to_csv(path_or_buf='temp.csv', encoding='utf-8')

Result:

>>> cat temp.csv
,names
0,"[foo, bar]"
1,"[baz, qux]"

An even weirder quirk is that even when encoding='ascii' - i.e. we are explicitly setting the encoding to its apparent default - the result is also broken:

>>> import pandas as pd
>>> data = [{'names': ['foo', 'bar']}, {'names': ['baz', 'qux']}]
>>> df = pd.DataFrame(data)
>>> df.to_csv(path_or_buf='temp.csv', encoding='ascii')

Result:

>>> cat temp.csv
,names
0,"[foo, bar]"
1,"[baz, qux]"

Note this seems to affect columns containing array of strings only. If the column contains a list of e.g. dictionaries, the data is written down to csv correctly:

>>> import pandas as pd
>>> data = [{'names': [{'foo': 1}, {'bar': 2}]}, {'names': [{'baz': 3}, {'qux': 4}]}]
>>> df = pd.DataFrame(data)
>>> df.to_csv(path_or_buf='temp.csv', encoding='utf-8')

Result:

>>> cat temp.csv
,names
,names
0,"[{u'foo': 1}, {u'bar': 2}]"
1,"[{u'baz': 3}, {u'qux': 4}]"

Comment From: jreback

I suppose. embedded lists of non-scalars are not first class citizens of pandas at all, nor are they generally lossleslly convertible to/from csv. json is a better format for this. If a community supported PR is pushed that would be ok.

Comment From: TomAugspurger

I think this has been fixed, but not by #17821. Would be nice ensure we have a regression test in place.

Comment From: rtkaleta

@TomAugspurger Thanks for picking this up, and sorry it took me so long to respond. Looks like this is now fixed for writing string arrays using the ascii encoding but still broken for utf-8 encoded values. See https://github.com/pandas-dev/pandas/pull/18013.

Comment From: rtkaleta

I'll have a stab at a fix... It stems from the fact that pandas' own UnicodeWriter calls pprint_thing without quote_strings=True so then:

>>> from pandas.io.formats.printing import pprint_thing
>>> pprint_thing([u'foo', u'bar'])
u'[foo, bar]'

instead of the more intuitive:

>>> pprint_thing([u'foo', u'bar'], quote_strings=True)
u"[u'foo', u'bar']"

Why do we have our own UnicodeWriter here instead of unicodecsv.writer?

Comment From: rtkaleta

@jreback This should not have been closed, got closed because I mentioned it in https://github.com/pandas-dev/pandas/pull/18013, please reopen and I'll have a stab at the fix, thanks.

Comment From: Rajjae

Hello, This is still an issue in Pandas v0.23.4. It is fixed when using the ascii encoding, but still broken when using the utf-8 encoding.

Comment From: aausch

ping any chance this is getting fixed?

Comment From: TomAugspurger

@jschendel did #25864 fix this?

Comment From: jschendel

I've only been able to reproduce this issue on Python 2; the output has looked fine to me on Python 3 even using some fairly old versions (e.g. 0.20.x). So this issue might not be relevant anymore in the sense that we no longer support Python 2, if it's indeed the case that this is a Python 2 specific issue.

Comment From: TomAugspurger

I also can't reproduce on python 3. If anyone can, let us know and we'll reopen.