Pandas Excel corrupted file when using to_csv method

Input data

Code Sample, a copy-pastable example if possible

import pandas as pd df = pd.read_csv('foo1.csv', sep='\t', encoding='utf-16le', engine='python') df.to_csv("out.csv", mode='wb', sep='\t', encoding='utf-16le', engine='python')

Expected Output

out.csv should be exactly the same than foo1.csv, instead it becomes "corrupted". For example, trying to open it with excel will show a message saying the file is corrupted and it cannot be opened. Checking the output with an hex editor you can see some extra bytes have been added into the header of out.csv as well as other additional bytes.

output of `pd.show_versions()`

commit: None python: 2.7.10.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None

pandas: 0.18.0 nose: None pip: 8.1.1 setuptools: 20.10.1 Cython: None numpy: 1.10.4 scipy: 0.17.0 statsmodels: None xarray: None IPython: None sphinx: None patsy: None dateutil: 2.4.2 pytz: 2015.7 blosc: None bottleneck: None tables: None numexpr: None matplotlib: 1.5.0 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: 4.4.1 html5lib: 0.9999999 httplib2: 0.9.2 apiclient: None sqlalchemy: 1.0.12 pymysql: None psycopg2: None jinja2: 2.8 boto: None

Comment From: jreback

you are writing out with encodings, I don't think excel likes these things.

Comment From: brupelo

@jreback : Did you try/reproduce the provided example? You should see the output file having some extra bytes at the beginning looking like this.

In any case, what would you suggest me then? I just want to work with my final processed csv files in excel right away. Right now i was just trying to merge and drop_duplicates rows from a bunch of csv (utf16-le) files, that's pretty much.

Regards.

Comment From: jreback

you didn't answer the question

why are you encoding to excel

Comment From: brupelo

@jreback I haven't answered your question cos you didn't ask it in the first place :-)

In any case, I'm using this

df.to_csv("out.csv", mode='wb', sep='\t', encoding='utf-16le', engine='python')

expecting an output excel can read without problems (as it's already doing with the source csv files)

If i try # df.to_csv("out.csv", mode='wb', sep='\t')

I'd get an error like this

UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0: ordinal not in range(128)

Or if i'd try # df.to_csv("out.csv", mode='wb', sep='\t', encoding='utf8')

My output will become screwed up

In any case, any suggestion will be well received. I think my use-case is pretty basic actually

Thanks

Comment From: jorisvandenbossche

@brupelo The problem probably lies in the reading of the file (and it is just writing what it read):

In [64]: df = pd.read_csv('Scipy/foo1.csv', sep='\t', encoding='utf-16')

In [65]: df.columns[0]
Out[65]: u'Ad group'

In [66]: df = pd.read_csv('Scipy/foo1.csv', sep='\t', encoding='utf-16-le')

In [67]: df.columns[0]
Out[67]: u'\ufeffAd group'

So using the encoding='utf-16-le' for some reason led to an incorrect result, while using just 'utf-16' works (although Notepad++ tells me this is indeed UCS-2 Little Endian, although I found that UFT16 is sometimes wrongly recognized as UCS-2. Further, wikipedia says "The standard also allows the byte order to be stated explicitly by specifying UTF-16BE or UTF-16LE as the encoding type. When the byte order is specified explicitly this way, a BOM is specifically not supposed to be prepended to the text").

Anyway, can you try with just encoding='utf-16' ?

Comment From: jorisvandenbossche

Looking at the file foo1.csv, it has a BOM, so following the above quote from wikipedia, you should not specify the endianness (and encoding='utf-16' should indeed be the correct solution)

Comment From: jorisvandenbossche

@brupelo I am closing this issue since there hasn't been any acitivity for a while. But that does not mean further questions or comments are not welcome anymore.

Comment From: chrisgdorn

Could the issue could be that Excel interprets a csv file with 'ID' in the 'A1' position as a SYLK file? MS help file

I was experiencing the same problem and after changing the index name: df.index.names - ['LocID'] I no longer had any problems

Comment From: FloBay

I agree to @chrisgdorn. It is an excel issue - not a pandas issue.

When you export a data frame like this with tab:

df.to_csv(path, sep='\t') or df.to_csv(path, sep='\t', index=False) (depending on your structure)

and the name of the first column is "ID" in the exported txt file. Then, Excel has this SYLK file issue.

However, when you rename your index (or first column, respectively,) to anything else, excel has no issue anymore. A look to wiki can confirm this: SYLK

Thus, I recommend to use "Identifier" or even just "Id" or "id", instead of "ID", and you are all good to watch your data in excel.

Pandas Excel corrupted file when using to_csv method

Input data

Code Sample, a copy-pastable example if possible

Expected Output

output of pd.show_versions()

output of `pd.show_versions()`