Input data
Code Sample, a copy-pastable example if possible
import pandas as pd df = pd.read_csv('foo1.csv', sep='\t', encoding='utf-16le', engine='python') df.to_csv("out.csv", mode='wb', sep='\t', encoding='utf-16le', engine='python')
Expected Output
out.csv should be exactly the same than foo1.csv, instead it becomes "corrupted". For example, trying to open it with excel will show a message saying the file is corrupted and it cannot be opened. Checking the output with an hex editor you can see some extra bytes have been added into the header of out.csv as well as other additional bytes.
output of pd.show_versions()
commit: None python: 2.7.10.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None
pandas: 0.18.0 nose: None pip: 8.1.1 setuptools: 20.10.1 Cython: None numpy: 1.10.4 scipy: 0.17.0 statsmodels: None xarray: None IPython: None sphinx: None patsy: None dateutil: 2.4.2 pytz: 2015.7 blosc: None bottleneck: None tables: None numexpr: None matplotlib: 1.5.0 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: 4.4.1 html5lib: 0.9999999 httplib2: 0.9.2 apiclient: None sqlalchemy: 1.0.12 pymysql: None psycopg2: None jinja2: 2.8 boto: None
Comment From: jreback
you are writing out with encodings, I don't think excel likes these things.
Comment From: brupelo
@jreback : Did you try/reproduce the provided example? You should see the output file having some extra bytes at the beginning looking like this.
In any case, what would you suggest me then? I just want to work with my final processed csv files in excel right away. Right now i was just trying to merge and drop_duplicates rows from a bunch of csv (utf16-le) files, that's pretty much.
Regards.
Comment From: jreback
you didn't answer the question
why are you encoding to excel
Comment From: brupelo
@jreback I haven't answered your question cos you didn't ask it in the first place :-)
In any case, I'm using this
df.to_csv("out.csv", mode='wb', sep='\t', encoding='utf-16le', engine='python')
expecting an output excel can read without problems (as it's already doing with the source csv files)
If i try # df.to_csv("out.csv", mode='wb', sep='\t')
I'd get an error like this
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0: ordinal not in range(128)
Or if i'd try # df.to_csv("out.csv", mode='wb', sep='\t', encoding='utf8')
My output will become screwed up
In any case, any suggestion will be well received. I think my use-case is pretty basic actually
Thanks
Comment From: jorisvandenbossche
@brupelo The problem probably lies in the reading of the file (and it is just writing what it read):
In [64]: df = pd.read_csv('Scipy/foo1.csv', sep='\t', encoding='utf-16')
In [65]: df.columns[0]
Out[65]: u'Ad group'
In [66]: df = pd.read_csv('Scipy/foo1.csv', sep='\t', encoding='utf-16-le')
In [67]: df.columns[0]
Out[67]: u'\ufeffAd group'
So using the encoding='utf-16-le'
for some reason led to an incorrect result, while using just 'utf-16'
works (although Notepad++ tells me this is indeed UCS-2 Little Endian, although I found that UFT16 is sometimes wrongly recognized as UCS-2. Further, wikipedia says "The standard also allows the byte order to be stated explicitly by specifying UTF-16BE or UTF-16LE as the encoding type. When the byte order is specified explicitly this way, a BOM is specifically not supposed to be prepended to the text").
Anyway, can you try with just encoding='utf-16'
?
Comment From: jorisvandenbossche
Looking at the file foo1.csv
, it has a BOM, so following the above quote from wikipedia, you should not specify the endianness (and encoding='utf-16'
should indeed be the correct solution)
Comment From: jorisvandenbossche
@brupelo I am closing this issue since there hasn't been any acitivity for a while. But that does not mean further questions or comments are not welcome anymore.
Comment From: chrisgdorn
Could the issue could be that Excel interprets a csv file with 'ID' in the 'A1' position as a SYLK file? MS help file
I was experiencing the same problem and after changing the index name:
df.index.names - ['LocID']
I no longer had any problems
Comment From: FloBay
I agree to @chrisgdorn. It is an excel issue - not a pandas issue.
When you export a data frame like this with tab:
df.to_csv(path, sep='\t')
or df.to_csv(path, sep='\t', index=False)
(depending on your structure)
and the name of the first column is "ID"
in the exported txt file. Then, Excel has this SYLK file issue.
However, when you rename your index (or first column, respectively,) to anything else, excel has no issue anymore. A look to wiki can confirm this: SYLK
Thus, I recommend to use "Identifier"
or even just "Id"
or "id"
, instead of "ID"
, and you are all good to watch your data in excel.