Pandas read_csv and unicode characters in filename (python 2.7, pandas 15.2)

The code:

import pandas
df = pandas.read_csv(u"C:/成功例Q309~Metadata.tsv")

does not work, and gives the output:

IOError: File C:/???Q309.ppt~Metadata.tsv does not exist

It seems similar in nature to this issue: https://github.com/pydata/pandas/issues/9315 however #9315 was reportedly fixed in 14.2 with 3.3.5. I am using 15.1 and 2.7.7.

Here is the output of pd.show_versions():

commit: None python: 2.7.7.final.0 python-bits: 64 OS: Windows OS-release: 8 machine: AMD64 processor: Intel64 Family 6 Model 62 Stepping 4, GenuineIntel byteorder: little LC_ALL: None LANG: en_US

pandas: 0.15.2 nose: 1.3.3 Cython: 0.20.1 numpy: 1.9.1 scipy: 0.15.1 statsmodels: 0.6.1 IPython: 2.3.1 sphinx: 1.2.3 patsy: 0.3.0 dateutil: 2.2 pytz: 2014.9 bottleneck: None tables: 3.1.1 numexpr: 2.3.1 matplotlib: 1.4.1 openpyxl: 1.8.5 xlrd: 0.9.3 xlwt: 0.7.5 xlsxwriter: 0.5.5 lxml: 3.3.5 bs4: 4.3.2 html5lib: None httplib2: None apiclient: None rpy2: None sqlalchemy: 0.9.4 pymysql: None psycopg2: None

Thanks, Justin

Comment From: jmatejka

For what it's worth, the following is a workaround which seems to be doing the trick:

f = open(u"C:/成功例Q309~Metadata.tsv")
df = pd.read_csv(f)
f.close()

Comment From: jreback

see this issue here: https://github.com/pydata/pandas/issues/6770

This is already in 0.15.2 (e.g. it will decode with the system encoding). So I think you maybe need to set it.

Comment From: jmatejka

My mistake, I am using 0.15.2 (not 15.1).

But I'm still not clear, what are you suggesting that I "set"? The system encoding? This is something that I would need to do before loading the file?

Thanks, Justin

Comment From: jreback

I think the system encoding might be set to something odd

you can try setting to utf-8 and see if it works

Comment From: jmatejka

The filesystemencoding and defaultsystemencoding are 'mbcs' and 'cp1252' respectively:

sys.getfilesystemencoding()
Out[12]: 'mbcs'

sys.getdefaultencoding()
Out[13]: 'cp1252'

These options all fail in a similar way though:

df = pandas.read_csv(u"C:/成功例Q309~Metadata.tsv", encoding='utf-8')
df = pandas.read_csv(u"C:/成功例Q309~Metadata.tsv", encoding='mbcs')
df = pandas.read_csv(u"C:/成功例Q309~Metadata.tsv", encoding='cp1252')

Should I bet setting the encoding in a different way?

Comment From: jreback

these have to do with the encoding of the file itself not the filename try decoding that filename before passing

the_filename.decode('utf-8') then pass the filename

Comment From: jmatejka

Using filename.decode('utf'8') gives this error:


UnicodeEncodeError: 'charmap' codec can't encode characters in position 3-5: character maps to <undefined>

Comment From: TomAugspurger

I think this is an issue with the filesystem / encoding. Let me know if it's still a problem, and if Python's builtin open(filename) works, but pandas read_csv does not.

Comment From: Masterxilo

This still happens for me. The worst part is that if I use the workaround using open, read_csv does not parse the utf-8 in the file correctly anymore. Any help?

Comment From: Rajasivaranjan

Try using Open command as below. It worked for me. df = pd.read_csv(open(filename, 'r'))