Code Sample, a copy-pastable example if possible
import pandas
cyrillic_filename = "./файл_1.csv"
# 'c' engine fails:
df = pandas.read_csv(cyrillic_filename, engine="c", encoding="cp1251")
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-18-9cb08141730c> in <module>()
2
3 cyrillic_filename = "./файл_1.csv"
----> 4 df = pandas.read_csv(cyrillic_filename , engine="c", encoding="cp1251")
d:\0_dev\services\protocol_sort\venv\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
653 skip_blank_lines=skip_blank_lines)
654
--> 655 return _read(filepath_or_buffer, kwds)
656
657 parser_f.__name__ = name
d:\0_dev\services\protocol_sort\venv\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
403
404 # Create the parser.
--> 405 parser = TextFileReader(filepath_or_buffer, **kwds)
406
407 if chunksize or iterator:
d:\0_dev\services\protocol_sort\venv\lib\site-packages\pandas\io\parsers.py in __init__(self, f, engine, **kwds)
762 self.options['has_index_names'] = kwds['has_index_names']
763
--> 764 self._make_engine(self.engine)
765
766 def close(self):
d:\0_dev\services\protocol_sort\venv\lib\site-packages\pandas\io\parsers.py in _make_engine(self, engine)
983 def _make_engine(self, engine='c'):
984 if engine == 'c':
--> 985 self._engine = CParserWrapper(self.f, **self.options)
986 else:
987 if engine == 'python':
d:\0_dev\services\protocol_sort\venv\lib\site-packages\pandas\io\parsers.py in __init__(self, src, **kwds)
1603 kwds['allow_leading_cols'] = self.index_col is not False
1604
-> 1605 self._reader = parsers.TextReader(src, **kwds)
1606
1607 # XXX
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader.__cinit__ (pandas\_libs\parsers.c:4209)()
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source (pandas\_libs\parsers.c:8895)()
OSError: Initializing from file failed
# 'python' engine work:
df = pandas.read_csv(cyrillic_filename, engine="python", encoding="cp1251")
df.size
>>172440
# 'c' engine works if filename can be encoded to utf-8
latin_filename = "./file_1.csv"
df = pandas.read_csv(latin_filename, engine="c", encoding="cp1251")
df.size
>>172440
Problem description
The 'c' engine should read the files with non-UTF-8 filenames
Expected Output
File content readed into dataframe
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None python: 3.6.1.final.0 python-bits: 32 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 42 Stepping 7, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None
pandas: 0.20.3 pytest: None pip: 9.0.1 setuptools: 28.8.0 Cython: None numpy: 1.13.2 scipy: 0.19.1 xarray: None IPython: 6.2.1 sphinx: None patsy: None dateutil: 2.6.1 pytz: 2017.2 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: 2.4.8 xlrd: None xlwt: None xlsxwriter: None lxml: 4.0.0 bs4: None html5lib: 1.0b10 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None None
Comment From: gfyoung
@c-fos : Thanks for reporting this! A couple of questions:
- If you change the engine to Python, does it make a difference?
- Can you open the file simply by calling
open(filename)
?
I'm trying to figure out if this is just a general Python issue of not handling Cyrillic characters well OR if this a pandas
-specific issue.
Comment From: c-fos
cyrillic_filename = "./файл_1.csv"
# 'python' engine is working:
df = pandas.read_csv(cyrillic_filename, engine="python", encoding="cp1251")
df.size
>>172440
# simple open in working
fd = open(cyrillic_filename)
fd
>><_io.TextIOWrapper name='./файл_1.csv' mode='r' encoding='cp1251'>
Comment From: gfyoung
Alright, this is indeed a pandas
-specific issue with the C engine. More than welcome to track this down and submit a patch for it!
Comment From: gfyoung
The error is raised here:
https://github.com/pandas-dev/pandas/blob/def3bce010eb0eaea2580ad6b6f44c0318314296/pandas/_libs/parsers.pyx#L709-L718
The culprit function I believe is here:
https://github.com/pandas-dev/pandas/blob/def3bce010eb0eaea2580ad6b6f44c0318314296/pandas/_libs/src/parser/io.c#L24-L49
What worries me is that it might have to do with the open
function, in which case we might have hit a dead end (and perhaps it would no longer be a pandas
-issue).
Comment From: jreback
duplicate of https://github.com/pandas-dev/pandas/issues/15086, there was a PR to fix this #15092 but it was erased somehow. This was a change in default file encoding on 3.6 on windows. There is a PEP reference in there. To solve we just treat the filename as bytes and decode as utf8. Welcome to have a patch.
Comment From: fanguoguo
File "pandas/_libs/parsers.pyx", line 384, in pandas._libs.parsers.TextReader.cinit File "pandas/_libs/parsers.pyx", line 695, in pandas._libs.parsers.TextReader._setup_parser_source
Comment From: fingoldo
I can not believe this. It's year 2019 now, pandas v 23.4 and this issue with Cyrillic paths and C engine is STILL NOT FIXED, even after so many issue reports and questions on stackoverflow. Open source community seems to be no better than Microsoft in this regard, where known bugs are not getting fixed for years.
Comment From: gfyoung
Open source community seems to be no better than Microsoft in this regard, where known bugs are not getting fixed for years.
@fingoldo : Sorry about this! We do get a lot of issues every day, and unlike at Microsoft, we have way fewer code maintainers to work and address all of these issues that we receive.
That being said, if you would like to tackle the issue, that would be great! Part of the issue that we have right now is that it's hard for us to test and validate any fixes, so a community contribution would be most welcome for something like this.
xref https://github.com/pandas-dev/pandas/issues/15086#issuecomment-445799283
Comment From: fingoldo
@gfyoung sorry for the harsh words and thank you for your kind reply, I personally don't know how to fix that issue but please, if someone from devs who has actually created relevant modules sees this thread, roll out the fix, it's really embarassing to still have that error when engine is set to C...
Comment From: gfyoung
it's really embarassing to still have that error when engine is set to C
@fingoldo : Yeah, it's awkward no doubt, but I hope you understand that there are many of these types of "embarrassing errors" than there are man-hours (devs and contributors combined) to correct them, especially if people have a hard time reproducing them.
Lucky for you, I currently have in my possession a Windows machine, and I was able to patch this issue pretty quickly in a PR:
https://github.com/pandas-dev/pandas/pull/24758
Comment From: fingoldo
Amazing thank you so much!!!