Pandas BUG: read_csv does not parse csv files with windows line terminator correctly

I am using pandas 0.11.in python 2.7.3

When I read csv data with a windows line terminator (\r\n), pandas creates extra rows in the dataframe ie it does not recognise the line terminator..i

test1.csv

Id,Description,Field1,Field2^M 1,"test data 1 ^M\n",2,3^M 2,"test data 2 ^M\n",3,4^M

In [80]: in_csv = 'test1.csv'

In [81]: cols = ['Id', 'Description', 'Field1', 'Field2']

In [82]: reader = pd.read_csv(in_csv, usecols=cols, dtype=object)

In [83]: print reader Id Description Field1 Field2 0 \r NaN NaN NaN 1 1 test data 1 \r\n 2 3 2 \r NaN NaN NaN 3 2 test data 2 \r\n 3 4 4 \r NaN NaN NaN

In [86]: reader.to_csv(out_csv, index=False, encoding='utf-8')

Id,Description,Field1,Field2 ^M,,, 1,test data 1 ^M\n,2,3 ^M,,, 2,test data 2 ^M\n,3,4 ^M,,,

I cannot use...

lineterminator : string (length 1), default None

reader = pd.read_csv(in_csv, usecols=cols, dtype=object, lineterminator='\r\n')

ValueError: Only length-1 line terminators supported

Comment From: cpcloud

Related: #3453

Comment From: wesm

Working fine for me:

In [3]: read_csv(StringIO('Id,Description,Field1,Field2\r\n1,"test data 1 \r\n",2,3\r\n2,"test data 2 \r\n",3,4\r\n'))
Out[3]: 
   Id       Description  Field1  Field2
0   1  test data 1 \r\n       2       3
1   2  test data 2 \r\n       3       4

Could you show us the exact binary contents of test1.csv? i.e.

open('test1.csv', 'rb').read()

Comment From: wesm

not going to make it into 0.11.1. Will investigate pending failing test case

Comment From: jreback

pls reopen if you can create a test case

Comment From: cpcloud

@bmacauley FWIW this SO post is very helpful:

http://stackoverflow.com/questions/3191528/csv-in-python-adding-extra-carriage-return

your data source was probably written with mode='w' (the default) and if you're in python 2 that will change \r\n to \r\r\n so you need write your data with mode='wb'.

if you're in python 3 you need to open with mode='w', newline=''.

this is a pretty sad state of affairs, but this isn't a pandas problem it's a problem with the csv module and python newline handling across versions

Comment From: lo-co

I am not sure if the previous comment by @cpcloud was particularly helpful given that you did not specify how the file was written. I to have run into this problem recently when reading in a file that was not written in python but rather by an external program. I could go back and alter the file writing code, but I still have to be able to process the data that is written to older files that retain the EOL constant.

I found that if you specify the lineterminator='\n', then you should get the results you are looking for. Given the easy solution, I believe that this is likely not something that requires attention in pandas. Hope this helps anyone who is running into this problem.

Cheers, Matt

Comment From: szeitlin

I'm having this problem too. Looking at census data. Filetype is 'ASCII English text, with CRLF line terminators'

tried the suggestions above, didn't work.

from following Nick Coghlan's post here I was able to view something: http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html

filecontent = open('Gaz_counties_national.txt', encoding='ASCII', errors="surrogateescape").read()

file looks like this after that:

"USPS\tGEOID\tANSICODE\tNAME\tPOP10\tHU10\tALAND\tAWATER\tALAND_SQMI\tAWATER_SQMI\tINTPTLAT\tINTPTLONG \nAL\t01001\t00161526\tAutauga County\t54571\t22135\t1539582278\t25775735\t 594.436\t 9.952\t 32.536382\t -86.644490 \nAL\t01003\t00161527\tBaldwin County\t182265\t104061\t4117521611\t1133190229\t 1589.784\t 437.527\t 30.659218\t -87.746067 \nAL\t01005\t00161528\tBarbour County\t27457\t11829\t2291818968\t50864716\t 884.876\t 19.639\t 31.870670\t -85.405456 \nAL\t01007\t00161529\tBibb County\t22915\t8981\t1612480789\t9289057\t 622.582\t 3.587\t 33.015893\t -87.127148 \nAL\t01009\t00161530\tBlount County\t57322\t23887\t1669961855\t15157440\t 644.776\t 5.852\t 33.977448\t -86.567246 \nAL\t01011\t00161531\tBullock County\t10914\t4493\t1613056905\t6056528\t 622.805\t 2.338\t 32.101759\t -85.717261 \nAL\t01013\t00161532\tButler County\t20947\t9964\t2011976894\t2726814\t 776.829\t 1.053\t 31.751667\t -86.681969 \nAL\t01015\t00161533\tCalhoun County\t118572\t53289\t1569189995\t16624267\t 605.868\t 6.419\t 33.771706\t -85.822513 \nAL\t01017\t00161534\tChambers County\t34215\t17004\t1545009282\t17048142\t 596.531\t 6.582\t 32.917943\t -85.391812 \nAL\t01019\t00161535\tCherokee County\t25989\t16267\t1434075952\t119858898\t 553.700\t 46.278\t 34.069515\t -85.654242 \nAL\t01021\t00161536\tChilton County\t43643\t19278\t1794483586\t20586383\t 692.854\t 7.948\t 32.854059\t -86.726627

If I then try to read that filecontent object with pandas:

pd.read_csv(filecontent, sep='\t', lineterminator = '\r\n')

I get

UnicodeEncodeError: 'utf-8' codec can't encode character: surrogates not allowed

tried following suggestions in the comments above, got 'UnsupportedOperation: not readable'

can't put '\r\n' in as the lineterminator argument to pandas.read_csv, because 'ValueError: Only length-1 line terminators supported' ?

I'm in python 3.4, pandas version 0.15.1. Would love to know what symbol should go here to get it to work, or if updating to a newer version would help?

Comment From: cbare

I'm seeing the same error as @szeitlin on Python 3.5, Win2012. Same cause:

pd.read_csv(path, lineterminator = os.linesep)

where os.linesep == '\r\n'

Comment From: cpcloud

@szeitlin Can you post a link to the file?

Comment From: cbare

I don't think the actual .csv file matters. It's just the handling of the lineterminator parameter. Here's a repro:

PS C:\Users\Administrator> notepad foo.csv

...create simple .csv file in notepad... I am totally clueless about Windows.

PS C:\Users\Administrator> ipython
Python 3.5.1 |Anaconda 2.5.0 (64-bit)| (default, Jan 29 2016, 15:01:46) [MSC v.1900 64 bit (AMD64)]
Type "copyright", "credits" or "license" for more information.

IPython 4.0.3 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: import os

In [2]: import pandas as pd

In [3]: pd.__version__
Out[3]: '0.17.1'

In [4]: m = pd.read_csv("foo.csv")

In [5]: m
Out[5]:
   A  B  C
0  1  2  3
1  4  5  6
2  7  8  9

In [6]: m = pd.read_csv("foo.csv", lineterminator=os.linesep)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-63603000fef9> in <module>()
----> 1 m = pd.read_csv("foo.csv", lineterminator=os.linesep)

C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, dialect, compression, doublequo
te, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfo
oter, skip_footer, na_values, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace
, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_de
fault_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, n
rows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blan
k_lines)
    496                     skip_blank_lines=skip_blank_lines)
    497
--> 498         return _read(filepath_or_buffer, kwds)
    499
    500     parser_f.__name__ = name

C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
    273
    274     # Create the parser.
--> 275     parser = TextFileReader(filepath_or_buffer, **kwds)
    276
    277     if (nrows is not None) and (chunksize is not None):

C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in __init__(self, f, engine, **kwds)
    588             self.options['has_index_names'] = kwds['has_index_names']
    589
--> 590         self._make_engine(self.engine)
    591
    592     def _get_options_with_defaults(self, engine):

C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in _make_engine(self, engine)
    729     def _make_engine(self, engine='c'):
    730         if engine == 'c':
--> 731             self._engine = CParserWrapper(self.f, **self.options)
    732         else:
    733             if engine == 'python':

C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in __init__(self, src, **kwds)
   1101         kwds['allow_leading_cols'] = self.index_col is not False
   1102
-> 1103         self._reader = _parser.TextReader(src, **kwds)
   1104
   1105         # XXX

pandas\parser.pyx in pandas.parser.TextReader.__cinit__ (pandas\parser.c:3407)()

ValueError: Only length-1 line terminators supported

Looking at foo.csv in the hexedit tool, it does indeed have \r\n line endings, as you'd expect, but I don't think it ever looks at the file.

For what it's worth, the csv module from the standard libraries has some confusing opinions about what the lineterminator parameter should mean and how it interacts with line endings in text mode files. Maybe, I'm just misunderstanding how this is supposed to work.

Comment From: szeitlin

sorry, I'm swamped with work right now, but if I can find the file and if I ever managed to get it working, I'll post both. I think I may have ended up solving the problem in a somewhat unintuitive way, but it was a while ago...

Comment From: yasink18

Hello,

change engine = 'python-fwf' it will work

Comment From: DavidEnergie

Hello,

change engine = 'python-fwf' it will work

Thanks for the pointer. This solved the same CR/LF issue for me, though with this engine (equally with 'read_fwf(file.txt)'), for long lines, they break into new columns after a certain length (for me its around 10 geocoordinate strings (x.x,y.y,z.z)). I didn't manage to resolve this, but switched back to engine='c' and using lineterminator = '\n' instead of lineterminator = '\r\n', as the later switches the engine to 'python'. Leaving out the '\r\' doesn't seem to have any negative effect in my example.

The solution to the issue of @szeitlin and I was right in front of my eyes, as @lo-co indicated above.

pd.read_csv(path_to_file_txt, sep='\t', lineterminator = '\n', engine='c', keep_default_na=False)

This original issue is around 8 years old. But I still wanted to add this, as the previous comments here helped me track it down.

Comment From: OrangeDog

\r\n isn't even just the windows line ending, it is the RFC-4180 standard CSV line ending.

It should at least be possible to set lineterminator='\r\n', and it should really be the default too.

Comment From: rossbush

Greetings, while trying to use Pandas to read files csv with \r\n in AWS Lambda's running on linux os. We are encountering the [ERROR] ValueError: Only length-1 line terminators supported error. Currently a fallback to array.split("\r\n") is being utilized but making use of pandas for all csv operations would be ideal. Any ideas if this will be addressed in upcoming releases or the nature of an official workaround?

Comment From: rtrad89

Oddly, I face this when reading a CSV, but not when saving it! This prevented me from unifying the arguments of the parameters for all CSV operations.