I am using pandas 0.11.in python 2.7.3
When I read csv data with a windows line terminator (\r\n), pandas creates extra rows in the dataframe ie it does not recognise the line terminator..i
test1.csv
Id,Description,Field1,Field2^M 1,"test data 1 ^M\n",2,3^M 2,"test data 2 ^M\n",3,4^M
In [80]: in_csv = 'test1.csv'
In [81]: cols = ['Id', 'Description', 'Field1', 'Field2']
In [82]: reader = pd.read_csv(in_csv, usecols=cols, dtype=object)
In [83]: print reader Id Description Field1 Field2 0 \r NaN NaN NaN 1 1 test data 1 \r\n 2 3 2 \r NaN NaN NaN 3 2 test data 2 \r\n 3 4 4 \r NaN NaN NaN
In [86]: reader.to_csv(out_csv, index=False, encoding='utf-8')
Id,Description,Field1,Field2 ^M,,, 1,test data 1 ^M\n,2,3 ^M,,, 2,test data 2 ^M\n,3,4 ^M,,,
I cannot use...
lineterminator : string (length 1), default None
reader = pd.read_csv(in_csv, usecols=cols, dtype=object, lineterminator='\r\n')
ValueError: Only length-1 line terminators supported
Comment From: cpcloud
Related: #3453
Comment From: wesm
Working fine for me:
In [3]: read_csv(StringIO('Id,Description,Field1,Field2\r\n1,"test data 1 \r\n",2,3\r\n2,"test data 2 \r\n",3,4\r\n'))
Out[3]:
Id Description Field1 Field2
0 1 test data 1 \r\n 2 3
1 2 test data 2 \r\n 3 4
Could you show us the exact binary contents of test1.csv? i.e.
open('test1.csv', 'rb').read()
Comment From: wesm
not going to make it into 0.11.1. Will investigate pending failing test case
Comment From: jreback
pls reopen if you can create a test case
Comment From: cpcloud
@bmacauley FWIW this SO post is very helpful:
http://stackoverflow.com/questions/3191528/csv-in-python-adding-extra-carriage-return
your data source was probably written with mode='w'
(the default) and if you're in python 2 that will change \r\n
to \r\r\n
so you need write your data with mode='wb'
.
if you're in python 3 you need to open with mode='w', newline=''
.
this is a pretty sad state of affairs, but this isn't a pandas problem it's a problem with the csv
module and python newline handling across versions
Comment From: lo-co
I am not sure if the previous comment by @cpcloud was particularly helpful given that you did not specify how the file was written. I to have run into this problem recently when reading in a file that was not written in python but rather by an external program. I could go back and alter the file writing code, but I still have to be able to process the data that is written to older files that retain the EOL constant.
I found that if you specify the lineterminator='\n'
, then you should get the results you are looking for. Given the easy solution, I believe that this is likely not something that requires attention in pandas. Hope this helps anyone who is running into this problem.
Cheers, Matt
Comment From: szeitlin
I'm having this problem too. Looking at census data. Filetype is 'ASCII English text, with CRLF line terminators'
tried the suggestions above, didn't work.
from following Nick Coghlan's post here I was able to view something: http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html
filecontent = open('Gaz_counties_national.txt', encoding='ASCII', errors="surrogateescape").read()
file looks like this after that:
"USPS\tGEOID\tANSICODE\tNAME\tPOP10\tHU10\tALAND\tAWATER\tALAND_SQMI\tAWATER_SQMI\tINTPTLAT\tINTPTLONG \nAL\t01001\t00161526\tAutauga County\t54571\t22135\t1539582278\t25775735\t 594.436\t 9.952\t 32.536382\t -86.644490 \nAL\t01003\t00161527\tBaldwin County\t182265\t104061\t4117521611\t1133190229\t 1589.784\t 437.527\t 30.659218\t -87.746067 \nAL\t01005\t00161528\tBarbour County\t27457\t11829\t2291818968\t50864716\t 884.876\t 19.639\t 31.870670\t -85.405456 \nAL\t01007\t00161529\tBibb County\t22915\t8981\t1612480789\t9289057\t 622.582\t 3.587\t 33.015893\t -87.127148 \nAL\t01009\t00161530\tBlount County\t57322\t23887\t1669961855\t15157440\t 644.776\t 5.852\t 33.977448\t -86.567246 \nAL\t01011\t00161531\tBullock County\t10914\t4493\t1613056905\t6056528\t 622.805\t 2.338\t 32.101759\t -85.717261 \nAL\t01013\t00161532\tButler County\t20947\t9964\t2011976894\t2726814\t 776.829\t 1.053\t 31.751667\t -86.681969 \nAL\t01015\t00161533\tCalhoun County\t118572\t53289\t1569189995\t16624267\t 605.868\t 6.419\t 33.771706\t -85.822513 \nAL\t01017\t00161534\tChambers County\t34215\t17004\t1545009282\t17048142\t 596.531\t 6.582\t 32.917943\t -85.391812 \nAL\t01019\t00161535\tCherokee County\t25989\t16267\t1434075952\t119858898\t 553.700\t 46.278\t 34.069515\t -85.654242 \nAL\t01021\t00161536\tChilton County\t43643\t19278\t1794483586\t20586383\t 692.854\t 7.948\t 32.854059\t -86.726627
If I then try to read that filecontent object with pandas:
pd.read_csv(filecontent, sep='\t', lineterminator = '\r\n')
I get
UnicodeEncodeError: 'utf-8' codec can't encode character: surrogates not allowed
tried following suggestions in the comments above, got 'UnsupportedOperation: not readable'
can't put '\r\n' in as the lineterminator argument to pandas.read_csv, because 'ValueError: Only length-1 line terminators supported' ?
I'm in python 3.4, pandas version 0.15.1. Would love to know what symbol should go here to get it to work, or if updating to a newer version would help?
Comment From: cbare
I'm seeing the same error as @szeitlin on Python 3.5, Win2012. Same cause:
pd.read_csv(path, lineterminator = os.linesep)
where os.linesep == '\r\n'
Comment From: cpcloud
@szeitlin Can you post a link to the file?
Comment From: cbare
I don't think the actual .csv file matters. It's just the handling of the lineterminator parameter. Here's a repro:
PS C:\Users\Administrator> notepad foo.csv
...create simple .csv file in notepad... I am totally clueless about Windows.
PS C:\Users\Administrator> ipython
Python 3.5.1 |Anaconda 2.5.0 (64-bit)| (default, Jan 29 2016, 15:01:46) [MSC v.1900 64 bit (AMD64)]
Type "copyright", "credits" or "license" for more information.
IPython 4.0.3 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
In [1]: import os
In [2]: import pandas as pd
In [3]: pd.__version__
Out[3]: '0.17.1'
In [4]: m = pd.read_csv("foo.csv")
In [5]: m
Out[5]:
A B C
0 1 2 3
1 4 5 6
2 7 8 9
In [6]: m = pd.read_csv("foo.csv", lineterminator=os.linesep)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-6-63603000fef9> in <module>()
----> 1 m = pd.read_csv("foo.csv", lineterminator=os.linesep)
C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, dialect, compression, doublequo
te, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfo
oter, skip_footer, na_values, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace
, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_de
fault_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, n
rows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blan
k_lines)
496 skip_blank_lines=skip_blank_lines)
497
--> 498 return _read(filepath_or_buffer, kwds)
499
500 parser_f.__name__ = name
C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
273
274 # Create the parser.
--> 275 parser = TextFileReader(filepath_or_buffer, **kwds)
276
277 if (nrows is not None) and (chunksize is not None):
C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in __init__(self, f, engine, **kwds)
588 self.options['has_index_names'] = kwds['has_index_names']
589
--> 590 self._make_engine(self.engine)
591
592 def _get_options_with_defaults(self, engine):
C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in _make_engine(self, engine)
729 def _make_engine(self, engine='c'):
730 if engine == 'c':
--> 731 self._engine = CParserWrapper(self.f, **self.options)
732 else:
733 if engine == 'python':
C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in __init__(self, src, **kwds)
1101 kwds['allow_leading_cols'] = self.index_col is not False
1102
-> 1103 self._reader = _parser.TextReader(src, **kwds)
1104
1105 # XXX
pandas\parser.pyx in pandas.parser.TextReader.__cinit__ (pandas\parser.c:3407)()
ValueError: Only length-1 line terminators supported
Looking at foo.csv in the hexedit tool, it does indeed have \r\n line endings, as you'd expect, but I don't think it ever looks at the file.
For what it's worth, the csv module from the standard libraries has some confusing opinions about what the lineterminator parameter should mean and how it interacts with line endings in text mode files. Maybe, I'm just misunderstanding how this is supposed to work.
Comment From: szeitlin
sorry, I'm swamped with work right now, but if I can find the file and if I ever managed to get it working, I'll post both. I think I may have ended up solving the problem in a somewhat unintuitive way, but it was a while ago...
Comment From: yasink18
Hello,
change engine = 'python-fwf' it will work
Comment From: DavidEnergie
Hello,
change engine = 'python-fwf' it will work
Thanks for the pointer. This solved the same CR/LF issue for me, though with this engine (equally with 'read_fwf(file.txt)')
, for long lines, they break into new columns after a certain length (for me its around 10 geocoordinate strings (x.x,y.y,z.z)).
I didn't manage to resolve this, but switched back to engine='c' and using lineterminator = '\n' instead of lineterminator = '\r\n', as the later switches the engine to 'python'. Leaving out the '\r\' doesn't seem to have any negative effect in my example.
The solution to the issue of @szeitlin and I was right in front of my eyes, as @lo-co indicated above.
pd.read_csv(path_to_file_txt, sep='\t', lineterminator = '\n', engine='c', keep_default_na=False)
This original issue is around 8 years old. But I still wanted to add this, as the previous comments here helped me track it down.
Comment From: OrangeDog
\r\n
isn't even just the windows line ending, it is the RFC-4180 standard CSV line ending.
It should at least be possible to set lineterminator='\r\n'
, and it should really be the default too.
Comment From: rossbush
Greetings, while trying to use Pandas to read files csv with \r\n in AWS Lambda's running on linux os. We are encountering the [ERROR] ValueError: Only length-1 line terminators supported error. Currently a fallback to array.split("\r\n") is being utilized but making use of pandas for all csv operations would be ideal. Any ideas if this will be addressed in upcoming releases or the nature of an official workaround?
Comment From: rtrad89
Oddly, I face this when reading a CSV, but not when saving it! This prevented me from unifying the arguments of the parameters for all CSV operations.