I have been successfully parsing a csv file in 0.14.1, just upgraded yesterday and now the following code breaks:
df = pd.io.parsers.read_csv(fname,
skiprows=range(1, 9))
here is the file:
https://www.dropbox.com/s/grhi6e9vihjf92t/testtown2.csv?dl=0
Versions:
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.8.final.0
python-bits: 64
OS: Linux
OS-release: 3.14-1-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.15.2
nose: None
Cython: 0.20.1
numpy: 1.9.2
scipy: None
statsmodels: None
IPython: 2.1.0
sphinx: None
patsy: None
dateutil: 2.4.1
pytz: 2014.10
bottleneck: None
tables: 3.1.1
numexpr: 2.4
matplotlib: None
openpyxl: None
xlrd: 0.9.3
xlwt: None
xlsxwriter: 0.6.8
lxml: None
bs4: 4.3.2
html5lib: None
httplib2: 0.9
apiclient: None
rpy2: None
sqlalchemy: 0.9.9
pymysql: None
psycopg2: 2.6 (dt dec pq3 ext lo64)
Comment From: jreback
straight out of the docs: http://pandas.pydata.org/pandas-docs/stable/io.html#ignoring-line-comments-and-empty-lines
there was an API change in 0.15.0. you may have to adjust the skiprows
and/or set skip_empty_lines=False
to resolve the ambiguity.
Comment From: bastianlb
The problem is I don't have control over the format of the lines I want to skip, so "commenting" them out is not an option. skip_blank_lines=False
does not seem to work.. (the lines aren't blank?). I know which lines I want to ignore in advance, I just can't predict their format.
Thanks for the help.
Comment From: bastianlb
I suppose I could use regex to indicate which rows are to be omitted, but it seems like only 1-length comment tags are currently supported.
Comment From: jreback
can you post lines 1-12 or so
Comment From: bastianlb
NAME,testtown sewer flow,testtown sewer depth,testtown sewer velocity
CONTEXT,sewer,,
TYPE,flow,depth,velocity
START, 1/1/11 0:00,,
END, 12/30/13 23:00,,
TZ, US/Eastern,,
INC,1,,
UNITS,cfs,ft,ft/s
DATA,,,
1,1.451139854,0.470901008,1.720064044
2,1.20869145,0.514026212,1.629014523
3,12,0.419756899,1.517084435
4,12,0.418530849,1.40467042
5,12,0.317392778,1.358932374
6,12,0.26695894,1.32262311
7,12,0.428965151,1.381718785
8,12,0.289426375,1.449016799
9,1.173475721,0.429696648,1.614747635
also file is for download in the original posting if that helps.
Comment From: jreback
works for me in 0.15.2
In [2]: pd.read_csv(StringIO(data),skiprows=range(1,9))
Out[2]:
NAME testtown sewer flow testtown sewer depth testtown sewer velocity
0 1 1.451140 0.470901 1.720064
1 2 1.208691 0.514026 1.629015
2 3 12.000000 0.419757 1.517084
3 4 12.000000 0.418531 1.404670
4 5 12.000000 0.317393 1.358932
5 6 12.000000 0.266959 1.322623
6 7 12.000000 0.428965 1.381719
7 8 12.000000 0.289426 1.449017
8 9 1.173476 0.429697 1.614748
Comment From: bastianlb
Ok, I realized the source of the problem is that this file came from a windows machine. I can read it as follows:
df = pd.io.parsers.read_csv(fname, skiprows=range(1, 9), lineterminator="\r")
however, now, this code now breaks on files of the same format with "\n" terminators. Could it be that this flexibility was lost in 0.15?
Comment From: bastianlb
Interestingly enough, read_csv
is able to parse both kinds of line terminators as long as I don't specify a skiprows
argument, and in 0.14.1
this worked with skiprows
.
Comment From: gfyoung
Not really seeing much of an issue anymore?
>>> from pandas import read_csv
>>> from pandas.compat import StringIO
>>>
>>> data = '1\r\a\r2'
>>> read_csv(StringIO(data), skiprows=1, engine='c')
a
0 2
>>> read_csv(StringIO(data), skiprows=1, engine='python')
...
_csv.Error: new-line character seen in unquoted field -
do you need to open the file in universal-newline mode?
The Python engine breakage is expected because custom line-terminators are not accepted.
Comment From: gfyoung
@jreback : I think this issue can be closed based on what I said above.
Comment From: jorisvandenbossche
@bastianl If you still have problems with this, fee free to reopen.