I'm getting a CParserError
on a large (25M rows) tab-delimited file. I've narrowed the problem to the lines in the example attached:
test.txt
When reading this file I get weird behaviour when displaying some of the lines.
➜ cat -vet test.txt
125403116848943104^IThey stay away from you for a reason stupid SMH$
125402641516859394^I"Women Would Sell Their Soul Jus To Buy Some Attention" #Fact$
125402461803515904^I@FFHIPHOP wats gd boii$
125402355914129408^I#TripleKrown$
125402323110461441^IWhat's ya purpose ??$
125402099134631936^IWhen one person says something about you you can fight it but when multiple people say the same thing then you lose out$
125402077856927744^IBlammer$
125386739329150976^IBitch you ignorant haaaaaaaaa$
In [109]: a = pandas.read_csv(root_dir + 'test.txt', sep='\t', header=None)
In [110]: a
Out[110]:
0 1
0 125403116848943104 They stay away from you for a reason stupid SMH
1 125402641516859394 Women Would Sell Their Soul Jus To Buy Some At...
2 125402461803515904 @FFHIPHOP wats gd boii
3 125402355914129408 #TripleKrown
4 125402323110461441 What's ya purpose ??
5 125402099134631936 When one person says something about you you c...
6 125402077856927744 Blammer
7 125386739329150976 Bitch you ignorant haaaaaaaaa
In [111]: a.iloc[1, 1]
Out[111]: 'Women Would Sell Their Soul Jus To Buy Some Attention #Fact'
In [112]: a.iloc[2, 1]
Out[112]: '@FFHIPHOP wats gd boii'
In [113]: a.iloc[3, 1]
Out[113]: '#TripleKrown'
Output of pd.show_versions()
In [114]: pandas.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Darwin
OS-release: 16.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: None.None
pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 28.6.1
Cython: 0.25.2
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.8.2
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.1.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.4
lxml: 3.7.0
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.4
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: 2.43.0
pandas_datareader: None
Comment From: Dr-Irv
@ohadle What is the odd behavior? Is it because of the "..." on the long strings? The long strings are there. You have to set the maximum display column width in order to see them. If you add the line:
pandas.set_option('display.max_colwidth', 200)
to your code before looking at the DataFrame a
, you will see the complete strings have been read in.
Comment From: ohadle
Well the odd behavior I noticed here is the indentation some of the lines get in column 1
. Some are displayed left-aligned but some have an offset, which doesn't appear when reading the element itself (as in [112]
).
I suspected this was somehow related to the CParserError
I get around those lines when parsing the larger file.
Comment From: TomAugspurger
@ohadle I think @Dr-Irv is correct, that's a display / formatting thing. From what I can tell, pandas successfully read the subset in test.txt.
Ideally, you'll be able to narrow down your file to the subset of lines that still reproduces the ParserError
.
Comment From: Dr-Irv
@ohadle The ones that look left-aligned in the full display of the table are not left-aligned. They are truncated on the right when printing. The data was correctly read for the text file you provided.
Comment From: ohadle
OK, thank you @Dr-Irv and @TomAugspurger !