I'm getting a CParserError on a large (25M rows) tab-delimited file. I've narrowed the problem to the lines in the example attached: test.txt When reading this file I get weird behaviour when displaying some of the lines.

➜  cat -vet test.txt
125403116848943104^IThey stay away from you for a reason stupid SMH$
125402641516859394^I"Women Would Sell Their Soul Jus To Buy Some Attention" #Fact$
125402461803515904^I@FFHIPHOP wats gd boii$
125402355914129408^I#TripleKrown$
125402323110461441^IWhat's ya purpose ??$
125402099134631936^IWhen one person says something about you you can fight it but when multiple people say the same thing then you lose out$
125402077856927744^IBlammer$
125386739329150976^IBitch you ignorant haaaaaaaaa$
In [109]: a = pandas.read_csv(root_dir + 'test.txt', sep='\t', header=None)

In [110]: a
Out[110]:
                    0                                                  1
0  125403116848943104    They stay away from you for a reason stupid SMH
1  125402641516859394  Women Would Sell Their Soul Jus To Buy Some At...
2  125402461803515904                             @FFHIPHOP wats gd boii
3  125402355914129408                                       #TripleKrown
4  125402323110461441                               What's ya purpose ??
5  125402099134631936  When one person says something about you you c...
6  125402077856927744                                            Blammer
7  125386739329150976                      Bitch you ignorant haaaaaaaaa

In [111]: a.iloc[1, 1]
Out[111]: 'Women Would Sell Their Soul Jus To Buy Some Attention #Fact'

In [112]: a.iloc[2, 1]
Out[112]: '@FFHIPHOP wats gd boii'

In [113]: a.iloc[3, 1]
Out[113]: '#TripleKrown'

Output of pd.show_versions()

In [114]: pandas.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Darwin
OS-release: 16.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 28.6.1
Cython: 0.25.2
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.8.2
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.1.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.4
lxml: 3.7.0
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.4
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: 2.43.0
pandas_datareader: None

Comment From: Dr-Irv

@ohadle What is the odd behavior? Is it because of the "..." on the long strings? The long strings are there. You have to set the maximum display column width in order to see them. If you add the line:

pandas.set_option('display.max_colwidth', 200)

to your code before looking at the DataFrame a, you will see the complete strings have been read in.

Comment From: ohadle

Well the odd behavior I noticed here is the indentation some of the lines get in column 1. Some are displayed left-aligned but some have an offset, which doesn't appear when reading the element itself (as in [112]). I suspected this was somehow related to the CParserError I get around those lines when parsing the larger file.

Comment From: TomAugspurger

@ohadle I think @Dr-Irv is correct, that's a display / formatting thing. From what I can tell, pandas successfully read the subset in test.txt.

Ideally, you'll be able to narrow down your file to the subset of lines that still reproduces the ParserError.

Comment From: Dr-Irv

@ohadle The ones that look left-aligned in the full display of the table are not left-aligned. They are truncated on the right when printing. The data was correctly read for the text file you provided.

Comment From: ohadle

OK, thank you @Dr-Irv and @TomAugspurger !