Pandas BUG: on_bad_lines=callable does not invoke callable for all bad lines

Pandas version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

In [29]:
import pandas as pd
pd.__version__
Out [29]:
'1.4.3'

In [30]:
len(open("bad.csv").readlines())
Out [30]:
3

In [31]:
df1 = pd.read_csv("bad.csv", on_bad_lines='warn', engine='python')
Skipping line 3: ',' expected after '"'


In [32]:
df2 = pd.read_csv("bad.csv", on_bad_lines=print, engine='python')

In [33]:
len(df1), len(df2)
Out [33]:
(1, 1)

Issue Description

The above data file has two rows + header. Row 2 is valid, Row 3 is bad.

For df1, I'm setting on_bad_line=warn, and I see a warning for line 3.

For d2, I'm passing on_bad_lines=print, and I don't see any prints - the bad line is silently skipped.

❯ cat bad.csv
country,founded,id,industry,linkedin_url,locality,name,region,size,website
united states,"",heritage-equine-equipment-llc,farming,linkedin.com/company/heritage-equine-equipment-llc,"",heritage equine equipment llc,"",1-10,heritageequineequip.com
chile,"",contacto-corporación-colina,hospital & health care,linkedin.com/company/contacto-corporación-colina,colina,"contacto \" corporación colina",santiago metropolitan,11-50,corporacioncolina.cl

Expected Behavior

I would expect the bad line to be printed in the second case.

Installed Versions

pd.show_versions() INSTALLED VERSIONS ------------------ commit : e8093ba372f9adfe79439d90fe74b0b5b6dea9d6 python : 3.9.12.final.0 python-bits : 64 OS : Linux OS-release : 5.11.0-49-generic Version : #55-Ubuntu SMP Wed Jan 12 17:36:34 UTC 2022 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.4.3 numpy : 1.23.1 pytz : 2022.1 dateutil : 2.8.2 setuptools : 60.6.0 pip : 22.0.3 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : 8.4.0 pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None markupsafe : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None /home/venky/dev/instant-science/explore/.venv/lib/python3.9/site-packages/_distutils_hack/__init__.py:30: UserWarning: Setuptools is replacing distutils. warnings.warn("Setuptools is replacing distutils.")

Comment From: phofl

Hi, thanks for your report, can reproduce this too.

could you try simplifying the csv file? It’s hard to see what’s going on in there right now

Comment From: mroeschke

This may be working as expected if I am looking at your csv file correctly.

As the docs state:

Specifies what to do upon encountering a bad line (a line with too many fields).

And I think each line has the same number of elements?

Comment From: indigoviolet

Pandas BUG: on_bad_lines=callable does not invoke callable for all bad lines

Which lines are considered bad should not be different between 'warn' and print.
I would expect all skipped lines to be denoted bad, and for the callable to be able to handle all of them.

Comment From: indigoviolet

Hi, thanks for your report, can reproduce this too.

could you try simplifying the csv file? It’s hard to see what’s going on in there right now

Here's a simplified version:

❯ cat bad2.csv
country,name
united states,heritage equine equipment llc
chile,"contacto \" corporación colina"

Setting escapechar='\\' will allow reading the second line, but the bug (different behavior b/w warn and print) as reported is still valid.

Comment From: baodangpro

Agree that it should be a bug. No confidence to pass a callback for on_bad_line as some bad lines due to escapechar will be skipped silently.

Comment From: kostyafarber

print returns None, so by the description in the docs the bad line will be ignored.

If the function returns None the bad line will be ignored

Comment From: kostyafarber

So I think I've identified the issue:

https://github.com/pandas-dev/pandas/blob/eb2351205c9d63ffc82753df59881b9138349869/pandas/io/parsers/python_parser.py#L776-L786

This bit here catches the CSV error (i.e in the example this line chile,"contacto \" corporación colina" not having defined an escapechar=\\) causes this error to be caught:

'\',\' expected after \'"\''

This block does not deal with user defined callables, only the pandas defined ones, and just returns None, effectively "skipping" this line.

The callable gets triggered in this snippet if the number of rows is not as expected:

https://github.com/pandas-dev/pandas/blob/eb2351205c9d63ffc82753df59881b9138349869/pandas/io/parsers/python_parser.py#L988-L1008

For example if I append

japan, bum ba boom, doon

to the example csv

country,name
united states,heritage equine equipment llc
chile,"contacto \" corporación colina"
japan, bum ba boom, doon

The callable will trigger for this line, as max_len > col_len, but not for the chile,"contacto \" corporación colina".

Not sure what the best way to go about a fix is. Perhaps we could call the user defined callable when we catch the error and let it process the line at return it?

Not sure what the implications would be downstream if the line wasn't processed correctly though and if we would want this to propagate down the parser.

Happy to work on this and open a PR if we have an idea on the approach for the fix.

Comment From: kostyafarber

Any ideas?

Comment From: Dr-Irv

Given how it's documented, I think this statement is correct:

I would expect all skipped lines to be denoted bad, and for the callable to be able to handle all of them.

In other words, the callable should be called if the line was detected as bad. That is what should get fixed.

@mroeschke do you agree that if the line is bad in any way, the callable should be called, as opposed to just getting called if the number of fields is wrong?

Comment From: mroeschke

I think when implementing this feature at the time, I tailored the callable to apply to what a "bad line" was documented at the time

Specifies what to do upon encountering a bad line (a line with too many fields).

And relied on how too many fields was defined internally.

Not sure why too many fields was the definition of a bad line at the time, but I would be open to expanding the definition of what "bad" mean in regards to a line

Comment From: kostyafarber

@mroeschke that makes sense.

What would the appropriate expansion of the meaning of 'bad line' be to improve the expected behaviour of here?

Currently error and warn are implemented to handle any csv.Error. Does csv.Error only catch "a line with too many fields"?

If not, then the documentation may be slightly misleading, as error and warn are dealing with many different cases not just "a line with too many fields" and callable only with "a line with too many fields".

A simple fix would be to change the documentation to better reflect what a bad line means and let the user process the line at the csv.Error stage.

Or we keep it as is and just change the documentation saying that the meaning of bad lines differ between user callables and error, warn etc.

Any thoughts?

Comment From: mroeschke

For now I would opt to improve the documentation then. "bad line" is indeed a very nondescript term and would be weary of the code changes required to expand its definition without more discussion

Comment From: kostyafarber

Okay I can look at making documentation changes to on_bad_lines. Does this need a separate issue opened and a PR against that, as I suppose we can leave this open for discussions on whether code changes are appropriate down the line?

If we open another issue we can discuss how we want to describe "bad lines" to reflect what's happening there.

Otherwise, I propose something along the lines of telling the user that user defined callables act on "to many fields" whereas warn, error are triggered by any CSV parsing error. What do you think?

Comment From: mroeschke

You can cross reference this issue when making a PR and we can leave this issue open to further discuss a broader scope for "bad line"

Comment From: paul-theorem

I realize i'm commenting on a closed thread. I do think the clarifying documentation is good, but probably need an additional issue / feature request.

on_bad_lines meaning is obviously overloaded. It would be nice to get a callback function for any skipped line so they can be counted. With a large dataset w/ millions of rows - the failure of 1 or 2 lines may be a minor annoyance, and the failure of 100k lines may be a serious defect. Without a general way to know the difference, we cannot pro grammatically differentiate.

It is further confusing because depending on the value of engine - the errors reported by the Exception/warning vary. For the same file, with the default (unspecified) engine, i see:

Skipping line 339779: expected 150 fields, saw 159

Which clearly looks like "too many fields" - and would trigger the callback. But the error is different with engine=python:

Skipping line 339779: ',' expected after '"'

Thus not triggering the callback, and silently skipping without calling back.

Comment From: jrhamilton

Also getting silent skip on callable functions when using on_bad_lines.

First tried writing to file but was getting blank files. Tried on_bad_lines=print like @indigoviolet , and getting silent skips.

Also getting the same errors as @paul-theorem when turning removing on_bad_lines:

pandas.errors.ParserError: ',' expected after '"'

However, I get the same errors whether I have engine set to python or not.

Comment From: Dr-Irv

@jrhamilton If you have a reproducible example, it would be best to open up a new issue with that example so that the behavior you describe is investigated. We won't do anything on closed issues