Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
In [29]:
import pandas as pd
pd.__version__
Out [29]:
'1.4.3'
In [30]:
len(open("bad.csv").readlines())
Out [30]:
3
In [31]:
df1 = pd.read_csv("bad.csv", on_bad_lines='warn', engine='python')
Skipping line 3: ',' expected after '"'
In [32]:
df2 = pd.read_csv("bad.csv", on_bad_lines=print, engine='python')
In [33]:
len(df1), len(df2)
Out [33]:
(1, 1)
Issue Description
The above data file has two rows + header. Row 2 is valid, Row 3 is bad.
For df1
, I'm setting on_bad_line=warn
, and I see a warning for line 3.
For d2
, I'm passing on_bad_lines=print
, and I don't see any prints - the bad line is silently skipped.
❯ cat bad.csv
country,founded,id,industry,linkedin_url,locality,name,region,size,website
united states,"",heritage-equine-equipment-llc,farming,linkedin.com/company/heritage-equine-equipment-llc,"",heritage equine equipment llc,"",1-10,heritageequineequip.com
chile,"",contacto-corporación-colina,hospital & health care,linkedin.com/company/contacto-corporación-colina,colina,"contacto \" corporación colina",santiago metropolitan,11-50,corporacioncolina.cl
Expected Behavior
I would expect the bad line to be printed in the second case.
Installed Versions
Comment From: phofl
Hi, thanks for your report, can reproduce this too.
could you try simplifying the csv file? It’s hard to see what’s going on in there right now
Comment From: mroeschke
This may be working as expected if I am looking at your csv file correctly.
As the docs state:
Specifies what to do upon encountering a bad line (a line with too many fields).
And I think each line has the same number of elements?
Comment From: indigoviolet
-
Which lines are considered bad should not be different between
'warn'
andprint
. -
I would expect all skipped lines to be denoted bad, and for the callable to be able to handle all of them.
Comment From: indigoviolet
Hi, thanks for your report, can reproduce this too.
could you try simplifying the csv file? It’s hard to see what’s going on in there right now
Here's a simplified version:
❯ cat bad2.csv
country,name
united states,heritage equine equipment llc
chile,"contacto \" corporación colina"
Setting escapechar='\\'
will allow reading the second line, but the bug (different behavior b/w warn and print) as reported is still valid.
Comment From: baodangpro
Agree that it should be a bug. No confidence to pass a callback for on_bad_line as some bad lines due to escapechar will be skipped silently.
Comment From: kostyafarber
print
returns None
, so by the description in the docs the bad line will be ignored.
If the function returns
None
the bad line will be ignored
Comment From: kostyafarber
So I think I've identified the issue:
https://github.com/pandas-dev/pandas/blob/eb2351205c9d63ffc82753df59881b9138349869/pandas/io/parsers/python_parser.py#L776-L786
This bit here catches the CSV error
(i.e in the example this line chile,"contacto \" corporación colina"
not having defined an escapechar=\\
) causes this error to be caught:
'\',\' expected after \'"\''
This block does not deal with user defined callables, only the pandas defined ones, and just returns None
, effectively "skipping" this line.
The callable gets triggered in this snippet if the number of rows is not as expected:
https://github.com/pandas-dev/pandas/blob/eb2351205c9d63ffc82753df59881b9138349869/pandas/io/parsers/python_parser.py#L988-L1008
For example if I append
japan, bum ba boom, doon
to the example csv
country,name
united states,heritage equine equipment llc
chile,"contacto \" corporación colina"
japan, bum ba boom, doon
The callable will trigger for this line, as max_len > col_len
, but not for the chile,"contacto \" corporación colina"
.
Not sure what the best way to go about a fix is. Perhaps we could call the user defined callable when we catch the error and let it process the line at return it?
Not sure what the implications would be downstream if the line wasn't processed correctly though and if we would want this to propagate down the parser.
Happy to work on this and open a PR if we have an idea on the approach for the fix.
Comment From: kostyafarber
Any ideas?
Comment From: Dr-Irv
Given how it's documented, I think this statement is correct:
I would expect all skipped lines to be denoted bad, and for the callable to be able to handle all of them.
In other words, the callable should be called if the line was detected as bad. That is what should get fixed.
@mroeschke do you agree that if the line is bad in any way, the callable should be called, as opposed to just getting called if the number of fields is wrong?
Comment From: mroeschke
I think when implementing this feature at the time, I tailored the callable to apply to what a "bad line" was documented at the time
Specifies what to do upon encountering a bad line (a line with too many fields).
And relied on how too many fields was defined internally.
Not sure why too many fields was the definition of a bad line at the time, but I would be open to expanding the definition of what "bad" mean in regards to a line
Comment From: kostyafarber
@mroeschke that makes sense.
What would the appropriate expansion of the meaning of 'bad line' be to improve the expected behaviour of here?
Currently error
and warn
are implemented to handle any csv.Error
. Does csv.Error
only catch "a line with too many fields"?
If not, then the documentation may be slightly misleading, as error
and warn
are dealing with many different cases not just "a line with too many fields" and callable
only with "a line with too many fields".
A simple fix would be to change the documentation to better reflect what a bad line means and let the user process the line at the csv.Error
stage.
Or we keep it as is and just change the documentation saying that the meaning of bad lines differ between user callables and error
, warn
etc.
Any thoughts?
Comment From: mroeschke
For now I would opt to improve the documentation then. "bad line" is indeed a very nondescript term and would be weary of the code changes required to expand its definition without more discussion
Comment From: kostyafarber
Okay I can look at making documentation changes to on_bad_lines
. Does this need a separate issue opened and a PR against that, as I suppose we can leave this open for discussions on whether code changes are appropriate down the line?
If we open another issue we can discuss how we want to describe "bad lines" to reflect what's happening there.
Otherwise, I propose something along the lines of telling the user that user defined callables act on "to many fields" whereas warn
, error
are triggered by any CSV parsing error. What do you think?
Comment From: mroeschke
You can cross reference this issue when making a PR and we can leave this issue open to further discuss a broader scope for "bad line"
Comment From: paul-theorem
I realize i'm commenting on a closed thread. I do think the clarifying documentation is good, but probably need an additional issue / feature request.
on_bad_lines
meaning is obviously overloaded. It would be nice to get a callback function for any skipped line so they can be counted. With a large dataset w/ millions of rows - the failure of 1 or 2 lines may be a minor annoyance, and the failure of 100k lines may be a serious defect. Without a general way to know the difference, we cannot pro grammatically differentiate.
It is further confusing because depending on the value of engine
- the errors reported by the Exception/warning vary. For the same file, with the default (unspecified) engine, i see:
Skipping line 339779: expected 150 fields, saw 159
Which clearly looks like "too many fields" - and would trigger the callback. But the error is different with engine=python
:
Skipping line 339779: ',' expected after '"'
Thus not triggering the callback, and silently skipping without calling back.
Comment From: jrhamilton
Also getting silent skip on callable functions when using on_bad_lines.
First tried writing to file but was getting blank files. Tried on_bad_lines=print
like @indigoviolet , and getting silent skips.
Also getting the same errors as @paul-theorem when turning removing on_bad_lines:
pandas.errors.ParserError: ',' expected after '"'
However, I get the same errors whether I have engine set to python or not.
Comment From: Dr-Irv
@jrhamilton If you have a reproducible example, it would be best to open up a new issue with that example so that the behavior you describe is investigated. We won't do anything on closed issues