Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

from io import StringIO
import pandas as pd

csvString = """MyId,MyDate,Name,C_D
ABC123,20141031,Name1,DDD
"""
csvStringIO = StringIO(csvString)
file_df = pd.read_csv(csvStringIO)
my_series = file_df["MyDate"]
assert my_series[0] == 20141031
# everything fine


# 3 column headers, 4 values
csvString = """MyId,MyDate,Name
ABC123,20141031,Name1,DDD
"""
csvStringIO = StringIO(csvString)
file_df = pd.read_csv(csvStringIO)
my_series = file_df["MyDate"]
assert my_series[0] == "Name1"
# Issue - column/column header misalignment - assuming that it should work from the left
# on bad lines (default error) does also not work

csvStringIO = StringIO(csvString)
file_df = pd.read_csv(csvStringIO, on_bad_lines="error")
my_series = file_df["MyDate"]
assert my_series[0] == "Name1"
# according to the [doc](http://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#column-and-index-locations-and-names)
# ( on_bad_lines (default error) - Specifies what to do upon encountering a bad line(a line with too many fields).
# - ‘error’, raise an ParserError when a bad line is encountered.
# Issue - does not raise an errror

csvStringIO = StringIO(csvString)
file_df = pd.read_csv(csvStringIO, on_bad_lines="skip")
my_series = file_df["MyDate"]
assert my_series[0] == "Name1"
# Issues - does not skip  the bad line

print(pd.show_versions())

Issue Description

3 issues:

  • Column header and column get misaligned (assuming that the alignment should work from the left).
  • read_csv parameter on_bad_lines="error" does not raise errors.
  • read_csv parameter on_bad_lines="skip" does not skip the lines.

Expected Behavior

I guess option "error" should raise errors, "skip" should return an empty dataframe.

As an enhancement, I would also suggest to keep returning lines with too many values, similar to the current behaviour. # However, the column headers and values should be aligned to the left, i.e.

csvString = """MyId,MyDate,Name
ABC123,20141031,Name1,DDD
"""
csvStringIO = StringIO(csvString)
file_df = pd.read_csv(csvStringIO, on_bad_lines="alignleft")
my_series = file_df["MyDate"]
assert my_series[0] == "20141031"

Installed Versions

C:\Users\me\Envs\myenv\lib\site-packages\_distutils_hack\__init__.py:33: UserWarning: Setuptools is replacing distutils. warnings.warn("Setuptools is replacing distutils.") INSTALLED VERSIONS ------------------ commit : 8dab54d6573f7186ff0c3b6364d5e4dd635ff3e7 python : 3.9.13.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19044 machine : AMD64 processor : Intel64 Family 6 Model 85 Stepping 4, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_United States.1252 pandas : 1.5.2 numpy : 1.23.3 pytz : 2022.2.1 dateutil : 2.8.2 setuptools : 65.5.1 pip : 22.2.2 Cython : None pytest : 7.2.0 hypothesis : None sphinx : 5.3.0 blosc : None feather : None xlsxwriter : None lxml.etree : 4.9.1 html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.4.0 pandas_datareader: None bs4 : 4.11.1 bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : 3.0.9 pandas_gbq : None pyarrow : 9.0.0 pyreadstat : None pyxlsb : None s3fs : None scipy : 1.9.0 snappy : None sqlalchemy : 1.4.29 tables : None tabulate : 0.8.9 xarray : None xlrd : 1.2.0 xlwt : None zstandard : None tzdata : None None

Comment From: phofl

Hi, thanks for your report.

If you have less column names than columns in the data area, then the remaining columns are used as index. This behaves as intended