Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
from io import StringIO
import pandas as pd
csvString = """MyId,MyDate,Name,C_D
ABC123,20141031,Name1,DDD
"""
csvStringIO = StringIO(csvString)
file_df = pd.read_csv(csvStringIO)
my_series = file_df["MyDate"]
assert my_series[0] == 20141031
# everything fine
# 3 column headers, 4 values
csvString = """MyId,MyDate,Name
ABC123,20141031,Name1,DDD
"""
csvStringIO = StringIO(csvString)
file_df = pd.read_csv(csvStringIO)
my_series = file_df["MyDate"]
assert my_series[0] == "Name1"
# Issue - column/column header misalignment - assuming that it should work from the left
# on bad lines (default error) does also not work
csvStringIO = StringIO(csvString)
file_df = pd.read_csv(csvStringIO, on_bad_lines="error")
my_series = file_df["MyDate"]
assert my_series[0] == "Name1"
# according to the [doc](http://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#column-and-index-locations-and-names)
# ( on_bad_lines (default error) - Specifies what to do upon encountering a bad line(a line with too many fields).
# - ‘error’, raise an ParserError when a bad line is encountered.
# Issue - does not raise an errror
csvStringIO = StringIO(csvString)
file_df = pd.read_csv(csvStringIO, on_bad_lines="skip")
my_series = file_df["MyDate"]
assert my_series[0] == "Name1"
# Issues - does not skip the bad line
print(pd.show_versions())
Issue Description
3 issues:
- Column header and column get misaligned (assuming that the alignment should work from the left).
read_csv
parameteron_bad_lines="error"
does not raise errors.read_csv
parameteron_bad_lines="skip"
does not skip the lines.
Expected Behavior
I guess option "error" should raise errors, "skip" should return an empty dataframe.
As an enhancement, I would also suggest to keep returning lines with too many values, similar to the current behaviour. # However, the column headers and values should be aligned to the left, i.e.
csvString = """MyId,MyDate,Name
ABC123,20141031,Name1,DDD
"""
csvStringIO = StringIO(csvString)
file_df = pd.read_csv(csvStringIO, on_bad_lines="alignleft")
my_series = file_df["MyDate"]
assert my_series[0] == "20141031"
Installed Versions
C:\Users\me\Envs\myenv\lib\site-packages\_distutils_hack\__init__.py:33: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")
INSTALLED VERSIONS
------------------
commit : 8dab54d6573f7186ff0c3b6364d5e4dd635ff3e7
python : 3.9.13.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19044
machine : AMD64
processor : Intel64 Family 6 Model 85 Stepping 4, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United States.1252
pandas : 1.5.2
numpy : 1.23.3
pytz : 2022.2.1
dateutil : 2.8.2
setuptools : 65.5.1
pip : 22.2.2
Cython : None
pytest : 7.2.0
hypothesis : None
sphinx : 5.3.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.4.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 9.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.9.0
snappy : None
sqlalchemy : 1.4.29
tables : None
tabulate : 0.8.9
xarray : None
xlrd : 1.2.0
xlwt : None
zstandard : None
tzdata : None
None
Comment From: phofl
Hi, thanks for your report.
If you have less column names than columns in the data area, then the remaining columns are used as index. This behaves as intended