Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [x] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

df = pandas.read_csv(
    "D:\\NIDS Datasets\\CIC-IDS2017\\GeneratedLabelledFlows\\TrafficLabelling\\Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv",
    encoding='utf-8',
    header=0,
    usecols=['Source IP', 'Destination IP', 'Source Port', 'Destination Port'],
    low_memory=False,
)

Issue Description

Reading the CSV file fails and an error is raised suggesting the requested columns are not part of the DataFrame while these are clearly present. In specific, the code results in the following stack trace.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[28], line 1
----> 1 df = pandas.read_csv(
      2     "D:\\NIDS Datasets\\CIC-IDS2017\\GeneratedLabelledFlows\\TrafficLabelling\\Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv",
      3     encoding='utf-8',
      4     header=0,
      5     usecols=['Source IP', 'Destination IP', 'Source Port', 'Destination Port'],
      6 )
      7 df.head()

File c:\Users\user\AppData\Local\Programs\Python\venv\lib\site-packages\pandas\util\_decorators.py:211, in deprecate_kwarg.._deprecate_kwarg..wrapper(*args, **kwargs)
    209     else:
    210         kwargs[new_arg_name] = new_arg_value
--> 211 return func(*args, **kwargs)

File c:\Users\user\AppData\Local\Programs\Python\venv\lib\site-packages\pandas\util\_decorators.py:331, in deprecate_nonkeyword_arguments..decorate..wrapper(*args, **kwargs)
    325 if len(args) > num_allow_args:
    326     warnings.warn(
    327         msg.format(arguments=_format_argument_list(allow_args)),
    328         FutureWarning,
    329         stacklevel=find_stack_level(),
    330     )
--> 331 return func(*args, **kwargs)

File c:\Users\user\AppData\Local\Programs\Python\venv\lib\site-packages\pandas\io\parsers\readers.py:950, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    935 kwds_defaults = _refine_defaults_read(
    936     dialect,
    937     delimiter,
   (...)
    946     defaults={"delimiter": ","},
    947 )
    948 kwds.update(kwds_defaults)
--> 950 return _read(filepath_or_buffer, kwds)

File c:\Users\user\AppData\Local\Programs\Python\venv\lib\site-packages\pandas\io\parsers\readers.py:605, in _read(filepath_or_buffer, kwds)
    602 _validate_names(kwds.get("names", None))
    604 # Create the parser.
--> 605 parser = TextFileReader(filepath_or_buffer, **kwds)
    607 if chunksize or iterator:
    608     return parser

File c:\Users\user\AppData\Local\Programs\Python\venv\lib\site-packages\pandas\io\parsers\readers.py:1442, in TextFileReader.__init__(self, f, engine, **kwds)
   1439     self.options["has_index_names"] = kwds["has_index_names"]
   1441 self.handles: IOHandles | None = None
-> 1442 self._engine = self._make_engine(f, self.engine)

File c:\Users\user\AppData\Local\Programs\Python\venv\lib\site-packages\pandas\io\parsers\readers.py:1753, in TextFileReader._make_engine(self, f, engine)
   1750     raise ValueError(msg)
   1752 try:
-> 1753     return mapping[engine](f, **self.options)
   1754 except Exception:
   1755     if self.handles is not None:

File c:\Users\user\AppData\Local\Programs\Python\venv\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py:135, in CParserWrapper.__init__(self, src, **kwds)
    131 assert self.orig_names is not None
    132 if self.usecols_dtype == "string" and not set(usecols).issubset(
    133     self.orig_names
    134 ):
--> 135     self._validate_usecols_names(usecols, self.orig_names)
    137 # error: Cannot determine type of 'names'
    138 if len(self.names) > len(usecols):  # type: ignore[has-type]
    139     # error: Cannot determine type of 'names'

File c:\Users\user\AppData\Local\Programs\Python\venv\lib\site-packages\pandas\io\parsers\base_parser.py:917, in ParserBase._validate_usecols_names(self, usecols, names)
    915 missing = [c for c in usecols if c not in names]
    916 if len(missing) > 0:
--> 917     raise ValueError(
    918         f"Usecols do not match columns, columns expected but not found: "
    919         f"{missing}"
    920     )
    922 return usecols

ValueError: Usecols do not match columns, columns expected but not found: ['Destination Port', 'Destination IP', 'Source IP', 'Source Port']

Expected Behavior

The DataFrame should be read correctly given the specified columns are present in the CSV file.

Surprisingly, the DataFrame is read correctly when a list of names as passed, such as in the example below. The columns requested using usecols is a subset of the columns provided using names.

df = pandas.read_csv(
    "D:\\NIDS Datasets\\CIC-IDS2017\\GeneratedLabelledFlows\\TrafficLabelling\\Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv",
    encoding='utf-8',
    header=0,
    usecols=['Source IP', 'Destination IP', 'Source Port', 'Destination Port'],
    names=['Flow ID', 'Source IP', 'Source Port', 'Destination IP',
       'Destination Port', 'Protocol', 'Timestamp', 'Flow Duration',
       'Total Fwd Packets', 'Total Backward Packets',
       'Total Length of Fwd Packets', 'Total Length of Bwd Packets',
       'Fwd Packet Length Max', 'Fwd Packet Length Min',
       'Fwd Packet Length Mean', 'Fwd Packet Length Std',
       'Bwd Packet Length Max', 'Bwd Packet Length Min',
       'Bwd Packet Length Mean', 'Bwd Packet Length Std', 'Flow Bytes/s',
       'Flow Packets/s', 'Flow IAT Mean', 'Flow IAT Std', 'Flow IAT Max',
       'Flow IAT Min', 'Fwd IAT Total', 'Fwd IAT Mean', 'Fwd IAT Std',
       'Fwd IAT Max', 'Fwd IAT Min', 'Bwd IAT Total', 'Bwd IAT Mean',
       'Bwd IAT Std', 'Bwd IAT Max', 'Bwd IAT Min', 'Fwd PSH Flags',
       'Bwd PSH Flags', 'Fwd URG Flags', 'Bwd URG Flags', 'Fwd Header Length',
       'Bwd Header Length', 'Fwd Packets/s', 'Bwd Packets/s',
       'Min Packet Length', 'Max Packet Length', 'Packet Length Mean',
       'Packet Length Std', 'Packet Length Variance', 'FIN Flag Count',
       'SYN Flag Count', 'RST Flag Count', 'PSH Flag Count', 'ACK Flag Count',
       'URG Flag Count', 'CWE Flag Count', 'ECE Flag Count', 'Down/Up Ratio',
       'Average Packet Size', 'Avg Fwd Segment Size', 'Avg Bwd Segment Size',
       'Fwd Header Length.1', 'Fwd Avg Bytes/Bulk', 'Fwd Avg Packets/Bulk',
       'Fwd Avg Bulk Rate', 'Bwd Avg Bytes/Bulk', 'Bwd Avg Packets/Bulk',
       'Bwd Avg Bulk Rate', 'Subflow Fwd Packets', 'Subflow Fwd Bytes',
       'Subflow Bwd Packets', 'Subflow Bwd Bytes', 'Init_Win_bytes_forward',
       'Init_Win_bytes_backward', 'act_data_pkt_fwd', 'min_seg_size_forward',
       'Active Mean', 'Active Std', 'Active Max', 'Active Min', 'Idle Mean',
       'Idle Std', 'Idle Max', 'Idle Min', 'Label']
)

While I doubt this is because of the dataset being used, I would like to note that the dataset is publicly available at https://www.unb.ca/cic/datasets/ids-2017.html

Installed Versions

INSTALLED VERSIONS ------------------ commit : 2e218d10984e9919f0296931d92ea851c6a6faf5 python : 3.10.9.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19044 machine : AMD64 processor : Intel64 Family 6 Model 141 Stepping 1, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_Netherlands.1252 pandas : 1.5.3 numpy : 1.23.5 pytz : 2022.7.1 dateutil : 2.8.2 setuptools : 67.6.0 pip : 23.0.1 Cython : 0.29.32 pytest : 7.2.2 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.9.2 html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.11.0 pandas_datareader: None bs4 : 4.11.2 bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.7.1 numba : 0.56.4 numexpr : None odfpy : None openpyxl : 3.1.2 pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : 1.10.1 snappy : None sqlalchemy : 2.0.4 tables : None tabulate : 0.9.0 xarray : None xlrd : None xlwt : None zstandard : None tzdata : None

Comment From: phofl

Hi, thanks for your report.

please provide a reproducible example that does not require devs to register somewhere, ideally this shouldn’t rely on any files.

that said this is most likely because of your file since Usecols generally works if the file is correct

Comment From: Koen1999

While working on an easier to reproduce sample, I discovered the actual problem is different from what I thought. Therefore, I will close the issue and detail the actual issue below in case anyone ever comes across this issue.

The problem is that the first line of the CSV file sometimes contains spaces after the delimiter (and the rest of the CSV file does not). So it looks like this Length of Bwd Packets, Fwd Packet Length Max, Fwd Packet Length Min, Fwd Packet Length Mean, Fwd Packet Length Std,Bwd Packet Length Max, Bwd Packet Length Min, Bwd Packet Length Mean, Bwd Packet Length Std,Flow Bytes/s, Flow Packets/s, Flow IAT Mean, Flow IAT Std, Flow IAT Max, Flow IAT Min,Fwd IAT Total, Fwd IAT Mean, Fwd IAT Std, Fwd IAT Max, Fwd IAT Min,Bwd IAT Total, Bwd IAT Mean, Bwd IAT Std, Bwd IAT Max, Bwd IAT Min,Fwd PSH Flags, Bwd PSH Flags, Fwd URG Flags, Bwd URG Flags, Fwd Header Length, Bwd Header Length,Fwd Packets/s, Bwd Packets/s, Min Packet Length, Max Packet Length, Packet Length Mean, Packet Length Std, Packet Length Variance,FIN Flag Count, SYN Flag Count, RST Flag Count, PSH Flag Count, ACK Flag Count, URG Flag Count, CWE Flag Count, ECE Flag Count, Down/Up Ratio, Average Packet Size, Avg Fwd Segment Size, Avg Bwd Segment Size, Fwd Header Length,Fwd Avg Bytes/Bulk, Fwd Avg Packets/Bulk, Fwd Avg Bulk Rate, Bwd Avg Bytes/Bulk, Bwd Avg Packets/Bulk,Bwd Avg Bulk Rate,Subflow Fwd Packets, Subflow Fwd Bytes, Subflow Bwd Packets, Subflow Bwd Bytes,Init_Win_bytes_forward, Init_Win_bytes_backward, act_data_pkt_fwd, min_seg_size_forward,Active Mean, Active Std, Active Max, Active Min,Idle Mean, Idle Std, Idle Max, Idle Min, Label.

The result is that column names are read as Source IP (with a preceding space). The mitigation is to set skipinitialspace=True