Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[x] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
df = pandas.read_csv(
"D:\\NIDS Datasets\\CIC-IDS2017\\GeneratedLabelledFlows\\TrafficLabelling\\Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv",
encoding='utf-8',
header=0,
usecols=['Source IP', 'Destination IP', 'Source Port', 'Destination Port'],
low_memory=False,
)
Issue Description
Reading the CSV file fails and an error is raised suggesting the requested columns are not part of the DataFrame while these are clearly present. In specific, the code results in the following stack trace.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[28], line 1
----> 1 df = pandas.read_csv(
2 "D:\\NIDS Datasets\\CIC-IDS2017\\GeneratedLabelledFlows\\TrafficLabelling\\Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv",
3 encoding='utf-8',
4 header=0,
5 usecols=['Source IP', 'Destination IP', 'Source Port', 'Destination Port'],
6 )
7 df.head()
File c:\Users\user\AppData\Local\Programs\Python\venv\lib\site-packages\pandas\util\_decorators.py:211, in deprecate_kwarg.._deprecate_kwarg..wrapper(*args, **kwargs)
209 else:
210 kwargs[new_arg_name] = new_arg_value
--> 211 return func(*args, **kwargs)
File c:\Users\user\AppData\Local\Programs\Python\venv\lib\site-packages\pandas\util\_decorators.py:331, in deprecate_nonkeyword_arguments..decorate..wrapper(*args, **kwargs)
325 if len(args) > num_allow_args:
326 warnings.warn(
327 msg.format(arguments=_format_argument_list(allow_args)),
328 FutureWarning,
329 stacklevel=find_stack_level(),
330 )
--> 331 return func(*args, **kwargs)
File c:\Users\user\AppData\Local\Programs\Python\venv\lib\site-packages\pandas\io\parsers\readers.py:950, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
935 kwds_defaults = _refine_defaults_read(
936 dialect,
937 delimiter,
(...)
946 defaults={"delimiter": ","},
947 )
948 kwds.update(kwds_defaults)
--> 950 return _read(filepath_or_buffer, kwds)
File c:\Users\user\AppData\Local\Programs\Python\venv\lib\site-packages\pandas\io\parsers\readers.py:605, in _read(filepath_or_buffer, kwds)
602 _validate_names(kwds.get("names", None))
604 # Create the parser.
--> 605 parser = TextFileReader(filepath_or_buffer, **kwds)
607 if chunksize or iterator:
608 return parser
File c:\Users\user\AppData\Local\Programs\Python\venv\lib\site-packages\pandas\io\parsers\readers.py:1442, in TextFileReader.__init__(self, f, engine, **kwds)
1439 self.options["has_index_names"] = kwds["has_index_names"]
1441 self.handles: IOHandles | None = None
-> 1442 self._engine = self._make_engine(f, self.engine)
File c:\Users\user\AppData\Local\Programs\Python\venv\lib\site-packages\pandas\io\parsers\readers.py:1753, in TextFileReader._make_engine(self, f, engine)
1750 raise ValueError(msg)
1752 try:
-> 1753 return mapping[engine](f, **self.options)
1754 except Exception:
1755 if self.handles is not None:
File c:\Users\user\AppData\Local\Programs\Python\venv\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py:135, in CParserWrapper.__init__(self, src, **kwds)
131 assert self.orig_names is not None
132 if self.usecols_dtype == "string" and not set(usecols).issubset(
133 self.orig_names
134 ):
--> 135 self._validate_usecols_names(usecols, self.orig_names)
137 # error: Cannot determine type of 'names'
138 if len(self.names) > len(usecols): # type: ignore[has-type]
139 # error: Cannot determine type of 'names'
File c:\Users\user\AppData\Local\Programs\Python\venv\lib\site-packages\pandas\io\parsers\base_parser.py:917, in ParserBase._validate_usecols_names(self, usecols, names)
915 missing = [c for c in usecols if c not in names]
916 if len(missing) > 0:
--> 917 raise ValueError(
918 f"Usecols do not match columns, columns expected but not found: "
919 f"{missing}"
920 )
922 return usecols
ValueError: Usecols do not match columns, columns expected but not found: ['Destination Port', 'Destination IP', 'Source IP', 'Source Port']
Expected Behavior
The DataFrame should be read correctly given the specified columns are present in the CSV file.
Surprisingly, the DataFrame is read correctly when a list of names
as passed, such as in the example below. The columns requested using usecols
is a subset of the columns provided using names
.
df = pandas.read_csv(
"D:\\NIDS Datasets\\CIC-IDS2017\\GeneratedLabelledFlows\\TrafficLabelling\\Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv",
encoding='utf-8',
header=0,
usecols=['Source IP', 'Destination IP', 'Source Port', 'Destination Port'],
names=['Flow ID', 'Source IP', 'Source Port', 'Destination IP',
'Destination Port', 'Protocol', 'Timestamp', 'Flow Duration',
'Total Fwd Packets', 'Total Backward Packets',
'Total Length of Fwd Packets', 'Total Length of Bwd Packets',
'Fwd Packet Length Max', 'Fwd Packet Length Min',
'Fwd Packet Length Mean', 'Fwd Packet Length Std',
'Bwd Packet Length Max', 'Bwd Packet Length Min',
'Bwd Packet Length Mean', 'Bwd Packet Length Std', 'Flow Bytes/s',
'Flow Packets/s', 'Flow IAT Mean', 'Flow IAT Std', 'Flow IAT Max',
'Flow IAT Min', 'Fwd IAT Total', 'Fwd IAT Mean', 'Fwd IAT Std',
'Fwd IAT Max', 'Fwd IAT Min', 'Bwd IAT Total', 'Bwd IAT Mean',
'Bwd IAT Std', 'Bwd IAT Max', 'Bwd IAT Min', 'Fwd PSH Flags',
'Bwd PSH Flags', 'Fwd URG Flags', 'Bwd URG Flags', 'Fwd Header Length',
'Bwd Header Length', 'Fwd Packets/s', 'Bwd Packets/s',
'Min Packet Length', 'Max Packet Length', 'Packet Length Mean',
'Packet Length Std', 'Packet Length Variance', 'FIN Flag Count',
'SYN Flag Count', 'RST Flag Count', 'PSH Flag Count', 'ACK Flag Count',
'URG Flag Count', 'CWE Flag Count', 'ECE Flag Count', 'Down/Up Ratio',
'Average Packet Size', 'Avg Fwd Segment Size', 'Avg Bwd Segment Size',
'Fwd Header Length.1', 'Fwd Avg Bytes/Bulk', 'Fwd Avg Packets/Bulk',
'Fwd Avg Bulk Rate', 'Bwd Avg Bytes/Bulk', 'Bwd Avg Packets/Bulk',
'Bwd Avg Bulk Rate', 'Subflow Fwd Packets', 'Subflow Fwd Bytes',
'Subflow Bwd Packets', 'Subflow Bwd Bytes', 'Init_Win_bytes_forward',
'Init_Win_bytes_backward', 'act_data_pkt_fwd', 'min_seg_size_forward',
'Active Mean', 'Active Std', 'Active Max', 'Active Min', 'Idle Mean',
'Idle Std', 'Idle Max', 'Idle Min', 'Label']
)
While I doubt this is because of the dataset being used, I would like to note that the dataset is publicly available at https://www.unb.ca/cic/datasets/ids-2017.html
Installed Versions
Comment From: phofl
Hi, thanks for your report.
please provide a reproducible example that does not require devs to register somewhere, ideally this shouldn’t rely on any files.
that said this is most likely because of your file since Usecols generally works if the file is correct
Comment From: Koen1999
While working on an easier to reproduce sample, I discovered the actual problem is different from what I thought. Therefore, I will close the issue and detail the actual issue below in case anyone ever comes across this issue.
The problem is that the first line of the CSV file sometimes contains spaces after the delimiter (and the rest of the CSV file does not). So it looks like this
Length of Bwd Packets, Fwd Packet Length Max, Fwd Packet Length Min, Fwd Packet Length Mean, Fwd Packet Length Std,Bwd Packet Length Max, Bwd Packet Length Min, Bwd Packet Length Mean, Bwd Packet Length Std,Flow Bytes/s, Flow Packets/s, Flow IAT Mean, Flow IAT Std, Flow IAT Max, Flow IAT Min,Fwd IAT Total, Fwd IAT Mean, Fwd IAT Std, Fwd IAT Max, Fwd IAT Min,Bwd IAT Total, Bwd IAT Mean, Bwd IAT Std, Bwd IAT Max, Bwd IAT Min,Fwd PSH Flags, Bwd PSH Flags, Fwd URG Flags, Bwd URG Flags, Fwd Header Length, Bwd Header Length,Fwd Packets/s, Bwd Packets/s, Min Packet Length, Max Packet Length, Packet Length Mean, Packet Length Std, Packet Length Variance,FIN Flag Count, SYN Flag Count, RST Flag Count, PSH Flag Count, ACK Flag Count, URG Flag Count, CWE Flag Count, ECE Flag Count, Down/Up Ratio, Average Packet Size, Avg Fwd Segment Size, Avg Bwd Segment Size, Fwd Header Length,Fwd Avg Bytes/Bulk, Fwd Avg Packets/Bulk, Fwd Avg Bulk Rate, Bwd Avg Bytes/Bulk, Bwd Avg Packets/Bulk,Bwd Avg Bulk Rate,Subflow Fwd Packets, Subflow Fwd Bytes, Subflow Bwd Packets, Subflow Bwd Bytes,Init_Win_bytes_forward, Init_Win_bytes_backward, act_data_pkt_fwd, min_seg_size_forward,Active Mean, Active Std, Active Max, Active Min,Idle Mean, Idle Std, Idle Max, Idle Min, Label
.
The result is that column names are read as Source IP
(with a preceding space). The mitigation is to set skipinitialspace=True