Code Sample, a copy-pastable example if possible
from io import StringIO
import pandas as pd
f = StringIO('''A B
C D''')
df = pd.read_fwf(f, colspecs=[(0, 1), (2,3)], header=None, skip_blank_lines=True)
print(df)
Problem description
Output:
0 1
0 A B
1 NaN NaN
2 C D
The (second) blank line is not skipped, but instead there is a row with two NaN
values. It seems that skip_blank_lines
has no effect on read_fwf
. On the other hand, read_csv(f, sep=' ', header=None)
, produces the expected output below.
Expected Output
0 1
0 A B
1 C D
Output of pd.show_versions()
Comment From: WillAyd
Thanks for the report - investigation and PRs are certainly welcome
Comment From: danpere
The bug appears to be in pandas/io/parsers.py
in the way _remove_empty_lines()
is used with fixed-width files (or probably files with whitespace delimiters for that matter). It sees that the line is ['', '']
which in a CSV would mean the string ","
and therefore non-empty, but for a fixed-width file can happen when the line is empty. ~~Arguably, if the line really were spaces out to the last field, empty strings might be the right thing to extract.~~ (Edit: Nevermind, then the fields would be a series of spaces, not empty.) I'm not sure if the right fix is to change _remove_empty_lines()
or its usages.
Comment From: dvalters
I believe this bug also extends to read_excel
and read_csv
in files that have 'empty' trailing lines, and is more generic than just the read_fwf
function.
Code sample
import pandas as pd
from io import StringIO
csv_f = StringIO('''A,B,C,D
FOO,1,2,3
FOO,4,5,6
,,,
FOO,7,8,9
,10,11,12
,,,
,,,
,,,
,,,
,,,
'''
)
df = pd.read_csv(csv_f, header=None, skip_blank_lines=True)
print(df)
Output
0 1 2 3
0 A B C D
1 FOO 1 2 3
2 FOO 4 5 6
3 NaN NaN NaN NaN
4 FOO 7 8 9
5 NaN 10 11 12
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 NaN NaN NaN NaN
10 NaN NaN NaN NaN
With read_excel
if the workbook has 'blank' lines that contain any sort of formula that result in a null string or blank cell (but not empty cell), similar behaviour is exhibited when skip_blank_lines=True
(which is default true anyway according to the docs https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#na-and-missing-data-handling)
Comment From: dvalters
Also possibly related to: #10164
Comment From: dcdenu4
Any updates on this? I'm still seeing this behavior in pandas 1.0.3
from io import StringIO
csv_file = StringIO('''lucode,desc,val1,val2
1,corn,0.5,2
,,,
3,peas,1,-2
'''
)
df = pandas.read_csv(csv_file, skip_blank_lines=True)
df
lucode desc val1 val2
0 1.0 corn 0.5 2.0
1 NaN NaN NaN NaN
2 3.0 peas 1.0 -2.0
Comment From: jreback
@dcdenu4 or anyone can submit a PR pandas is all volunteer and we have 3000+ open issues
Comment From: mroeschke
This looks to work on master now. Could use a test
In [1]: from io import StringIO
...: import pandas as pd
...:
...: f = StringIO('''A B
...:
...: C D''')
...:
...: df = pd.read_fwf(f, colspecs=[(0, 1), (2,3)], header=None, skip_blank_lines=True)
...: print(df)
0 1
0 A B
1 C D
Comment From: lucnguyen93
take
Comment From: jnclt
As mentioned above, the original issue isn't reproducible anymore and there is a test covering this case (test_fwf_skip_blank_lines
): https://github.com/pandas-dev/pandas/blob/36dcf519c67a8098572447f7d5a896740fc9c464/pandas/tests/io/parser/test_read_fwf.py#L354
As for read_csv
, I think the behavior is as expected (Nan
for coma-separated blank values).
I guess this issue can be closed?
Comment From: mroeschke
Yeah looks like test_fwf_skip_blank_lines
tests this so closing