Pandas read_fwf: skip_blank_lines does nothing

Code Sample, a copy-pastable example if possible

from io import StringIO
import pandas as pd

f = StringIO('''A B

C D''')

df = pd.read_fwf(f, colspecs=[(0, 1), (2,3)], header=None, skip_blank_lines=True)
print(df)

Problem description

Output:

     0    1
0    A    B
1  NaN  NaN
2    C    D

The (second) blank line is not skipped, but instead there is a row with two NaN values. It seems that skip_blank_lines has no effect on read_fwf. On the other hand, read_csv(f, sep=' ', header=None), produces the expected output below.

Expected Output

   0  1
0  A  B
1  C  D

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.6.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 79 Stepping 1, GenuineIntel byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None pandas: 0.23.1 pytest: 3.6.2 pip: 18.0 setuptools: 39.0.1 Cython: None numpy: 1.15.1 scipy: None pyarrow: None xarray: None IPython: 6.5.0 sphinx: 1.5.5 patsy: None dateutil: 2.7.3 pytz: 2018.5 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 1.0.1 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Comment From: WillAyd

Thanks for the report - investigation and PRs are certainly welcome

Comment From: danpere

The bug appears to be in pandas/io/parsers.py in the way _remove_empty_lines() is used with fixed-width files (or probably files with whitespace delimiters for that matter). It sees that the line is ['', ''] which in a CSV would mean the string "," and therefore non-empty, but for a fixed-width file can happen when the line is empty. ~~Arguably, if the line really were spaces out to the last field, empty strings might be the right thing to extract.~~ (Edit: Nevermind, then the fields would be a series of spaces, not empty.) I'm not sure if the right fix is to change _remove_empty_lines() or its usages.

Comment From: dvalters

I believe this bug also extends to read_excel and read_csv in files that have 'empty' trailing lines, and is more generic than just the read_fwf function.

Code sample

import pandas as pd
from io import StringIO

csv_f = StringIO('''A,B,C,D 
FOO,1,2,3 
FOO,4,5,6 
,,, 
FOO,7,8,9 
,10,11,12 
,,, 
,,, 
,,, 
,,, 
,,, 
''' 
) 

df = pd.read_csv(csv_f, header=None, skip_blank_lines=True)
print(df)

Output

      0    1    2    3
0     A    B    C    D
1   FOO    1    2    3
2   FOO    4    5    6
3   NaN  NaN  NaN  NaN
4   FOO    7    8    9
5   NaN   10   11   12
6   NaN  NaN  NaN  NaN
7   NaN  NaN  NaN  NaN
8   NaN  NaN  NaN  NaN
9   NaN  NaN  NaN  NaN
10  NaN  NaN  NaN  NaN

With read_excel if the workbook has 'blank' lines that contain any sort of formula that result in a null string or blank cell (but not empty cell), similar behaviour is exhibited when skip_blank_lines=True (which is default true anyway according to the docs https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#na-and-missing-data-handling)

Comment From: dvalters

Also possibly related to: #10164

Comment From: dcdenu4

Any updates on this? I'm still seeing this behavior in pandas 1.0.3

from io import StringIO
csv_file = StringIO('''lucode,desc,val1,val2
1,corn,0.5,2
,,,
3,peas,1,-2
'''
)
df = pandas.read_csv(csv_file, skip_blank_lines=True)
df
   lucode  desc  val1  val2
0     1.0  corn   0.5   2.0
1     NaN   NaN   NaN   NaN
2     3.0  peas   1.0  -2.0

Comment From: jreback

@dcdenu4 or anyone can submit a PR pandas is all volunteer and we have 3000+ open issues

Comment From: mroeschke

This looks to work on master now. Could use a test

In [1]: from io import StringIO
   ...: import pandas as pd
   ...:
   ...: f = StringIO('''A B
   ...:
   ...: C D''')
   ...:
   ...: df = pd.read_fwf(f, colspecs=[(0, 1), (2,3)], header=None, skip_blank_lines=True)
   ...: print(df)
   0  1
0  A  B
1  C  D

Comment From: lucnguyen93

take

Comment From: jnclt

As mentioned above, the original issue isn't reproducible anymore and there is a test covering this case (test_fwf_skip_blank_lines): https://github.com/pandas-dev/pandas/blob/36dcf519c67a8098572447f7d5a896740fc9c464/pandas/tests/io/parser/test_read_fwf.py#L354

As for read_csv, I think the behavior is as expected (Nan for coma-separated blank values).

I guess this issue can be closed?