Pandas pd.read_fwf removes leading and trailing whitespace

Code Sample, a copy-pastable example if possible

from io import StringIO
import pandas as pd

data = u""" a bbb
 ccdd """

df = pd.read_fwf(StringIO(data), widths=[3, 3], header=None)

The output is

>>> df.iloc[0,0]
u'a'

Expected Output

u' a '

Problem description

Apparently, leading and trailing whitespaces are removed but I want to keep them. Adding dtype options, converters does not solve the problem. Is this expected behaviour?

I do not think this is intended because if we implement the same example with pd.read_csv(), whitespaces are preserved.

from io import StringIO
import pandas as pd

data = u""" a ,bbb
 cc,dd """

df = pd.read_csv(StringIO(data), header=None)

>>> df.iloc[0, 0]
' a '

For consistency, behaviour should be identical.

The problem is also mentioned on Stackoverflow (https://stackoverflow.com/questions/41558138/pandas-read-fwf-removing-leading-and-trailing-whitespace).

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 2.7.13.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.20.1 pytest: 3.0.7 pip: 9.0.1 setuptools: 27.2.0 Cython: 0.25.2 numpy: 1.12.1 scipy: 0.19.0 xarray: None IPython: 5.3.0 sphinx: 1.5.6 patsy: 0.4.1 dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: 1.2.1 tables: 3.2.2 numexpr: 2.6.2 feather: None matplotlib: 2.0.2 openpyxl: 2.4.7 xlrd: 1.0.0 xlwt: 1.2.0 xlsxwriter: 0.9.6 lxml: 3.7.3 bs4: 4.6.0 html5lib: 0.999 sqlalchemy: 1.1.9 pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None

Comment From: chris-b1

I don't think this is a bug - since fixed width files are by definition white-space padded, stripping that whitespace is a very sane default and probably what most people want.

That said, I think it would be reasonable to add an option to support this.

Comment From: lphk92

I'm working on this at the SciPy 2017 sprints

Comment From: diatomicDisaster

I know this issue is old and closed, but it's the only place I could find where it's been discussed. Is there a way to prevent read_fwf from trimming whitespace? In my particular case I'm trying to split a fixed-width string based on the index of the character in that string. If the first characters are whitespace then this breaks the indexing.

The way I see it: a file where the character widths of each column are fixed, regardless of whether the content of each 'cell' occupies the full width or not. But either way, there are databases that follow this format, so it would perhaps be good to have the option to switch stripping on or off?

Comment From: ignaciohermosillacornejo

Having the same issue, I know this is a niche case but it would be great to have a param to make pandas pandas skip removing leading and trailing white spaces

Comment From: juanr2001

Just following up with this issue, has this been solve?

Comment From: RonaldBarnes

The issue can be solved by setting delimiter="\0", or to any other character that you do not wish stripped from your fields.

Internally, if no delimiter is set (and it seems counter-intuitive to set a delimiter on a fixed width file), then it is set by default to "\r\n\t", and this is used to strip CR/LF from each line, plus CR/LF and all tabs and spaces from each field:

Set delimiter set to whitespace

Then strip delimiters

Perhaps this makes sense when reading in a table, but for a "true" fixed-width file, where data position is important, this is unexpected and undesirable.

I propose having a preserve_whitespace option and eventually deprecating the delimiter option.

I would also set it to True by default, but that would break current behaviour and likely be poorly received.

I've got some code that I could make a pull request from to add this feature and have it work with any explicitly passed delimiter if anyone's interested?

Comment From: RonaldBarnes

I think it's worth repeating: as @tobiasraabe showed, even read_csv will preserve whitespace by default!

Here's an example of using the delimiter to preserve whitespace:

data = """
  leading    |1|a
    white    |2|b
      space  |3|c
"""

df = pd.read_fwf(StringIO(data),
    delimiter="\0",
    names=["col1", "col2", "col3"],
    colspecs=[(0,13),(14,15),(16,17)],
    dtype={"col1": "str", "col2": "str", "col3": "str"},
    )

df.values

[['  leading    ' '1' 'a']
 ['    white    ' '2' 'b']
 ['      space  ' '3' 'c']]

Comment From: RonaldBarnes

Reading through the test suite, is the delimiter used for columns="infer"?

I may have to investigate that, and the use of delimiter(s) there makes perfect sense.

Pandas pd.read_fwf removes leading and trailing whitespace

Code Sample, a copy-pastable example if possible

Expected Output

Problem description

Output of pd.show_versions()

Output of `pd.show_versions()`