Code Sample, a copy-pastable example if possible
from io import StringIO
import pandas as pd
data = u""" a bbb
ccdd """
df = pd.read_fwf(StringIO(data), widths=[3, 3], header=None)
The output is
>>> df.iloc[0,0]
u'a'
Expected Output
u' a '
Problem description
Apparently, leading and trailing whitespaces are removed but I want to keep them. Adding dtype options, converters does not solve the problem. Is this expected behaviour?
I do not think this is intended because if we implement the same example with pd.read_csv()
, whitespaces are preserved.
from io import StringIO
import pandas as pd
data = u""" a ,bbb
cc,dd """
df = pd.read_csv(StringIO(data), header=None)
>>> df.iloc[0, 0]
' a '
For consistency, behaviour should be identical.
The problem is also mentioned on Stackoverflow (https://stackoverflow.com/questions/41558138/pandas-read-fwf-removing-leading-and-trailing-whitespace).
Output of pd.show_versions()
Comment From: chris-b1
I don't think this is a bug - since fixed width files are by definition white-space padded, stripping that whitespace is a very sane default and probably what most people want.
That said, I think it would be reasonable to add an option to support this.
Comment From: lphk92
I'm working on this at the SciPy 2017 sprints
Comment From: diatomicDisaster
I know this issue is old and closed, but it's the only place I could find where it's been discussed. Is there a way to prevent read_fwf from trimming whitespace? In my particular case I'm trying to split a fixed-width string based on the index of the character in that string. If the first characters are whitespace then this breaks the indexing.
The way I see it: a file where the character widths of each column are fixed, regardless of whether the content of each 'cell' occupies the full width or not. But either way, there are databases that follow this format, so it would perhaps be good to have the option to switch stripping on or off?
Comment From: ignaciohermosillacornejo
Having the same issue, I know this is a niche case but it would be great to have a param to make pandas pandas skip removing leading and trailing white spaces
Comment From: juanr2001
Just following up with this issue, has this been solve?
Comment From: RonaldBarnes
The issue can be solved by setting delimiter="\0"
, or to any other character that you do not wish stripped from your fields.
Internally, if no delimiter is set (and it seems counter-intuitive to set a delimiter on a fixed width file), then it is set by default to "\r\n\t", and this is used to strip CR/LF from each line, plus CR/LF and all tabs and spaces from each field:
Set delimiter set to whitespace
Perhaps this makes sense when reading in a table, but for a "true" fixed-width file, where data position is important, this is unexpected and undesirable.
I propose having a preserve_whitespace
option and eventually deprecating the delimiter
option.
I would also set it to True by default, but that would break current behaviour and likely be poorly received.
I've got some code that I could make a pull request from to add this feature and have it work with any explicitly passed delimiter if anyone's interested?
Comment From: RonaldBarnes
I think it's worth repeating: as @tobiasraabe showed, even read_csv will preserve whitespace by default!
Here's an example of using the delimiter to preserve whitespace:
data = """
leading |1|a
white |2|b
space |3|c
"""
df = pd.read_fwf(StringIO(data),
delimiter="\0",
names=["col1", "col2", "col3"],
colspecs=[(0,13),(14,15),(16,17)],
dtype={"col1": "str", "col2": "str", "col3": "str"},
)
df.values
[[' leading ' '1' 'a']
[' white ' '2' 'b']
[' space ' '3' 'c']]
Comment From: RonaldBarnes
Reading through the test suite, is the delimiter
used for columns="infer"
?
I may have to investigate that, and the use of delimiter(s) there makes perfect sense.