As of now read_fwf
infers the fields positions using only first 100 rows of the file, and this number is not easily modifiable. However, if there is a field with values for several rare objects only, it will be completely missed! So it would be great if pandas used much more rows by default (100 is quite a small number) - why not put something like 10000? Or at least provide a way to increase this number and/or a parameter like infer_using_whole_file=False
.
If anyone finds this issue and needs an immediate solution - I personally use monkey-patching:
_detect_colspecs = pd.io.parsers.FixedWidthReader.detect_colspecs
pd.io.parsers.FixedWidthReader.detect_colspecs = lambda self, n=100000, skiprows=None: _detect_colspecs(self, n, skiprows)
Comment From: gfyoung
Indeed! Would be nice to parameterize how many rows you need to infer field positions.
Comment From: aldanor
Yes, this messes thing up quite often and columns where strings get longer towards the end of the data may get messed up and truncated.
Comment From: amotl
Hi there,
first things first: Thanks a stack for outlining this important detail, it just saved our implementation 12. 🌻
Provide a way to increase this number and/or a parameter like
infer_using_whole_file=False
. Would be nice to parameterize how many rows you need to infer field positions.
On this matter, we wanted to report that the infer_nrows
argument has apparently been added to read_fwf
, so, while it still does not satisfy the need for a infer_using_whole_file
flag, the original gist can be implemented like read_fwf(colspecs="infer", infer_nrows=100000)
now, which may be more convenient than using the monkey patch, depending on the scenario.
With kind regards, Andreas.
Comment From: amotl
the need for a
infer_using_whole_file
flag
It looks like using infer_nrows=np.Infinity
works well, see https://github.com/earthobservations/wetterdienst/commit/a4b15125d.