Pandas Infer fixed width format using the whole file

As of now read_fwf infers the fields positions using only first 100 rows of the file, and this number is not easily modifiable. However, if there is a field with values for several rare objects only, it will be completely missed! So it would be great if pandas used much more rows by default (100 is quite a small number) - why not put something like 10000? Or at least provide a way to increase this number and/or a parameter like infer_using_whole_file=False.

If anyone finds this issue and needs an immediate solution - I personally use monkey-patching:

_detect_colspecs = pd.io.parsers.FixedWidthReader.detect_colspecs
pd.io.parsers.FixedWidthReader.detect_colspecs = lambda self, n=100000, skiprows=None: _detect_colspecs(self, n, skiprows)

Comment From: gfyoung

Indeed! Would be nice to parameterize how many rows you need to infer field positions.

Comment From: aldanor

Yes, this messes thing up quite often and columns where strings get longer towards the end of the data may get messed up and truncated.

Comment From: amotl

Hi there,

first things first: Thanks a stack for outlining this important detail, it just saved our implementation ¹². 🌻

Provide a way to increase this number and/or a parameter like infer_using_whole_file=False. Would be nice to parameterize how many rows you need to infer field positions.

On this matter, we wanted to report that the infer_nrows argument has apparently been added to read_fwf, so, while it still does not satisfy the need for a infer_using_whole_file flag, the original gist can be implemented like read_fwf(colspecs="infer", infer_nrows=100000) now, which may be more convenient than using the monkey patch, depending on the scenario.

With kind regards, Andreas.

Comment From: amotl

the need for a infer_using_whole_file flag

It looks like using infer_nrows=np.Infinity works well, see https://github.com/earthobservations/wetterdienst/commit/a4b15125d.

https://github.com/earthobservations/wetterdienst/commit/76dc39b26da77 ↩
https://github.com/earthobservations/wetterdienst/pull/847#discussion_r1081580454 ↩