See https://github.com/pydata/pandas/pull/4955#issuecomment-25352643
id foo
1 a
1 a
1 a
123a
This works with the snippet in the orig #4488 but fails with the merged #4955 due to the last line.
It's a common example of a fwf file.
Comment From: jtratner
So what should happen in this case? Are you thinking that detection should be entirely based on first line?
id foo
123a
123a
123a
123a
Comment From: jtratner
Also, as an aside, this is probably a clearer example (when it's at the end, seems like it's a special footer)
id foo
1 a
123a
1 a
1 a
Comment From: ghost
Not at all. #4488 handles the first case by looking for a pattern which is common to a majority of records,
4955 fails if there's a single record in the whole dataset that occupies the whole column.
In your example, the first row is indeed the only hint as to the width of the columns, and it may be interpreted as a single column with a header label of "id foo".
Comment From: jreback
cc @alefnula
I think that if detection is 'infer' and "doesn't work", then should raise.
not sure if you can make an accurate determination in all cases though?
Comment From: ghost
True, can't make it work for all cases. I think the first example is more common then yours. For the case where there is a full single-character column seperating the fields, as required by #4955, #4488 will (almost) always work.
Comment From: jreback
cc @alefnula
I think you can do a percentage computation on the mask that you are computing. If they are say the same in say 90% of the lines then you can go with that (or even higher per). I think that would allow for some (but not most) no-delimeter lines
Comment From: alefnula
Yes, this is a problem. The current implementation cannot detect the columns properly if they are not separated by the filler. But the suggested implementation in issue #4488 didn't work for any case I tested :( For the example you just wrote here it returns [8]
. And even for simple cases like:
1 2 3
4 5 6
7 8 9
It returns [2, 3]
...
But this is not the primary reason I haven't tried to fix it. The much bigger problem with the splitting approach is that it cannot handle the cases where columns are strings which can contain the filler, for example:
id Firstname and Lastname Age
1 Douglas Adams 44
1 Arthur Conan Doyle 55
The approach I used supports these cases.
So basically these are the reason why I chose to use the bitmask instead of splitting and counting. (and also it's more memory and time efficient - but this is not that important if we read only 100 lines of data). The only drawback is that it cannot handle the columns that are glued together, but that is clearly stated in the documentation.
Maybe an approach where the outliers will be discarded can be implemented. But I'll have to find a way to do it efficiently...
Comment From: jtratner
Ultimately, you can still pass colspecs directly in ambiguous cases.
Comment From: alefnula
Also I don't think this is a bug. The behavior of inferring columns is clearly documented, and the function behaves as the documentation states it does... But, as everything, it can be improved :)
Comment From: jreback
@alefnula can you squash the mas column wise and compare how many equal as per of total rows? (not sure if that is really useful.....) just a thought
Comment From: ghost
Not true. There may be a bug in in #4488 (or not) but the underlying approach should handle strings such as those just fine. The issue is that it has a free parameter (the threshold) which has no "right" value, unless you know the distribution of all datasets out there (we don't). It seems to me that #4955 is a private case of the threshold scheme in #4488, but with a threshold of 0%, and I'm pretty sure whatever the "best" value is, it's not that.
Documented behaviour is great, but ultimately it's preferable to support more cases rather then document the fact that you don't, no?
Comment From: ghost
Removed bug label. It's a common case that isn't handled, which is a shame but not a bug.
Comment From: alefnula
@jreback I dont actually think that this problem can be solved in a way that will satisfy everyone :( since it's too ambiguous. The provided example:
id foo
1 a
1 a
123a
Can be split in two different (both valid) ways:
id foo id foo
1 a 1 a
1 a or 1 a
123 a 12 3a
And here is an even worse case.
c a foo
12 1 5
23 1 6
4561237
If we try to implement any kind of heuristic that will determine where to split, there will be a case when we will get it wrong. The user will, maybe, get the correct number of columns but the data will be incorrect. So he will get a false conclusion (especially if he has a large data set that he cannot skim through) that we correctly split the columns, but the data will not be correct. In the current implementation, all these ambiguous cases will produce wrong number of column in the beginning, so the user will immediately know that something is not right... My opinion is that it's better to get a completely wrong number of columns in the first place, then to get the right number of columns, but have incorrect data in them because of the ambiguous cases.
Comment From: jreback
@alefnula agreed....false positive is bad here. I was just wondering if you can detect the false positive and raise a helpful error message?
Comment From: alefnula
@jreback In the current implementation you wont get false positives. Either the number column will be correct and you will get valid data or the number of columns wont be correct... The thing that I may be able to implement is to "suspect" that some lines are ambiguous and issue a warning:
Line 44 is suspicious. If the inferred columns are incorrect, check the line 44 or provide a correct
colspecs
argument.
Comment From: ghost
In all the examples you provided #4488 would do the right thing.
Comment From: alefnula
@y-p The problem is that there is no right thing :)
c a foo
12 1 5
23 1 6
4561237
Which of the following is right?
c a foo
12 1 5
23 1 6
45 6123 7
c a foo
12 1 5
23 1 6
456 1 237
c a foo
12 1 5
23 1 6
45 61 237
c a foo
12 1 5
23 1 6
456 12 37
c a foo
12 1 5
23 1 6
45 612 37
Also #4488 does not work for any case I tried :( Maybe you pasted the wrong version of the code there?
Comment From: ghost
None of them, the right one is:
c a foo
12 1 5
23 1 6
456 123 7
Both #4488 and your private case of it take the rightmost "filler" column as the edge of each column. Honestly, most of the fwf datasets I've encountered would not work with #4955, that's the only reason I'm pushing for this.
How about adding a detect_threshold
kwarg and setting the defval to match the behaviour of
4955? a conservative default is fine by me, just don't be too conservative to be useful.
Comment From: alefnula
Oh, now I see what you aim for :) Correct me if I'm wrong, but your point is that most of the fwf
files are aligned left. If this is the case, we had a completely different experience with fwf
files, since most of the ones I saw were aligned right so I would probably interpret this one as:
c a foo
12 1 5
23 1 6
45 61 237
Maybe a aligned=left|right|None
kwarg? In case of None
I'll do the thing I'm doing now, and in the case of left
or rigth
I can be a little bit "smarter".
Comment From: ghost
No, it has nothing to do with alignment. Your code is a fine default, I'll add in a kwarg threshold when I get a chance, preserving this default behaviour. np.
Correction: an aligned kwarg makes sense, but still won't handle the cases I'm concerned with,
unless there's a threshold
argument.
Comment From: jtratner
@y-p if you don't have the column headers, you can't know which one is correct and all of the examples could be valid. Only if the column headers are present (and you can assume that they are both left-aligned and don't have spaces in their name) can you unambiguously decide where to split. (especially if the data for a column has filler in it but is supposed to stay together)
Comment From: ghost
@jtratner the example @alefnula gave aligns some of the column headings differently from the data, that's misleading. Remove the bogus headings and assume All columns are aligned the same way (valid assumption, in my experience), Is there still ambiguity there?
Comment From: jtratner
I think you're right that the alignment should be consistent, but the example still seems dependent upon the assumption that the first row are column headings, which isn't always true with fwf that I've seen.
Comment From: ghost
The algo I had in mind doesn't isn't dependent on col headings in any way. We've had plenty of discussion on this at this point, I suggest we leave it for now until I or someone else provides actual code to review. @alefnula's code is a fine enhancement so we're ahead already. In the future, we can do more.
pushed to 0.14, it's not a blocker. (or someday)
Comment From: mroeschke
Since there hasn't been much activity for this feature since and there appears to be significant complexity for determining the correct behavior I am going to close for now, but can reopen if there's more interest