I've noticed that when reading large Stata files using the chunksize parameter the time it takes to create the StataReader object is affected by the size of the file. This is a bit surprising since all of the metadata it needs is contained in the file header so it seems like it should take the same time regardless of the total file size.
I took a look at the code and it seems like the culprit is this line that reads the entire file into a BytesIO object before parsing the header. I'm not entirely sure what this accomplishes. Ideally it would be nice to be able to create the StataReader object after processing just the header portion of the file.
https://github.com/pandas-dev/pandas/blob/71fc89cd515c3c19230fbab64e979118858b808a/pandas/io/stata.py#L1167-L1175
Comment From: twoertwein
Is that a behavior change you have noticed since 1.5 or did it also exist in previous versions? I think these particular lines of code are around since 1.3 but even before it (I think) it had a similar logic.
I think the issue is that some IO-like objects are not seekable but read_stata does internally a lot of seeking (some of the compressions IO doesn't support seeking). It might be the case that we can change the above line to only completely read the file if it isn't seekable.
Comment From: sterlinm
I don't think it changed in 1.5, I had noticed it with 1.4. I didn't look back to see when it was introduced or if it has always been there.
I see the uses of seek in parsing the header but it seems like it should be possible to avoid that.
EDIT: Commented to soon, I think the suggestion to skip that when the file is seekable is simpler.
Comment From: twoertwein
Feel free to open a PR!
I think the main change is
self.handles = get_handle(...)
if hasattr(self.handles.handle, "seekable") and self.handles.handle.seekable:
self.path_or_buf = self.handles.handle
else:
with self.handles:
self.path_or_buf = BytesIO(handles.handle.read())
# and then appropriate code to close self.handles (and self.path_or_buf in case of BytesIO)
Comment From: sterlinm
Feel free to open a PR!
I'll give it a shot over the weekend. Thanks!
Comment From: akx
By the looks of it, https://github.com/pandas-dev/pandas/commit/2f0ada3d430f0cc49fa9ab1e7e2e2a7ec23c9616 was the commit that changed this behavior, way back when.
(Came here via answering https://stackoverflow.com/a/73934594/51685 :-) )
Comment From: sterlinm
Amazing, thanks so much!