Feature Type
-
[ ] Adding new functionality to pandas
-
[X] Changing existing functionality in pandas
-
[ ] Removing existing functionality in pandas
Problem Description
pd.read_json()
is powerful with its chunksize
and nrows
arguments. However, sometimes I wish I could read just a certain number of lines - e.g. between 2000 and 4000 - rather than iterating over the whole dataset and just keeping that line range.
Right now my workaround is:
with pd.read_json("path/to/my/fat.json.gz", lines=True, chunksize=2000, nrows=50_000) as reader:
raw = pd.concat([chunk for iteration, chunk in enumerate(reader) if 20 < iteration < 25])
This is non optimal because pandas
still has to read the chunks from 0 to 19. I am not reading all the lines up to the 50k-th to avoid reading too much stuff into memory.
Feature Description
I would change the default behaviour of nrows
:
- if it's an integer, reads all the lines up to that line.
- If it's an iterable (
list
,tuple
...) of length two, then read the lines between the range.
My snippet above would become:
raw = pd.read_json("path/to/my/fat.json.gz", lines=True, nrows=(40_000, 50_000))
Alternative Solutions
Add two new parameters like from=
to=
(not ideal because will add yet another parameter to an already long signature)
Additional Context
No response
Comment From: Lakshyachitransh
take
Comment From: lithomas1
Thanks for opening an issue request. IMO, this should probably be a separate parameter (like skiprows
from read_csv).
I'll leave this open, but I'd like to see if others are also interested before having it implemented in pandas, since I do think this will be a more niche feature.