Feature Type

  • [ ] Adding new functionality to pandas

  • [X] Changing existing functionality in pandas

  • [ ] Removing existing functionality in pandas

Problem Description

pd.read_json() is powerful with its chunksize and nrows arguments. However, sometimes I wish I could read just a certain number of lines - e.g. between 2000 and 4000 - rather than iterating over the whole dataset and just keeping that line range.

Right now my workaround is:

with pd.read_json("path/to/my/fat.json.gz", lines=True, chunksize=2000, nrows=50_000) as reader:    
    raw = pd.concat([chunk for iteration, chunk in enumerate(reader) if 20 < iteration < 25])

This is non optimal because pandas still has to read the chunks from 0 to 19. I am not reading all the lines up to the 50k-th to avoid reading too much stuff into memory.

Feature Description

I would change the default behaviour of nrows:

  • if it's an integer, reads all the lines up to that line.
  • If it's an iterable (list, tuple...) of length two, then read the lines between the range.

My snippet above would become:

raw = pd.read_json("path/to/my/fat.json.gz", lines=True, nrows=(40_000, 50_000))

Alternative Solutions

Add two new parameters like from= to= (not ideal because will add yet another parameter to an already long signature)

Additional Context

No response

Comment From: Lakshyachitransh

take

Comment From: lithomas1

Thanks for opening an issue request. IMO, this should probably be a separate parameter (like skiprows from read_csv).

I'll leave this open, but I'd like to see if others are also interested before having it implemented in pandas, since I do think this will be a more niche feature.