Pandas read_json should support usecols option (or hook for arbitrary transformation after each line is read)

Code Sample, a copy-pastable example if possible

import pandas as pd
pd.read_json(path_to_big_file, lines=True, usecols=some_subset_of_cols)

Problem description

The read_csv function's usecols argument is very helpful in reading a subset of columns from a very large file. Unfortunately it does not exist in read_json so I have to manually read the file line by line and parse out the fields I am interested in to avoid memory issues. This comes at a big cost of slowness to loading json files.

One possible implementation that may be worth considering might be to have hooks to allow for user-defined transformations after each line read and after json transformation. In that way, we could also support an additional use case of applying custom decode function to lines when dealing with hybrid proprietary file types.

def strip_preamble(line):
   '''Example line transformation before valid json object'''
    return line.split("|", 1)[1]

def select_subset(json_rec):
   '''Example selection of columns from all possible columns in json lines'''
   return {k: json_rec.get(k) for k in ["col_A", "col_B"]}

pd.read_json(path_to_big_file, lines=True, line_loaded_hook=strip_preamble, json_created_hook=select_subset)

Comment From: WillAyd

Correct me if I am wrong but whitespace is not significant in JSON, so I don't think it is generalizable to have the hook that you envision. Are your memory issues on creation of a DataFrame or purely from parsing the JSON? I suppose there are some options with the former but the latter would be quite the undertaking, if even possible

Comment From: sam-cohan

I am not sure why whitespace not being significant is important. this is specifically for lines=True so presumably just beofre and just after each line is parsed into json there should be a hook to pass the object for arbitrary manipulation rather than having to wait for all the DataFrame to be created before dropping some of the columns.

Comment From: WillAyd

My point was that with how you described it the hook would not be generalizable without lines=True so it would be pretty difficult to implement usecols outside of a limited scope if at all. I'm not an expert on JSON so perhaps there is a way others can chime in on.

You are aware of the chunksize argument right? If you are running into memory issues parsing the file that can be helpful at the cost of some performance

Comment From: sam-cohan

I see your point, thought still think it is helpful to have those arguments when lines=True. I guess one way around my current problem is indeed using chunksize parameter.

Comment From: jreback

@sam-cohan have a calling hook on every line would make this unbearably slow. This would require quite a major effort and the json parser is just not flexibile enough to allow this. This suggestion is already noted in the design of the next generation parser being contemplated by @wesm see: https://github.com/pandas-dev/pandas2/issues/71, though this is still very much in the early design phase.

Comment From: sam-cohan

@jreback yes, I am essentially reading the file line by line and doing the same myself and that is very slow. I was hoping there would be some internal optimization that could be built in to discard certain keys from the json after each line is read. I guess that is what the enhancement you are referring to is for. Thanks.

Comment From: JackOfSpade

One other workaround is to parse out all the unneeded columns in the json file during development so subsequent reads in production will be faster.