Pandas ENH: Add engine keyword to read_json to enable reading from pyarrow

pyarrow has a read_json function that could be used as an alternative parser for pd.read_json https://arrow.apache.org/docs/python/generated/pyarrow.json.read_json.html#pyarrow.json.read_json

Like we have for read_csv and read_parquet, I would like to propose a engine keyword argument to allow users to pick the parsing backend engine="ujson"|"pyarrow"

One change compared to read_csv and read_parquet with engine="pyarrow" I would like to propose would be to return ArrowExtensionArrays instead of converting the result to numpy dtypes such that the pyarrow.Table returned by read_json still propagates pyarrow objects underneath.

Thoughts?

Comment From: jreback

+1 for engine and returning arrow extension types

Comment From: abkosar

Hi! I would like to work on this but this will be my first time contributing to pandas. On the other hand, I use pandas a lot and I am familiar with the read_json method; however, I may need some help on the way.

Thanks!

Comment From: mroeschke

Thanks @abkosar. Just noting that for first time contributors we recommend tackling issues labeled good first issue, but if you're still interested in this particular issue a pull request would be welcome.

Comment From: abkosar

Thanks for the reply @mroeschke. Yeah I looked at those too but since it's not my first time contributing to open source I thought I could this one a shot. Cool I will start looking into it.

Comment From: lithomas1

One change compared to read_csv and read_parquet with engine="pyarrow" I would like to propose would be to return ArrowExtensionArrays instead of converting the result to numpy dtypes such that the pyarrow.Table returned by read_json still propagates pyarrow objects underneath.

Thoughts?

I would prefer returning numpy dtypes instead, to be consistent between engines and across IO methods.

Comment From: mroeschke

I would prefer returning numpy dtypes instead, to be consistent between engines and across IO methods.

Yeah understandable. I think https://github.com/pandas-dev/pandas/issues/48957 would allow returning pyarrow types in a more backward compat & API consistent way

Comment From: abkosar

@mroeschke So I have been looking into the issue. I was mainly looking at the read_csv and see how engine is implemented for read_csv; however, I have couple questions: - JsonReader doesn't have the _make_engine method (TextFileReader) implemented so I think I need to implement that too, just confirming? - read_csv has its own pyarrow engine wrapper in the pandas.io.parsers.arrow_parser_wrapper.py. For json should I extend that class or create new wrapper for read_json?

Also since this is my first time contributing to pandas, do you expect a fully functional PR or should I make a PR with what I have (which is not so far since I wanted to confirm the two questions above) and we iterate back and forth?

Thanks!

Comment From: mroeschke

Not too familiar with the json code base, so it would be easier to discuss those questions in a PR just to see what it would look like.

Generally I would (hope) it would be something like

def read_json(...):
    # maybe some validation first
    if engine == "ujson":
        return ExistingParser().read()
    elif engine="pyarrow"
        return PyArrowParser().read()

Where PyArrowParser and ExistingParser might be able to share some code.

It's okay to open up a PR at any state, especially since Github has a draft PR status.

Comment From: abkosar

Sounds good then. I will add couple more things and create a PR.