pyarrow has a read_json
function that could be used as an alternative parser for pd.read_json
https://arrow.apache.org/docs/python/generated/pyarrow.json.read_json.html#pyarrow.json.read_json
Like we have for read_csv
and read_parquet
, I would like to propose a engine
keyword argument to allow users to pick the parsing backend engine="ujson"|"pyarrow"
One change compared to read_csv
and read_parquet
with engine="pyarrow"
I would like to propose would be to return ArrowExtensionArray
s instead of converting the result to numpy dtypes such that the pyarrow.Table
returned by read_json
still propagates pyarrow objects underneath.
Thoughts?
Comment From: jreback
+1 for engine and returning arrow extension types
Comment From: abkosar
Hi! I would like to work on this but this will be my first time contributing to pandas. On the other hand, I use pandas a lot and I am familiar with the read_json
method; however, I may need some help on the way.
Thanks!
Comment From: mroeschke
Thanks @abkosar. Just noting that for first time contributors we recommend tackling issues labeled good first issue
, but if you're still interested in this particular issue a pull request would be welcome.
Comment From: abkosar
Thanks for the reply @mroeschke. Yeah I looked at those too but since it's not my first time contributing to open source I thought I could this one a shot. Cool I will start looking into it.
Comment From: lithomas1
One change compared to
read_csv
andread_parquet
withengine="pyarrow"
I would like to propose would be to returnArrowExtensionArray
s instead of converting the result to numpy dtypes such that thepyarrow.Table
returned byread_json
still propagates pyarrow objects underneath.Thoughts?
I would prefer returning numpy dtypes instead, to be consistent between engines and across IO methods.
Comment From: mroeschke
I would prefer returning numpy dtypes instead, to be consistent between engines and across IO methods.
Yeah understandable. I think https://github.com/pandas-dev/pandas/issues/48957 would allow returning pyarrow types in a more backward compat & API consistent way
Comment From: abkosar
@mroeschke So I have been looking into the issue. I was mainly looking at the read_csv
and see how engine
is implemented for read_csv
; however, I have couple questions:
- JsonReader
doesn't have the _make_engine
method (TextFileReader
) implemented so I think I need to implement that too, just confirming?
- read_csv
has its own pyarrow engine wrapper in the pandas.io.parsers.arrow_parser_wrapper.py
. For json
should I extend that class or create new wrapper for read_json
?
Also since this is my first time contributing to pandas, do you expect a fully functional PR or should I make a PR with what I have (which is not so far since I wanted to confirm the two questions above) and we iterate back and forth?
Thanks!
Comment From: mroeschke
Not too familiar with the json code base, so it would be easier to discuss those questions in a PR just to see what it would look like.
Generally I would (hope) it would be something like
def read_json(...):
# maybe some validation first
if engine == "ujson":
return ExistingParser().read()
elif engine="pyarrow"
return PyArrowParser().read()
Where PyArrowParser and ExistingParser might be able to share some code.
It's okay to open up a PR at any state, especially since Github has a draft PR status.
Comment From: abkosar
Sounds good then. I will add couple more things and create a PR.