see https://github.com/pandas-dev/pandas/issues/25513#issuecomment-469264925
Code Sample, a copy-pastable example if possible
from __future__ import print_function
import pandas as pd
df = pd.DataFrame.from_records(
[
{'id': 'abc', 'data': 'qwerty'},
{'id': 'def', 'data': 'uiop'}
],
index='id'
)
print(df.to_json(orient='records', index=True))
# Prints out:
[{"data":"qwerty"},{"data":"uiop"}]
series = df.squeeze()
print(series.to_json(orient='records', index=True))
# Prints out:
["qwerty","uiop"]
Problem description
When creating a DataFrame
that has two columns, one to be used as an index and another for the data, if you call .to_json(orient='records')
the index is omitted. I know that in theory I should be using a Series
for this, but I'm using it to convert a CSV file into JSONL and I don't know what the CSV file is going to look like ahead of time.
In any case, squeezing the DataFrame
into a Series
doesn't work either. In fact, the bug in Series.to_json
is even worse, as it produces an array of strings instead of an array of dictionaries.
This bug is present in master.
Expected Output
Expected output is:
[{"id":"abc", "data":"qwerty"},{"id":"def","data":"uiop"}]
Output of pd.show_versions()
Comment From: coderop2
@mroeschke in the function description its clearly mentioned that by using orient = "records" the index is not preserved... so what to here..? @dargueta u could use df.reset_index(inplace=True) to make the index as a column before using to_json()
Comment From: TomAugspurger
Right, so the index=
parameter only works for split
and table
orients.
I think we should change the default split
to None
, and
- Make the default
index
beTrue
when orient issplit
ortable
. - Raise if
index
is specified for any other orient.
We should deprecate the current behavior (ignoring index
for other orients) first.
Comment From: coderop2
I think ignoring index for other orients might not be good because sometimes index like dates is used....so in that case the index might get ignored completely
Comment From: TomAugspurger
@coderop2 can you clarify? My proposal is to stop ignoring the index
parameter when it's not valid for the given orient
.
Comment From: dargueta
in the function description its clearly mentioned that by using orient = "records" the index is not preserved.
@coderop2 I disagree about "clearly". The documentation of the parameter states:
Whether to include the index values in the JSON string. Not including the index (index=False) is only supported when orient is ‘split’ or ‘table’.
I took this to mean the index is always included when the orientation is "records". We should at the very least rephrase this to be more explicit.
@TomAugspurger I agree with most of your proposed solution, though I don't understand why the index should be automatically dropped for orient="records". Dropping it would make sense if the index is the default sequential series, but if I explicitly specify a column as an index, I'd expect it to be included by default.
Comment From: TomAugspurger
We wouldn't want data-dependent behavior for whether the index is included.
On Mon, Mar 4, 2019 at 12:06 PM Diego Argueta notifications@github.com wrote:
in the function description its clearly mentioned that by using orient = "records" the index is not preserved.
@coderop2 https://github.com/coderop2 I disagree about "clearly". The documentation of the parameter states:
Whether to include the index values in the JSON string. Not including the index (index=False) is only supported when orient is ‘split’ or ‘table’.
I took this to mean the index is always included when the orientation is "records".
@TomAugspurger https://github.com/TomAugspurger I agree with most of your proposed solution, though I don't understand why the index should be automatically dropped for orient="records". Dropping it would make sense if the index is the default sequential series, but if I explicitly specify a column as an index, I'd expect it to be included by default.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/25513#issuecomment-469354740, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIl-dMOI4AmRz-6EbJWrOCMhDbBwEks5vTWCpgaJpZM4bZsDw .
Comment From: dargueta
Good point.
Comment From: coderop2
@TomAugspurger I am sorry I misunderstood for what you were explaining...I understand now
@dargueta I was talking about the description given in the code...it goes like this
Encoding/decoding a Dataframe using ``'records'`` formatted JSON.
Note that index labels are not preserved with this encoding.
@TomAugspurger since I am new to open source can I work up a solution and present it here...if that solution is acceptable then I will create a pull req
Comment From: dargueta
I still think dropping the index is a bit of a nasty surprise, especially if my code gets handed a DataFrame that someone else's code created. I don't want to break their code if I change my serialization format on the back end. Do you think there should be a way to override the default behavior, at least for orient="records"
?
Comment From: TomAugspurger
@dargueta we need to consider backwards compatibility. Making df.to_json(orient='records')
include the index would be an API breaking change.
We'll need a period where providing index=True
with the records orient raises. After that time period we can reconsider allowing it.
In the meantime, you'll need to .reset_index()
before calling .to_json
with the records orient.
Comment From: TomAugspurger
@TomAugspurger since I am new to open source can I work up a solution and present it here...if that solution is acceptable then I will create a pull req
That'd be great! Let us know if you need help getting started.
Comment From: dargueta
Oh I didn't mean making it the default, just providing a way to override the default behavior so that we can include it if desired without breaking backwards compatibility.
Comment From: TomAugspurger
I don't think there's a way though, since the default index is currently True, right?
On Tue, Mar 5, 2019 at 10:04 AM Diego Argueta notifications@github.com wrote:
Oh I didn't mean making it the default, just providing a way to override the default behavior so that we can include it if desired without breaking backwards compatibility.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/25513#issuecomment-469737412, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIlK3ra6VLIiqn-ybVtscU20H-Pefks5vTpWDgaJpZM4bZsDw .
Comment From: dargueta
Well if we change the default of index
to be None
like you said earlier, then passing True
explicitly would indicate the intent, no?
Comment From: WillAyd
Can you use the table orient in this case? It at least provides the data that you are looking for, albeit with some extra metadata attached.
We already have quite a few JSON serialization formats so I'd be hesitant to add more, especially given this functionality overlaps already with one of the other formats
Comment From: TomAugspurger
Hmm, yes. I worry a bit about people who were previously passing
index=True
... But perhaps that's
not a valid concern.
On Tue, Mar 5, 2019 at 10:40 AM Diego Argueta notifications@github.com wrote:
Well if we change the default of index to be None like you said earlier, then passing True explicitly would indicate the intent, no?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/25513#issuecomment-469753148, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIjPmjxRyEGUel0aZkBWwGTJCD63tks5vTp4LgaJpZM4bZsDw .
Comment From: dargueta
Can you use the table orient in this case? It at least provides the data that you are looking for, albeit with some extra metadata attached.
Sadly no, I'm dumping it using lines=True
so that it can be read by a database. I didn't put that in my example to minimize the number of arguments that could be interacting.
We already have quite a few JSON serialization formats so I'd be hesitant to add more
I wasn't suggesting adding another format, only changing the behavior of index
.
Hmm, yes. I worry a bit about people who were previously passing
index=True
I wouldn't say that's an invalid concern, but it does make me question why someone would deliberately pass index=True
and orient='records'
if they weren't trying to get the behavior I wanted, based on what the documentation states. I imagine the change be an issue for users who're dynamically constructing the arguments and are relying on index
being ignored. At the risk of coming off as a pedantic jerk, I'd say relying on that behavior is unwise to begin with.
Comment From: coderop2
@TomAugspurger if we are ignoring the index for the records orient then why cannot we just set it to false. But if we do a error is thrown accrding to:
if not index and orient not in ['split', 'table']:
raise ValueError("'index=False' is only valid when 'orient' is "" 'split' or 'table' ")
Because if you include records in the list and pass index as false no error is thrown
Comment From: ARDivekar
FYI. For those looking for a solution, I got around this by doing a simple: df['index_name'] = df.index
before calling df.to_json(...)
Comment From: tommyjcarpenter
I just want to chime in and say, the docs are very subtle w.r.t. this issue: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_json.html
It is very easy to miss the line Note that index labels are not preserved with this encoding.
The fact that orient=records
, index=True
is a LEGAL combination of parameters, yet it ...doesnt do that, is the unexpected behavior. Pandas basically says "we heard you but nah". I would prefer orient=records
, index=True
be an illegal combination and explicitly tell the user, "you cant do that" [for some reason] rather than just silently ignoring the intent.
Perhaps the solution above df['index_name'] = df.index before calling df.to_json(...)
should be mentioned in those docs to help the reader understand what to do, other than googling and eventually finding this issue.
It would also be nice if index=True
just did this for you, so that it "did" work.