Pandas DataFrame.to_json silently ignores index parameter for most orients.

see https://github.com/pandas-dev/pandas/issues/25513#issuecomment-469264925

Code Sample, a copy-pastable example if possible

from __future__ import print_function

import pandas as pd

df = pd.DataFrame.from_records(
    [
        {'id': 'abc', 'data': 'qwerty'},
        {'id': 'def', 'data': 'uiop'}
    ],
    index='id'
)

print(df.to_json(orient='records', index=True))

# Prints out:
[{"data":"qwerty"},{"data":"uiop"}]

series = df.squeeze()
print(series.to_json(orient='records', index=True))

# Prints out:
["qwerty","uiop"]

Problem description

When creating a DataFrame that has two columns, one to be used as an index and another for the data, if you call .to_json(orient='records') the index is omitted. I know that in theory I should be using a Series for this, but I'm using it to convert a CSV file into JSONL and I don't know what the CSV file is going to look like ahead of time.

In any case, squeezing the DataFrame into a Series doesn't work either. In fact, the bug in Series.to_json is even worse, as it produces an array of strings instead of an array of dictionaries.

This bug is present in master.

Expected Output

Expected output is:

[{"id":"abc", "data":"qwerty"},{"id":"def","data":"uiop"}]

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.7.2.final.0 python-bits: 64 OS: Darwin OS-release: 17.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.24.1 pytest: 4.3.0 pip: 19.0.3 setuptools: 40.8.0 Cython: 0.29.6 numpy: 1.16.2 scipy: None pyarrow: 0.12.1 xarray: None IPython: None sphinx: None patsy: None dateutil: 2.8.0 pytz: 2018.9 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml.etree: None bs4: None html5lib: None sqlalchemy: None pymysql: None psycopg2: 2.7.7 (dt dec pq3 ext lo64) jinja2: None s3fs: 0.2.0 fastparquet: 0.2.1 pandas_gbq: None pandas_datareader: None gcsfs: None

Comment From: coderop2

@mroeschke in the function description its clearly mentioned that by using orient = "records" the index is not preserved... so what to here..? @dargueta u could use df.reset_index(inplace=True) to make the index as a column before using to_json()

Comment From: TomAugspurger

Right, so the index= parameter only works for split and table orients.

I think we should change the default split to None, and

Make the default index be True when orient is split or table.
Raise if index is specified for any other orient.

We should deprecate the current behavior (ignoring index for other orients) first.

Comment From: coderop2

I think ignoring index for other orients might not be good because sometimes index like dates is used....so in that case the index might get ignored completely

Comment From: TomAugspurger

@coderop2 can you clarify? My proposal is to stop ignoring the index parameter when it's not valid for the given orient.

Comment From: dargueta

in the function description its clearly mentioned that by using orient = "records" the index is not preserved.

@coderop2 I disagree about "clearly". The documentation of the parameter states:

Whether to include the index values in the JSON string. Not including the index (index=False) is only supported when orient is ‘split’ or ‘table’.

I took this to mean the index is always included when the orientation is "records". We should at the very least rephrase this to be more explicit.

@TomAugspurger I agree with most of your proposed solution, though I don't understand why the index should be automatically dropped for orient="records". Dropping it would make sense if the index is the default sequential series, but if I explicitly specify a column as an index, I'd expect it to be included by default.

Comment From: TomAugspurger

We wouldn't want data-dependent behavior for whether the index is included.

On Mon, Mar 4, 2019 at 12:06 PM Diego Argueta notifications@github.com wrote:

in the function description its clearly mentioned that by using orient = "records" the index is not preserved.

@coderop2 https://github.com/coderop2 I disagree about "clearly". The documentation of the parameter states:

Whether to include the index values in the JSON string. Not including the index (index=False) is only supported when orient is ‘split’ or ‘table’.

I took this to mean the index is always included when the orientation is "records".

@TomAugspurger https://github.com/TomAugspurger I agree with most of your proposed solution, though I don't understand why the index should be automatically dropped for orient="records". Dropping it would make sense if the index is the default sequential series, but if I explicitly specify a column as an index, I'd expect it to be included by default.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/25513#issuecomment-469354740, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIl-dMOI4AmRz-6EbJWrOCMhDbBwEks5vTWCpgaJpZM4bZsDw .

Comment From: dargueta

Good point.

Comment From: coderop2

@TomAugspurger I am sorry I misunderstood for what you were explaining...I understand now @dargueta I was talking about the description given in the code...it goes like this Encoding/decoding a Dataframe using ``'records'`` formatted JSON. Note that index labels are not preserved with this encoding. @TomAugspurger since I am new to open source can I work up a solution and present it here...if that solution is acceptable then I will create a pull req

Comment From: dargueta

I still think dropping the index is a bit of a nasty surprise, especially if my code gets handed a DataFrame that someone else's code created. I don't want to break their code if I change my serialization format on the back end. Do you think there should be a way to override the default behavior, at least for orient="records"?

Comment From: TomAugspurger

@dargueta we need to consider backwards compatibility. Making df.to_json(orient='records') include the index would be an API breaking change.

We'll need a period where providing index=True with the records orient raises. After that time period we can reconsider allowing it.

In the meantime, you'll need to .reset_index() before calling .to_json with the records orient.

Comment From: TomAugspurger

@TomAugspurger since I am new to open source can I work up a solution and present it here...if that solution is acceptable then I will create a pull req

That'd be great! Let us know if you need help getting started.

Comment From: dargueta

Oh I didn't mean making it the default, just providing a way to override the default behavior so that we can include it if desired without breaking backwards compatibility.

Comment From: TomAugspurger

I don't think there's a way though, since the default index is currently True, right?

On Tue, Mar 5, 2019 at 10:04 AM Diego Argueta notifications@github.com wrote:

Oh I didn't mean making it the default, just providing a way to override the default behavior so that we can include it if desired without breaking backwards compatibility.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/25513#issuecomment-469737412, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIlK3ra6VLIiqn-ybVtscU20H-Pefks5vTpWDgaJpZM4bZsDw .

Comment From: dargueta

Well if we change the default of index to be None like you said earlier, then passing True explicitly would indicate the intent, no?

Comment From: WillAyd

Can you use the table orient in this case? It at least provides the data that you are looking for, albeit with some extra metadata attached.

We already have quite a few JSON serialization formats so I'd be hesitant to add more, especially given this functionality overlaps already with one of the other formats

Comment From: TomAugspurger

Hmm, yes. I worry a bit about people who were previously passing index=True... But perhaps that's not a valid concern.

On Tue, Mar 5, 2019 at 10:40 AM Diego Argueta notifications@github.com wrote:

Well if we change the default of index to be None like you said earlier, then passing True explicitly would indicate the intent, no?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/25513#issuecomment-469753148, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIjPmjxRyEGUel0aZkBWwGTJCD63tks5vTp4LgaJpZM4bZsDw .

Comment From: dargueta

Can you use the table orient in this case? It at least provides the data that you are looking for, albeit with some extra metadata attached.

Sadly no, I'm dumping it using lines=True so that it can be read by a database. I didn't put that in my example to minimize the number of arguments that could be interacting.

We already have quite a few JSON serialization formats so I'd be hesitant to add more

I wasn't suggesting adding another format, only changing the behavior of index.

Hmm, yes. I worry a bit about people who were previously passing index=True

I wouldn't say that's an invalid concern, but it does make me question why someone would deliberately pass index=True and orient='records' if they weren't trying to get the behavior I wanted, based on what the documentation states. I imagine the change be an issue for users who're dynamically constructing the arguments and are relying on index being ignored. At the risk of coming off as a pedantic jerk, I'd say relying on that behavior is unwise to begin with.

Comment From: coderop2

@TomAugspurger if we are ignoring the index for the records orient then why cannot we just set it to false. But if we do a error is thrown accrding to: if not index and orient not in ['split', 'table']: raise ValueError("'index=False' is only valid when 'orient' is "" 'split' or 'table' ") Because if you include records in the list and pass index as false no error is thrown

Comment From: ARDivekar

FYI. For those looking for a solution, I got around this by doing a simple: df['index_name'] = df.index before calling df.to_json(...)

Comment From: tommyjcarpenter

I just want to chime in and say, the docs are very subtle w.r.t. this issue: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_json.html

It is very easy to miss the line Note that index labels are not preserved with this encoding.

The fact that orient=records, index=True is a LEGAL combination of parameters, yet it ...doesnt do that, is the unexpected behavior. Pandas basically says "we heard you but nah". I would prefer orient=records, index=True be an illegal combination and explicitly tell the user, "you cant do that" [for some reason] rather than just silently ignoring the intent.

Perhaps the solution above df['index_name'] = df.index before calling df.to_json(...) should be mentioned in those docs to help the reader understand what to do, other than googling and eventually finding this issue.

It would also be nice if index=True just did this for you, so that it "did" work.

Pandas DataFrame.to_json silently ignores index parameter for most orients.

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`