Pandas to_json of df with multi-index header, followed by read_json results in incorrectly formatted column headers

cc @Komnomnomnom

I'm using a recent anaconda build on Windows, which includes v 0.12.0 pandas. I'm attempting to use a multi-index header, write it out to a json file, import it and get the same formatted dataframe. The column headers come out stacked vertically instead.

import numpy as np import pandas as pd pd.set_option('html',False)

index = pd.MultiIndex(levels=[['foo', 'bar', 'baz', 'qux'], ['one', 'two', 'three']], labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3], [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]], names=['foo', 'bar']) df = pd.DataFrame(np.random.randn(10, 3), index=index, columns=['A', 'B', 'C']) dft = df.transpose() dft dft.to_json('D:\xxmt.json',orient='split') dftj=pd.read_json('D:\xxmt.json',orient='split') dftj

Comment From: jnorwood

I should also point out that the json output omits the names info that would be required to restore the df completely (names=['foo', 'bar'])

Comment From: jnorwood

ok, if this is going to be an enhancement request, let me make another suggestion... The stored json data should include all the info necessary to restore the same dataframe ... so it shouldn't be necessary for the reader to specify the orient parameter when reading.

Comment From: jreback

@jnorwood so you are using these for 'storage'? JSON is not traditionally a reflexive serialization format. IOW, it takes a schema to ensure deserialization of a serialization will work. (as types can be subtly different, ordering is not guaranteed, etc.). CSV is like this too, SQL is partially like this. HDF5 can be reflexive as it stores the meta data.

Comment From: jnorwood

The current JSON output does (almost) have the description of a dataframes structure including support for multi-index headers. The current output with orient='split' is already implementing an implied schema based on the dataframe structure. It is missing "names". I think it could also add "orient". Then all that is left is for the reader to read back and restore the info correctly. In general, if pandas is going to support multi-index headers fully, then I think it needs to provide a way to handle it correctly in whatever output you export to and import from. If you just state what the schema is then users can generate the proper format to still take advantage of the multi-index, which is a nice feature of the dataframe.

Here is a sample json output with orient='split'

{"columns":[["foo","one"],["foo","two"],["foo","three"],["bar","one"],["bar","two"],["baz","two"],["baz","three"],["qux","one"],["qux","two"],["qux","three"]],"index":["A","B","C"],"data":[[1.4363830373,-0.3037434653,-1.0501832047,0.3830191489,-0.6174135405,-1.2378511344,-0.6473196664,1.1436025073,0.2124401475,0.5732940926],[0.6103747676,-0.8948600162,0.5987307019,-0.3098951553,-1.659220121,-1.8047032839,1.006122262,-0.9470006317,-0.2870211491,-1.9018432636],[-0.6036721203,0.7321329545,-0.6730598856,0.0215133965,-0.422750222,-0.6237148299,-0.2726430101,-0.7192759505,-0.3399415151,1.6568583209]]}

Comment From: Komnomnomnom

@jnorwood FYI to_json doesn't really support MultiIndex or higher dim containers like Panel and NDPanel right now. I'm hoping to get a PR in for 0.13 to improve the JSON docs.

Also I don't think any metadata (like the orient) should be included for aiding deserialisation. I'm thinking of the case where you are deserialising outside of pandas, it just causes confusion.

Comment From: jnorwood

I think there is more confusion without including the orient option. If someone has generated a json document for you to read in pandas, currently you'd have to guess what orient option was originally used during generation so you can enter the correct reader options. If the writer includes the orient option in the json attributes, then the reader can simply retrieve it as it does the other attributes. If some reader doesn't know what the attribute means it can just ignore it, right?

It isn't clear to me what other reader you are trying to support. I'd guess the most common use case would be pandas generating the json and then pandas reading it back. My particular use case is I'd like to generate a json format file or string, and have pandas read and display it as a multi-index table in some cases, or as a single index table in other cases. Of course we could invent another layer of communication to tell pandas what type of table orientation the json file contains, but it seems more natural to just put the info in the json file.

Really, I think you should try to consistently support all the features that pandas supports with all the readers and writers, and do it in a straightforward way that is documented. Someone has apparently made an attempt to support multi-index columns in the json writer, and it seems like it is a good start. It would be a good feature to support a simple json format that other apps could generate and so enable easy description of a multi-index table. It is a lot simpler solution than using hdf5 generation. I can see the potential importance of both options.

Comment From: jtratner

One point in implied by @jnorwood stands out: what's the use-case for to_json? I'd assume it's less space efficient than other formats.

Comment From: Komnomnomnom

Well my main use case is (well, was more than is) interoperability. Specifically with a web frontend, deserialisation in JS, for plotting etc.

Comment From: jreback

@Komnomnomnom progresss on this?

Comment From: Komnomnomnom

@jreback as of right now my priorities for pandasjson are MultiIndex support and Panel (+ higher dim) support. No work done in this direction yet but I think 0.14 is a good target to aim for.

So this issue to track MultiIndex support and #5428 for tracking Panel support?

(As stated above I'm not a fan of the idea of adding extra metadata onto the JSON encoded Pandas object. I'd favour a 'clean' representation of the data so it's easily portable. I'd welcome other's opinions though.)

Comment From: jreback

@Komnomnomnom sounds good on the priorities

I also agree with keeping clean json

I think this issue concerns some sort of auto sniffing of the orientation not sure how hard that would be (and prob would be ambiguous in some cases)

maybe accept orient= 'infer' at some point?

Comment From: Komnomnomnom

I like the idea of orient=infer on decoding, although yes it would be ambiguous for index and columns formats.

In [24]: df
Out[24]: 
              Open    High     Low   Close   Volume  Adj Close
Date                                                          
2010-01-04  626.95  629.51  624.24  626.75  1956200     626.75
2010-01-05  627.18  627.84  621.54  623.99  3004700     623.99
2010-01-06  625.86  625.86  606.36  608.26  3978700     608.26
2010-01-07  609.40  610.00  592.65  594.10  6414300     594.10
2010-01-08  592.00  603.25  589.11  602.02  4724300     602.02
2010-01-11  604.46  604.46  594.04  601.11  7212900     601.11
2010-01-12  597.65  598.16  588.00  590.48  4853300     590.48
2010-01-13  576.49  588.38  573.90  587.09  6496600     587.09
2010-01-14  583.90  594.20  582.81  589.85  4240100     589.85
2010-01-15  593.34  593.56  578.04  580.00  5434500     580.00

[10 rows x 6 columns]

In [19]: df.to_json(orient='columns')
Out[19]: '{"Open":{"1262563200000":626.95,"1262649600000":627.18,"1262736000000":625.86,"1262822400000":609.4,"1262908800000":592.0,"1263168000000":604.46,"1263254400000":597.65,"1263340800000":576.49,"1263427200000":583.9,"1263513600000":593.34},"High":{"1262563200000":629.51,"1262649600000":627.84,"1262736000000":625.86,"1262822400000":610.0,"1262908800000":603.25,"1263168000000":604.46,"1263254400000":598.16,"1263340800000":588.38,"1263427200000":594.2,"1263513600000":593.56},"Low":{"1262563200000":624.24,"1262649600000":621.54,"1262736000000":606.36,"1262822400000":592.65,"1262908800000":589.11,"1263168000000":594.04,"1263254400000":588.0,"1263340800000":573.9,"1263427200000":582.81,"1263513600000":578.04},"Close":{"1262563200000":626.75,"1262649600000":623.99,"1262736000000":608.26,"1262822400000":594.1,"1262908800000":602.02,"1263168000000":601.11,"1263254400000":590.48,"1263340800000":587.09,"1263427200000":589.85,"1263513600000":580.0},"Volume":{"1262563200000":1956200.0,"1262649600000":3004700.0,"1262736000000":3978700.0,"1262822400000":6414300.0,"1262908800000":4724300.0,"1263168000000":7212900.0,"1263254400000":4853300.0,"1263340800000":6496600.0,"1263427200000":4240100.0,"1263513600000":5434500.0},"Adj Close":{"1262563200000":626.75,"1262649600000":623.99,"1262736000000":608.26,"1262822400000":594.1,"1262908800000":602.02,"1263168000000":601.11,"1263254400000":590.48,"1263340800000":587.09,"1263427200000":589.85,"1263513600000":580.0}}'

In [22]: df.to_json(orient='index')
Out[22]: '{"1262563200000":{"Open":626.95,"High":629.51,"Low":624.24,"Close":626.75,"Volume":1956200.0,"Adj Close":626.75},"1262649600000":{"Open":627.18,"High":627.84,"Low":621.54,"Close":623.99,"Volume":3004700.0,"Adj Close":623.99},"1262736000000":{"Open":625.86,"High":625.86,"Low":606.36,"Close":608.26,"Volume":3978700.0,"Adj Close":608.26},"1262822400000":{"Open":609.4,"High":610.0,"Low":592.65,"Close":594.1,"Volume":6414300.0,"Adj Close":594.1},"1262908800000":{"Open":592.0,"High":603.25,"Low":589.11,"Close":602.02,"Volume":4724300.0,"Adj Close":602.02},"1263168000000":{"Open":604.46,"High":604.46,"Low":594.04,"Close":601.11,"Volume":7212900.0,"Adj Close":601.11},"1263254400000":{"Open":597.65,"High":598.16,"Low":588.0,"Close":590.48,"Volume":4853300.0,"Adj Close":590.48},"1263340800000":{"Open":576.49,"High":588.38,"Low":573.9,"Close":587.09,"Volume":6496600.0,"Adj Close":587.09},"1263427200000":{"Open":583.9,"High":594.2,"Low":582.81,"Close":589.85,"Volume":4240100.0,"Adj Close":589.85},"1263513600000":{"Open":593.34,"High":593.56,"Low":578.04,"Close":580.0,"Volume":5434500.0,"Adj Close":580.0}}'

In [20]: df.to_json(orient='split')
Out[20]: '{"columns":["Open","High","Low","Close","Volume","Adj Close"],"index":[1262563200000,1262649600000,1262736000000,1262822400000,1262908800000,1263168000000,1263254400000,1263340800000,1263427200000,1263513600000],"data":[[626.95,629.51,624.24,626.75,1956200.0,626.75],[627.18,627.84,621.54,623.99,3004700.0,623.99],[625.86,625.86,606.36,608.26,3978700.0,608.26],[609.4,610.0,592.65,594.1,6414300.0,594.1],[592.0,603.25,589.11,602.02,4724300.0,602.02],[604.46,604.46,594.04,601.11,7212900.0,601.11],[597.65,598.16,588.0,590.48,4853300.0,590.48],[576.49,588.38,573.9,587.09,6496600.0,587.09],[583.9,594.2,582.81,589.85,4240100.0,589.85],[593.34,593.56,578.04,580.0,5434500.0,580.0]]}'

In [21]: df.to_json(orient='records')
Out[21]: '[{"Open":626.95,"High":629.51,"Low":624.24,"Close":626.75,"Volume":1956200.0,"Adj Close":626.75},{"Open":627.18,"High":627.84,"Low":621.54,"Close":623.99,"Volume":3004700.0,"Adj Close":623.99},{"Open":625.86,"High":625.86,"Low":606.36,"Close":608.26,"Volume":3978700.0,"Adj Close":608.26},{"Open":609.4,"High":610.0,"Low":592.65,"Close":594.1,"Volume":6414300.0,"Adj Close":594.1},{"Open":592.0,"High":603.25,"Low":589.11,"Close":602.02,"Volume":4724300.0,"Adj Close":602.02},{"Open":604.46,"High":604.46,"Low":594.04,"Close":601.11,"Volume":7212900.0,"Adj Close":601.11},{"Open":597.65,"High":598.16,"Low":588.0,"Close":590.48,"Volume":4853300.0,"Adj Close":590.48},{"Open":576.49,"High":588.38,"Low":573.9,"Close":587.09,"Volume":6496600.0,"Adj Close":587.09},{"Open":583.9,"High":594.2,"Low":582.81,"Close":589.85,"Volume":4240100.0,"Adj Close":589.85},{"Open":593.34,"High":593.56,"Low":578.04,"Close":580.0,"Volume":5434500.0,"Adj Close":580.0}]'


In [23]: df.to_json(orient='values')
Out[23]: '[[626.95,629.51,624.24,626.75,1956200.0,626.75],[627.18,627.84,621.54,623.99,3004700.0,623.99],[625.86,625.86,606.36,608.26,3978700.0,608.26],[609.4,610.0,592.65,594.1,6414300.0,594.1],[592.0,603.25,589.11,602.02,4724300.0,602.02],[604.46,604.46,594.04,601.11,7212900.0,601.11],[597.65,598.16,588.0,590.48,4853300.0,590.48],[576.49,588.38,573.9,587.09,6496600.0,587.09],[583.9,594.2,582.81,589.85,4240100.0,589.85],[593.34,593.56,578.04,580.0,5434500.0,580.0]]'

Getting index and columns confused would be a transpose of the data so either raise if one of these orients were inferred or do some fancy thinking, like date data being primarily on the index axis, and the index size is usually larger than the column size etc. Unfortunately index is the default orientation.

Comment From: jreback

@Komnomnomnom from SO: http://stackoverflow.com/questions/22768682/pandas-read-json-not-working-on-multiindex/22768931#22768931

but the same issue

Comment From: watsonix

Last update to this was 2 years ago. I'm assuming its still open? I have a multi-index situation when I do: query_df = pd.read_sql_query(sql_text, db_conn, index_col=['user_id', 'mobile_time']) and then output via: records_json = query_df.to_json(orient='split')

but then pd.read_json(records_json, orient='split') doesn't work

Comment From: jreback

If you want a more round-trippable JSON format, this is in upcoming 0.20.0.: http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#whatsnew-0200-enhancements-table-schema

Comment From: people-can-fly

Or you can just write json with orient = 'table'

df.to_json(path_or_buf='test.json', orient='table')

read multi_index json

pd.read_json('test.json', orient='table')

Comment From: eddy-geek

Here is a fancy hack if you have a good reason not to use orient='table' -- in my case it is 3x bigger than orient='split'.

Notes: * index level names are lost by orient='split', even with this * it will probably break as soon as you use non-basic types (think: dates).

def to_json_split_roundtrip(df):
    if isinstance(df.index, pd.MultiIndex):
        values = [repr(v) for v in df.index.ravel()]
        df = df.set_index(pd.Index(values))
    return df.to_json(orient="split")

def from_json_split_roundtrip(j):
    df = pd.read_json(j, orient="split")
    if df.index[0].startswith('('):
        df.index = pd.MultiIndex.from_tuples((ast.literal_eval(t) for t in df.index))
    return df

Example use

df=pd.DataFrame([['1',2,3,4],['1',4,5,6],['7',4,9,10]],columns=["a","b","c","d"]).groupby(["a","b"]).sum()

display(df)
print(df.index)

as_json = to_json_split_roundtrip(df)
# send it around
r = from_json_split_roundtrip(as_json)
display(r)
print(r.index)

Pandas to_json of df with multi-index header, followed by read_json results in incorrectly formatted column headers

So, a built-in solution (and/or a more efficient orient='table') is still sorely needed.

Comment From: Helgeb

There seems to be a mixup in handling the multiindex columns of read_json and to_json. As a solution you can manually rearrange the multiindex column order in the json data:

>>> df = pd.DataFrame({'a':[1,2],'b':[3,4]})
>>> df.columns = pd.MultiIndex.from_tuples([('A','B',),('C','D',)])
>>> print(df)
   A  C
   B  D
0  1  3
1  2  4
>>> json_string = df.to_json(orient='split')
>>> print(json_string)
{"columns":[["A","B"],["C","D"]],"index":[0,1],"data":[[1,3],[2,4]]}
>>> wrong_df = pd.read_json(json_string,orient='split')
>>> print(wrong_df)
   A  B
   C  D
0  1  3
1  2  4
>>> correct_columns = pd.MultiIndex.from_tuples([tuple(c) for c in json.loads(json_string)['columns']])
>>> print(correct_columns)
MultiIndex(levels=[['A', 'C'], ['B', 'D']],
           labels=[[0, 1], [0, 1]])
>>> json_dict = json.loads(json_string)
>>> json_dict['columns'] = list(range(len(json_dict['columns'])))

>>> correct_df = pd.read_json(json.dumps(json_dict),orient='split')
>>> correct_df.columns = correct_columns
>>> print(correct_df)
   A  C
   B  D
0  1  3
1  2  4

Comment From: LarsHill

The above solution from @Helgeb correctly decodes multi-index dataframes but seems to fail in handling normal single-index dataframes correctly.

I propose a different workaround to correctly decode multi-index dataframes from serialized json strings. Namely, ast.literal_eval allows for safely evaluating tuple strings to actual tuples, which lead to correcly decoded dataframes, since pd.DataFrame parses Dict objects with tuple keys correctly (with multi-index columns).

import ast
import json

import numpy as np
import pandas as pd


def create_dataframe_from_json(df: str) -> pd.DataFrame:
    """ Decode dataframe from serialized json string.
    Safely evaluates multi-index keys (tuple strings) as tuples before converting dict to dataframe.
    """

    df = json.loads(df)

    parsed_df = {}
    for k, v in df.items():
        try:
            parsed_key = ast.literal_eval(k)
            if not isinstance(parsed_key, tuple):
                parsed_key = k
        except (ValueError, SyntaxError):
            parsed_key = k
        parsed_df[parsed_key] = v

    return pd.DataFrame(parsed_df)


def main():

    inp = [{'c1': 10, 'c2': 100}, {'c1': 11, 'c2': 110}, {'c1': 12, 'c2': 120}]
    df1 = pd.DataFrame(inp)

    df2 = pd.DataFrame(np.random.random((4, 4)))
    df2.columns = pd.MultiIndex.from_product([[1, 2], ['A', 'B']])

    df3 = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
    df3.columns = pd.MultiIndex.from_tuples([('A', 1,), ('C', 'D',)])

    for df in [df1, df2, df3]:
        # original dataframe
        print(df)

        # serialize to json
        json_str = df.to_json()

        # wrong decoding for multi-index dataframes
        wrong_df = pd.read_json(json_str)
        print(wrong_df)

        # correct decoding from json (also for multi-index dataframes)
        correct_df = create_dataframe_from_json(json_str)
        print(correct_df)


if __name__ == '__main__':
    main()

Comment From: jocelyne8

I'm not sure what the status is on this issue (I'm stuck on some older version of pandas here), but in case someone else runs into the same issue, I ended up using a variant of @Helgeb 's solution, which converts both index and column, and supports both MultiIndex and Index. This was very minimally tested and does very minimal input checking. Notably: * if you have both tuples and non-tuples in your index list, you'll end up with a single-level index containing tuples. * if you have tuples of multiple lengths in your index list, it will probably die? (untested) * if either index list is empty, it will probably die? (untested) * if you didn't serialize with orient="split", it will also probably die? (also untested)

def deserialize_multiindex_dataframe(dataframe_json: str) -> pd.DataFrame:
    """Deserialize the dataframe json into a dataframe object.
    The dataframe json must be generated with DataFrame.to_json(orient="split")
    This function is to address the fact that `pd.read_json()` isn't behaving correctly (yet)
    https://github.com/pandas-dev/pandas/issues/4889
    """
    def convert_index(json_obj):
        to_tuples = [tuple(i) if isinstance(i, list) else i for i in json_obj]
        if all(isinstance(i, list) for i in json_obj):
            return pd.MultiIndex.from_tuples(to_tuples)
        else:
            return pd.Index(to_tuples)
    json_dict = json.loads(dataframe_json)
    columns = convert_index(json_dict['columns'])
    index = convert_index(json_dict['index'])
    dataframe = pd.DataFrame(json_dict["data"], index, columns)
    return dataframe

Comment From: mroeschke

I think the orient="table" schema is the solution forward here as it is not possible really to represent a MultiIndex with different orients so I think we can close. Can open a new issue if something else if found