Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [X] I have confirmed this bug exists on the [main branch] (https://pandas.pydata.org/docs/dev/getting_started/install.html#installing-the-development-version-of-pandas) of pandas.

Reproducible Example

import pandas as pd
import json

def json_conversion(df, orient_type = "values"):

  # convert dataframe to a JSON string
  json_str = df.to_json(orient=orient_type)

  # write the JSON string to a file
  with open('data.json', 'w') as f:
      json.dump(json_str, f)

  # read the JSON string from the file
  with open('data.json', 'r') as f:
      json_str = json.load(f)

  # Convert the JSON string back to a dataframe
  df2 = pd.read_json(json_str, orient=orient_type)

  return df2

# Create a dataframe with imaginary numbers
df = pd.DataFrame({'a': [1 + 2j, 3 + 4j], 'b': [5 + 6j, 7 + 8j]})
print(df)
#           a         b
# 0  1.0+2.0j  5.0+6.0j
# 1  3.0+4.0j  7.0+8.0j

# Check with `values`
df_values_json = json_conversion(df, "values")
print(df_values_json)
#                             0                           1
# 0  {'imag': 2.0, 'real': 1.0}  {'imag': 6.0, 'real': 5.0}
# 1  {'imag': 4.0, 'real': 3.0}  {'imag': 8.0, 'real': 7.0}


# Check with `table`
df_table_json = json_conversion(df, "table")
# TypeError: float() argument must be a string or a number, not 'dict'

Issue Description

When trying to re-create a dataframe with complex numbers using JSON, the pd.read_json() function has trouble with different orientations, e.g. orient="values" and orient="table". In particular, the reconstructed data frame either treats the number as a combined dictionary with "imag" and "real" entries or is unable to be recreated due to a TypeError.

a b
0 1+2j 5+6j
1 3+4j 7+8j
JSON Output under `orient='values'`
[
    [
        {
            "imag":2.0,
            "real":1.0
        },
        {
            "imag":6.0,
            "real":5.0
        }
    ],
    [
        {
            "imag":4.0,
            "real":3.0
        },
        {
            "imag":8.0,
            "real":7.0
        }
    ]
]

This leads to the reconstructed data frame looking like so:

0 1
0 {'imag': 2.0, 'real': 1.0} {'imag': 6.0, 'real': 5.0}
1 {'imag': 4.0, 'real': 3.0} {'imag': 8.0, 'real': 7.0}

In the case of orient='table', we have:

JSON Output under `orient='table'`
{
    "schema":{
        "fields":[
            {
                "name":"index",
                "type":"integer"
            },
            {
                "name":"a",
                "type":"number"
            },
            {
                "name":"b",
                "type":"number"
            }
        ],
        "primaryKey":[
            "index"
        ],
        "pandas_version":"0.20.0"
    },
    "data":[
        {
            "index":0,
            "a":{
                "imag":2.0
            },
            "b":{
                "imag":6.0
            }
        },
        {
            "index":1,
            "a":{
                "imag":4.0
            },
            "b":{
                "imag":8.0
            }
        }
    ]
}

The end output is a TypeError of:

TypeError: float() argument must be a string or a number, not 'dict'

Expected Behavior

Ideally, the original data frame should be constructed up to column names in the values case whereas the table case should be identical to the original data frame.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 8dab54d6573f7186ff0c3b6364d5e4dd635ff3e7 python : 3.8.16.final.0 python-bits : 64 OS : Linux OS-release : 5.10.147+ Version : #1 SMP Sat Dec 10 16:00:40 UTC 2022 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.5.2 numpy : 1.21.6 pytz : 2022.7 dateutil : 2.8.2 setuptools : 57.4.0 pip : 22.0.4 Cython : 0.29.32 pytest : 3.6.4 hypothesis : None sphinx : 3.5.4 blosc : None feather : 0.4.1 xlsxwriter : None lxml.etree : 4.9.2 html5lib : 1.0.1 pymysql : None psycopg2 : 2.9.5 jinja2 : 2.11.3 IPython : 7.9.0 pandas_datareader: 0.9.0 bs4 : 4.6.3 bottleneck : None brotli : None fastparquet : None fsspec : 2022.11.0 gcsfs : None matplotlib : 3.2.2 numba : 0.56.4 numexpr : 2.8.4 odfpy : None openpyxl : 3.0.10 pandas_gbq : 0.17.9 pyarrow : 9.0.0 pyreadstat : None pyxlsb : None s3fs : None scipy : 1.7.3 snappy : None sqlalchemy : 1.4.46 tables : 3.7.0 tabulate : 0.8.10 xarray : 2022.12.0 xlrd : 1.2.0 xlwt : 1.3.0 zstandard : None tzdata : None

Comment From: dicristina

The documentation says that the output of df.to_json(orient="values") will be "just the values array" so it is not possible to recover the column names. To recover the actual complex values you can do something like:

df_values_json.applymap(lambda c: complex(**c))

Unlike in the previous case in the orient="table" case we have the data type of each column so in theory we should be able to reconstruct the values without doing any work at all. The problem here is that there is no special handling for complex numbers and when the values are read they are passed to the float function. The relevant code is in pandas/io/json/_table_schema.py.

Comment From: topper-123

Table Schema doesn't seem to have schema fields for complex numbers, so this isn't possible to fix for Pandas, under the constraint that we follow Table Schema. I'm not an expert on Table Schema at all, so if I'm wrong there, I appreciate feedback on that, of course.

So, I agree that the solution proposed by @dicristina using apply/applymap is the best possible right now and I don't think this is fixable, while following Table Schema.

Comment From: dicristina

There is a mechanism already in place to add an extDtype key to the field descriptor for extension types. When the table is read the data type indicated by this key is the one used instead of the one derived from the type key. Maybe this can be used for complex numbers even though they are not a pandas extension type.

Even when the correct data type is contained in the field descriptor the representation of the complex numbers presents a small problem. The parse_table_schema function builds a mapping of dtypes and then calls df.astype(dtypes). This does not work when we have the complex numbers represented as a dictionary.

Comment From: topper-123

Yes I agree.

Looking at the table schema number definition, it doesn't look like the dict is a legal value for a "number" field, so the current behavior is a bit strange.

Maybe complex numbers should have type "object" instead (i.e. allowing the dict) and a extDtype field with value "complex". I.e. type "object" will by default be read in as a json-like object (i.e. result from json.loads in python), except if the field has a "extDtype" with value of "complex", it will be converted to a complex type using complex(**val)?