I am trying to find a clean, concise way of setting the values of a multi-dimensional, multi-indexed dataframe, using data of lower dimensionality. In this case, I am trying to use two-dimensional data to set values in a four-dimensional dataframe.

Unfortunately, the syntax I am using only works with 2D data if the keys I'm using are already in the dataframe's index/column. But empty dataframes do not have any keys (yet) in their indices/columns.

For single points (0D), this is not a problem. Pandas just adds the missing key(s) appropriately and sets the value. For anything else, the key must be already there, it seems, as the below code shows.

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

# Create an empty 2-level mux (multi-index) for the index.
# The first level is run number ('r'). The second is x-axis values ('x').
mux = pd.MultiIndex(levels=[[]]*2,labels=[[]]*2,names=['r','x'])

# Create an empty 2-level mux for the column
# The first level is parameter value ('p'). The second is y-axis values ('y').
mux2 = pd.MultiIndex(levels=[[]]*2,labels=[[]]*2,names=['p','y'])

# Create the empty multi-indexed and multi-columned dataframe
df = pd.DataFrame(index=mux,columns=mux2)

# run number 0 (r=0), using parameter value 1.024 (p=1.024)...
# ... produces 2D data on an x-y grid.
data = np.array([[1,2,3],[4,5,6]])
ys = np.array([0,1,2])
xs = np.array([0,1])

# Now we want to set values in the 4D dataframe with our 2D data. Throws error.
df.loc[(0,list(xs)),(1.024,list(ys))] = data

Traceback (most recent call last):
    KeyError: 0

But single points work fine.

# Single points automatically result in new keys
df.loc[(0,xs[0]),(1.024,ys[0])] = 1
df.loc[(0,xs[0]),(1.024,ys[1])] = 2
df.loc[(0,xs[1]),(1.024,ys[0])] = 3
df.loc[(0,xs[1]),(1.024,ys[1])] = 4

# Keys are now found, and this now works.
df.loc[(0,list(xs[0:2])),(1.024,list(ys[0:2]))] = ((5,6),(7,8))

# But this does not work. '1' is not currently a key.
df.loc[(1,list(xs[0:2])),(1.024,list(ys[0:2]))] = ((1,2),(3,4))

Traceback (most recent call last):
    KeyError: 1L

Problem description

It seems the default behavior for setting single points (that is, auto-creation of keys) is different than the behavior of setting multiple points (no auto-creation of keys). This seems pretty arbitrary from my outsider perspective; not sure why the behavior shouldn't be identical.

If there is another way of accomplishing this, I would love to hear about it. But perhaps the point behavior should be extended to multiple dimensions.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.12.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.19.2 nose: 1.3.7 pip: 9.0.1 setuptools: 33.1.0.post20170122 Cython: 0.25.2 numpy: 1.10.4 scipy: 0.17.1 statsmodels: 0.8.0 xarray: 0.9.1 IPython: 5.2.2 sphinx: 1.5.2 patsy: 0.4.1 dateutil: 2.6.0 pytz: 2016.10 blosc: None bottleneck: 1.2.0 tables: 3.3.0 numexpr: 2.6.1 matplotlib: 2.0.0 openpyxl: 2.4.1 xlrd: 1.0.0 xlwt: 1.2.0 xlsxwriter: 0.9.6 lxml: 3.7.2 bs4: 4.5.3 html5lib: 0.999 httplib2: None apiclient: None sqlalchemy: 1.1.5 pymysql: None psycopg2: None jinja2: 2.8 boto: 2.45.0 pandas_datareader: 0.2.1

Comment From: jreback

What you are doing is about as inefficient as possible. You want to create all in 1 go. The setting of single keys is for convenience; the expansion is what makes this inefficient, you get a copy each time (of the entire structure).

In [17]: df = DataFrame([[1, 2], [3, 4]], 
    ...:                columns=pd.MultiIndex.from_product([[1.024], [0, 1]], names=list('py')),
    ...:                index=pd.MultiIndex.from_product([[0], [0, 1]], names=list('rx')))

In [18]: df
Out[18]: 
p   1.024   
y       0  1
r x         
0 0     1  2
  1     3  4

Comment From: joseortiz3

Ok, now to address my question: You generated an xy dataset for r=0, p=1.024. Suppose you now have another xy dataset for r=1, p=1.024. How do you obtain a dataframe with both xy sets for r = [0,1] and p=1.024?

Continuing on to arbitrary r, arbitrary p, how do you obtain a dataframe with any arbitrary collection of xy data for each r and p? Especially when you do not know ahead of time what r and p values you will end up with? (And hence, cannot create it from scratch).

The efficiency of this is really not important. I don't care if it takes fifty milliseconds or fifty seconds. I just need to obtain the required multi-dimensional data frame.

Comment From: jreback

show what you are meaning

Comment From: joseortiz3

# Pseudocode
higher_df = Higher_Dimensional_Dataframe() # 4-D Dataframe
for i in range(100):
    p = rand() # Some random float
    r = randint() # Some random integer
    # Dataframe of unique xy values for a particular p and r.
    df = DataFrame(data = [[ randint() , randint() ], [ randint() ,  randint() ]], 
        columns=pd.MultiIndex.from_product([[ p ], [0, 1]], names=list('py')),
        index=pd.MultiIndex.from_product([[ r ], [0, 1]], names=list('rx')))
    # Put each of these dataframes into a single higher-dimensional dataframe.
    higher_df.loc[(r,df.index),(p,df.columns)] = df
# Now I have a dataframe with xy-datasets for an arbitrary collection of p's and r's
do_stuff(higher_df)

Comment From: jreback

lots of ways to do this.

here is one. This is a better question for SO, or you can read some tutorials (and docs).

In [6]: df = DataFrame({(1.024, 0): np.random.randn(10), (1.024, 1): np.random.randint(0, 10, size=10)},
   ...:     ...:                index=pd.MultiIndex.from_product([range(5), [0, 1]], names=list('rx')))
   ...: df.columns.names = ['foo', 'bar']
   ...: df
   ...: 
Out[6]: 
foo     1.024   
bar         0  1
r x             
0 0  1.215597  1
  1  0.475140  3
1 0  1.610304  7
  1 -0.261228  5
2 0  0.476945  6
  1 -0.257677  3
3 0 -2.170884  0
  1  0.743454  3
4 0 -1.721198  4
  1  0.487578  4

Comment From: joseortiz3

Thanks for your help. But I don't think I'm successfully conveying what the problem is. I provided a suggestion, as per the rules.

In the end, I just had to iterate point-by-point to get what I want. Inefficient, but my time is worth more.

Comment From: jorisvandenbossche

@joseortiz3 Using a simpler example (without multi-indexes). What you are trying to do is this:

In [14]: df = pd.DataFrame()

In [15]: df.loc[[1,2], [1,2,3]] = data
...
KeyError: '[1 2] not in index'

And indeed, this is not supported by pandas at the moment.

BTW, if you don't know the keys in advance to create the dataframe, my suggestion would be to gather the data in something else (eg append a list, or a few lists for the data, index, columns), and only create the dataframe at the end.