Description
Consider the 3D array
a = array([[[ 0, 1, 2],
[ 3, 4, 5]],
[[ 6, 7, 8],
[ 9, 10, 11]]])
which i wish to convert into a pandas DataFrame, so that each dimension gets mapped onto a column name. Viz,
A B C val
0 0 0 0 0
1 0 0 1 1
2 0 0 2 2
3 0 1 0 3
4 0 1 1 4
5 0 1 2 5
6 1 0 0 6
7 1 0 1 7
8 1 0 2 8
9 1 1 0 9
10 1 1 1 10
11 1 1 2 11
Note that in practice the array a
could have be very large. What's the "pandas" way to do this ? At the moment this is what I'm doing:
def ndarray2dataframe(arr, dim_names=None, dim_values=None,
target_dim_name="val", keep_dtypes=False):
"""Converts an ndarray of shape (d_1,...,d_n) into a dataframe with n + 1
columns containing thesame number of rows as there are elements in the
input array, and for all index tuples (i_1, ..., i_n) in
[0...d_1] x ... x [0...d_n], we have
arr[i_1,...,i_n] == df.loc[mask].iloc[0]["val"],
where msk := (df[df.columns[0]] == i_1) * ... * (df.columns[n] == i_n)
Examples
--------
In [50]: utils.ndarray2dataframe(arange(2 * 2 * 3).reshape((2, 2, 3)))
Out[50]:
0 1 2 val
0 0 0 0 0
1 0 0 1 1
2 0 0 2 2
3 0 1 0 3
4 0 1 1 4
5 0 1 2 5
6 1 0 0 6
7 1 0 1 7
8 1 0 2 8
9 1 1 0 9
10 1 1 1 10
11 1 1 2 11
"""
# Misc
arr = np.array(arr)
if dim_names is None:
dim_names = range(arr.ndim)
else:
assert len(dim_names) == arr.ndim
columns = list(dim_names) + [target_dim_name]
if dim_values is None:
dim_values = map(range, arr.shape)
else:
assert map(len, dim_values) == list(arr.shape)
# Form meshgrid with matricial indexing="ij" (N.B.: "xy" indexing screws
# everything up)
grid = np.meshgrid(*dim_values, indexing="ij", copy=False)
# Explode everything
data = np.reshape(arr, grid[0].shape).ravel().tolist()
grid = list(map(np.ravel, grid))
data = grid + [data]
df = pd.DataFrame(np.transpose(data), columns=columns)
# Pandas thinks it's smarter than us :)
df[target_dim_name] = df[target_dim_name].astype(arr.dtype)
if keep_dtypes:
for column, these_dim_values in zip(dim_names, dim_values):
df[column] = df[column].astype(type(these_dim_values[0]))
return df
I just wanted to be sure I wasn't reinventing the wheel.
Comment From: jreback
In [6]: Series(a.ravel(), pd.MultiIndex.from_product([ range(n) for n in a.shape ]))
Out[6]:
0 0 0 0
1 1
2 2
1 0 3
1 4
2 5
1 0 0 6
1 7
2 8
1 0 9
1 10
2 11
dtype: int64
though not sure if this actually generalizes to your problem. SO might be a better forum for things like this.