Pandas Working with arrays inside cells - mean() works, median() crashes

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas
a = np.eye(4, dtype= np.float)
df = pandas.DataFrame()
df['test'] = [a,a,a,a]
# works
df['test'].mean()
# doesn't
df['test'].median()

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\nanops.py in f(values, axis, skipna, **kwds)
    118                 else:
--> 119                     result = alt(values, axis=axis, skipna=skipna, **kwds)
    120             except Exception:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\nanops.py in nanmedian(values, axis, skipna)
    337     if not is_float_dtype(values):
--> 338         values = values.astype('f8')
    339         values[mask] = np.nan

ValueError: setting an array element with a sequence.

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\nanops.py in f(values, axis, skipna, **kwds)
    121                 try:
--> 122                     result = alt(values, axis=axis, skipna=skipna, **kwds)
    123                 except ValueError as e:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\nanops.py in nanmedian(values, axis, skipna)
    337     if not is_float_dtype(values):
--> 338         values = values.astype('f8')
    339         values[mask] = np.nan

ValueError: setting an array element with a sequence.

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-195-87f0d4023067> in <module>()
      7 df['test'].mean()
      8 # doesn't
----> 9 df['test'].median()

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py in stat_func(self, axis, skipna, level, numeric_only, **kwargs)
   6340                                       skipna=skipna)
   6341         return self._reduce(f, name, axis=axis, skipna=skipna,
-> 6342                             numeric_only=numeric_only)
   6343 
   6344     return set_function_name(stat_func, name, cls)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\series.py in _reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
   2379                                           'numeric_only.'.format(name))
   2380             with np.errstate(all='ignore'):
-> 2381                 return op(delegate, skipna=skipna, **kwds)
   2382 
   2383         return delegate._reduce(op=op, name=name, axis=axis, skipna=skipna,

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\nanops.py in _f(*args, **kwargs)
     60             try:
     61                 with np.errstate(invalid='ignore'):
---> 62                     return f(*args, **kwargs)
     63             except ValueError as e:
     64                 # we want to transform an object array

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\nanops.py in f(values, axis, skipna, **kwds)
    128 
    129                     if is_object_dtype(values):
--> 130                         raise TypeError(e)
    131                     raise
    132 

TypeError: setting an array element with a sequence.

Problem description

In a "pandas flow", building a complex "one-liner" commands this behavior is an unexpected error. Some commands work (e.g. sum, cumsum), some don't (min/max/median/std) I think this translates to how the internal numpy array is stored/passed to numpy.

Expected Output

This workaround works, but breaks the "pandas flow" completely:

np.nanmedian(np.array(list(df['test'])), axis=0)
array([[ 1.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  1.]])

Output of `pd.show_versions()`

[paste the output of ``pd.show_versions()`` here below this line] INSTALLED VERSIONS ------------------ commit: None python: 3.6.3.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.20.3 pytest: 3.2.1 pip: 9.0.1 setuptools: 36.5.0.post20170921 Cython: 0.26.1 numpy: 1.13.3 scipy: 0.19.1 xarray: None IPython: 6.1.0 sphinx: 1.6.3 patsy: 0.4.1 dateutil: 2.6.1 pytz: 2017.2 blosc: None bottleneck: 1.2.1 tables: 3.4.2 numexpr: 2.6.2 feather: None matplotlib: 2.1.0 openpyxl: 2.4.8 xlrd: 1.1.0 xlwt: 1.3.0 xlsxwriter: 1.0.2 lxml: 4.1.0 bs4: 4.6.0 html5lib: 0.999999999 sqlalchemy: 1.1.13 pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None

Comment From: jreback

this is non-idiomatic and non-performant to store data like this. not surprising this doesn’t work.

Comment From: kubaraczkowski

It is true that this is non-idiomatic and may be non-performant, but it does actually work for a number of methods (mean, sum, etc).

What would be an "idiomatic" way for operating on (fixed size) arrays inside the cells?

Comment From: jreback

it only works for mean & sum. that is basically happenstance.

idiomatic way would be to use a multiindex

# turn your data into a real dataframe, its actually much easier to simply construct it this way
In [37]: new_df = df.test.apply(lambda x: pd.Series(x.ravel()))

In [38]: new_df.columns = pd.MultiIndex.from_product([range(4), range(4)], names=['first', 'second'])

In [39]: new_df
Out[39]: 
first     0                   1 ...     2    3               
second    0    1    2    3    0 ...     3    0    1    2    3
0       1.0  0.0  0.0  0.0  0.0 ...   0.0  0.0  0.0  0.0  1.0
1       1.0  0.0  0.0  0.0  0.0 ...   0.0  0.0  0.0  0.0  1.0
2       1.0  0.0  0.0  0.0  0.0 ...   0.0  0.0  0.0  0.0  1.0
3       1.0  0.0  0.0  0.0  0.0 ...   0.0  0.0  0.0  0.0  1.0

[4 rows x 16 columns]

In [41]: new_df.median(level=0 , axis=1)
Out[41]: 
first    0    1    2    3
0      0.0  0.0  0.0  0.0
1      0.0  0.0  0.0  0.0
2      0.0  0.0  0.0  0.0
3      0.0  0.0  0.0  0.0

In [42]: new_df.mean(level=0 , axis=1)
Out[42]: 
first     0     1     2     3
0      0.25  0.25  0.25  0.25
1      0.25  0.25  0.25  0.25
2      0.25  0.25  0.25  0.25
3      0.25  0.25  0.25  0.25

Comment From: mewwts

@jreback what if you have varying numbers of entries in those lists? I don't see how this would work then.

As an example consider having a tags columns which can contain zero, one or many tags.

tags = [
    ['tag1'],
    ['tag1', 'tag2'],
    None,
    ['tag3', 'tag4', 'tag5']
]

Would you have to pad everything with Nones to produce something like

tags = [
    ['tag1', None, None],
    ['tag1', 'tag2', None],
    None,
    ['tag3', 'tag4', 'tag5']
]

or what?

Comment From: jreback

@mewwts there is barely any support for ops inside cells. it happens to work for sum/mean. This is not likely to be supported until pandas2 in any real way, w/o a real community effort.

Comment From: kubaraczkowski

I think this is indeed a nice to have. The original purpose of this was to store images (2d data) in the cells and then use pandas for easy selection and operations on the images. You can then easily take a mean() of the images along a certain column/row. But this certainly should not be a part of "core features". I think it's time to close this issue.

Comment From: mewwts

I'm hijacking this issue a bit @kubaraczkowski, but unstructured data like text is becoming ubiquitous, and I think it's a shame that Pandas isn't very helpful when analyzing / pre-processing this kind of data.

@jreback is there an issue for this somewhere we can continue the discussion?

Pandas Working with arrays inside cells - mean() works, median() crashes

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`