Code Sample, a copy-pastable example if possible
import numpy as np
import pandas
a = np.eye(4, dtype= np.float)
df = pandas.DataFrame()
df['test'] = [a,a,a,a]
# works
df['test'].mean()
# doesn't
df['test'].median()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\nanops.py in f(values, axis, skipna, **kwds)
118 else:
--> 119 result = alt(values, axis=axis, skipna=skipna, **kwds)
120 except Exception:
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\nanops.py in nanmedian(values, axis, skipna)
337 if not is_float_dtype(values):
--> 338 values = values.astype('f8')
339 values[mask] = np.nan
ValueError: setting an array element with a sequence.
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\nanops.py in f(values, axis, skipna, **kwds)
121 try:
--> 122 result = alt(values, axis=axis, skipna=skipna, **kwds)
123 except ValueError as e:
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\nanops.py in nanmedian(values, axis, skipna)
337 if not is_float_dtype(values):
--> 338 values = values.astype('f8')
339 values[mask] = np.nan
ValueError: setting an array element with a sequence.
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
<ipython-input-195-87f0d4023067> in <module>()
7 df['test'].mean()
8 # doesn't
----> 9 df['test'].median()
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py in stat_func(self, axis, skipna, level, numeric_only, **kwargs)
6340 skipna=skipna)
6341 return self._reduce(f, name, axis=axis, skipna=skipna,
-> 6342 numeric_only=numeric_only)
6343
6344 return set_function_name(stat_func, name, cls)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\series.py in _reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
2379 'numeric_only.'.format(name))
2380 with np.errstate(all='ignore'):
-> 2381 return op(delegate, skipna=skipna, **kwds)
2382
2383 return delegate._reduce(op=op, name=name, axis=axis, skipna=skipna,
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\nanops.py in _f(*args, **kwargs)
60 try:
61 with np.errstate(invalid='ignore'):
---> 62 return f(*args, **kwargs)
63 except ValueError as e:
64 # we want to transform an object array
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\nanops.py in f(values, axis, skipna, **kwds)
128
129 if is_object_dtype(values):
--> 130 raise TypeError(e)
131 raise
132
TypeError: setting an array element with a sequence.
Problem description
In a "pandas flow", building a complex "one-liner" commands this behavior is an unexpected error. Some commands work (e.g. sum, cumsum), some don't (min/max/median/std) I think this translates to how the internal numpy array is stored/passed to numpy.
Expected Output
This workaround works, but breaks the "pandas flow" completely:
np.nanmedian(np.array(list(df['test'])), axis=0)
array([[ 1., 0., 0., 0.],
[ 0., 1., 0., 0.],
[ 0., 0., 1., 0.],
[ 0., 0., 0., 1.]])
Output of pd.show_versions()
Comment From: jreback
this is non-idiomatic and non-performant to store data like this. not surprising this doesn’t work.
Comment From: kubaraczkowski
It is true that this is non-idiomatic and may be non-performant, but it does actually work for a number of methods (mean, sum, etc).
What would be an "idiomatic" way for operating on (fixed size) arrays inside the cells?
Comment From: jreback
it only works for mean & sum. that is basically happenstance.
idiomatic way would be to use a multiindex
# turn your data into a real dataframe, its actually much easier to simply construct it this way
In [37]: new_df = df.test.apply(lambda x: pd.Series(x.ravel()))
In [38]: new_df.columns = pd.MultiIndex.from_product([range(4), range(4)], names=['first', 'second'])
In [39]: new_df
Out[39]:
first 0 1 ... 2 3
second 0 1 2 3 0 ... 3 0 1 2 3
0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 1.0
1 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 1.0
2 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 1.0
3 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 1.0
[4 rows x 16 columns]
In [41]: new_df.median(level=0 , axis=1)
Out[41]:
first 0 1 2 3
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0
In [42]: new_df.mean(level=0 , axis=1)
Out[42]:
first 0 1 2 3
0 0.25 0.25 0.25 0.25
1 0.25 0.25 0.25 0.25
2 0.25 0.25 0.25 0.25
3 0.25 0.25 0.25 0.25
Comment From: mewwts
@jreback what if you have varying numbers of entries in those lists? I don't see how this would work then.
As an example consider having a tags
columns which can contain zero, one or many tags.
tags = [
['tag1'],
['tag1', 'tag2'],
None,
['tag3', 'tag4', 'tag5']
]
Would you have to pad everything with None
s to produce something like
tags = [
['tag1', None, None],
['tag1', 'tag2', None],
None,
['tag3', 'tag4', 'tag5']
]
or what?
Comment From: jreback
@mewwts there is barely any support for ops inside cells. it happens to work for sum/mean. This is not likely to be supported until pandas2 in any real way, w/o a real community effort.
Comment From: kubaraczkowski
I think this is indeed a nice to have. The original purpose of this was to store images (2d data) in the cells and then use pandas for easy selection and operations on the images. You can then easily take a mean() of the images along a certain column/row. But this certainly should not be a part of "core features". I think it's time to close this issue.
Comment From: mewwts
I'm hijacking this issue a bit @kubaraczkowski, but unstructured data like text is becoming ubiquitous, and I think it's a shame that Pandas isn't very helpful when analyzing / pre-processing this kind of data.
@jreback is there an issue for this somewhere we can continue the discussion?