See also: http://stackoverflow.com/questions/26736745/indexing-a-pandas-panel-counterintuitive-or-a-bug
These are actually two related(?) issues. The first is that the DataFrame is transposed, when you index the major_indexer or minor_indexer:
from pandas import Panel
from numpy import arange
p = Panel(arange(24).reshape(2,3,4))
p.shape
Out[4]: (2, 3, 4)
p.iloc[0].shape # original order
Out[5]: (3, 4)
p.iloc[:,0].shape # I would expect (2,4), but it is transposed
Out[6]: (4, 2)
p.iloc[:,:,0].shape # also transposed
Out[7]: (3, 2)
p.iloc[:,0,:].shape # transposed (same as [6])
Out[8]: (4, 2)
This may be a design choice, but it seems counterintuitive to me and it is not in line with the way numpy indexing works. On a related note, I would expect the following two commands to be equivalent:
p.iloc[1:,0,:].shape # Slicing item_indexer, then transpose
Out[9]: (4, 1)
p.iloc[1:,0].shape # Expected to get the same as [9], but slicing minor_indexer instead????
Out[10]: (3, 2)
INSTALLED VERSIONS
commit: None python: 2.7.6.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 42 Stepping 7, GenuineIntel byteorder: little LC_ALL: None LANG: nl_NL
pandas: 0.15.1 nose: 1.3.3 Cython: 0.20.1 numpy: 1.9.1 scipy: 0.14.0 statsmodels: 0.5.0 IPython: 2.2.0 sphinx: 1.2.2 patsy: 0.2.1 dateutil: 1.5 pytz: 2014.9 bottleneck: None tables: 3.1.1 numexpr: 2.3.1 matplotlib: 1.4.2 openpyxl: 1.8.5 xlrd: 0.9.3 xlwt: 0.7.5 xlsxwriter: 0.5.5 lxml: 3.3.5 bs4: 4.3.1 html5lib: None httplib2: None apiclient: None rpy2: None sqlalchemy: 0.9.4 pymysql: None psycopg2: None
Comment From: max-sixty
I've just hit this too, on 0.16.2. Is this intended? Is it related to https://github.com/pydata/pandas/issues/11369?
In [8]: panel = pd.Panel(pd.np.random.rand(2,3,4))
In [10]: panel.shape
Out[10]: (2, 3, 4)
In [11]: panel[:, :, 0].shape
Out[11]: (3, 2)
In numpy:
In [15]: npanel=pd.np.random.rand(2,3,4)
In [16]: npanel.shape
Out[16]: (2, 3, 4)
In [18]: npanel[:,:,0].shape
Out[18]: (2, 3)
CC @jreback, as this seemed like an abandoned issue
Comment From: jreback
yes this has always been like this. DataFrame
is 'reversed' in that the columns axis (1) is the 'primary' (we call it the info) axis. This translates to indexing where a Panel
is conceptually a dict of DataFrames
. Not sure what/if anything can do about this as it would break practially all code.
Comment From: max-sixty
This is a bigger issue than one we're going to solve here. But regardless a couple of points:
Panels generally
- I have been working with Panels a lot over the past couple of weeks and - from my humble user perspective - it has felt pretty painful. I know it's a difficult challenge to go from 2 -> n dimensions.
DataFrame
s are so beautiful, andPanel
s seem like an alpha of their functionality at a different level of quality (i.e. a 'preview' with low documentation & testing, rather than a fully functional subset). - FWIW, my basic approach now is to use pandas for the initial alignment, and then use numpy functions only. I wonder as pandas moves to a 1.0 release, whether
Panel
needs to either be given a lot of love or deprecated to 'experimental' or completely moved to something like xray infrastructure for >2D along with the current options forMultiIndex
.
Panel indexing
- I imagine there's something I don't understand, although I don't get why we have this design.
- My understanding is that a
DataFrame
has row x column dimensions which are consistent across the indexers, and then there are some 'convenience' methods (such asdf['a']
which reference theinfo_axis
/ columns anddf[2:5]
which reference the rows). In production, using the indexers is rigorous and predictable. - I would have thought a consistent design could exist for
Panel
s - while there might be convenience methods, standard indexers would apply to items x rows (/ major) x columns (/ minor), and selecting a slice of one would collapse the others, in order. I had thought theinfo_axis
&stat_axis
were for convenience only, not affecting the core indexing operations (but sounds like I'm wrong).
xray mostly has the design I expected, I think, although does remember the collapsed dimension:
In [22]: panel_x=xray.DataArray(pd.np.random.rand(4,3,2))
In [24]: panel_x
Out[24]:
<xray.DataArray (dim_0: 4, dim_1: 3, dim_2: 2)>
array([[[ 0.81499518, 0.73722039],
...
[ 0.21864764, 0.93710684]]])
Coordinates:
* dim_0 (dim_0) int64 0 1 2 3
* dim_1 (dim_1) int64 0 1 2
* dim_2 (dim_2) int64 0 1
In [25]: panel_x.loc[:,0,:]
Out[25]:
<xray.DataArray (dim_0: 4, dim_2: 2)>
array([[ 0.81499518, 0.73722039],
[ 0.41809174, 0.28529916],
[ 0.82198192, 0.14365383],
[ 0.55948113, 0.24809068]])
Coordinates:
* dim_0 (dim_0) int64 0 1 2 3
dim_1 int64 0
* dim_2 (dim_2) int64 0 1
Relevant xref: https://github.com/pydata/pandas/issues/9595, https://github.com/pydata/pandas/issues/10000 CC @shoyer
Comment From: jreback
@maximilianr
Well, @shoyer and I had some discussions w.r.t. essentially making .to_panel()
simply return a DataArray
directly (then you would work with it), and deprecating Panel
.
That's an option; more closely aligns pandas and x-ray.
However, I think is a nice use case for a dense Panel
. if you allow that x-ray
is more 'geared' towards sparse type nd-arrays (of course it has dense support), more that is its primary usecase.
I happen to (well in the past), used Panels
quite a lot where I would things like:
fields x time-axis x tickers
, where the pandas model makes a lot of sense.
So maybe you can elaborate where you think pandas is lacking (in docs/tests/etc). Pretty much everything is there. So asside from the indexing conventions, not sure what issues there are.
Comment From: max-sixty
Here are a couple of issues I've had in addition to the above; I can provide more on these / others if helpful:
- Very standard functions such as multiply
that exist on Panel
s, when other
is a different dimension. Without going to numpy, this is very slow as it iterates through each series combination. SO question here. I just had a go with xray and it seems decent:
In [56]: x
Out[56]:
<xray.DataArray (dim_0: 2, dim_1: 3, dim_2: 4)>
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]],
[[12, 13, 14, 15],
[16, 17, 18, 19],
[20, 21, 22, 23]]])
Coordinates:
* dim_0 (dim_0) int64 0 1
* dim_1 (dim_1) int64 0 1 2
* dim_2 (dim_2) int64 0 1 2 3
In [57]: x * pd.np.asarray([0,1])[:, pd.np.newaxis, pd.np.newaxis]
Out[57]:
<xray.DataArray (dim_0: 2, dim_1: 3, dim_2: 4)>
array([[[ 0, 0, 0, 0],
[ 0, 0, 0, 0],
[ 0, 0, 0, 0]],
[[12, 13, 14, 15],
[16, 17, 18, 19],
[20, 21, 22, 23]]])
Coordinates:
* dim_1 (dim_1) int64 0 1 2
* dim_2 (dim_2) int64 0 1 2 3
* dim_0 (dim_0) int64 0 1
In [58]: x.to_pandas() * pd.np.asarray([0,1])[:, pd.np.newaxis, pd.np.newaxis]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-58-18d40558bcd9> in <module>()
----> 1 x.to_pandas() * pd.np.asarray([0,1])[:, pd.np.newaxis, pd.np.newaxis]
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/ops.py in f(self, other)
1050 raise ValueError('Simple arithmetic with %s can only be '
1051 'done with scalar values' %
-> 1052 self._constructor.__name__)
1053
1054 return self._combine(other, op)
ValueError: Simple arithmetic with Panel can only be done with scalar values
- Non-standard functions such as percentile that don't exist on a
Panel
. I ended up usingnp.nanpercentile
here; the alternative wasapply
over series combinations, which was extremely slow. (I tried applying theDataFrame
percentile over two of the axes and then reorganizing the axes, which I think was a bit faster, but awkward). - Selecting, as in https://github.com/pydata/pandas/issues/11451.
I ended up using
np.where
:
panel.loc[:, :, :] = pd.np.where(
panel.notnull(),
panel,
fallback_df[:, :, pd.np.newaxis]
)
xray seems decent at this too:
In [61]: x.where(x>5)
Out[61]:
<xray.DataArray (dim_0: 2, dim_1: 3, dim_2: 4)>
array([[[ nan, nan, nan, nan],
[ nan, nan, 6., 7.],
[ 8., 9., 10., 11.]],
[[ 12., 13., 14., 15.],
[ 16., 17., 18., 19.],
[ 20., 21., 22., 23.]]])
Coordinates:
* dim_1 (dim_1) int64 0 1 2
* dim_2 (dim_2) int64 0 1 2 3
* dim_0 (dim_0) int64 0 1
In [62]: x.where(x[0]>5)
Out[62]:
<xray.DataArray (dim_0: 2, dim_1: 3, dim_2: 4)>
array([[[ nan, nan, nan, nan],
[ nan, nan, 6., 7.],
[ 8., 9., 10., 11.]],
[[ nan, nan, nan, nan],
[ nan, nan, 18., 19.],
[ 20., 21., 22., 23.]]])
Coordinates:
* dim_1 (dim_1) int64 0 1 2
* dim_2 (dim_2) int64 0 1 2 3
* dim_0 (dim_0) int64 0 1
Hope this is helpful - thanks for your engagement @jreback
Comment From: shoyer
Yes, these sorts of issues are exactly why we wrote xray in the first place. The pandas API and internals weren't really designed with n-dimensional data in mind, which makes panels and nd-panel quite awkward.
xray mostly has the design I expected, I think, although does remember the collapsed dimension:
The collapsed dimension is essentially just metadata and can be safely ignored. I think @jreback was a little confused here, but scalar coordinates are not used for any sort of alignment.
IMO the xray.DataArray is almost strictly more useful the panels. The main feature gap is that we currently don't support MultiIndex in xray, but hopefully that will change soon.
Comment From: jreback
@MaximilianR
since I understand you recently switched from using Panels
to x-ray
, can you elaborate on how it went? good-bad-ugly?
if we deprecate Panel
entirely and make to_panel
return an x-ray
object. What are upsides / downsides?
Comment From: max-sixty
Sure - I'll give a short synthesis, and happy to answer any follow up questions you have.
Good:
- Clear, explicit API, very few surprises. Indexing in particular is very reliable. Stark contrast to Panel
!
- Labeled dimensions, and the benefits that come with them - .sel
, .isel
(which becomes more important for higher dimensional datasets)
- Clear difference between a DataArray
and Dataset
, independent of dimensionality (the ability to have DataArray
s aligned on different dimensions is awesome)
Bad - minor, and very specific to my experience:
- Index issues - for indexes whose .values
aren't the same as the index (PeriodIndex
, maybe tz?). PeriodIndex
is very usable though given some recent minor changes. No MultiIndexes. @shoyer will have a better view here
- Smaller API - greater need to use numpy / bottleneck / numbagg functions. For example, .where
doesn't take an other
argument
- A bit less magic - for example, you can't slice a date index with a string ['2015']
- I think this is a big plus for XRay generally, but given that DataArray
s can only be a single type, that would have to be handled in .to_panel
Overall it's a beautiful library, both for exploratory work and for production. I'm very excited to be using it, and grateful to @shoyer for creating it.
I don't have a strong view on whether we should make to_panel
return an XRay DataArray
, but I do think we should choose an articulate a vision & roadmap on Panel
vs XRay - the time the community spends on improving Panel around the edges is a waste IMHO, and it's the role of the maintainers to ensure that contributors know whether they're working on sustainable products.
Let me know if I can help beyond this at all, Max
Comment From: shoyer
The good news is that almost all of @MaximilianR's issues should be fixable with a bit more work -- there are no fundamental design issues. For example, I just made a PR adding MultiIndex support (https://github.com/xray/xray/pull/702).
for example, you can't slice an date index with a string
['2015']
Could you share an example where this fails? There may be a bug here -- we've had support for string indexing of datetime indexes since almost the beginning: http://xray.readthedocs.org/en/stable/time-series.html#datetime-indexing
Comment From: max-sixty
That should read PeriodIndex
:
In [51]: ds=xray.Dataset(coords={'date':pd.period_range(periods=10,start='2000')})
In [52]: ds['d']=('date', pd.np.random.rand(10))
In [53]: ds.sel(date='2000')
Out[53]:
<xray.Dataset>
Dimensions: ()
Coordinates:
date object 2000-01-01
Data variables:
d float64 0.8965
Confirming it works for DatetimeIndex
:
In [54]: ds=xray.Dataset(coords={'date':pd.date_range(periods=10,start='2000')})
In [55]: ds['d']=('date', pd.np.random.rand(10))
In [56]: ds.sel(date='2000')
Out[56]:
<xray.Dataset>
Dimensions: (date: 10)
Coordinates:
* date (date) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 ...
Data variables:
d (date) float64 0.09303 0.5456 0.4934 0.08438 0.1854 0.2823 ...
Comment From: jreback
closing as Panels are deprecated