Pandas Partial indexing of a Panel - Nineya|java/go/python

See also: http://stackoverflow.com/questions/26736745/indexing-a-pandas-panel-counterintuitive-or-a-bug

These are actually two related(?) issues. The first is that the DataFrame is transposed, when you index the major_indexer or minor_indexer:

from pandas import Panel
from numpy import arange
p = Panel(arange(24).reshape(2,3,4))
p.shape
Out[4]: (2, 3, 4)
p.iloc[0].shape # original order
Out[5]: (3, 4)
p.iloc[:,0].shape # I would expect (2,4), but it is transposed
Out[6]: (4, 2)
p.iloc[:,:,0].shape # also transposed
Out[7]: (3, 2)
p.iloc[:,0,:].shape # transposed (same as [6])
Out[8]: (4, 2)

This may be a design choice, but it seems counterintuitive to me and it is not in line with the way numpy indexing works. On a related note, I would expect the following two commands to be equivalent:

p.iloc[1:,0,:].shape # Slicing item_indexer, then transpose
Out[9]: (4, 1)
p.iloc[1:,0].shape # Expected to get the same as [9], but slicing minor_indexer instead????
Out[10]: (3, 2)

INSTALLED VERSIONS

commit: None python: 2.7.6.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 42 Stepping 7, GenuineIntel byteorder: little LC_ALL: None LANG: nl_NL

pandas: 0.15.1 nose: 1.3.3 Cython: 0.20.1 numpy: 1.9.1 scipy: 0.14.0 statsmodels: 0.5.0 IPython: 2.2.0 sphinx: 1.2.2 patsy: 0.2.1 dateutil: 1.5 pytz: 2014.9 bottleneck: None tables: 3.1.1 numexpr: 2.3.1 matplotlib: 1.4.2 openpyxl: 1.8.5 xlrd: 0.9.3 xlwt: 0.7.5 xlsxwriter: 0.5.5 lxml: 3.3.5 bs4: 4.3.1 html5lib: None httplib2: None apiclient: None rpy2: None sqlalchemy: 0.9.4 pymysql: None psycopg2: None

Comment From: max-sixty

I've just hit this too, on 0.16.2. Is this intended? Is it related to https://github.com/pydata/pandas/issues/11369?

In [8]: panel = pd.Panel(pd.np.random.rand(2,3,4))

In [10]: panel.shape
Out[10]: (2, 3, 4)

In [11]: panel[:, :, 0].shape
Out[11]: (3, 2)

In numpy:

In [15]: npanel=pd.np.random.rand(2,3,4)

In [16]: npanel.shape
Out[16]: (2, 3, 4)

In [18]: npanel[:,:,0].shape
Out[18]: (2, 3)

CC @jreback, as this seemed like an abandoned issue

Comment From: jreback

yes this has always been like this. DataFrame is 'reversed' in that the columns axis (1) is the 'primary' (we call it the info) axis. This translates to indexing where a Panel is conceptually a dict of DataFrames. Not sure what/if anything can do about this as it would break practially all code.

Comment From: max-sixty

This is a bigger issue than one we're going to solve here. But regardless a couple of points:

Panels generally

I have been working with Panels a lot over the past couple of weeks and - from my humble user perspective - it has felt pretty painful. I know it's a difficult challenge to go from 2 -> n dimensions. DataFrames are so beautiful, and Panels seem like an alpha of their functionality at a different level of quality (i.e. a 'preview' with low documentation & testing, rather than a fully functional subset).
FWIW, my basic approach now is to use pandas for the initial alignment, and then use numpy functions only. I wonder as pandas moves to a 1.0 release, whether Panel needs to either be given a lot of love or deprecated to 'experimental' or completely moved to something like xray infrastructure for >2D along with the current options for MultiIndex.

Panel indexing

I imagine there's something I don't understand, although I don't get why we have this design.
My understanding is that a DataFrame has row x column dimensions which are consistent across the indexers, and then there are some 'convenience' methods (such as df['a'] which reference the info_axis / columns and df[2:5] which reference the rows). In production, using the indexers is rigorous and predictable.
I would have thought a consistent design could exist for Panels - while there might be convenience methods, standard indexers would apply to items x rows (/ major) x columns (/ minor), and selecting a slice of one would collapse the others, in order. I had thought the info_axis & stat_axis were for convenience only, not affecting the core indexing operations (but sounds like I'm wrong).

xray mostly has the design I expected, I think, although does remember the collapsed dimension:


In [22]: panel_x=xray.DataArray(pd.np.random.rand(4,3,2))

In [24]: panel_x
Out[24]: 
<xray.DataArray (dim_0: 4, dim_1: 3, dim_2: 2)>
array([[[ 0.81499518,  0.73722039],
...
        [ 0.21864764,  0.93710684]]])
Coordinates:
  * dim_0    (dim_0) int64 0 1 2 3
  * dim_1    (dim_1) int64 0 1 2
  * dim_2    (dim_2) int64 0 1

In [25]: panel_x.loc[:,0,:]
Out[25]: 
<xray.DataArray (dim_0: 4, dim_2: 2)>
array([[ 0.81499518,  0.73722039],
       [ 0.41809174,  0.28529916],
       [ 0.82198192,  0.14365383],
       [ 0.55948113,  0.24809068]])
Coordinates:
  * dim_0    (dim_0) int64 0 1 2 3
    dim_1    int64 0
  * dim_2    (dim_2) int64 0 1

Relevant xref: https://github.com/pydata/pandas/issues/9595, https://github.com/pydata/pandas/issues/10000 CC @shoyer

Comment From: jreback

@maximilianr

Well, @shoyer and I had some discussions w.r.t. essentially making .to_panel() simply return a DataArray directly (then you would work with it), and deprecating Panel.

That's an option; more closely aligns pandas and x-ray.

However, I think is a nice use case for a dense Panel. if you allow that x-ray is more 'geared' towards sparse type nd-arrays (of course it has dense support), more that is its primary usecase.

I happen to (well in the past), used Panels quite a lot where I would things like:

fields x time-axis x tickers, where the pandas model makes a lot of sense.

So maybe you can elaborate where you think pandas is lacking (in docs/tests/etc). Pretty much everything is there. So asside from the indexing conventions, not sure what issues there are.

Comment From: max-sixty

Here are a couple of issues I've had in addition to the above; I can provide more on these / others if helpful: - Very standard functions such as multiply that exist on Panels, when other is a different dimension. Without going to numpy, this is very slow as it iterates through each series combination. SO question here. I just had a go with xray and it seems decent:

In [56]: x
Out[56]: 
<xray.DataArray (dim_0: 2, dim_1: 3, dim_2: 4)>
array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])
Coordinates:
  * dim_0    (dim_0) int64 0 1
  * dim_1    (dim_1) int64 0 1 2
  * dim_2    (dim_2) int64 0 1 2 3

In [57]: x * pd.np.asarray([0,1])[:, pd.np.newaxis, pd.np.newaxis]
Out[57]: 
<xray.DataArray (dim_0: 2, dim_1: 3, dim_2: 4)>
array([[[ 0,  0,  0,  0],
        [ 0,  0,  0,  0],
        [ 0,  0,  0,  0]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])
Coordinates:
  * dim_1    (dim_1) int64 0 1 2
  * dim_2    (dim_2) int64 0 1 2 3
  * dim_0    (dim_0) int64 0 1

In [58]: x.to_pandas() * pd.np.asarray([0,1])[:, pd.np.newaxis, pd.np.newaxis]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-58-18d40558bcd9> in <module>()
----> 1 x.to_pandas() * pd.np.asarray([0,1])[:, pd.np.newaxis, pd.np.newaxis]

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/ops.py in f(self, other)
   1050             raise ValueError('Simple arithmetic with %s can only be '
   1051                              'done with scalar values' %
-> 1052                              self._constructor.__name__)
   1053 
   1054         return self._combine(other, op)

ValueError: Simple arithmetic with Panel can only be done with scalar values

Non-standard functions such as percentile that don't exist on a Panel. I ended up using np.nanpercentile here; the alternative was apply over series combinations, which was extremely slow. (I tried applying the DataFrame percentile over two of the axes and then reorganizing the axes, which I think was a bit faster, but awkward).
Selecting, as in https://github.com/pydata/pandas/issues/11451. I ended up using np.where:

panel.loc[:, :, :] = pd.np.where(
        panel.notnull(),
        panel,
        fallback_df[:, :, pd.np.newaxis]
    )

xray seems decent at this too:

In [61]: x.where(x>5)
Out[61]: 
<xray.DataArray (dim_0: 2, dim_1: 3, dim_2: 4)>
array([[[ nan,  nan,  nan,  nan],
        [ nan,  nan,   6.,   7.],
        [  8.,   9.,  10.,  11.]],

       [[ 12.,  13.,  14.,  15.],
        [ 16.,  17.,  18.,  19.],
        [ 20.,  21.,  22.,  23.]]])
Coordinates:
  * dim_1    (dim_1) int64 0 1 2
  * dim_2    (dim_2) int64 0 1 2 3
  * dim_0    (dim_0) int64 0 1

In [62]: x.where(x[0]>5)
Out[62]: 
<xray.DataArray (dim_0: 2, dim_1: 3, dim_2: 4)>
array([[[ nan,  nan,  nan,  nan],
        [ nan,  nan,   6.,   7.],
        [  8.,   9.,  10.,  11.]],

       [[ nan,  nan,  nan,  nan],
        [ nan,  nan,  18.,  19.],
        [ 20.,  21.,  22.,  23.]]])
Coordinates:
  * dim_1    (dim_1) int64 0 1 2
  * dim_2    (dim_2) int64 0 1 2 3
  * dim_0    (dim_0) int64 0 1

Hope this is helpful - thanks for your engagement @jreback

Comment From: shoyer

Yes, these sorts of issues are exactly why we wrote xray in the first place. The pandas API and internals weren't really designed with n-dimensional data in mind, which makes panels and nd-panel quite awkward.

xray mostly has the design I expected, I think, although does remember the collapsed dimension:

The collapsed dimension is essentially just metadata and can be safely ignored. I think @jreback was a little confused here, but scalar coordinates are not used for any sort of alignment.

IMO the xray.DataArray is almost strictly more useful the panels. The main feature gap is that we currently don't support MultiIndex in xray, but hopefully that will change soon.

Comment From: jreback

@MaximilianR

since I understand you recently switched from using Panels to x-ray, can you elaborate on how it went? good-bad-ugly?

if we deprecate Panel entirely and make to_panel return an x-ray object. What are upsides / downsides?

Comment From: max-sixty

Sure - I'll give a short synthesis, and happy to answer any follow up questions you have.

Good: - Clear, explicit API, very few surprises. Indexing in particular is very reliable. Stark contrast to Panel! - Labeled dimensions, and the benefits that come with them - .sel, .isel (which becomes more important for higher dimensional datasets) - Clear difference between a DataArray and Dataset, independent of dimensionality (the ability to have DataArrays aligned on different dimensions is awesome)

Bad - minor, and very specific to my experience: - Index issues - for indexes whose .values aren't the same as the index (PeriodIndex, maybe tz?). PeriodIndex is very usable though given some recent minor changes. No MultiIndexes. @shoyer will have a better view here - Smaller API - greater need to use numpy / bottleneck / numbagg functions. For example, .where doesn't take an other argument - A bit less magic - for example, you can't slice a date index with a string ['2015'] - I think this is a big plus for XRay generally, but given that DataArrays can only be a single type, that would have to be handled in .to_panel

Overall it's a beautiful library, both for exploratory work and for production. I'm very excited to be using it, and grateful to @shoyer for creating it.

I don't have a strong view on whether we should make to_panel return an XRay DataArray, but I do think we should choose an articulate a vision & roadmap on Panel vs XRay - the time the community spends on improving Panel around the edges is a waste IMHO, and it's the role of the maintainers to ensure that contributors know whether they're working on sustainable products.

Let me know if I can help beyond this at all, Max

Comment From: shoyer

The good news is that almost all of @MaximilianR's issues should be fixable with a bit more work -- there are no fundamental design issues. For example, I just made a PR adding MultiIndex support (https://github.com/xray/xray/pull/702).

for example, you can't slice an date index with a string ['2015']

Could you share an example where this fails? There may be a bug here -- we've had support for string indexing of datetime indexes since almost the beginning: http://xray.readthedocs.org/en/stable/time-series.html#datetime-indexing

Comment From: max-sixty

That should read PeriodIndex:

In [51]: ds=xray.Dataset(coords={'date':pd.period_range(periods=10,start='2000')})

In [52]: ds['d']=('date', pd.np.random.rand(10))

In [53]: ds.sel(date='2000')
Out[53]: 
<xray.Dataset>
Dimensions:  ()
Coordinates:
    date     object 2000-01-01
Data variables:
    d        float64 0.8965

Confirming it works for DatetimeIndex:

In [54]: ds=xray.Dataset(coords={'date':pd.date_range(periods=10,start='2000')})

In [55]: ds['d']=('date', pd.np.random.rand(10))

In [56]: ds.sel(date='2000')
Out[56]: 
<xray.Dataset>
Dimensions:  (date: 10)
Coordinates:
  * date     (date) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 ...
Data variables:
    d        (date) float64 0.09303 0.5456 0.4934 0.08438 0.1854 0.2823 ...

Comment From: jreback

closing as Panels are deprecated