Pandas ENH: set multi-index names as NamedTuples

import pandas as PD s1=PD.Series((1,4,9,16), name=dict(a=1,b=2)) s2=s1*2 s2.name=dict(c=5,d=7) s1 0 1 1 4 2 9 3 16 Name: {'a': 1, 'b': 2}, dtype: int32 df = PD.concat((s1,s2), axis=1) # this should raise ValueError as the resulting df is broken df.shape # ok, this works... (4, 2) df # But this doesn't. ) failed: TypeError: unhashable type: 'dict'> df.columns # This is not a proper index... Index([{'a': 1, 'b': 2}, {'c': 5, 'd': 7}], dtype='object')

Comments: 1. There may be other ways aside from concat that lead to broken df's if a Series.name attribute is unhashable. 2. If Series.name is a dict, one could try to convert it into a namedtuple. That's hashable.

Comment From: jorisvandenbossche

Shouldn't we just disallow that a Series name is unhashable?

Comment From: dr-leo

Good proposal. This would make the API clear and consistent. It should also be less error-prone than catching all situations in which a Series.name is propagated to an index.

And one can always use namedtuples instead of dict to store multi-dimensional key-value pairs as metadata in Series.name or df.columns. This could be mentioned somewhere in the docs, I think, as namedtuple is not so widely used, although it comes in handy.

The following is a bit off-topic, but I am kind of struggling with building a nnametuple class factory returning a singleton, i.e. don't create a new namedtuple class if the same has been created before. I found a verbose and hardly elegant solution in pandaSDMX... So I was wondering if pandas could provide a utility function for this common task...

Am 13.09.2014 21:11, schrieb Joris Van den Bossche:

Shouldn't we just disallow that a Series name is unhashable?

— Reply to this email directly or view it on GitHub https://github.com/pydata/pandas/issues/8263#issuecomment-55503633.

Comment From: jreback

@dr-leo pull-request for this?

I think you can simply intercept the __setattr__ call (and validate) in core/generic.py

Comment From: dr-leo

Do we agree what we want?

Option 1: require Series.name to be hashable as suggested by Joris 2. enforce that df.columns is hashable.

As I wrote, I favour option 1. But your hints for a PR seem to imply option2.

Am 14.09.2014 00:49, schrieb jreback:

@dr-leo https://github.com/dr-leo pull-request for this?

I think you can simply intercept the |setattr| call (and validate) in core/generic.py

— Reply to this email directly or view it on GitHub https://github.com/pydata/pandas/issues/8263#issuecomment-55509446.

Comment From: dr-leo

@Jeff:

I am unfamiliar with the PD sources. But I've seen that Series inherits from core.generic.PDFrame. So your hint makes sense even for option 1, i.e. preventing Series.name to be set to an unhashable type.

I will look into this and if I succeed, send my first PR. But due to other projects it will take a couple of weeks. I see no urgency. This is likely for v0.16.

@All: for most of you this is a matter of a couple of minutes. So if you feel this should be done earlier, feel free to jump in.

Leo

Am 14.09.2014 00:49, schrieb jreback:

@dr-leo https://github.com/dr-leo pull-request for this?

I think you can simply intercept the |setattr| call (and validate) in core/generic.py

— Reply to this email directly or view it on GitHub https://github.com/pydata/pandas/issues/8263#issuecomment-55509446.

Comment From: dr-leo

I have just submitted a minimalist PR addressing the easier part of this issue, i.e., validating that Series.name is hashable.

Comment From: dr-leo

Some thoughts:

Consider the following script:

from numpy.random import randn import pandas as PD import pdb idx = PD.MultiIndex.from_product((('one', 'two'), ('foo', 'bar')), names = ('level1', 'level2')) df=PD.DataFrame(randn(4,4), columns = idx) pdb.set_trace() s=df.one.bar 1. The purpose of this issue is this: When extracting a column from a multi-indexed df, the returned series should not just have the immediate key at the lowest index level, but a namedtuple whose field names are the MI level names and whose items are the keys at each MI level. 1. Tracing through the execution of a command such as s=df.one.foo shows what one would expect: The Series at df.one.bar is retrieved by walking down the MI levels. Here we have two levels. df.foo yields a 2-column df with one level and keys foo and bar. The first index level (one, two) is forgotten as it is not part of the sub-df. The second iteration yields the actual Series with name 'bar'. 3. If we want to set the Series.name to the namedtuple as described above, we need to remember the level names and the labels. How? Where? 2. I find it inconsistent to attach to the Series the full path through the levels (i.e. Level1:one, Level2:bar, whereas df.one only yields a df with a simple index ('foo', 'bar'). df.one should have a multi-index as well having just 'one' at Level1 and ('foo','bar') at Level2. This would reflect the idea of attaching the corresponding subset of the multi-index to whatever subset of columns we extract from a df. In other words: Extracting a sub-frame or Series yields a multi-index with the same number of levels as the original frame, but with a smaller subset of tuples. For Series, the single tuple is stored in a namedtuple whereas a sub-frame will have a proper MultiIndex. Does this semantics make sense? This would break some code, sure. But is the Series.name as namedtuple without also changing the behavior for sub-frames convincing?

Comment From: toddrjen

I think it should use similar mechanics to now, except with the addition of a name. So it should return the same number of values as it does now. In cases where it currently drops a value, it should continue to do so. In cases where it doesn't currently drop a value, the value should still be maintained.

Comment From: dr-leo

Just to make sure I understand what you mean:

Do you agree that all NDFrames (Series, DataFrame, Index, MultiIndex and perhaps Panel) should have a name attribute?

I think this would be a good idea as it would make things more consistent and provide a place for simple metadata at DataFrame or Index level. So you could name a MultiIndex "Earth" and the levels continents, countries, regions etc. A DataFrame with this MultiIndex could be named 'GDP'. And df.America.Canada would have a 1-level column index with labels for the regions. Its name would be a namedtuple with two fields from the original DF (Continent, Country). If you pick the series for the region Quebec, the Series.name would have three fields: Continent, Country, Region.

This would not break any code.

Am 17.01.2015 um 09:16 schrieb toddrjen:

I think it should use similar mechanics to now, except with the addition of a name. So it should return the same number of values as it does now. In cases where it currently drops a value, it should continue to do so. In cases where it doesn't currently drop a value, the value should still be maintained.

— Reply to this email directly or view it on GitHub https://github.com/pydata/pandas/issues/8263#issuecomment-70359091.

Comment From: toddrjen

So names from successive slices should be appended to the existing name? What if the existing name is a string or ordinary tuple?

Comment From: dr-leo

In the Earth example, if you do df=DataFrame(my_array, columns = my_multiindex, name='GDP')

and then: GDp_Canada = df.Americas.Canada

gdp_Canada would have an automatically generated name. It would be a namedtuple as described earlier. gdp_Quebec = gdp_Canda.Quebec

wouldd be a Series whose name would also be auto-generated as it is now, but with more information on the position within the original frame. So there is no conflict with user-generated names.

Am 17.01.2015 um 13:31 schrieb toddrjen:

So names from successive slices should be appended to the existing name? What if the existing name is a string or ordinary tuple?

— Reply to this email directly or view it on GitHub https://github.com/pydata/pandas/issues/8263#issuecomment-70365335.

Comment From: toddrjen

First, do we agree that both of these versions of GDp_Canada are identical?

>>> GDp_Canada = df.Americas.Canada

and:

>>> GDp_Americas = df.Americas
>>> GDp_Canada = GDp_Americas.Canada

If so, then I see two possible ways that the name generation can play out. The first is that names of previous levels are dropped when generated new names:

>>> df.name  # this has no pre-existing name
''
>>> GDp_Americas = df.Americas
>>> GDp_Americas.name
name(Continent='Americas')
>>> GDp_Canada = GDp_Americas.Canada
>>> GDp_Canada.name
name(Country='Canada')
>>> GDp_Quebec = GDp_Canada.Quebec
>>> GDp_Quebec = GDp_Canada.name
name(Region='Quebec')
>>> df.Americas.Canada.Quebec.name
name(Region='Quebec')

The other is that names are appended to the existing name:

>>> df.name  # this has no pre-existing name
''
>>> GDp_Americas = df.Americas
>>> GDp_Americas.name
name(Continent='Americas')
>>> GDp_Canada = GDp_Americas.Canada
>>> GDp_Canada.name
name(Continent='Americas', Country='Canada')
>>> GDp_Quebec = GDp_Canada.Quebec
>>> GDp_Quebec = GDp_Canada.name
name(Continent='Americas', Country='Canada', Region='Quebec')
>>> df.Americas.Canada.Quebec.name
name(Continent='Americas', Country='Canada', Region='Quebec')

The problem arises here, if we use the second scenario:

>>> df.name  # this now HAS a pre-existing name
'GDP'
>>> GDp_Americas = df.Americas
>>> GDp_Americas.name
????

What should be the resulting name from this situation?

Comment From: dr-leo

Please see my responses below.

In summary, names are technically not generated by appending to the previous name, but by creating a namedtuple class using the factory function collections.namedtuple(). Its fields are the index level names up to the current level. The name is set to an instance whose values come from the index levels (keys) of the requested sub-frame (df.Americas.Canada or df.Americas.Mexico etc.) or series (df.Americas.Canada.Quebec or ...Ontario). You cannot append to a namedtuple anyway. You have to make a class first, and then instantiate it. So you need exactly one namedtuple class (singleton) per index level. Any user-provided df-name is discarded.

Am 17.01.2015 um 23:02 schrieb toddrjen:

First, do we agree that both of these versions of |GDp_Canada| are identical?

GDp_Canada = df.Americas.Canada

and:

GDp_Americas = df.Americas GDp_Canada = GDp_Americas.Canada **Yes. The '.' operator is and should remain associative.

If so, then I see two possible ways that the name generation can play out. The first is that names of previous levels are dropped when generated new names:

df.name # this has no pre-existing name ** It may have a user-provided name like s=Series([1,3], name='hello'). DataFrame() would have an optional kwarg 'name'.

GDp_Americas = df.Americas GDp_Americas.name name(Continent='Americas') GDp_Canada = GDp_Americas.Canada GDp_Canada.name name(Country='Canada') GDp_Quebec = GDp_Canada.Quebec GDp_Quebec = GDp_Canada.name name(Region='Quebec') df.Americas.Canada.Quebec.name name(Region='Quebec') ** IMHO this would be second-best as it would provide insufficient info. The big idea is to have name tell the user where the sub-frame/series is derived from.

The other is that names are appended to the existing name:

df.name # this has no pre-existing name '' ** Same as before: It may have user-provided name, or defaults to None, not ''.

GDp_Americas = df.Americas GDp_Americas.name name(Continent='Americas') ** Yes. df.name is discarded. The appending business starts at the third index level. At the second level you have a namedtuple with one field, i.e. Continent.

GDp_Canada = GDp_Americas.Canada GDp_Canada.name name(Continent='Americas', Country='Canada') GDp_Quebec = GDp_Canada.Quebec GDp_Quebec = GDp_Canada.name name(Continent='Americas', Country='Canada', Region='Quebec') df.Americas.Canada.Quebec.name name(Continent='Americas', Country='Canada', Region='Quebec') ** Exactly. That how I think it should be.

The problem arises here:

df.name # this now HAS a pre-existing name 'GDP' ** I'm not sure if I see your point. If 'GDP' was the user-provided name, df.Americas.name won't be appended to it but auto-generated from the first index level, whatever df.name might have been. df.Americas is a different frame after all. It is up to the user to recall that its values are about GDP. GDp_Americas = df.Americas GDp_Americas.name ????

— Reply to this email directly or view it on GitHub https://github.com/pydata/pandas/issues/8263#issuecomment-70386237.

Comment From: toddrjen

On Sun, Jan 18, 2015 at 12:33 PM, dr-leo notifications@github.com wrote:

The problem arises here:

df.name # this now HAS a pre-existing name 'GDP' ** I'm not sure if I see your point. If 'GDP' was the user-provided name, df.Americas.name won't be appended to it but auto-generated from the first index level, whatever df.name might have been. df.Americas is a different frame after all. It is up to the user to recall that its values are about GDP.

So what if the existing name is a namedtuple that is otherwise identical to a generated one? How can pandas tell the different between a user-defined namedtuple and a generated one?

Comment From: dr-leo

You have a point here. In a non-technical sense generating the new name is appending the first index level name as field and respective value to the nametuple of the original df. Technically, to create the nametuple class for the new name, you take the fields from df.name and add the highest level name such as country... Hence, the name for the resulting df or series is derived from the current name rather than the multi-index of the original frame.

The answer to your problem is to do this whole business in a new read-only attribute named context, path, key or whatever. By default, its value would be an empty tuple, like so:

df.context == (,) True df.Americas.context == ('continent': 'AMericas') True

So we need not care about user-defined names. These could be propagated unchanged like so:

df.name == df.Americas.name == df.Americas.Canada.name == 'GDP' True

But name propagation would be a separate issue. It could even be generalized to other types of column or row selection.

Am 18.01.2015 um 13:51 schrieb toddrjen:

On Sun, Jan 18, 2015 at 12:33 PM, dr-leo notifications@github.com wrote:

The problem arises here:

df.name # this now HAS a pre-existing name 'GDP' ** I'm not sure if I see your point. If 'GDP' was the user-provided name, df.Americas.name won't be appended to it but auto-generated from the first index level, whatever df.name might have been. df.Americas is a different frame after all. It is up to the user to recall that its values are about GDP.

So what if the existing name is a namedtuple that is otherwise identical to a generated one? How can pandas tell the different between a user-defined namedtuple and a generated one?

— Reply to this email directly or view it on GitHub https://github.com/pydata/pandas/issues/8263#issuecomment-70407190.

Comment From: dr-leo

I now think we need one attribute or rather property for this in each dimension. It should be attached to the indices like so:

df.Americas.Canada.columns.context = ('continent': 'Americas')

df2.index.context = 'sex': 'female')

Same for Panel. I withdraw my earlier comment that it should be attached to DataFrame etc. - Clearly, it should be attached to Series directly:

df.Americas.Canada.Quebec.area = (..., 'country': 'Canada')

A more intuitive name might be area or selection rather than context.

An alternative implementation would use tuple rather than namedtuple plus a link to the index instance it has been derived from:

df.Americas.Canada.columns.area = ('Americas', ) df.Americas.Canada.columns.origin = df.columns

But this may have memory implications due to the reference to df.columns.

Am 18.01.2015 um 18:13 schrieb Dr. Leo:

You have a point here. In a non-technical sense generating the new name is appending the first index level name as field and respective value to the nametuple of the original df. Technically, to create the nametuple class for the new name, you take the fields from df.name and add the highest level name such as country... Hence, the name for the resulting df or series is derived from the current name rather than the multi-index of the original frame.

The answer to your problem is to do this whole business in a new read-only attribute named context, path, key or whatever. By default, its value would be an empty tuple, like so:

df.context == (,) True df.Americas.context == ('continent': 'AMericas') True

So we need not care about user-defined names. These could be propagated unchanged like so:

df.name == df.Americas.name == df.Americas.Canada.name == 'GDP' True

But name propagation would be a separate issue.

Am 18.01.2015 um 13:51 schrieb toddrjen:

On Sun, Jan 18, 2015 at 12:33 PM, dr-leo notifications@github.com wrote:

The problem arises here:

df.name # this now HAS a pre-existing name 'GDP' ** I'm not sure if I see your point. If 'GDP' was the user-provided name, df.Americas.name won't be appended to it but auto-generated from the first index level, whatever df.name might have been. df.Americas is a different frame after all. It is up to the user to recall that its values are about GDP.

So what if the existing name is a namedtuple that is otherwise identical to a generated one? How can pandas tell the different between a user-defined namedtuple and a generated one?

— Reply to this email directly or view it on GitHub https://github.com/pydata/pandas/issues/8263#issuecomment-70407190.

Comment From: TomAugspurger

Series.name now has to be hashable.

Namedtuples are fine for series.name

Let me know if there were unresolved issues in this thread.