In [8]: wp = Panel(randn(2, 5, 4), items=['Item1', 'Item2'], major_axis=date_range('1/1/2000', periods=5), minor_axis=['A', 'B', 'C', 'D'])
In [9]: wp2 = Panel(randn(2, 5, 4), items=['Item1', 'Item2'], major_axis=date_range('1/1/2000', periods=5), minor_axis=['A', 'B', 'C', 'D'])
In [10]: wp.loc[['Item1', 'Item2'], :, ['A', 'B']]
Out[10]:
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 5 (major_axis) x 2 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 2000-01-01 00:00:00 to 2000-01-05 00:00:00
Minor_axis axis: A to B
In [11]: wp.loc[['Item1', 'Item2'], :, ['A', 'B']] = wp2.loc[['Item1', 'Item2'], :, ['A', 'B']]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-11-275f8d9be664> in <module>()
----> 1 wp.loc[['Item1', 'Item2'], :, ['A', 'B']] = wp2.loc[['Item1', 'Item2'], :, ['A', 'B']]
/usr/local/lib/python2.7/dist-packages/pandas/core/indexing.pyc in __setitem__(self, key, value)
85 indexer = self._convert_to_indexer(key)
86
---> 87 self._setitem_with_indexer(indexer, value)
88
89 def _has_valid_tuple(self, key):
/usr/local/lib/python2.7/dist-packages/pandas/core/indexing.pyc in _setitem_with_indexer(self, indexer, value)
182 values = self.obj.values
183 if np.prod(values.shape):
--> 184 values[indexer] = value
185
186 def _align_series(self, indexer, ser):
ValueError: array is not broadcastable to correct shape
Those should be the same shape so I don't see why this assignment shouldn't work. This is with pandas 0.11.0.
Comment From: twiecki
Seems like its numpy
related. Is this expected slicing behavior?!
In [11]: x = np.arange(3**3).reshape((3, 3, 3))
In [12]: x.shape
Out[12]: (3, 3, 3)
In [15]: x[[0, 1], :, :].shape
Out[15]: (2, 3, 3)
In [16]: x[:, :, [0,1]].shape
Out[16]: (3, 3, 2)
In [17]: x[[0,1], :, [0,1]].shape
Out[17]: (2, 3)
In [20]: x[[0,1], :, 0:2].shape
Out[20]: (2, 3, 2)
Comment From: jreback
hah...just about to write back
1) this wasn't implemented 2) its a bit harder than I though to fix, so have to push to 0.12
its straightforward to align the rhs (e.g. the other panel), but then I have to assign the values which is actually non-trivial because you are assigning in a non-contiguos was (potentially) and have to do a multi-dim put
Comment From: cpcloud
possibly related: #3738
Comment From: twiecki
I suppose one can only fancy-index in one dimension at a time.
Comment From: twiecki
It seems in my case there is a work-around:
wp.update(wp2.loc[['Item1', 'Item2'], :, ['A', 'B']])
Would there be any problem with this?
Comment From: jreback
ahh..yes..that is a good solution
fyi....will be a fair amount slower as its basically going frame-by-frame which then goes series by series.... so if this panel is big at all won't be fast
another way to approach this is to break the panel into frames, copy/update them as needed, then concat back together
Comment From: twiecki
OK, thanks for the info!
I might wait until 0.12 then if the performance hit is not too dramatic for my case.
Comment From: dragoljub
I just noticed this issue (or something related) silently caused one of my processing flows to assign erroneous data.
I find myself constantly wanting to group on several columns, compute a new column based on grouped data, and assign that result column back to the original data frame, indexing the original DataFrame using the groups multi-index name. This worked in 0.10.0.
Assignment to a df column indexed using .ix[] a multi-index tuple and a column name used to work. E.G. df.ix[(1,2,3,4), 'new_col'] = np.arange(100). Now (in Pandas 0.11.0) it just replaced the entire assignment by repeating the first entry in the np.array. Is there any fix to this in a dev version? I might have to drop down 0.10.0.
In [1]: import numpy as np
In [3]: print pd.__version__
0.11.0
In [4]: # Generate Test DataFrame
...: NUM_ROWS = 100000
...:
In [5]: NUM_COLS = 10
In [6]: col_names = ['A'+num for num in map(str,np.arange(NUM_COLS).tolist())]
In [7]: index_cols = col_names[:5]
In [8]: # Set DataFrame to have 5 level Hierarchical Index.
...: # Sort the index!
...: df = pd.DataFrame(np.random.randint(5, size=(NUM_ROWS,NUM_COLS)), dtype=np.int64, columns=col_names)
...:
In [9]: df = df.set_index(index_cols).sort_index()
In [10]: df
Out[10]: <class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (0, 0, 0, 0, 0) to (4, 4, 4, 4, 4)
Data columns (total 5 columns):
A5 100000 non-null values
A6 100000 non-null values
A7 100000 non-null values
A8 100000 non-null values
A9 100000 non-null values
dtypes: int64(5)
In [11]: # Group by first 4 index columns.
....: grp = df.groupby(level=index_cols[:4])
....:
In [12]: # Find index of largest group.
....: big_loc = grp.size().idxmax()
....:
In [13]: # Create new empty column in DataFrame
....: df['new_col'] = np.nan
....:
In [14]: # Loop through groups and assign to orignal array to new_col column
....: for name, df2 in grp:
....: new_vals = np.arange(df2.shape[0])
....: print 'Group: ', name
....: print 'Expected:\n', pd.Series(new_vals).value_counts()
....: df.ix[name, 'new_col'] = new_vals #This used to work, but now only assigns the first number from the np.array
....: print '\nAssigned:\n', df.ix[name, 'new_col'].value_counts()
....: print '\n'
....:
Group: (0, 0, 0, 0)
Expected:
155 1
48 1
55 1
54 1
53 1
52 1
51 1
50 1
49 1
47 1
57 1
46 1
45 1
44 1
43 1
...
113 1
112 1
111 1
110 1
109 1
108 1
107 1
106 1
105 1
104 1
103 1
102 1
101 1
100 1
0 1
Length: 156, dtype: int64
Assigned:
0 156
dtype: int64
Group: (0, 0, 0, 1)
Expected:
147 1
54 1
52 1
51 1
50 1
49 1
48 1
47 1
46 1
45 1
44 1
43 1
42 1
41 1
40 1
...
108 1
107 1
106 1
105 1
104 1
103 1
101 1
94 1
100 1
99 1
98 1
97 1
96 1
95 1
0 1
Length: 148, dtype: int64
Assigned:
0 148
dtype: int64
Comment From: jreback
@dragoljub It think the issue is related to #3668, which has been fixed in master and will be in upcoming 0.11.1 (very soon)
The issue described here is related to Panel <-> Panel assignment via ix/loc
Comment From: dragoljub
Any idea if the nightly builds at http://pandas.pydata.org/pandas-build/dev/ have this update? I have not seen any updated binary since April.
Comment From: dragoljub
I'm seeing this issue again: When iterating through a DataFrame's groupby object and indexing the orignal DataFrame with the returned groupby index tuple. If there is a better way of iterating through groups and updating/adding columns for each group in the original DataFrame I'm all ears.
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: print pd.__version__
0.12.0
In [4]: # Generate Test DataFrame
...: NUM_ROWS = 100000
...:
In [5]: NUM_COLS = 10
In [6]: col_names = ['A'+num for num in map(str,np.arange(NUM_COLS).tolist())]
In [7]: index_cols = col_names[:5]
In [8]: # Set DataFrame to have 5 level Hierarchical Index.
...: # Sort the index!
...: df = pd.DataFrame(np.random.randint(5, size=(NUM_ROWS,NUM_COLS)), dtype=np.int64, columns=col_names)
...:
In [9]: df = df.set_index(index_cols).sort_index()
In [10]: df
Out[10]: <class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (0, 0, 0, 0, 0) to (4, 4, 4, 4, 4)
Data columns (total 5 columns):
A5 100000 non-null values
A6 100000 non-null values
A7 100000 non-null values
A8 100000 non-null values
A9 100000 non-null values
dtypes: int64(5)
In [11]: # Group by first 4 index columns.
....: grp = df.groupby(level=index_cols[:4])
....:
In [12]: # Find index of largest group.
....: big_loc = grp.size().idxmax()
....:
In [13]: # Create new empty column in DataFrame
....: df['new_col'] = np.nan
....:
In [14]: # Loop through groups and assign to orignal array to new_col column
....: for name, df2 in grp:
....: new_vals = np.arange(df2.shape[0])
....: print 'Group: ', name
....: print 'Expected:\n', pd.Series(new_vals).value_counts()
....: df.ix[name, 'new_col'] = new_vals #This used to work, but now only assigns the first number from the np.array
....: print '\nAssigned:\n', df.ix[name, 'new_col'].value_counts()
....: print '\n'
....:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-14-06123ee45450> in <module>()
4 print 'Group: ', name
5 print 'Expected:\n', pd.Series(new_vals).value_counts()
----> 6 df.ix[name, 'new_col'] = new_vals #This used to work, but now only assigns the first number from the np.array
7 print '\nAssigned:\n', df.ix[name, 'new_col'].value_counts()
8 print '\n'
C:\Python27\lib\site-packages\pandas\core\indexing.pyc in __setitem__(self, key, value)
86 indexer = self._convert_to_indexer(key)
87
---> 88 self._setitem_with_indexer(indexer, value)
89
90 def _has_valid_tuple(self, key):
C:\Python27\lib\site-packages\pandas\core\indexing.pyc in _setitem_with_indexer(self, indexer, value)
156 # we have an equal len list/ndarray
157 elif len(labels) == 1 and (
--> 158 len(self.obj[labels[0]]) == len(value) or len(plane_indexer[0]) == len(value)):
159 setter(labels[0], value)
160
TypeError: object of type 'slice' has no len()
Group: (0, 0, 0, 0)
Expected:
162 1
40 1
58 1
57 1
56 1
55 1
54 1
53 1
52 1
51 1
50 1
49 1
48 1
47 1
46 1
...
117 1
116 1
115 1
114 1
113 1
111 1
103 1
110 1
109 1
108 1
107 1
106 1
105 1
104 1
0 1
Length: 163, dtype: int64
Comment From: jreback
@dragoljub this is a bug, fixing in master...but I wouldn't do it this way anyhow....do somethign like I think is what you want. or can at least get you there
In [2]: NUM_ROWS = 100
In [3]: NUM_COLS = 10
In [4]: col_names = ['A'+num for num in map(str,np.arange(NUM_COLS).tolist())]
In [5]: index_cols = col_names[:5]
In [6]: df = DataFrame(np.random.randint(5, size=(NUM_ROWS,NUM_COLS)), dtype=np.int64, columns=col_names)
In [7]: df = df.set_index(index_cols).sort_index()
In [8]: grp = df.groupby(level=index_cols[:4])
In [9]: f_index = np.arange(5)
In [10]: def f(name,df2):
....: return Series(np.arange(df2.shape[0]),name=df2.index.values[0]).reindex(f_index)
....:
In [11]: new_df = pd.concat([ f(name,df2) for name, df2 in grp ],axis=1).T
In [12]: df
Out[12]:
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 100 entries, (0, 0, 0, 3, 0) to (4, 4, 3, 1, 0)
Data columns (total 5 columns):
A5 100 non-null values
A6 100 non-null values
A7 100 non-null values
A8 100 non-null values
A9 100 non-null values
dtypes: int64(5)
In [13]: new_df
Out[13]:
<class 'pandas.core.frame.DataFrame'>
Index: 90 entries, (0, 0, 0, 3, 0) to (4, 4, 3, 1, 0)
Data columns (total 5 columns):
0 90 non-null values
1 9 non-null values
2 1 non-null values
3 0 non-null values
4 0 non-null values
dtypes: float64(5)
In [14]: df['x0'] = new_df[0]
In [15]: df['x1'] = new_df[1]
In [16]: df
Out[16]:
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 100 entries, (0, 0, 0, 3, 0) to (4, 4, 3, 1, 0)
Data columns (total 7 columns):
A5 100 non-null values
A6 100 non-null values
A7 100 non-null values
A8 100 non-null values
A9 100 non-null values
x0 91 non-null values
x1 10 non-null values
dtypes: float64(2), int64(5)
Comment From: dragoljub
Looks interesting. The ultimate goal I have is to apply various machine learning/clustering etc algorithm across a DataFrame's subgroups, where each algorithm returns a series of results (one for each row in a subgroup). I think the ultimate solution would be the flexible-apply function (http://pandas.pydata.org/pandas-docs/dev/groupby.html#flexible-apply) df.groupby([['a','b','c','d']]).apply(my_ml_function) which would expect a function to return the results series and then the groupby apply method would combine everything into a data frame with the new column labeled 'my_ml_function' with the results for each application of my_ml_function.
Maybe I just need to think about it differently and expect the results of flexible apply to be a long series of results and then just join the results back to the original DataFrame. It would be nice to have the option to just augment the data frame with just the column I wish to add. :)
Then one command could run a bunch of analytics across subgroups while keeping results joined to the original data, for plotting etc.
Comment From: jreback
you can do this now. What is too complicated is your assignment method. Just compute what you need in the apply (which is essentially what I did, but more 'manually') and create the resulting structure.
Comment From: jreback
@dragoljub see PR #4766 for the fixes for this issue (on master)
Comment From: dragoljub
WOW! df.apply() is awesome! Last time I checked apply did not work with MultiIndex. For anyone who is interested here is an example of applying ~625 clustering jobs across 625 groups in a 100k element DataFrame. The syntax is great, the semantics is powerful. There is just one little thing. There is no lazy way to add a named series as a column to a DataFrame in-place. It seems like we have a df.pop() method maybe we need a df.push() method to do something as simple as df[series.name] = series. df.append() takes a data frame and takes longer and returns a copy. Why does pandas seem to be doing away with in-place methods?
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: from sklearn.cluster import DBSCAN as DBSCAN
In [4]: print pd.__version__
0.12.0
In [5]: # Generate Test DataFrame
...: NUM_ROWS = 100000
...:
In [6]: NUM_COLS = 10
In [7]: col_names = ['A'+num for num in map(str,np.arange(NUM_COLS).tolist())]
In [8]: index_cols = col_names[:5]
In [9]: # Set DataFrame to have 5 level Hierarchical Index.
...: # Sort the index!
...: df = pd.DataFrame(np.random.randint(5, size=(NUM_ROWS,NUM_COLS)), dtype=np.int64, columns=col_names)
...:
In [10]: df = df.set_index(index_cols).sort_index()
In [11]: df
Out[11]: <class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (0, 0, 0, 0, 0) to (4, 4, 4, 4, 4)
Data columns (total 5 columns):
A5 100000 non-null values
A6 100000 non-null values
A7 100000 non-null values
A8 100000 non-null values
A9 100000 non-null values
dtypes: int64(5)
In [12]: # Group by first 4 index columns.
....: grp = df.groupby(level=index_cols[:4])
....:
In [13]: # Find index of largest group.
....: big_loc = grp.size().idxmax()
....:
In [14]: # Create function to apply clustering on groups
....: def grp_func(df):
....: """Run clustering on subgroup and return series of results."""
....: db = DBSCAN(eps=1, min_samples=1, metric='euclidean').fit(df.values)
....: return pd.Series(db.labels_, name='cluster_id')
....:
....: # Apply clustering on each subgroup of DataFrame
....:
In [15]: %time out = grp.apply(grp_func)
CPU times: user 33.32 s, sys: 0.00 s, total: 33.32 s
Wall time: 33.32 s
In [16]: # Add Cluster ID column to orignal df, too bad this creates a copy...
....: %timeit df.join(out)
....:
1 loops, best of 3: 217 ms per loop
In [17]: # Much faster but you have to specify column name :(
....: %timeit df['cluster_id'] = out
....:
10 loops, best of 3: 48.3 ms per loop
Comment From: jreback
try df.insert(loc,column,value)
Comment From: dragoljub
Thanks! Not sure how I missed insert. Still, having to specify location and column_name is too much work. :smile_cat:
Anyone for adding a df.push() method? df.push(series): Appends series as a column to the end of a data frame, using the series name and checks for index alignment (I'm pretty sure df.insert() ensures index alignment..). This is essentially a df.insert shortcut something like:
df.insert(len(df.columns), series.name, series)
or maybe we can just add defaults to loc and column parameters?
Comment From: jreback
use join
In [3]: df = DataFrame(randn(10,2),columns=list('AB'))
In [4]: s = Series(randn(5),name='foo')
In [5]: df.join(s)
Out[5]:
A B foo
0 -1.544531 -1.000338 -0.155670
1 -0.307918 0.070451 -1.074536
2 0.704317 -1.254720 -1.201768
3 -1.022448 -0.991398 1.531125
4 -1.294532 -1.244509 -0.696183
5 -1.578079 1.447263 NaN
6 -0.820401 1.134300 NaN
7 -0.690274 0.248999 NaN
8 -1.209722 -0.242846 NaN
9 0.599999 -0.300904 NaN
Comment From: dragoljub
Join returns a copy (which I don't need) and therefore appears to be 4x slower than simply inserting a column with alignment. Unless of course the assign statement does not align on MultiIndex.
In [16]: # Add Cluster ID column to orignal df, too bad this creates a copy...
....: %timeit df.join(out)
....:
1 loops, best of 3: 217 ms per loop
In [17]: # Much faster but you have to specify column name :(
....: %timeit df['cluster_id'] = out
....:
10 loops, best of 3: 48.3 ms per loop
Comment From: jreback
assignment does not copy (though it may do an internal copy), e.g. if you already have float data and add a float column then it will 'copy' it but its pretty cheap
join will for sure copy
inplace operations are not generally a good thing; state can change when least expected so avoid if at all possible.
why is it a problem to specify the column name? that seems a natural thing IMHO.
and your timings on the assignment are not valid (they happen to be faster, but not that much faster).
because only the first time is it a valid timeing, aftern that its a set, not an insert (which is where the copying happens). A set will just operwrite the data, but an insert can copy it.
You need to do the timings in a separate function, where you first copy the frame, then do an action on it.
Comment From: dragoljub
Interesting. Good info Jeff. I guess I'm trying to avoid specifying the column name again because in my apply function I set the series name and want that to just propagate. I profiled it again with the functions you described and got similar results.
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: from sklearn.cluster import DBSCAN as DBSCAN
In [4]: print pd.__version__
0.12.0
In [5]: # Generate Test DataFrame
...: NUM_ROWS = 100000
...:
In [6]: NUM_COLS = 10
In [7]: col_names = ['A'+num for num in map(str,np.arange(NUM_COLS).tolist())]
In [8]: index_cols = col_names[:5]
In [9]: # Set DataFrame to have 5 level Hierarchical Index.
...: # Sort the index!
...: df = pd.DataFrame(np.random.randint(5, size=(NUM_ROWS,NUM_COLS)), dtype=np.int64, columns=col_names)
...:
In [10]: df = df.set_index(index_cols).sort_index()
In [11]: df
Out[11]: <class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (0, 0, 0, 0, 0) to (4, 4, 4, 4, 4)
Data columns (total 5 columns):
A5 100000 non-null values
A6 100000 non-null values
A7 100000 non-null values
A8 100000 non-null values
A9 100000 non-null values
dtypes: int64(5)
In [12]: # Group by first 4 index columns.
....: grp = df.groupby(level=index_cols[:4])
....:
In [13]: # Find index of largest group.
....: big_loc = grp.size().idxmax()
....:
In [14]: # Create function to apply clustering on groups
....: def grp_func(df):
....: """Run clustering on subgroup and return series of results."""
....: db = DBSCAN(eps=1, min_samples=1, metric='euclidean').fit(df.values)
....: return pd.Series(db.labels_, name='cluster_id')
....:
....: # Apply clustering on each subgroup of DataFrame
....:
In [15]: %time out = grp.apply(grp_func)
CPU times: user 34.27 s, sys: 0.00 s, total: 34.27 s
Wall time: 34.27 s
In [16]: # Add Cluster ID column to orignal df, too bad this creates a copy...
....: %timeit df.join(out)
....:
1 loops, best of 3: 232 ms per loop
In [17]: # Much faster but you have to specify column name :(
....: #%timeit df[out.name] = out
....:
....: # Here is another way to insert
....: #%time df.insert(len(df.columns), out.name + '_2', out)
....:
....: def insert_col(df, ser):
....: df2 = df.copy()
....: ser2 = ser.copy()
....: df2[ser2.name] = ser2
....: return df2
....:
....:
In [18]: def join_col(df, ser):
....: df2 = df.copy()
....: ser2 = ser.copy()
....: df2.join(ser2)
....: return df2
....:
In [19]: %timeit dfa = insert_col(df, out)
10 loops, best of 3: 53.5 ms per loop
In [20]: %timeit dfb = join_col(df, out)
1 loops, best of 3: 223 ms per loop
Comment From: jreback
closing as Panel deprecated