Pandas .loc[] assignment gives broadcast error - 玖涯软件开发|java/go/python

In [8]: wp = Panel(randn(2, 5, 4), items=['Item1', 'Item2'], major_axis=date_range('1/1/2000', periods=5), minor_axis=['A', 'B', 'C', 'D'])

In [9]: wp2 = Panel(randn(2, 5, 4), items=['Item1', 'Item2'], major_axis=date_range('1/1/2000', periods=5), minor_axis=['A', 'B', 'C', 'D'])

In [10]: wp.loc[['Item1', 'Item2'], :, ['A', 'B']]
Out[10]: 
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 5 (major_axis) x 2 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 2000-01-01 00:00:00 to 2000-01-05 00:00:00
Minor_axis axis: A to B

In [11]: wp.loc[['Item1', 'Item2'], :, ['A', 'B']] = wp2.loc[['Item1', 'Item2'], :, ['A', 'B']]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-275f8d9be664> in <module>()
----> 1 wp.loc[['Item1', 'Item2'], :, ['A', 'B']] = wp2.loc[['Item1', 'Item2'], :, ['A', 'B']]

/usr/local/lib/python2.7/dist-packages/pandas/core/indexing.pyc in __setitem__(self, key, value)
     85             indexer = self._convert_to_indexer(key)
     86 
---> 87         self._setitem_with_indexer(indexer, value)
     88 
     89     def _has_valid_tuple(self, key):

/usr/local/lib/python2.7/dist-packages/pandas/core/indexing.pyc in _setitem_with_indexer(self, indexer, value)
    182             values = self.obj.values
    183             if np.prod(values.shape):
--> 184                 values[indexer] = value
    185 
    186     def _align_series(self, indexer, ser):

ValueError: array is not broadcastable to correct shape

Those should be the same shape so I don't see why this assignment shouldn't work. This is with pandas 0.11.0.

Comment From: twiecki

Seems like its numpy related. Is this expected slicing behavior?!

In [11]: x = np.arange(3**3).reshape((3, 3, 3))

In [12]: x.shape
Out[12]: (3, 3, 3)

In [15]: x[[0, 1], :, :].shape
Out[15]: (2, 3, 3)

In [16]: x[:, :, [0,1]].shape
Out[16]: (3, 3, 2)

In [17]: x[[0,1], :, [0,1]].shape
Out[17]: (2, 3)

In [20]: x[[0,1], :, 0:2].shape
Out[20]: (2, 3, 2)

Comment From: jreback

hah...just about to write back

1) this wasn't implemented 2) its a bit harder than I though to fix, so have to push to 0.12

its straightforward to align the rhs (e.g. the other panel), but then I have to assign the values which is actually non-trivial because you are assigning in a non-contiguos was (potentially) and have to do a multi-dim put

Comment From: cpcloud

possibly related: #3738

Comment From: twiecki

I suppose one can only fancy-index in one dimension at a time.

Comment From: twiecki

It seems in my case there is a work-around:

wp.update(wp2.loc[['Item1', 'Item2'], :, ['A', 'B']])

Would there be any problem with this?

Comment From: jreback

ahh..yes..that is a good solution

fyi....will be a fair amount slower as its basically going frame-by-frame which then goes series by series.... so if this panel is big at all won't be fast

another way to approach this is to break the panel into frames, copy/update them as needed, then concat back together

Comment From: twiecki

OK, thanks for the info!

I might wait until 0.12 then if the performance hit is not too dramatic for my case.

Comment From: dragoljub

I just noticed this issue (or something related) silently caused one of my processing flows to assign erroneous data.

I find myself constantly wanting to group on several columns, compute a new column based on grouped data, and assign that result column back to the original data frame, indexing the original DataFrame using the groups multi-index name. This worked in 0.10.0.

Assignment to a df column indexed using .ix[] a multi-index tuple and a column name used to work. E.G. df.ix[(1,2,3,4), 'new_col'] = np.arange(100). Now (in Pandas 0.11.0) it just replaced the entire assignment by repeating the first entry in the np.array. Is there any fix to this in a dev version? I might have to drop down 0.10.0.


In [1]: import numpy as np

In [3]: print pd.__version__
0.11.0

In [4]: # Generate Test DataFrame
   ...: NUM_ROWS = 100000
   ...: 

In [5]: NUM_COLS = 10

In [6]: col_names = ['A'+num for num in map(str,np.arange(NUM_COLS).tolist())]

In [7]: index_cols = col_names[:5]

In [8]: # Set DataFrame to have 5 level Hierarchical Index.
   ...: # Sort the index!
   ...: df = pd.DataFrame(np.random.randint(5, size=(NUM_ROWS,NUM_COLS)), dtype=np.int64, columns=col_names)
   ...: 

In [9]: df = df.set_index(index_cols).sort_index()

In [10]: df
Out[10]: <class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (0, 0, 0, 0, 0) to (4, 4, 4, 4, 4)
Data columns (total 5 columns):
A5    100000  non-null values
A6    100000  non-null values
A7    100000  non-null values
A8    100000  non-null values
A9    100000  non-null values
dtypes: int64(5)

In [11]: # Group by first 4 index columns.
   ....: grp = df.groupby(level=index_cols[:4])
   ....: 

In [12]: # Find index of largest group.
   ....: big_loc = grp.size().idxmax()
   ....: 

In [13]: # Create new empty column in DataFrame
   ....: df['new_col'] = np.nan
   ....: 

In [14]: # Loop through groups and assign to orignal array to new_col column
   ....: for name, df2 in grp:
   ....:     new_vals = np.arange(df2.shape[0])
   ....:     print 'Group: ', name
   ....:     print 'Expected:\n', pd.Series(new_vals).value_counts()
   ....:     df.ix[name, 'new_col'] = new_vals #This used to work, but now only assigns the first number from the np.array
   ....:     print '\nAssigned:\n', df.ix[name, 'new_col'].value_counts()
   ....:     print '\n'
   ....: 
Group:  (0, 0, 0, 0)
Expected:
155    1
48     1
55     1
54     1
53     1
52     1
51     1
50     1
49     1
47     1
57     1
46     1
45     1
44     1
43     1
...
113    1
112    1
111    1
110    1
109    1
108    1
107    1
106    1
105    1
104    1
103    1
102    1
101    1
100    1
0      1
Length: 156, dtype: int64

Assigned:
0    156
dtype: int64


Group:  (0, 0, 0, 1)
Expected:
147    1
54     1
52     1
51     1
50     1
49     1
48     1
47     1
46     1
45     1
44     1
43     1
42     1
41     1
40     1
...
108    1
107    1
106    1
105    1
104    1
103    1
101    1
94     1
100    1
99     1
98     1
97     1
96     1
95     1
0      1
Length: 148, dtype: int64

Assigned:
0    148
dtype: int64

Comment From: jreback

@dragoljub It think the issue is related to #3668, which has been fixed in master and will be in upcoming 0.11.1 (very soon)

The issue described here is related to Panel <-> Panel assignment via ix/loc

Comment From: dragoljub

Any idea if the nightly builds at http://pandas.pydata.org/pandas-build/dev/ have this update? I have not seen any updated binary since April.

Comment From: dragoljub

I'm seeing this issue again: When iterating through a DataFrame's groupby object and indexing the orignal DataFrame with the returned groupby index tuple. If there is a better way of iterating through groups and updating/adding columns for each group in the original DataFrame I'm all ears.

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: print pd.__version__
0.12.0

In [4]: # Generate Test DataFrame
   ...: NUM_ROWS = 100000
   ...: 

In [5]: NUM_COLS = 10

In [6]: col_names = ['A'+num for num in map(str,np.arange(NUM_COLS).tolist())]

In [7]: index_cols = col_names[:5]

In [8]: # Set DataFrame to have 5 level Hierarchical Index.
   ...: # Sort the index!
   ...: df = pd.DataFrame(np.random.randint(5, size=(NUM_ROWS,NUM_COLS)), dtype=np.int64, columns=col_names)
   ...: 

In [9]: df = df.set_index(index_cols).sort_index()

In [10]: df
Out[10]: <class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (0, 0, 0, 0, 0) to (4, 4, 4, 4, 4)
Data columns (total 5 columns):
A5    100000  non-null values
A6    100000  non-null values
A7    100000  non-null values
A8    100000  non-null values
A9    100000  non-null values
dtypes: int64(5)

In [11]: # Group by first 4 index columns.
   ....: grp = df.groupby(level=index_cols[:4])
   ....: 

In [12]: # Find index of largest group.
   ....: big_loc = grp.size().idxmax()
   ....: 

In [13]: # Create new empty column in DataFrame
   ....: df['new_col'] = np.nan
   ....: 

In [14]: # Loop through groups and assign to orignal array to new_col column
   ....: for name, df2 in grp:
   ....:     new_vals = np.arange(df2.shape[0])
   ....:     print 'Group: ', name
   ....:     print 'Expected:\n', pd.Series(new_vals).value_counts()
   ....:     df.ix[name, 'new_col'] = new_vals #This used to work, but now only assigns the first number from the np.array
   ....:     print '\nAssigned:\n', df.ix[name, 'new_col'].value_counts()
   ....:     print '\n'
   ....: 
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-14-06123ee45450> in <module>()
      4     print 'Group: ', name
      5     print 'Expected:\n', pd.Series(new_vals).value_counts()
----> 6     df.ix[name, 'new_col'] = new_vals #This used to work, but now only assigns the first number from the np.array
      7     print '\nAssigned:\n', df.ix[name, 'new_col'].value_counts()
      8     print '\n'

C:\Python27\lib\site-packages\pandas\core\indexing.pyc in __setitem__(self, key, value)
     86             indexer = self._convert_to_indexer(key)
     87 
---> 88         self._setitem_with_indexer(indexer, value)
     89 
     90     def _has_valid_tuple(self, key):

C:\Python27\lib\site-packages\pandas\core\indexing.pyc in _setitem_with_indexer(self, indexer, value)
    156                 # we have an equal len list/ndarray
    157                 elif len(labels) == 1 and (
--> 158                     len(self.obj[labels[0]]) == len(value) or len(plane_indexer[0]) == len(value)):
    159                     setter(labels[0], value)
    160 

TypeError: object of type 'slice' has no len()
Group:  (0, 0, 0, 0)
Expected:
162    1
40     1
58     1
57     1
56     1
55     1
54     1
53     1
52     1
51     1
50     1
49     1
48     1
47     1
46     1
...
117    1
116    1
115    1
114    1
113    1
111    1
103    1
110    1
109    1
108    1
107    1
106    1
105    1
104    1
0      1
Length: 163, dtype: int64

Comment From: jreback

@dragoljub this is a bug, fixing in master...but I wouldn't do it this way anyhow....do somethign like I think is what you want. or can at least get you there

In [2]: NUM_ROWS = 100

In [3]: NUM_COLS = 10

In [4]: col_names = ['A'+num for num in map(str,np.arange(NUM_COLS).tolist())]

In [5]: index_cols = col_names[:5]

In [6]: df = DataFrame(np.random.randint(5, size=(NUM_ROWS,NUM_COLS)), dtype=np.int64, columns=col_names)

In [7]: df = df.set_index(index_cols).sort_index()

In [8]: grp = df.groupby(level=index_cols[:4])

In [9]: f_index = np.arange(5)

In [10]: def f(name,df2):
   ....:                 return Series(np.arange(df2.shape[0]),name=df2.index.values[0]).reindex(f_index)
   ....: 

In [11]: new_df = pd.concat([ f(name,df2) for name, df2 in grp ],axis=1).T

In [12]: df
Out[12]: 
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 100 entries, (0, 0, 0, 3, 0) to (4, 4, 3, 1, 0)
Data columns (total 5 columns):
A5    100  non-null values
A6    100  non-null values
A7    100  non-null values
A8    100  non-null values
A9    100  non-null values
dtypes: int64(5)

In [13]: new_df
Out[13]: 
<class 'pandas.core.frame.DataFrame'>
Index: 90 entries, (0, 0, 0, 3, 0) to (4, 4, 3, 1, 0)
Data columns (total 5 columns):
0    90  non-null values
1    9  non-null values
2    1  non-null values
3    0  non-null values
4    0  non-null values
dtypes: float64(5)

In [14]: df['x0'] = new_df[0]

In [15]: df['x1'] = new_df[1]

In [16]: df
Out[16]: 
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 100 entries, (0, 0, 0, 3, 0) to (4, 4, 3, 1, 0)
Data columns (total 7 columns):
A5    100  non-null values
A6    100  non-null values
A7    100  non-null values
A8    100  non-null values
A9    100  non-null values
x0    91  non-null values
x1    10  non-null values
dtypes: float64(2), int64(5)

Comment From: dragoljub

Looks interesting. The ultimate goal I have is to apply various machine learning/clustering etc algorithm across a DataFrame's subgroups, where each algorithm returns a series of results (one for each row in a subgroup). I think the ultimate solution would be the flexible-apply function (http://pandas.pydata.org/pandas-docs/dev/groupby.html#flexible-apply) df.groupby([['a','b','c','d']]).apply(my_ml_function) which would expect a function to return the results series and then the groupby apply method would combine everything into a data frame with the new column labeled 'my_ml_function' with the results for each application of my_ml_function.

Maybe I just need to think about it differently and expect the results of flexible apply to be a long series of results and then just join the results back to the original DataFrame. It would be nice to have the option to just augment the data frame with just the column I wish to add. :)

Then one command could run a bunch of analytics across subgroups while keeping results joined to the original data, for plotting etc.

Comment From: jreback

you can do this now. What is too complicated is your assignment method. Just compute what you need in the apply (which is essentially what I did, but more 'manually') and create the resulting structure.

Comment From: jreback

@dragoljub see PR #4766 for the fixes for this issue (on master)

Comment From: dragoljub

WOW! df.apply() is awesome! Last time I checked apply did not work with MultiIndex. For anyone who is interested here is an example of applying ~625 clustering jobs across 625 groups in a 100k element DataFrame. The syntax is great, the semantics is powerful. There is just one little thing. There is no lazy way to add a named series as a column to a DataFrame in-place. It seems like we have a df.pop() method maybe we need a df.push() method to do something as simple as df[series.name] = series. df.append() takes a data frame and takes longer and returns a copy. Why does pandas seem to be doing away with in-place methods?

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: from sklearn.cluster import DBSCAN as DBSCAN

In [4]: print pd.__version__
0.12.0

In [5]: # Generate Test DataFrame
   ...: NUM_ROWS = 100000
   ...: 

In [6]: NUM_COLS = 10

In [7]: col_names = ['A'+num for num in map(str,np.arange(NUM_COLS).tolist())]

In [8]: index_cols = col_names[:5]

In [9]: # Set DataFrame to have 5 level Hierarchical Index.
   ...: # Sort the index!
   ...: df = pd.DataFrame(np.random.randint(5, size=(NUM_ROWS,NUM_COLS)), dtype=np.int64, columns=col_names)
   ...: 

In [10]: df = df.set_index(index_cols).sort_index()

In [11]: df

Out[11]: <class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (0, 0, 0, 0, 0) to (4, 4, 4, 4, 4)
Data columns (total 5 columns):
A5    100000  non-null values
A6    100000  non-null values
A7    100000  non-null values
A8    100000  non-null values
A9    100000  non-null values
dtypes: int64(5)

In [12]: # Group by first 4 index columns.
   ....: grp = df.groupby(level=index_cols[:4])
   ....: 

In [13]: # Find index of largest group.
   ....: big_loc = grp.size().idxmax()
   ....: 

In [14]: # Create function to apply clustering on groups
   ....: def grp_func(df):
   ....:     """Run clustering on subgroup and return series of results."""
   ....:     db = DBSCAN(eps=1, min_samples=1, metric='euclidean').fit(df.values)
   ....:     return pd.Series(db.labels_, name='cluster_id')
   ....: 
   ....: # Apply clustering on each subgroup of DataFrame
   ....: 

In [15]: %time out = grp.apply(grp_func)
CPU times: user 33.32 s, sys: 0.00 s, total: 33.32 s
Wall time: 33.32 s

In [16]: # Add Cluster ID column to orignal df, too bad this creates a copy...
   ....: %timeit df.join(out)
   ....: 
1 loops, best of 3: 217 ms per loop

In [17]: # Much faster but you have to specify column name :(
   ....: %timeit df['cluster_id'] = out
   ....: 
10 loops, best of 3: 48.3 ms per loop

Comment From: jreback

try df.insert(loc,column,value)

Comment From: dragoljub

Thanks! Not sure how I missed insert. Still, having to specify location and column_name is too much work. :smile_cat:

Anyone for adding a df.push() method? df.push(series): Appends series as a column to the end of a data frame, using the series name and checks for index alignment (I'm pretty sure df.insert() ensures index alignment..). This is essentially a df.insert shortcut something like:

df.insert(len(df.columns), series.name, series)

or maybe we can just add defaults to loc and column parameters?

Comment From: jreback

use join

In [3]: df = DataFrame(randn(10,2),columns=list('AB'))

In [4]: s = Series(randn(5),name='foo')

In [5]: df.join(s)
Out[5]: 
          A         B       foo
0 -1.544531 -1.000338 -0.155670
1 -0.307918  0.070451 -1.074536
2  0.704317 -1.254720 -1.201768
3 -1.022448 -0.991398  1.531125
4 -1.294532 -1.244509 -0.696183
5 -1.578079  1.447263       NaN
6 -0.820401  1.134300       NaN
7 -0.690274  0.248999       NaN
8 -1.209722 -0.242846       NaN
9  0.599999 -0.300904       NaN

Comment From: dragoljub

Join returns a copy (which I don't need) and therefore appears to be 4x slower than simply inserting a column with alignment. Unless of course the assign statement does not align on MultiIndex.

In [16]: # Add Cluster ID column to orignal df, too bad this creates a copy...
   ....: %timeit df.join(out)
   ....: 
1 loops, best of 3: 217 ms per loop

In [17]: # Much faster but you have to specify column name :(
   ....: %timeit df['cluster_id'] = out
   ....: 
10 loops, best of 3: 48.3 ms per loop

Comment From: jreback

assignment does not copy (though it may do an internal copy), e.g. if you already have float data and add a float column then it will 'copy' it but its pretty cheap

join will for sure copy

inplace operations are not generally a good thing; state can change when least expected so avoid if at all possible.

why is it a problem to specify the column name? that seems a natural thing IMHO.

and your timings on the assignment are not valid (they happen to be faster, but not that much faster).

because only the first time is it a valid timeing, aftern that its a set, not an insert (which is where the copying happens). A set will just operwrite the data, but an insert can copy it.

You need to do the timings in a separate function, where you first copy the frame, then do an action on it.

Comment From: dragoljub

Interesting. Good info Jeff. I guess I'm trying to avoid specifying the column name again because in my apply function I set the series name and want that to just propagate. I profiled it again with the functions you described and got similar results.


In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: from sklearn.cluster import DBSCAN as DBSCAN

In [4]: print pd.__version__
0.12.0

In [5]: # Generate Test DataFrame
   ...: NUM_ROWS = 100000
   ...: 

In [6]: NUM_COLS = 10

In [7]: col_names = ['A'+num for num in map(str,np.arange(NUM_COLS).tolist())]

In [8]: index_cols = col_names[:5]

In [9]: # Set DataFrame to have 5 level Hierarchical Index.
   ...: # Sort the index!
   ...: df = pd.DataFrame(np.random.randint(5, size=(NUM_ROWS,NUM_COLS)), dtype=np.int64, columns=col_names)
   ...: 

In [10]: df = df.set_index(index_cols).sort_index()

In [11]: df

Out[11]: <class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (0, 0, 0, 0, 0) to (4, 4, 4, 4, 4)
Data columns (total 5 columns):
A5    100000  non-null values
A6    100000  non-null values
A7    100000  non-null values
A8    100000  non-null values
A9    100000  non-null values
dtypes: int64(5)

In [12]: # Group by first 4 index columns.
   ....: grp = df.groupby(level=index_cols[:4])
   ....: 

In [13]: # Find index of largest group.
   ....: big_loc = grp.size().idxmax()
   ....: 

In [14]: # Create function to apply clustering on groups
   ....: def grp_func(df):
   ....:     """Run clustering on subgroup and return series of results."""
   ....:     db = DBSCAN(eps=1, min_samples=1, metric='euclidean').fit(df.values)
   ....:     return pd.Series(db.labels_, name='cluster_id')
   ....: 
   ....: # Apply clustering on each subgroup of DataFrame
   ....: 

In [15]: %time out = grp.apply(grp_func)
CPU times: user 34.27 s, sys: 0.00 s, total: 34.27 s
Wall time: 34.27 s

In [16]: # Add Cluster ID column to orignal df, too bad this creates a copy...
   ....: %timeit df.join(out)
   ....: 
1 loops, best of 3: 232 ms per loop

In [17]: # Much faster but you have to specify column name :(
   ....: #%timeit df[out.name] = out
   ....: 
   ....: # Here is another way to insert
   ....: #%time df.insert(len(df.columns), out.name + '_2', out)
   ....: 
   ....: def insert_col(df, ser):
   ....:     df2 = df.copy()
   ....:     ser2 = ser.copy()
   ....:     df2[ser2.name] = ser2
   ....:     return df2
   ....:     
   ....: 

In [18]: def join_col(df, ser):
   ....:     df2 = df.copy()
   ....:     ser2 = ser.copy()
   ....:     df2.join(ser2)
   ....:     return df2
   ....: 

In [19]: %timeit dfa = insert_col(df, out)
10 loops, best of 3: 53.5 ms per loop

In [20]: %timeit dfb = join_col(df, out)
1 loops, best of 3: 223 ms per loop

Comment From: jreback

closing as Panel deprecated