Code Sample, a copy-pastable example if possible
In [1]: import pandas as pd
In [2]: pd.__version__
Out[2]: '0.23.0rc2'
In [3]: dates = [pd.Timestamp("2016-01-0%d 12:00:00" % i, tz='US/Pacific')
...: for i in range(1, 5)]
...: df = pd.DataFrame({'A': ['a', 'b'] * 2, 'B': dates})
...: grouped = df.groupby('A')
...:
In [4]: df
Out[4]:
A B
0 a 2016-01-01 12:00:00-08:00
1 b 2016-01-02 12:00:00-08:00
2 a 2016-01-03 12:00:00-08:00
3 b 2016-01-04 12:00:00-08:00
In [5]: grouped.apply(lambda x: x.iloc[0])[0]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
3062 try:
-> 3063 return self._engine.get_loc(key)
3064 except KeyError:
C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5720)()
138
--> 139 cpdef get_loc(self, object val):
140 if is_definitely_invalid_key(val):
C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5566)()
160 try:
--> 161 return self.mapping.get_item(val)
162 except (TypeError, ValueError):
C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas\_libs\hashtable.c:22442)()
1491
-> 1492 cpdef get_item(self, object val):
1493 cdef khiter_t k
C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas\_libs\hashtable.c:22396)()
1499 else:
-> 1500 raise KeyError(val)
1501
KeyError: 0
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-5-2b16555d6e05> in <module>()
----> 1 grouped.apply(lambda x: x.iloc[0])[0]
C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\frame.py in __getitem__(self, key)
2685 return self._getitem_multilevel(key)
2686 else:
-> 2687 return self._getitem_column(key)
2688
2689 def _getitem_column(self, key):
C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\frame.py in _getitem_column(self, key)
2692 # get column
2693 if self.columns.is_unique:
-> 2694 return self._get_item_cache(key)
2695
2696 # duplicate columns & possible reduce dimensionality
C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\generic.py in _get_item_cache(self, item)
2485 res = cache.get(item)
2486 if res is None:
-> 2487 values = self._data.get(item)
2488 res = self._box_item_values(item, values)
2489 cache[item] = res
C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\internals.py in get(self, item, fastpath)
4113
4114 if not isna(item):
-> 4115 loc = self.items.get_loc(item)
4116 else:
4117 indexer = np.arange(len(self.items))[isna(self.items)]
C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
3063 return self._engine.get_loc(key)
3064 except KeyError:
-> 3065 return self._engine.get_loc(self._maybe_cast_indexer(key))
3066
3067 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5720)()
137 util.set_value_at(arr, loc, value)
138
--> 139 cpdef get_loc(self, object val):
140 if is_definitely_invalid_key(val):
141 raise TypeError("'{val}' is an invalid key".format(val=val))
C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5566)()
159
160 try:
--> 161 return self.mapping.get_item(val)
162 except (TypeError, ValueError):
163 raise KeyError(val)
C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas\_libs\hashtable.c:22442)()
1490 sizeof(uint32_t)) # flags
1491
-> 1492 cpdef get_item(self, object val):
1493 cdef khiter_t k
1494 if val != val or val is None:
C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas\_libs\hashtable.c:22396)()
1498 return self.table.vals[k]
1499 else:
-> 1500 raise KeyError(val)
1501
1502 cpdef set_item(self, object key, Py_ssize_t val):
KeyError: 0
Problem description
Related to #20949 (and moved from there at request).
Note that if you do the following, it works:
In [6]: grouped.nth(0)['B'].iloc[0]
Out[6]: Timestamp('2016-01-01 12:00:00-0800', tz='US/Pacific')
In [7]: grouped.apply(lambda x: x.iloc[0])[0]
Out[7]: Timestamp('2016-01-01 12:00:00-0800', tz='US/Pacific')
So doing one operation (in this case nth
) prior to the apply
then makes the apply
work.
Expected Output
In [5]: grouped.apply(lambda x: x.iloc[0])[0]
Out[5]: Timestamp('2016-01-01 12:00:00-0800', tz='US/Pacific')
Output of pd.show_versions()
Comment From: jreback
so after #20959 this is easily fixed
the issue is that .nth
hold onto the _group_selection
state (and doesn't reset it properly)
e.g.
In [7]: g = df.groupby('A')
In [8]: g.nth(0)
Out[8]:
B
A
a 2016-01-01 12:00:00-08:00
b 2016-01-02 12:00:00-08:00
In [9]: g.apply(lambda x: x.iloc[0])
Out[9]:
A
a 2016-01-01 12:00:00-08:00
b 2016-01-02 12:00:00-08:00
dtype: datetime64[ns, US/Pacific]
but
In [10]: g = df.groupby('A')
In [11]: g.apply(lambda x: x.iloc[0])
Out[11]:
A B
A
a a 2016-01-01 12:00:00-08:00
b b 2016-01-02 12:00:00-08:00
Comment From: Dr-Irv
@jreback isn't this now closed because of #20959 being done?
Comment From: jreback
no this particular example still fails because of the ordering issue with .nth
. basically the code it is using is relying on the group being set (and not unsetting it), so its stil incorrect, seee comments on #20959
Comment From: jbrockmendel
This now looks non-stateful to me. The call in [5] and [7] both raise the same exception regardless of whether [6] has been run first. @Dr-Irv can you confirm?
Comment From: Dr-Irv
This now looks non-stateful to me. The call in [5] and [7] both raise the same exception regardless of whether [6] has been run first. @Dr-Irv can you confirm?
Yes, I can confirm.
I don't remember creating this issue, but looking at the notes, it appears I copied the example from pandas/tests/groupby/aggregate/test_other.py
in function test_agg_timezone_round_trip
. That test code was modified in #27110 (https://github.com/pandas-dev/pandas/pull/27110/files#diff-44e2353ba876d1417a4bc718f02f7fdc03439e3d8464ad8760e175f7f505fc91) (line 424 replaced with line 426) by you, @jbrockmendel .
Looking at that code, I wouldn't expect it to work anyway, because we are using an indexer into a DF where the indexer doesn't exist.
So I think this can be closed - the original test case is not expected to work, and it is no longer non-stateful.