When try to access labels (.loc
) by using a specific list of Timestamp
objects or a DatetimeIndex
object (See attached csv file), the resulting index is returned in UTC offset but the original timezone is not removed.
This seems to happen only in very specific cases, when the index passed to .loc
contains labels that do not exist in the DataFrame and also contains duplicates.
@yuval-jether
import pandas as pd
import pytz
idx = pd.date_range('2011-01-01', '2017-10-01 00:00:00', freq='h', tz='America/Chicago')
s = pd.Series(np.random.rand(len(idx)), index=idx)
i = pd.to_datetime(['2017-09-30 06:00:00', '2017-09-30 23:00:00', '2017-09-30 23:00:00', '2017-10-01 00:00:00',]).tz_localize('America/Chicago')
s.loc[i]
Out[39]:
2017-09-30 06:00:00-05:00 0.380138
2017-09-30 23:00:00-05:00 0.774696
2017-09-30 23:00:00-05:00 0.774696
2017-10-01 00:00:00-05:00 0.728027
dtype: float64
# Added a label that does not exist in the Series
import pandas as pd
import pytz
idx = pd.date_range('2011-01-01', '2017-10-01 00:00:00', freq='h', tz='America/Chicago')
s = pd.Series(np.random.rand(len(idx)), index=idx)
i = pd.to_datetime(['2017-09-30 06:00:00', '2017-09-30 23:00:00', '2017-09-30 23:00:00', '2017-10-01 00:00:00', '2017-10-01 01:00:00',]).tz_localize('America/Chicago')
s.loc[i]
Out[40]:
2017-09-30 11:00:00-05:00 0.645350
2017-10-01 04:00:00-05:00 0.099323
2017-10-01 04:00:00-05:00 0.099323
2017-10-01 05:00:00-05:00 0.037136
2017-10-01 06:00:00-05:00 NaN
dtype: float64
Output of pd.show_versions()
Comment From: linar-jether
[deleted - see top post]
Comment From: jorisvandenbossche
@linar-jether Thanks for the report!
First note: in the just released 0.21.0 release, using loc
with some non-existant labels will raise a warning and this functionality will be removed in a future version.
If you want to do that, it points to reindex
as an alternative:
In [17]: idx = pd.date_range('2011-01-01', '2017-10-01 00:00:00', freq='h', tz='America/Chicago')
...: s = pd.Series(np.random.rand(len(idx)), index=idx)
...: i = pd.to_datetime(['2017-09-30 06:00:00', '2017-09-30 23:00:00', '2017-09-30 23:00:00', '2017-10-01 00:00:00',]).tz_localize('America/Chicago')
In [18]: i2 = pd.to_datetime(['2017-09-30 06:00:00', '2017-09-30 23:00:00', '2017-09-30 23:00:00', '2017-10-01 00:00:00', '2017-10-01 01:00:00',]).tz_localize('America/Chicago')
In [19]: s.loc[i]
Out[19]:
2017-09-30 06:00:00-05:00 0.037572
2017-09-30 23:00:00-05:00 0.771407
2017-09-30 23:00:00-05:00 0.771407
2017-10-01 00:00:00-05:00 0.778859
dtype: float64
In [20]: s.loc[i2]
/home/joris/miniconda3/envs/dev/bin/ipython:1: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
Out[20]:
2017-09-30 11:00:00-05:00 0.037572 # <------- still buggy (11:00 instead of 06:00)
2017-10-01 04:00:00-05:00 0.771407
2017-10-01 04:00:00-05:00 0.771407
2017-10-01 05:00:00-05:00 0.778859
2017-10-01 06:00:00-05:00 NaN
dtype: float64
In [22]: s.reindex(i)
Out[22]:
2017-09-30 06:00:00-05:00 0.037572
2017-09-30 23:00:00-05:00 0.771407
2017-09-30 23:00:00-05:00 0.771407
2017-10-01 00:00:00-05:00 0.778859
dtype: float64
In [23]: s.reindex(i2)
Out[23]:
2017-09-30 06:00:00-05:00 0.037572 # <------ now correct 06:00
2017-09-30 23:00:00-05:00 0.771407
2017-09-30 23:00:00-05:00 0.771407
2017-10-01 00:00:00-05:00 0.778859
2017-10-01 01:00:00-05:00 NaN
dtype: float64
So the bug in loc
is still present, but the recommended alternative reindex
already works correctly.
Comment From: gfyoung
@linar-jether : BTW, if you could by any chance present a smaller example to demonstrate the bug, that would be easier for us to read in the future.
Comment From: jorisvandenbossche
@gfyoung the second post (not the top one) already includes a smaller example (it is the one I used in my response). But would be good to update the top post with that for clarity
Comment From: jorisvandenbossche
So given this is deprecated behaviour, I am not sure we should put time into trying to fix this. But, I am trying to think of similar cases where this bug could come up that are not deprecated?
Comment From: linar-jether
Thanks for the response @jorisvandenbossche,
this bug can easily be overlooked and lead to false results, so maybe even a small patch to raise an exception when using loc with both duplicates and non-existing labels?
- I've updated the top post with the more concise example
Comment From: linar-jether
And also note that using reindex is not the same as using loc, as reindex will not work when there are duplicates in the source dataframe.
Comment From: jreback
as @jorisvandenbossche points out this already will show a deprecation warning; this will be changed to an exception in 1.0 (after 0.22.0). The 'this can be overlooked' is the reason.
And also note that using reindex is not the same as using loc, as reindex will not work when there are duplicates in the source dataframe.
Of course if you have duplicates in an index, then you are on your own, that behavior is also not directly supported.
Comment From: mroeschke
This behavior looks fixed on master if anyone wants to put up at test.
In [1]: pd.__version__
Out[1]: '0.24.0.dev0+371.g0b7a08b70'
In [3]: s.loc[i]
Out[3]:
2017-09-30 06:00:00-05:00 0.951915
2017-09-30 23:00:00-05:00 0.969771
2017-09-30 23:00:00-05:00 0.969771
2017-10-01 00:00:00-05:00 0.461426
dtype: float64
In [4]: s.reindex(i)
Out[4]:
2017-09-30 06:00:00-05:00 0.951915
2017-09-30 23:00:00-05:00 0.969771
2017-09-30 23:00:00-05:00 0.969771
2017-10-01 00:00:00-05:00 0.461426
dtype: float64
In [5]: s.loc[i2]
/anaconda3/envs/pandas-dev/bin/ipython:1: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
#!/anaconda3/envs/pandas-dev/bin/python
Out[5]:
2017-09-30 06:00:00-05:00 0.951915
2017-09-30 23:00:00-05:00 0.969771
2017-09-30 23:00:00-05:00 0.969771
2017-10-01 00:00:00-05:00 0.461426
2017-10-01 01:00:00-05:00 NaN
dtype: float64
In [6]: s.reindex(i2)
Out[6]:
2017-09-30 06:00:00-05:00 0.951915
2017-09-30 23:00:00-05:00 0.969771
2017-09-30 23:00:00-05:00 0.969771
2017-10-01 00:00:00-05:00 0.461426
2017-10-01 01:00:00-05:00 NaN
dtype: float64
Comment From: gfyoung
- The deprecation warning is already tested thanks to #17295.
- No one has come forward with examples for how this deprecation could be "missed"
In light of those two facts, I'm going to close in fact.