When try to access labels (.loc) by using a specific list of Timestamp objects or a DatetimeIndex object (See attached csv file), the resulting index is returned in UTC offset but the original timezone is not removed.

This seems to happen only in very specific cases, when the index passed to .loc contains labels that do not exist in the DataFrame and also contains duplicates.

@yuval-jether

import pandas as pd
import pytz
idx = pd.date_range('2011-01-01', '2017-10-01 00:00:00', freq='h', tz='America/Chicago')
s = pd.Series(np.random.rand(len(idx)), index=idx)
i = pd.to_datetime(['2017-09-30 06:00:00', '2017-09-30 23:00:00', '2017-09-30 23:00:00', '2017-10-01 00:00:00',]).tz_localize('America/Chicago')
s.loc[i]

Out[39]: 
2017-09-30 06:00:00-05:00    0.380138
2017-09-30 23:00:00-05:00    0.774696
2017-09-30 23:00:00-05:00    0.774696
2017-10-01 00:00:00-05:00    0.728027
dtype: float64

# Added a label that does not exist in the Series
import pandas as pd
import pytz
idx = pd.date_range('2011-01-01', '2017-10-01 00:00:00', freq='h', tz='America/Chicago')
s = pd.Series(np.random.rand(len(idx)), index=idx)
i = pd.to_datetime(['2017-09-30 06:00:00', '2017-09-30 23:00:00', '2017-09-30 23:00:00', '2017-10-01 00:00:00', '2017-10-01 01:00:00',]).tz_localize('America/Chicago')
s.loc[i]

Out[40]: 
2017-09-30 11:00:00-05:00    0.645350
2017-10-01 04:00:00-05:00    0.099323
2017-10-01 04:00:00-05:00    0.099323
2017-10-01 05:00:00-05:00    0.037136
2017-10-01 06:00:00-05:00         NaN
dtype: float64

Output of pd.show_versions()

pd.show_versions() 2017-10-30 03:15:08 [pip.utils] [DEBUG] lzma module is not available 2017-10-30 03:15:08 [pip.vcs] [DEBUG] Registered VCS backend: git 2017-10-30 03:15:08 [pip.vcs] [DEBUG] Registered VCS backend: hg 2017-10-30 03:15:08 [pip.pep425tags] [DEBUG] Config variable 'Py_DEBUG' is unset, Python ABI tag may be incorrect 2017-10-30 03:15:08 [pip.pep425tags] [DEBUG] Config variable 'WITH_PYMALLOC' is unset, Python ABI tag may be incorrect 2017-10-30 03:15:08 [pip.pep425tags] [DEBUG] Config variable 'Py_UNICODE_SIZE' is unset, Python ABI tag may be incorrect 2017-10-30 03:15:08 [pip.pep425tags] [DEBUG] Config variable 'Py_DEBUG' is unset, Python ABI tag may be incorrect 2017-10-30 03:15:08 [pip.pep425tags] [DEBUG] Config variable 'WITH_PYMALLOC' is unset, Python ABI tag may be incorrect 2017-10-30 03:15:08 [pip.pep425tags] [DEBUG] Config variable 'Py_UNICODE_SIZE' is unset, Python ABI tag may be incorrect 2017-10-30 03:15:08 [pip.vcs] [DEBUG] Registered VCS backend: svn 2017-10-30 03:15:09 [pip.vcs] [DEBUG] Registered VCS backend: bzr INSTALLED VERSIONS ------------------ commit: None python: 2.7.12.final.0 python-bits: 64 OS: Windows OS-release: 8.1 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.20.3 pytest: 3.0.7 pip: 9.0.1 setuptools: 36.5.0 Cython: None numpy: 1.13.1 scipy: 0.18.1 xarray: 0.9.6 IPython: 5.5.0 sphinx: None patsy: 0.4.1 dateutil: 2.6.1 pytz: 2017.2 blosc: None bottleneck: None tables: None numexpr: 2.6.2 feather: None matplotlib: 2.0.2 openpyxl: 2.4.8 xlrd: 1.0.0 xlwt: 1.2.0 xlsxwriter: None lxml: 3.8.0 bs4: 4.5.1 html5lib: 0.999999999 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: 0.4.0

Comment From: linar-jether

[deleted - see top post]

Comment From: jorisvandenbossche

@linar-jether Thanks for the report!

First note: in the just released 0.21.0 release, using loc with some non-existant labels will raise a warning and this functionality will be removed in a future version. If you want to do that, it points to reindex as an alternative:

In [17]: idx = pd.date_range('2011-01-01', '2017-10-01 00:00:00', freq='h', tz='America/Chicago')
    ...: s = pd.Series(np.random.rand(len(idx)), index=idx)
    ...: i = pd.to_datetime(['2017-09-30 06:00:00', '2017-09-30 23:00:00', '2017-09-30 23:00:00', '2017-10-01 00:00:00',]).tz_localize('America/Chicago')

In [18]: i2 = pd.to_datetime(['2017-09-30 06:00:00', '2017-09-30 23:00:00', '2017-09-30 23:00:00', '2017-10-01 00:00:00', '2017-10-01 01:00:00',]).tz_localize('America/Chicago')

In [19]: s.loc[i]
Out[19]: 
2017-09-30 06:00:00-05:00    0.037572
2017-09-30 23:00:00-05:00    0.771407
2017-09-30 23:00:00-05:00    0.771407
2017-10-01 00:00:00-05:00    0.778859
dtype: float64

In [20]: s.loc[i2]
/home/joris/miniconda3/envs/dev/bin/ipython:1: FutureWarning: 
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
Out[20]: 
2017-09-30 11:00:00-05:00    0.037572     # <------- still buggy (11:00 instead of 06:00)
2017-10-01 04:00:00-05:00    0.771407
2017-10-01 04:00:00-05:00    0.771407
2017-10-01 05:00:00-05:00    0.778859
2017-10-01 06:00:00-05:00         NaN
dtype: float64

In [22]: s.reindex(i)
Out[22]: 
2017-09-30 06:00:00-05:00    0.037572
2017-09-30 23:00:00-05:00    0.771407
2017-09-30 23:00:00-05:00    0.771407
2017-10-01 00:00:00-05:00    0.778859
dtype: float64

In [23]: s.reindex(i2)
Out[23]: 
2017-09-30 06:00:00-05:00    0.037572    # <------ now correct 06:00
2017-09-30 23:00:00-05:00    0.771407
2017-09-30 23:00:00-05:00    0.771407
2017-10-01 00:00:00-05:00    0.778859
2017-10-01 01:00:00-05:00         NaN
dtype: float64

So the bug in loc is still present, but the recommended alternative reindex already works correctly.

Comment From: gfyoung

@linar-jether : BTW, if you could by any chance present a smaller example to demonstrate the bug, that would be easier for us to read in the future.

Comment From: jorisvandenbossche

@gfyoung the second post (not the top one) already includes a smaller example (it is the one I used in my response). But would be good to update the top post with that for clarity

Comment From: jorisvandenbossche

So given this is deprecated behaviour, I am not sure we should put time into trying to fix this. But, I am trying to think of similar cases where this bug could come up that are not deprecated?

Comment From: linar-jether

Thanks for the response @jorisvandenbossche,

this bug can easily be overlooked and lead to false results, so maybe even a small patch to raise an exception when using loc with both duplicates and non-existing labels?

  • I've updated the top post with the more concise example

Comment From: linar-jether

And also note that using reindex is not the same as using loc, as reindex will not work when there are duplicates in the source dataframe.

Comment From: jreback

as @jorisvandenbossche points out this already will show a deprecation warning; this will be changed to an exception in 1.0 (after 0.22.0). The 'this can be overlooked' is the reason.

And also note that using reindex is not the same as using loc, as reindex will not work when there are duplicates in the source dataframe.

Of course if you have duplicates in an index, then you are on your own, that behavior is also not directly supported.

Comment From: mroeschke

This behavior looks fixed on master if anyone wants to put up at test.

In [1]: pd.__version__
Out[1]: '0.24.0.dev0+371.g0b7a08b70'

In [3]: s.loc[i]
Out[3]:
2017-09-30 06:00:00-05:00    0.951915
2017-09-30 23:00:00-05:00    0.969771
2017-09-30 23:00:00-05:00    0.969771
2017-10-01 00:00:00-05:00    0.461426
dtype: float64

In [4]: s.reindex(i)
Out[4]:
2017-09-30 06:00:00-05:00    0.951915
2017-09-30 23:00:00-05:00    0.969771
2017-09-30 23:00:00-05:00    0.969771
2017-10-01 00:00:00-05:00    0.461426
dtype: float64

In [5]: s.loc[i2]
/anaconda3/envs/pandas-dev/bin/ipython:1: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  #!/anaconda3/envs/pandas-dev/bin/python
Out[5]:
2017-09-30 06:00:00-05:00    0.951915
2017-09-30 23:00:00-05:00    0.969771
2017-09-30 23:00:00-05:00    0.969771
2017-10-01 00:00:00-05:00    0.461426
2017-10-01 01:00:00-05:00         NaN
dtype: float64

In [6]: s.reindex(i2)
Out[6]:
2017-09-30 06:00:00-05:00    0.951915
2017-09-30 23:00:00-05:00    0.969771
2017-09-30 23:00:00-05:00    0.969771
2017-10-01 00:00:00-05:00    0.461426
2017-10-01 01:00:00-05:00         NaN
dtype: float64

Comment From: gfyoung

  • The deprecation warning is already tested thanks to #17295.
  • No one has come forward with examples for how this deprecation could be "missed"

In light of those two facts, I'm going to close in fact.