Pandas Series accessed with timezoned timestamps inexplicably returns NaN

Somewhere between version 0.16.1 and 0.18.1 the behavior of Series[] called with an array of timestamps with timezones stopped working.

Is this expected behavior?

Here's our failing test case.

Code Sample, a copy-pastable example if possible

    import pytz
    import pandas as pp
    MTV_TZ = pytz.timezone('America/Los_Angeles')

    def D(*args):
      return pp.Timestamp(pp.datetime(*args), tz=MTV_TZ)
    series = pp.Series(
        index=[
            D(2011, 1, 1, 8),
            D(2011, 1, 1, 9),
            D(2011, 1, 1, 10),
            D(2011, 2, 2, 1),
            D(2011, 2, 2, 2),
            D(2011, 2, 2, 3),
        ], data=[1, 2, 3, 4, 5, 6])
    index = pp.Series(
        index=[D(2011, 1, 1), D(2011, 2, 2)],
        data=[D(2011, 1, 1, 9), D(2011, 2, 2, 3)])
    expected = pp.Series(
        index=[D(2011, 1, 1, 9), D(2011, 2, 2, 3)],
        data=[2, 6])
    # These are OK.
    print (series[index] == expected).all()
    print (series[(t for t in index)] == expected).all()
    print series[index[0]]#, 2)
    print series[index.values[0]]#, 2)
    print series[index.values[1]]#, 6
    # These fail. In 0.16.1 they passed
    print  (series[index.values] == expected).all()
    print (series[(t for t in index.values)] == expected).all()
    # Examining the output
    print 'SERIES'
    print series[index.values]

Expected Output

True
True
2
2
6
True
True
SERIES
2011-01-01 09:00:00-08:00    2
2011-02-02 03:00:00-08:00    6
dtype: int64

Actual Output

True
True
2
2
6
False
False
SERIES
2011-01-01 17:00:00   NaN
2011-02-02 11:00:00   NaN
dtype: float64

output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-83-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.16.1
nose: None
Cython: None
numpy: 1.9.2
scipy: 0.15.1
statsmodels: 0.6.1
IPython: 2.0.0
sphinx: None
patsy: 0.2.1
dateutil: 2.4.2
pytz: 2016.4
bottleneck: None
tables: 3.1.1
numexpr: 2.5
matplotlib: 1.3.0
openpyxl: None
xlrd: 0.9.3
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.3.2
html5lib: 1.0b8
httplib2: 0.9.2
apiclient: 1.3.1
sqlalchemy: None
pymysql: None
psycopg2: None

Comment From: jreback

Its very confusing that you are naming a Series as index.

But in any event, you are mis/ab-using the API. You are expecting a numpy array to index something when it cannot even hold the data type (yes it did hold it prior to 0.17.0, as object dtype, but in a very inefficient way). The point is you should not be using this at all. And there is no reason to anyhow.

In [31]: series[index]
Out[31]: 
2011-01-01 09:00:00-08:00    2
2011-02-02 03:00:00-08:00    6
dtype: int64

In [32]: index
Out[32]: 
2011-01-01 00:00:00-08:00   2011-01-01 09:00:00-08:00
2011-02-02 00:00:00-08:00   2011-02-02 03:00:00-08:00
dtype: datetime64[ns, America/Los_Angeles]

# this doesn't understand timezones at all (yes it is DISPLAYING the local timezone, but this is a bug as well in numpy)
In [33]: index.values
Out[33]: array(['2011-01-01T17:00:00.000000000', '2011-02-02T11:00:00.000000000'], dtype='datetime64[ns]')

Comment From: jorisvandenbossche

@jeremywhelchel As @jreback pointed out, datetime data with a timezone were previously stored as object dtype (which is not efficient), and this was changed in 0.17.

This was a huge improvement for dealing with timezone data in columns, but as a consequence of the refactor, the .values is no longer returning Timestamp objects, but a datetime64 array:

In [19]: index.values
Out[19]:
array([Timestamp('2011-01-01 09:00:00-0800', tz='America/Los_Angeles'),
       Timestamp('2011-02-02 03:00:00-0800', tz='America/Los_Angeles')], dtype=o
bject)

In [20]: pd.__version__
Out[20]: '0.16.2'

In [24]: index.values
Out[24]:
array(['2011-01-01T18:00:00.000000000+0100',
       '2011-02-02T12:00:00.000000000+0100'], dtype='datetime64[ns]')

In [25]: pd.__version__
Out[25]: '0.18.1

And this is the reason you cannot longer access the elements using the .values (but as @jreback said, you don't need to)

Comment From: jeremywhelchel

Thanks for the explanation @jreback and @jorisvandenbossche. I'm upgrading Pandas at my organization. This behavior caused a handful of breakages in our codebase, so we just wanted to make sure it was intended. I agree storing the array rather than a series of individual timestamps is an improvement. Thanks for the fix!

Pandas Series accessed with timezoned timestamps inexplicably returns NaN

Code Sample, a copy-pastable example if possible

Expected Output

Actual Output

output of pd.show_versions()

output of `pd.show_versions()`