Somewhere between version 0.16.1 and 0.18.1 the behavior of Series[] called with an array of timestamps with timezones stopped working.
Is this expected behavior?
Here's our failing test case.
Code Sample, a copy-pastable example if possible
import pytz
import pandas as pp
MTV_TZ = pytz.timezone('America/Los_Angeles')
def D(*args):
return pp.Timestamp(pp.datetime(*args), tz=MTV_TZ)
series = pp.Series(
index=[
D(2011, 1, 1, 8),
D(2011, 1, 1, 9),
D(2011, 1, 1, 10),
D(2011, 2, 2, 1),
D(2011, 2, 2, 2),
D(2011, 2, 2, 3),
], data=[1, 2, 3, 4, 5, 6])
index = pp.Series(
index=[D(2011, 1, 1), D(2011, 2, 2)],
data=[D(2011, 1, 1, 9), D(2011, 2, 2, 3)])
expected = pp.Series(
index=[D(2011, 1, 1, 9), D(2011, 2, 2, 3)],
data=[2, 6])
# These are OK.
print (series[index] == expected).all()
print (series[(t for t in index)] == expected).all()
print series[index[0]]#, 2)
print series[index.values[0]]#, 2)
print series[index.values[1]]#, 6
# These fail. In 0.16.1 they passed
print (series[index.values] == expected).all()
print (series[(t for t in index.values)] == expected).all()
# Examining the output
print 'SERIES'
print series[index.values]
Expected Output
True
True
2
2
6
True
True
SERIES
2011-01-01 09:00:00-08:00 2
2011-02-02 03:00:00-08:00 6
dtype: int64
Actual Output
True
True
2
2
6
False
False
SERIES
2011-01-01 17:00:00 NaN
2011-02-02 11:00:00 NaN
dtype: float64
output of pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-83-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.16.1
nose: None
Cython: None
numpy: 1.9.2
scipy: 0.15.1
statsmodels: 0.6.1
IPython: 2.0.0
sphinx: None
patsy: 0.2.1
dateutil: 2.4.2
pytz: 2016.4
bottleneck: None
tables: 3.1.1
numexpr: 2.5
matplotlib: 1.3.0
openpyxl: None
xlrd: 0.9.3
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.3.2
html5lib: 1.0b8
httplib2: 0.9.2
apiclient: 1.3.1
sqlalchemy: None
pymysql: None
psycopg2: None
Comment From: jreback
Its very confusing that you are naming a Series
as index
.
But in any event, you are mis/ab-using the API. You are expecting a numpy array to index something when it cannot even hold the data type (yes it did hold it prior to 0.17.0, as object
dtype, but in a very inefficient way). The point is you should not be using this at all. And there is no reason to anyhow.
In [31]: series[index]
Out[31]:
2011-01-01 09:00:00-08:00 2
2011-02-02 03:00:00-08:00 6
dtype: int64
In [32]: index
Out[32]:
2011-01-01 00:00:00-08:00 2011-01-01 09:00:00-08:00
2011-02-02 00:00:00-08:00 2011-02-02 03:00:00-08:00
dtype: datetime64[ns, America/Los_Angeles]
# this doesn't understand timezones at all (yes it is DISPLAYING the local timezone, but this is a bug as well in numpy)
In [33]: index.values
Out[33]: array(['2011-01-01T17:00:00.000000000', '2011-02-02T11:00:00.000000000'], dtype='datetime64[ns]')
Comment From: jorisvandenbossche
@jeremywhelchel As @jreback pointed out, datetime data with a timezone were previously stored as object
dtype (which is not efficient), and this was changed in 0.17.
This was a huge improvement for dealing with timezone data in columns, but as a consequence of the refactor, the .values
is no longer returning Timestamp
objects, but a datetime64 array:
In [19]: index.values
Out[19]:
array([Timestamp('2011-01-01 09:00:00-0800', tz='America/Los_Angeles'),
Timestamp('2011-02-02 03:00:00-0800', tz='America/Los_Angeles')], dtype=o
bject)
In [20]: pd.__version__
Out[20]: '0.16.2'
vs
In [24]: index.values
Out[24]:
array(['2011-01-01T18:00:00.000000000+0100',
'2011-02-02T12:00:00.000000000+0100'], dtype='datetime64[ns]')
In [25]: pd.__version__
Out[25]: '0.18.1
And this is the reason you cannot longer access the elements using the .values
(but as @jreback said, you don't need to)
Comment From: jeremywhelchel
Thanks for the explanation @jreback and @jorisvandenbossche. I'm upgrading Pandas at my organization. This behavior caused a handful of breakages in our codebase, so we just wanted to make sure it was intended. I agree storing the array rather than a series of individual timestamps is an improvement. Thanks for the fix!