Pandas unique(), values discards timezone information in some cases

Not sure if it's an expected behavior. This seems strange to me.

A small, complete example of the issue

rng = pd.date_range('1/1/2011', periods=5, freq='D', tz='utc')
rng  # Timezone is good

DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03', '2011-01-04', '2011-01-05'], dtype='datetime64[ns, UTC]', freq='D')

rng.unique()  # Timezone is still good

DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03', '2011-01-04', '2011-01-05'], dtype='datetime64[ns, UTC]', freq='D')

rng.values  # Timezone lost!

array(['2011-01-01T00:00:00.000000000', '2011-01-02T00:00:00.000000000', '2011-01-03T00:00:00.000000000', '2011-01-04T00:00:00.000000000', '2011-01-05T00:00:00.000000000'], dtype='datetime64[ns]')

serie = rng.to_series().reset_index()['index']
serie  # Timezone is good

0 2011-01-01 00:00:00+00:00 1 2011-01-02 00:00:00+00:00 2 2011-01-03 00:00:00+00:00 3 2011-01-04 00:00:00+00:00 4 2011-01-05 00:00:00+00:00 Name: index, dtype: datetime64[ns, UTC]

serie.unique()  # Timezone is lost!

array(['2011-01-01T00:00:00.000000000', '2011-01-02T00:00:00.000000000', '2011-01-03T00:00:00.000000000', '2011-01-04T00:00:00.000000000', '2011-01-05T00:00:00.000000000'], dtype='datetime64[ns]')

serie.values  # Timezone is lost!

array(['2011-01-01T00:00:00.000000000', '2011-01-02T00:00:00.000000000', '2011-01-03T00:00:00.000000000', '2011-01-04T00:00:00.000000000', '2011-01-05T00:00:00.000000000'], dtype='datetime64[ns]')

Expected Output

Outputs of pd.DatetimeIndex.values, pd.Series.unique() and pd.Series.values should be preserve the timezone information.

Note: I did a quick test with Python 3.4.5 and pandas 0.19 only difference is (to be clear, rng.values and serie.values still discard the tzinfo):

serie.unique()  # Timezone is good again!! but dtype is object (preserving pd.Timestamp)

array([Timestamp('2011-01-01 00:00:00+0000', tz='UTC', freq='D'), Timestamp('2011-01-02 00:00:00+0000', tz='UTC', freq='D'), Timestamp('2011-01-03 00:00:00+0000', tz='UTC', freq='D'), Timestamp('2011-01-04 00:00:00+0000', tz='UTC', freq='D'), Timestamp('2011-01-05 00:00:00+0000', tz='UTC', freq='D')], dtype=object)

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.4.4.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel byteorder: little LC_ALL: None LANG: None pandas: 0.18.1 nose: None pip: 8.1.2 setuptools: 23.0.0 Cython: None numpy: 1.11.0 scipy: 0.17.1 statsmodels: 0.6.1 xarray: None IPython: 4.2.0 sphinx: None patsy: 0.4.1 dateutil: 2.5.3 pytz: 2016.4 blosc: None bottleneck: 1.0.0 tables: None numexpr: 2.6.0 matplotlib: 1.5.1 openpyxl: 2.3.2 xlrd: 1.0.0 xlwt: None xlsxwriter: 0.9.2 lxml: 3.6.0 bs4: None html5lib: None httplib2: None apiclient: None sqlalchemy: 1.0.13 pymysql: 0.7.6.None psycopg2: None jinja2: 2.8 boto: None pandas_datareader: 0.2.1

Comment From: jreback

discussion here: https://github.com/pandas-dev/pandas/issues/13395

To summarize:

.values returns a numpy array, datetime64 are converted to naive as numpy doesn't support anything else. .unique will give you back an object array if the tz is set.

DTI return another DTI as .unique(), and .values is the same as Series.

So this IS different for a DTI and a Series for .unique.

It has been this way for quite some time and its not worth it to return a '1-D' object for Series.unique, which you would think would be a DTI. Note that this is the only thing that makes sense as the index is pretty meaningless. Apparently a DTI is unexpected for numpy users.

See my comments in the xref issue.

Comment From: jorisvandenbossche

Maybe we should an explanation about this somewhere in the docs as a FAQ or gotcha, to which we can refer when this comes up.

Comment From: jreback

could update the doc strings with more xref

to be honest .values use should be discouraged

Comment From: aavanian

Thanks for the clear answer and sorry about not catching these other issues (somehow forgot to remove is:open while searching for past similar issues).

as for .values, I can't say I found a very clear/consistent pattern to get the values out of a Series/DF to pass to non-pandas functions/objects. I often resort to .values or .values.tolist() which I supposed was probably ugly but found the most predictable. As in, I can use that everywhere without checking too much what comes in, I'm sure I get float-like for float-like and 'us since epoch' for datetime-like, the issue being of course the latter.

Comment From: jorisvandenbossche

@aavanian many functions that expect a numpy array also can handle a Series, but if you need the values, then .values is the recommended way to get it

Pandas unique(), values discards timezone information in some cases

A small, complete example of the issue

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`