Not sure if it's an expected behavior. This seems strange to me.
A small, complete example of the issue
rng = pd.date_range('1/1/2011', periods=5, freq='D', tz='utc')
rng # Timezone is good
DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03', '2011-01-04', '2011-01-05'], dtype='datetime64[ns, UTC]', freq='D')
rng.unique() # Timezone is still good
DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03', '2011-01-04', '2011-01-05'], dtype='datetime64[ns, UTC]', freq='D')
rng.values # Timezone lost!
array(['2011-01-01T00:00:00.000000000', '2011-01-02T00:00:00.000000000', '2011-01-03T00:00:00.000000000', '2011-01-04T00:00:00.000000000', '2011-01-05T00:00:00.000000000'], dtype='datetime64[ns]')
serie = rng.to_series().reset_index()['index']
serie # Timezone is good
0 2011-01-01 00:00:00+00:00 1 2011-01-02 00:00:00+00:00 2 2011-01-03 00:00:00+00:00 3 2011-01-04 00:00:00+00:00 4 2011-01-05 00:00:00+00:00 Name: index, dtype: datetime64[ns, UTC]
serie.unique() # Timezone is lost!
array(['2011-01-01T00:00:00.000000000', '2011-01-02T00:00:00.000000000', '2011-01-03T00:00:00.000000000', '2011-01-04T00:00:00.000000000', '2011-01-05T00:00:00.000000000'], dtype='datetime64[ns]')
serie.values # Timezone is lost!
array(['2011-01-01T00:00:00.000000000', '2011-01-02T00:00:00.000000000', '2011-01-03T00:00:00.000000000', '2011-01-04T00:00:00.000000000', '2011-01-05T00:00:00.000000000'], dtype='datetime64[ns]')
Expected Output
Outputs of pd.DatetimeIndex.values, pd.Series.unique() and pd.Series.values should be preserve the timezone information.
Note: I did a quick test with Python 3.4.5 and pandas 0.19 only difference is (to be clear, rng.values and serie.values still discard the tzinfo):
serie.unique() # Timezone is good again!! but dtype is object (preserving pd.Timestamp)
array([Timestamp('2011-01-01 00:00:00+0000', tz='UTC', freq='D'), Timestamp('2011-01-02 00:00:00+0000', tz='UTC', freq='D'), Timestamp('2011-01-03 00:00:00+0000', tz='UTC', freq='D'), Timestamp('2011-01-04 00:00:00+0000', tz='UTC', freq='D'), Timestamp('2011-01-05 00:00:00+0000', tz='UTC', freq='D')], dtype=object)
Output of pd.show_versions()
Comment From: jreback
discussion here: https://github.com/pandas-dev/pandas/issues/13395
To summarize:
.values
returns a numpy array, datetime64 are converted to naive as numpy doesn't support anything else. .unique
will give you back an object
array if the tz is set.
DTI return another DTI as .unique()
, and .values
is the same as Series
.
So this IS different for a DTI and a Series for .unique
.
It has been this way for quite some time and its not worth it to return a '1-D' object for Series.unique
, which you would think would be a DTI. Note that this is the only thing that makes sense as the index is pretty meaningless. Apparently a DTI is unexpected for numpy users.
See my comments in the xref issue.
Comment From: jorisvandenbossche
Maybe we should an explanation about this somewhere in the docs as a FAQ or gotcha, to which we can refer when this comes up.
Comment From: jreback
could update the doc strings with more xref
to be honest .values use should be discouraged
Comment From: aavanian
Thanks for the clear answer and sorry about not catching these other issues (somehow forgot to remove is:open while searching for past similar issues).
as for .values, I can't say I found a very clear/consistent pattern to get the values out of a Series/DF to pass to non-pandas functions/objects. I often resort to .values or .values.tolist() which I supposed was probably ugly but found the most predictable. As in, I can use that everywhere without checking too much what comes in, I'm sure I get float-like for float-like and 'us since epoch' for datetime-like, the issue being of course the latter.
Comment From: jorisvandenbossche
@aavanian many functions that expect a numpy array also can handle a Series, but if you need the values, then .values
is the recommended way to get it