Pandas API: Wrong DateOffset behaviour with DST changes

Given a timezone-aware Timestamp:

foo = pd.Timestamp('2016-10-30 00:00:00', tz=pytz.timezone('Europe/Helsinki'))

(Please note that 2016-10-30 is a 25-hour day, due to a DST change. This day the hour changes from +0300 to +0200)

I'm trying to get the next day. If I understand correctly the DateOffset behaviour, all these lines should be equivalent:

foo + pd.tseries.frequencies.to_offset('D')
foo + pd.tseries.offsets.Day()
foo + pd.DateOffset()
foo + pd.DateOffset(1)
foo + pd.DateOffset(days=1)

But the first four return a wrong date, presumably because they're not adding a day, but 24 hours:

Timestamp('2016-10-30 23:00:00+0200', tz='Europe/Helsinki')

And for some reason, the last one returns the correct date:

Timestamp('2016-10-31 00:00:00+0200', tz='Europe/Helsinki')

Output of `pd.show_versions()`

>>> pd.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.4.6.final.0 python-bits: 64 OS: Darwin OS-release: 16.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: es_ES.UTF-8 LOCALE: es_ES.UTF-8 pandas: 0.20.3 pytest: None pip: 9.0.1 setuptools: 28.8.0 Cython: None numpy: 1.13.1 scipy: None xarray: None IPython: None sphinx: None patsy: None dateutil: 2.6.1 pytz: 2017.2 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None sqlalchemy: None pymysql: None psycopg2: None jinja2: None s3fs: None pandas_gbq: None pandas_datareader: None

Comment From: jreback

xref to https://github.com/pandas-dev/pandas/issues/8774 and #7825

the first 4 are equivalent to Day, so what you are asking is why are these different.

In [44]: foo + pd.DateOffset(days=1)
Out[44]: Timestamp('2016-10-31 00:00:00+0200', tz='Europe/Helsinki')

In [45]: foo + pd.offsets.Day()
Out[45]: Timestamp('2016-10-30 23:00:00+0200', tz='Europe/Helsinki')

Day is a sub-class of Tick which like Minute, Second, is applicable to the UTC time directly (IOW these are converted to UTC added, then reconverted to local). So D is effectively 24 hours and not days.

While other offsets e.g. Month and such. seek to preserve the DST semantics exactly.

DateOffset(days=1) works like Month and such and respects DST.

So this is a bit confusing. I don't really recall why it was structured like this. You are welcome to delve into the linked issues and code and see if you can write up a better explanation (which maybe we want to add to the docs).

Comment From: tapia

The problem is that I need a way to write a string offset ('D', '10m', whatever) that is timezone aware. If I'm going to add a day, I need to add an actual day, not 24 hours. How can I achieve this using to_offset?

I can't use DateOffset to do this, because my code receives the offset string as a parameter, and writing a parser that "overrides" the Pandas parser looks like a very bad idea.

Comment From: gfyoung

I'm not sure if you can, but at the very least, what you can do is check for "D" in the string parameter and use pd.DateOffset(days=n) as the workaround. Does that work for you?

Comment From: tapia

Yes, of course, that's what I ended doing. But, regardless of how I workaround the problem, I still think this is a bug in Pandas.

Thank you guys for your help :-)

Comment From: jreback

If you want to have a detailed look inside the code/tests for this would be great. I don't remember exactly why this is the way it is.

Comment From: jreback

consolidated to #20633

Pandas API: Wrong DateOffset behaviour with DST changes

Output of pd.show_versions()

Output of `pd.show_versions()`