Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
i1 = pd.date_range('2020-10-25 1:00', freq='H', periods= 4, tz='Europe/Berlin')
# DatetimeIndex(['2020-10-25 01:00:00+02:00', '2020-10-25 02:00:00+02:00',
#                '2020-10-25 02:00:00+01:00', '2020-10-25 03:00:00+01:00'],
#               dtype='datetime64[ns, Europe/Berlin]', freq='H')
ts1 = i1[1]
# Timestamp('2020-10-25 02:00:00+0200', tz='Europe/Berlin', freq='H')

# This error...
i1.floor('H') # AmbiguousTimeError: Cannot infer dst time from 2020-10-25 02:00:00, try using the 'ambiguous' argument
# ...can be solved using the `ambiguous` argument if we floor an index:
i1.floor('H', ambiguous='infer')

# But that won't work on the single timestamp:
ts1.floor('H') # AmbiguousTimeError: Cannot infer dst time from 2020-10-25 02:00:00, try using the 'ambiguous' argument
ts1.floor('H', ambiguous='infer') # ValueError: Cannot infer offset with only one time.

Issue Description

All timestamps in the above example are well-defined having both a timezone (Europe/Berlin) and a UTC-offset (+2 or +1). Flooring them to the nearest full (quarter)hour should be possible, but raises an error instead.

Expected Behavior

In this case, I don't think the result of either flooring operation is ambiguous. The first should return i1, and the second should return ts1.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 06d230151e6f18fdb8139d09abf539867a8cd481 python : 3.9.7.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19044 machine : AMD64 processor : Intel64 Family 6 Model 165 Stepping 2, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : de_DE.cp1252 pandas : 1.4.1 numpy : 1.22.3 pytz : 2022.1 dateutil : 2.8.2 pip : 21.2.4 setuptools : 58.0.4 Cython : None pytest : 7.1.1 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : 8.1.1 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.5.1 numba : None numexpr : None odfpy : None openpyxl : 3.0.9 pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : 1.8.0 sqlalchemy : 1.4.32 tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None

Comment From: mroeschke

I think this is the expected behavior as stated in the docs in the Notes section https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DatetimeIndex.floor.html

If the timestamps have a timezone, flooring will take place relative to the local (“wall”) time and re-localized to the same timezone. When flooring near daylight savings time, use nonexistent and ambiguous to control the re-localization behavior.

Comment From: rwijtvliet

Thank you for looking into this.

The nonexistent does not apply here (we're dealing only with existing timestamps), and ambiguous is only used for DatetimeIndex. (I've updated the example above to use Timestamp as well, as this too is part of my use-case.)

Current (imperfect) workaround for timestamps:

floored_ts1 = ts1.tz_convert('UTC').floor('H').tz_convert(ts1.tz) # NB: incorrect if ts1 has fractional offset (like 03:30 or 03:45) to UTC.
floored_ts1.freq = ts1.freq

Comment From: mroeschke

and ambiguous is only used for DatetimeIndex

This is a DatetimeIndex

In [4]: type(i1)
Out[4]: pandas.core.indexes.datetimes.DatetimeIndex

Since this DatetimeIndex has Timestamps that fall within the DST transition, this is an ambiguous operation since it's equivalent to

In [5]: i1.tz_localize(None).floor("H").tz_localize("Europe/Berlin")
AmbiguousTimeError: Cannot infer dst time from 2020-10-25 02:00:00, try using the 'ambiguous' argument

So you'll need to specify one of the ambiguous options

‘infer’ will attempt to infer fall dst-transition hours based on order

bool-ndarray where True signifies a DST time, False designates a non-DST time (note that this flag is only applicable for ambiguous times)

‘NaT’ will return NaT where there are ambiguous times

Comment From: rwijtvliet

Thanks @mroeschke for your answer.

My question was originally about DatetimeIndex, but I now recognise the issue is primarily with Timestamp, which is why I changed the bugreport yesterday afternoon.

Either way, maybe the method below is helpful. Compared to the current implementation, it...

  • ...does work for Timestamp;
  • ...does not require the ambiguous argument for DatetimeIndex; and
  • ...retains the freq information.
import pandas as pd
import pytz


def floor2(self, freq):
    if self.tz is None:
        return self.__class__(self.floor(freq), freq=self.freq)

    # All timestamps are converted to the anchor's UTC offset.
    anchor = self if isinstance(self, pd.Timestamp) else self[0]

    # Turn into fixed offset to eliminate ambiguity...
    fo = pytz.FixedOffset(anchor.utcoffset().total_seconds() / 60)
    # (fo = 'UTC' gives incorrect result if UTC-offset is not integer number of hours)
    # ...then floor and convert back to original timezone...
    newinstance = self.tz_convert(fo).floor(freq).tz_convert(self.tz)
    # ...and keep original frequency.
    return self.__class__(newinstance, freq=self.freq)


pd.DatetimeIndex.floor2 = floor2
pd.Timestamp.floor2 = floor2

# Works for flooring DatetimeIndex to quarterhour
# ------------------------------------------------------
i0 = pd.date_range("2020-10-25 1:11", freq="H", periods=4)
i0.floor2("15T")
#DatetimeIndex(['2020-10-25 01:00:00', '2020-10-25 02:00:00',
#               '2020-10-25 03:00:00', '2020-10-25 04:00:00'],
#              dtype='datetime64[ns]', freq='H')

i1 = pd.date_range("2020-10-25 1:11", freq="H", periods=4, tz="Europe/Berlin")
i1.floor2("15T")
# DatetimeIndex(['2020-10-25 01:00:00+02:00', '2020-10-25 02:00:00+02:00',
#                '2020-10-25 02:00:00+01:00', '2020-10-25 03:00:00+01:00'],
#               dtype='datetime64[ns, Europe/Berlin]', freq='H')

i2 = pd.date_range("2020-10-25 1:11", freq="H", periods=4, tz="Asia/Kolkata")
i2.floor2("15T")
# DatetimeIndex(['2020-10-25 01:00:00+05:30', '2020-10-25 02:00:00+05:30',
#                '2020-10-25 03:00:00+05:30', '2020-10-25 04:00:00+05:30'],
#               dtype='datetime64[ns, Asia/Kolkata]', freq='H')

# Works for flooring Timestamp to quarterhour
# --------------------------------------------------
i0[0].floor2("15T")
# Timestamp('2020-10-25 01:00:00', freq='H')
i1[0].floor2("15T")
# Timestamp('2020-10-25 01:00:00+0200', tz='Europe/Berlin', freq='H')
i2[0].floor2("15T")
# Timestamp('2020-10-25 01:00:00+0530', tz='Asia/Kolkata', freq='H')

It's not finished/perfect yet - e.g. flooring to days is not done correctly. What's needed is a check: if the flooring is to the nearest hour, or shorter, the above code can be used. If it's to a frequency that's longer than an hour, the original code could be used.

But this is quickly turning into a code review. Let me know if you think it's helpful if I turn this into a pull request.

Comment From: lucasjamar

I think this is the expected behavior as stated in the docs in the Notes section https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DatetimeIndex.floor.html

If the timestamps have a timezone, flooring will take place relative to the local (“wall”) time and re-localized to the same timezone. When flooring near daylight savings time, use nonexistent and ambiguous to control the re-localization behavior.

@mroeschke , why is "wall" time selected for flooring the datetime? why are times not converted by default to "UTC", floored and then returned again to set timezone?