Pandas BUG: date_range with freq="C" (business days) return value changed on 1.5.0

Pandas version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import datetime as dt
import pandas as pd
import pytz

paris = pytz.timezone("Europe/Paris")
start = dt.datetime.fromisoformat("2022-07-22 22:00:00+00:00").astimezone(paris)
print(pd.date_range(start, periods=1, freq="C"))

Issue Description

On Pandas 1.4.4, the above shows:

DatetimeIndex(['2022-07-25 00:00:00+02:00'], dtype='datetime64[ns, Europe/Paris]', freq='C')

On Pandas 1.5.1, it shows:

DatetimeIndex(['2022-07-23 00:00:00+02:00', '2022-07-25 00:00:00+02:00'], dtype='datetime64[ns, Europe/Paris]', freq='C')

There are no warnings from the previous code.

Expected Behavior

THe same results, unless I have missed something in the release notes (I tried reading all results matching date/time/business, couldn't see anything relevant).

Installed Versions


INSTALLED VERSIONS
------------------
commit           : ca60aab7340d9989d9428e11a51467658190bb6b
python           : 3.10.8.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 21.6.0
Version          : Darwin Kernel Version 21.6.0: Mon Aug 22 20:19:52 PDT 2022; root:xnu-8020.140.49~2/RELEASE_ARM64_T6000
machine          : arm64
processor        : arm
byteorder        : little
LC_ALL           : None
LANG             : en_GB.UTF-8
LOCALE           : en_GB.UTF-8

pandas           : 1.4.4
numpy            : 1.23.4
pytz             : 2022.6
dateutil         : 2.8.2
setuptools       : 65.5.0
pip              : 22.3
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
brotli           : None
fastparquet      : None
fsspec           : None
gcsfs            : None
markupsafe       : None
matplotlib       : None
numba            : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : None
snappy           : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
zstandard        : None

Comment From: mroeschke

Thanks for the report. I'm seeing the same on main.

From a quick look, https://github.com/pandas-dev/pandas/pull/48385 touched the related code path most recently but the core logic didn't seem to change. cc @jbrockmendel

This looks to be the root cause of the behavior

(Pdb) start
Timestamp('2022-07-23 00:00:00+0200', tz='Europe/Paris')
(Pdb) offset
<CustomBusinessDay>
(Pdb) start + (0 * offset) # = end
Timestamp('2022-07-25 00:00:00+0200', tz='Europe/Paris')  # date rolled forward

Comment From: lohmanndouglas

@mroeschke I would like to contribute to this issue, but I believe that I will require some support as it is my first contribution to pandas.

I did a git bisect and figure out that the issue was introduced in #47019. More specifically in the lines from 273 to 276 .

I reverted the modification, only in these lines, and it worked. I need a better understanding of what is happing and figure out how to write a test case for it.

One possible solution is just to revert does lines, but I think we are looking for a better solution :)

I will continue studying the code to propose a good solution.

Comment From: mroeschke

Thanks for investigating @lohmanndouglas. Since this is a regression, it would be valid to revert those line if it fixes this issue and doesn't fail any existing test. A test should be included in pandas/tests/indexes/datetimes/test_date_range.py that asserts that the OP code example exhibits the 1.4.4 behavior

Comment From: lohmanndouglas

Thanks @mroeschke

I will do it and open a PR as soon as possible :)

Comment From: lohmanndouglas

take

Comment From: Natim

One possible solution is just to revert does lines

Yes it looks good to me indeed.

Comment From: mroeschke

@lohmanndouglas while git bisect points to that commit, IMO the "root" issue still points to the behavior of https://github.com/pandas-dev/pandas/issues/49441#issuecomment-1302692153, namely why in CustomBusinessDay._apply why we have

    @apply_wraps
    def _apply(self, other):
        if self.n <= 0: # why does this include 0
            roll = "forward"
        else:
            roll = "backward"

Comment From: lohmanndouglas

@mroeschke I did not open the PR because I would like to get a deep understanding of what is happening. We might have changed the function _to_dt64D(dt) behavior, consequently revealing another bug in the _apply function.

From the following snippet

    paris = pytz.timezone("Europe/Paris")
    start = dt.datetime.fromisoformat("2022-07-22 22:00:00+00:00").astimezone(paris)
    print(pd.date_range(start, periods=1, freq="C"))

I got in _to_dt64D(dt) function:

dt (input)	After the #47019	Before the #47019
2022-07-23 00:00:00+02:00	2022-07-22	2022-07-23

1) Is the new behavior correct? It should ignore the TZ?

Knowing that we have two different execution flows in the _generate_range function (here is the logic and the apply function)

FLOW 1: Before the #47019

    if start and not offset.is_on_offset(start):
        # Incompatible types in assignment (expression has type "datetime",
        # variable has type "Optional[Timestamp]")
        start = offset.rollforward(start)  # type: ignore[assignment]

The offset.is_on_offset(start) returns False, and the IF conditional evaluates to True, setting the start to 2022-07-25 00:00:00+02:00

Then, we set the end to end = start + (periods - 1) * offset, which means that we set the end to 2022-07-25 00:00:00+02:00.

Finally, we check:

```python cur = start if offset.n >= 0: while cur <= end: yield cur if cur == end: # GH#24252 avoid overflows by not performing the addition # in offset.apply unless we have to break

...

```

As the cur == end is True we just return the solution. In this flow (before #47019) it does not pass through the apply function.

### FLOW 2: After the #47019

After #47019, the offset.is_on_offset(start) (from the second python snippet) evaluates to True, and the IF condition turns to False (does not perform the roll forward).

Then we also set the end to end = start + (periods - 1) * offset, but here we have start = 2022-07-23 00:00:00+02:00 and end = 2022-07-25 00:00:00+02:00.

Finally, we start the while to find out the periods. As the start is not equal to the end we call the apply.

    cur = start
    if offset.n >= 0:
        while cur <= end:
            yield cur         
            if cur == end:
                break
            [... omitted code]
            next_date = offset._apply(cur)
           [... omitted code]
            cur = next_date

2) Here is another point: Should the start consider that day 23 is Saturday and avoid starting with an invalid custom business day?

The apply function returns the correct value 2022-07-25 00:00:00+02:00. In the second iteration, the cur is
cur == end finishing the while loop. At this point, we have the start (2022-07-23 00:00:00+02:00) and the cur (2022-07-25 00:00:00+02:00) as the result of our function.

So far I have this understanding. We need to certify what should be the correct behavior for the _to_dt64D(dt) function. If the changes from #47019 introduced an undesired behavior, it is simple to undo the modification, otherwise, we may investigate more about the start, it should not start with an invalid day.

DOING -> Studying the is_on_offset that uses the _to_dt64D(dt) function. This function is the root evidence of bad behavior from _to_dt64D.

    def is_on_offset(self, dt: datetime) -> bool:
        if self.normalize and not _is_normalized(dt):
            return False
        day64 = _to_dt64D(dt)
        return np.is_busday(day64, busdaycal=self.calendar)

I do not have a deep understanding of this code, this is my first issue, and I am very glad for any help.

Comment From: mroeschke

I got in _to_dt64D(dt) function:

dt (input)	After the #47019	Before the #47019
2022-07-23 00:00:00+02:00	2022-07-22	2022-07-23

Yeah I think this is the crux of the issue as I think we want the naive wall time (dt.replace(tzinfo=None)) instead of the naive absolute time (dt.astimezone(None)) from this function. This diff fixes the original example if you want to open a pull request

diff --git a/pandas/_libs/tslibs/offsets.pyx b/pandas/_libs/tslibs/offsets.pyx
index 50d6a0a02b..64099310f7 100644
--- a/pandas/_libs/tslibs/offsets.pyx
+++ b/pandas/_libs/tslibs/offsets.pyx
@@ -254,7 +254,7 @@ cdef _to_dt64D(dt):
     if getattr(dt, 'tzinfo', None) is not None:
         # Get the nanosecond timestamp,
         #  equiv `Timestamp(dt).value` or `dt.timestamp() * 10**9`
-        naive = dt.astimezone(None)
+        naive = dt.replace(tzinfo=None)
         dt = np.datetime64(naive, "D")
     else:
         dt = np.datetime64(dt)

Comment From: lohmanndouglas

@mroeschke tanks

sure, I would like to open the PR :)