Pandas PERF: Construction of a DatetimeIndex from a list of Timestamp with timezone

Pandas version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this issue exists on the latest version of pandas.
[ ] I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

The construction of a DatetimeIndex from a list of Timestamp with timezone appears to be slow when pandas 1.5 is installed in a virtual environment (venv) on Windows.

Here is a mininal script that run slowly on my machine:

# Pandas 1.5.0
python -m timeit -n 5 "import pandas as pd; x = pd.DatetimeIndex([pd.Timestamp.now(tz='Europe/Paris')] * 1000)"
# output: 5 loops, best of 5: 1.38 sec per loop

Remarks: - Pandas 1.4.4 works fine on the same machine (see below) - There is no issue on linux - There is no issue if the timezone is UTC or None - There is no issue when pandas 1.5 is installed in a conda environment on the same machine - Here is the script I used to install the virtual environment bash python -m venv venv .\venv\Scripts\activate pip install pandas

Installed Versions

INSTALLED VERSIONS ------------------ commit : 87cfe4e38bafe7300a6003a1d18bd80f3f77c763 python : 3.9.12.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.22000 machine : AMD64 processor : Intel64 Family 6 Model 140 Stepping 2, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : fr_FR.cp1252 pandas : 1.5.0 numpy : 1.23.3 pytz : 2022.4 dateutil : 2.8.2 setuptools : 58.1.0 pip : 22.0.4 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None tzdata : None

Prior Performance

# Pandas 1.4.4
python -m timeit -n 5 "import pandas as pd; x = pd.DatetimeIndex([pd.Timestamp.now(tz='Europe/Paris')] * 1000)"
# output: 5 loops, best of 5: 2.25 msec per loop
# :0: UserWarning: The test results are likely unreliable. The worst time (176 msec) was more than four times slower than the best time (2.25 msec).

Comment From: jbrockmendel

I'm seeing about 2.5ms on main.

Comment From: bernardkaplan

I installed pandas from source (version 1.6.0.dev0+319.g98323eec0e) and the issue still persists (almost 2 seconds per loop).

INSTALLED VERSIONS ------------------ commit : 98323eec0e94ea103412a1a20a1e6b4c6fa0599b python : 3.9.12.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.22621 machine : AMD64 processor : Intel64 Family 6 Model 140 Stepping 2, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : fr_FR.cp1252 pandas : 1.6.0.dev0+319.g98323eec0e numpy : 1.23.4 pytz : 2022.4 dateutil : 2.8.2 setuptools : 58.1.0 pip : 22.2.2 Cython : 0.29.32 pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None tzdata : None

Comment From: borlaugen

This has a very hight impact on our code. Some of the code we develop has gone from 100 seconds to 4000 seconds just by changing from pandas 1.4.4 to > 1.5.

Our profiling shows that the function \pandas\core\arrays\datetimes,py:1980(_sequence_to_dt64ns) spends 18 seconds on average in 1.5 but on 1.4.4 spends 0.003 seconds.

I too can confirm that this is not an issue on linux (debian) but is on Windows 11.

The issue seems to be present also in non virtual environments.

Comment From: jbrockmendel

Our profiling shows that the function \pandas\core\arrays\datetimes,py:1980(_sequence_to_dt64ns) spends 18 seconds on average in 1.5 but on 1.4.4 spends 0.003 seconds.

Do the profiling results point to anything more specific?

Comment From: borlaugen

After looking more closely at the results it looks like nt.stat is called roughly 10 000 times more in 1.5 than in 1.4 for one of our tests. Below is an excerpt of cprofile's output sorted on total time. This is the same run where i got the 18 seconds on average per call for _sequence_to_dt64ns. I am still a newbie at reading cprofiler's results, let me know if there is anything I could do to provide additional information.

ncalls          tottime percall cumtime percall filename:lineno(function)                   
18213004    732,81  0   732,81      0       {built-in   method  nt,stat}            
91057598    696,116 0   1002,256    0       <frozen importlib,_bootstrap_external>:96(_path_join)               
18211525    460,503 0   2272,45     0       <frozen importlib,_bootstrap_external>:1536(find_spec)              
1821172     113,012 0   2417,21     0,001   <frozen importlib,_bootstrap_external>:1399(_get_spec)              
91057598    94,61   0   119,267     0       <frozen importlib,_bootstrap_external>:119(<listcomp>)              
1821172     77,473  0   2549,822    0,001   <frozen importlib,_bootstrap>:921(_find_spec)               
273173609   55,334  0   55,334      0       {method 'endswith'  of  'str'   objects}    
194868241   48,031  0   48,031      0       {method 'startswith'    of  'str'   objects}    
274996319   43,611  0   43,611      0       {method 'rstrip'    of  'str'   objects}

edit: tried to make the table pretty

Comment From: jbrockmendel

can you use .sort_stat("cumtime") before printing the profiler results? also need more lines of output, as what you've posted doesn't look like its even getting to the pandas code

Comment From: borlaugen

Attached is the entire output from two different runs with 1.4.4 and 1.5.1 respectively.

profiler_1_5_github.txt profiler_1_4_github.txt

Comment From: jbrockmendel

@pganssle it looks like zoneinfo load_tzdata is taking something like 93% of the runtime here. any idea what's going on?

Comment From: pganssle

@jbrockmendel Are you doing something weird to load the ZoneInfo objects? I don't really understand why load_tzdata is being called 118k times, since you should be hitting the cache after the first one.

Do you see the same issue if you add TZ=zoneinfo.ZoneInfo("Europe/Paris") to the setup code? This might be an artifact of the way you are doing the benchmark.

Edit Looking at this more closely, I don't actually see why zoneinfo is getting called at all. From what I can tell the slow MWE code exclusively uses pytz, so presumably something is calling some zoneinfo thing when it doesn't need to. The repeated calls to load_tzdata make me think that it never actually succeeds in constructing a ZoneInfo file, so it's probably a side effect of some "check if this is a ZI object" code that isn't caching properly.

@bernardkaplan What happens if you install the tzdata Python package before running the tests. Does that fix it?

Comment From: borlaugen

@pganssle, I install tzdata and now the run times are normal. See attached cprofiler output.

profiler_1_5_tzdata.txt

Comment From: stalehuse

python -m timeit -n 5 "import pandas as pd; x = pd.DatetimeIndex([pd.Timestamp.now(tz='Europe/Paris')] * 1000)"

The code above speeds up from 2 seconds to 3 milliseconds after installing tzdata

Comment From: pganssle

OK, I think the problem is probably this. If you don't have zoneinfo data, It tries to construct ZoneInfo("UTC") every time it's called, and fails only after searching a bunch of places looking for time zone data. It's not wrong to do this, since its possible that you will install some time zone data after the application is open, but in the median case where tzdata is not installed, that's going to be expensive.

You can probably get closer to ideal by doing something like this (the line starting if tz is None is the part that is changed):

cdef inline bint is_utc_zoneinfo(tzinfo tz):
    # Workaround for cases with missing tzdata
    #  https://github.com/pandas-dev/pandas/pull/46425#discussion_r830633025
    if tz is None or zoneinfo is None or not isinstance(tz, ZoneInfo):
        return False

    global utc_zoneinfo
    if utc_zoneinfo is None:
        try:
            utc_zoneinfo = ZoneInfo("UTC")
        except zoneinfo.ZoneInfoNotFoundError:
            return False
        # Warn if tzdata is too old, even if there is a system tzdata to alert
        # users about the mismatch between local/system tzdata
        import_optional_dependency("tzdata", errors="warn", min_version="2022.1")

    return tz is utc_zoneinfo

ZoneInfo("UTC") will always be an instance of zoneinfo.ZoneInfo, and it's a very slim chance that someone will pass you a zoneinfo.ZoneInfo file on a system where there's no UTC tzif file. Those people will have the same performance regression, and if that's a real problem then you may need to decide between the two very uncommon use cases of "no UTC time zone data but they are actively using zoneinfo.ZoneInfo" and "wants to be able to install time zone data while the program is running and have pandas still be able to recognize that it's a UTC object".

You can also make this a bit more targeted by moving the isinstance check to: if utc_zoneinfo is None and isinstance(tz, ZoneInfo), if you feel that having the isinstance check slows down the case where the user actually has passed ZoneInfo("UTC") enough that it's preferable to put it under the guard where we are trying to populate utc_zoneinfo.

Current users can install tzdata as a workaround until the bug is fixed.

Comment From: bernardkaplan

@bernardkaplan What happens if you install the tzdata Python package before running the tests. Does that fix it?

I confirm that installing tzdata solved the performance issue.

Comment From: MarcoGorelli

Anyone know how to reproduce this on Windows on main?

I followed the contributing guide for Windows (which I hadn't done before, I was always using Linux), then did pip uninstall tzdata -y. But then I see:

(.venv) PS C:\Users\User\pandas-dev> python -m timeit -n 5 "import pandas as pd; x = pd.DatetimeIndex([pd.Timestamp.now(tz='Europe/Paris')] * 1000)"
5 loops, best of 5: 1.83 msec per loop
:0: UserWarning: The test results are likely unreliable. The worst time (123 msec) was more than four times slower than the best time (1.83 msec).