Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this issue exists on the latest version of pandas.
-
[ ] I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
The construction of a DatetimeIndex from a list of Timestamp with timezone appears to be slow when pandas 1.5 is installed in a virtual environment (venv) on Windows.
Here is a mininal script that run slowly on my machine:
# Pandas 1.5.0
python -m timeit -n 5 "import pandas as pd; x = pd.DatetimeIndex([pd.Timestamp.now(tz='Europe/Paris')] * 1000)"
# output: 5 loops, best of 5: 1.38 sec per loop
Remarks:
- Pandas 1.4.4 works fine on the same machine (see below)
- There is no issue on linux
- There is no issue if the timezone is UTC or None
- There is no issue when pandas 1.5 is installed in a conda environment on the same machine
- Here is the script I used to install the virtual environment
bash
python -m venv venv
.\venv\Scripts\activate
pip install pandas
Installed Versions
Prior Performance
# Pandas 1.4.4
python -m timeit -n 5 "import pandas as pd; x = pd.DatetimeIndex([pd.Timestamp.now(tz='Europe/Paris')] * 1000)"
# output: 5 loops, best of 5: 2.25 msec per loop
# :0: UserWarning: The test results are likely unreliable. The worst time (176 msec) was more than four times slower than the best time (2.25 msec).
Comment From: jbrockmendel
I'm seeing about 2.5ms on main.
Comment From: bernardkaplan
I installed pandas from source (version 1.6.0.dev0+319.g98323eec0e) and the issue still persists (almost 2 seconds per loop).
Comment From: borlaugen
This has a very hight impact on our code. Some of the code we develop has gone from 100 seconds to 4000 seconds just by changing from pandas 1.4.4 to > 1.5.
Our profiling shows that the function \pandas\core\arrays\datetimes,py:1980(_sequence_to_dt64ns) spends 18 seconds on average in 1.5 but on 1.4.4 spends 0.003 seconds.
I too can confirm that this is not an issue on linux (debian) but is on Windows 11.
The issue seems to be present also in non virtual environments.
Comment From: jbrockmendel
Our profiling shows that the function \pandas\core\arrays\datetimes,py:1980(_sequence_to_dt64ns) spends 18 seconds on average in 1.5 but on 1.4.4 spends 0.003 seconds.
Do the profiling results point to anything more specific?
Comment From: borlaugen
After looking more closely at the results it looks like nt.stat is called roughly 10 000 times more in 1.5 than in 1.4 for one of our tests. Below is an excerpt of cprofile's output sorted on total time. This is the same run where i got the 18 seconds on average per call for _sequence_to_dt64ns. I am still a newbie at reading cprofiler's results, let me know if there is anything I could do to provide additional information.
ncalls tottime percall cumtime percall filename:lineno(function)
18213004 732,81 0 732,81 0 {built-in method nt,stat}
91057598 696,116 0 1002,256 0 <frozen importlib,_bootstrap_external>:96(_path_join)
18211525 460,503 0 2272,45 0 <frozen importlib,_bootstrap_external>:1536(find_spec)
1821172 113,012 0 2417,21 0,001 <frozen importlib,_bootstrap_external>:1399(_get_spec)
91057598 94,61 0 119,267 0 <frozen importlib,_bootstrap_external>:119(<listcomp>)
1821172 77,473 0 2549,822 0,001 <frozen importlib,_bootstrap>:921(_find_spec)
273173609 55,334 0 55,334 0 {method 'endswith' of 'str' objects}
194868241 48,031 0 48,031 0 {method 'startswith' of 'str' objects}
274996319 43,611 0 43,611 0 {method 'rstrip' of 'str' objects}
edit: tried to make the table pretty
Comment From: jbrockmendel
can you use .sort_stat("cumtime")
before printing the profiler results? also need more lines of output, as what you've posted doesn't look like its even getting to the pandas code
Comment From: borlaugen
Attached is the entire output from two different runs with 1.4.4 and 1.5.1 respectively.
profiler_1_5_github.txt profiler_1_4_github.txt
Comment From: jbrockmendel
@pganssle it looks like zoneinfo load_tzdata is taking something like 93% of the runtime here. any idea what's going on?
Comment From: pganssle
@jbrockmendel Are you doing something weird to load the ZoneInfo objects? I don't really understand why load_tzdata is being called 118k times, since you should be hitting the cache after the first one.
Do you see the same issue if you add TZ=zoneinfo.ZoneInfo("Europe/Paris")
to the setup code? This might be an artifact of the way you are doing the benchmark.
Edit Looking at this more closely, I don't actually see why zoneinfo
is getting called at all. From what I can tell the slow MWE code exclusively uses pytz
, so presumably something is calling some zoneinfo
thing when it doesn't need to. The repeated calls to load_tzdata
make me think that it never actually succeeds in constructing a ZoneInfo
file, so it's probably a side effect of some "check if this is a ZI object" code that isn't caching properly.
@bernardkaplan What happens if you install the tzdata
Python package before running the tests. Does that fix it?
Comment From: borlaugen
@pganssle, I install tzdata and now the run times are normal. See attached cprofiler output.
Comment From: stalehuse
python -m timeit -n 5 "import pandas as pd; x = pd.DatetimeIndex([pd.Timestamp.now(tz='Europe/Paris')] * 1000)"
The code above speeds up from 2 seconds to 3 milliseconds after installing tzdata
Comment From: pganssle
OK, I think the problem is probably this. If you don't have zoneinfo data, It tries to construct ZoneInfo("UTC")
every time it's called, and fails only after searching a bunch of places looking for time zone data. It's not wrong to do this, since its possible that you will install some time zone data after the application is open, but in the median case where tzdata
is not installed, that's going to be expensive.
You can probably get closer to ideal by doing something like this (the line starting if tz is None
is the part that is changed):
cdef inline bint is_utc_zoneinfo(tzinfo tz):
# Workaround for cases with missing tzdata
# https://github.com/pandas-dev/pandas/pull/46425#discussion_r830633025
if tz is None or zoneinfo is None or not isinstance(tz, ZoneInfo):
return False
global utc_zoneinfo
if utc_zoneinfo is None:
try:
utc_zoneinfo = ZoneInfo("UTC")
except zoneinfo.ZoneInfoNotFoundError:
return False
# Warn if tzdata is too old, even if there is a system tzdata to alert
# users about the mismatch between local/system tzdata
import_optional_dependency("tzdata", errors="warn", min_version="2022.1")
return tz is utc_zoneinfo
ZoneInfo("UTC")
will always be an instance of zoneinfo.ZoneInfo
, and it's a very slim chance that someone will pass you a zoneinfo.ZoneInfo
file on a system where there's no UTC tzif file. Those people will have the same performance regression, and if that's a real problem then you may need to decide between the two very uncommon use cases of "no UTC time zone data but they are actively using zoneinfo.ZoneInfo
" and "wants to be able to install time zone data while the program is running and have pandas
still be able to recognize that it's a UTC object".
You can also make this a bit more targeted by moving the isinstance
check to: if utc_zoneinfo is None and isinstance(tz, ZoneInfo)
, if you feel that having the isinstance
check slows down the case where the user actually has passed ZoneInfo("UTC")
enough that it's preferable to put it under the guard where we are trying to populate utc_zoneinfo
.
Current users can install tzdata
as a workaround until the bug is fixed.
Comment From: bernardkaplan
@bernardkaplan What happens if you install the
tzdata
Python package before running the tests. Does that fix it?
I confirm that installing tzdata solved the performance issue.
Comment From: MarcoGorelli
Anyone know how to reproduce this on Windows on main?
I followed the contributing guide for Windows (which I hadn't done before, I was always using Linux), then did pip uninstall tzdata -y
. But then I see:
(.venv) PS C:\Users\User\pandas-dev> python -m timeit -n 5 "import pandas as pd; x = pd.DatetimeIndex([pd.Timestamp.now(tz='Europe/Paris')] * 1000)"
5 loops, best of 5: 1.83 msec per loop
:0: UserWarning: The test results are likely unreliable. The worst time (123 msec) was more than four times slower than the best time (1.83 msec).