Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this issue exists on the latest version of pandas.
-
[X] I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
I have been seeing slow performance when using the pandas read_sql_query function when the returned data includes datetimes with a non UTC timezone offset.
I have located the issue to the objects_to_datetime64ns function called when constructing the dataframe. In pandas 1.4. the data function would try to call tslib.array_to_datetime and fail, then having a failover to conversion.datetime_to_datetime64, but for pandas 1.5. the tslib.array_to_datetime is executed which takes a significant amount of time compared to the 1.4.* execution time.
Installed Versions
Prior Performance
On my computer, the difference between the two version is Pandas 1.4.4: Execution time: 0:00:00.013004 Pandas 1.5.2: Execution time: 0:00:21.586688
import pandas as pd
import datetime as dt
import pytz
import logging
from pandas._libs.tslibs import (
conversion,
)
from pandas._libs import (
tslib,
)
date_range = pd.date_range(dt.datetime(2021, 1, 1, tzinfo=pytz.timezone('CET')), dt.datetime(2022, 1, 1, tzinfo=pytz.timezone('CET')), freq='1H')
df = pd.DataFrame({'dtColumn': date_range})
df['dtColumn'] = pd.to_datetime(df['dtColumn'])
data = df.iloc[:, 0:].values
print('Starting conversion')
if pd.__version__ == '1.5.2':
start = dt.datetime.now()
result, tz_parsed = tslib.array_to_datetime(data.ravel("K"),
errors='raise',
utc=False,
dayfirst=False,
yearfirst=False,
require_iso8601=True,
allow_mixed=True)
print(f'Execution time: {dt.datetime.now() - start}')
if pd.__version__ == '1.4.4':
start = dt.datetime.now()
try:
result, tz_parsed = tslib.array_to_datetime(data.ravel("K"),
errors='raise',
utc=False,
dayfirst=False,
yearfirst=False,
require_iso8601=True,
allow_mixed=True)
except ValueError:
values, tz_parsed = conversion.datetime_to_datetime64(data.ravel("K"))
print(f'Execution time: {dt.datetime.now() - start}')
Comment From: MarcoGorelli
thanks @kenha37 for your report
I can't reproduce, see below:
(.venv) marcogorelli@DESKTOP-U8OKFP3:~/tmp$ pip install pandas==1.5.2
Collecting pandas==1.5.2
Using cached pandas-1.5.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.2 MB)
Requirement already satisfied: python-dateutil>=2.8.1 in ./.venv/lib/python3.8/site-packages (from pandas==1.5.2) (2.8.2)
Requirement already satisfied: numpy>=1.20.3 in ./.venv/lib/python3.8/site-packages (from pandas==1.5.2) (1.23.5)
Requirement already satisfied: pytz>=2020.1 in ./.venv/lib/python3.8/site-packages (from pandas==1.5.2) (2022.6)
Requirement already satisfied: six>=1.5 in ./.venv/lib/python3.8/site-packages (from python-dateutil>=2.8.1->pandas==1.5.2) (1.16.0)
Installing collected packages: pandas
Attempting uninstall: pandas
Found existing installation: pandas 1.4.4
Uninstalling pandas-1.4.4:
Successfully uninstalled pandas-1.4.4
Successfully installed pandas-1.5.2
(.venv) marcogorelli@DESKTOP-U8OKFP3:~/tmp$ python t.py
Starting conversion
Execution time: 0:00:00.008793
(.venv) marcogorelli@DESKTOP-U8OKFP3:~/tmp$ pip install pandas==1.4.4
Collecting pandas==1.4.4
Using cached pandas-1.4.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.7 MB)
Requirement already satisfied: numpy>=1.18.5 in ./.venv/lib/python3.8/site-packages (from pandas==1.4.4) (1.23.5)
Requirement already satisfied: pytz>=2020.1 in ./.venv/lib/python3.8/site-packages (from pandas==1.4.4) (2022.6)
Requirement already satisfied: python-dateutil>=2.8.1 in ./.venv/lib/python3.8/site-packages (from pandas==1.4.4) (2.8.2)
Requirement already satisfied: six>=1.5 in ./.venv/lib/python3.8/site-packages (from python-dateutil>=2.8.1->pandas==1.4.4) (1.16.0)
Installing collected packages: pandas
Attempting uninstall: pandas
Found existing installation: pandas 1.5.2
Uninstalling pandas-1.5.2:
Successfully uninstalled pandas-1.5.2
Successfully installed pandas-1.4.4
(.venv) marcogorelli@DESKTOP-U8OKFP3:~/tmp$ python t.py
Starting conversion
Execution time: 0:00:00.008626
(.venv) marcogorelli@DESKTOP-U8OKFP3:~/tmp$ cat t.py
import pandas as pd
import datetime as dt
import pytz
import logging
from pandas._libs.tslibs import (
conversion,
)
from pandas._libs import (
tslib,
)
date_range = pd.date_range(dt.datetime(2021, 1, 1, tzinfo=pytz.timezone('CET')), dt.datetime(2022, 1, 1, tzinfo=pytz.timezone('CET')), freq='1H')
df = pd.DataFrame({'dtColumn': date_range})
df['dtColumn'] = pd.to_datetime(df['dtColumn'])
data = df.iloc[:, 0:].values
print('Starting conversion')
if pd.__version__ == '1.5.2':
start = dt.datetime.now()
result, tz_parsed = tslib.array_to_datetime(data.ravel("K"),
errors='raise',
utc=False,
dayfirst=False,
yearfirst=False,
require_iso8601=True,
allow_mixed=True)
print(f'Execution time: {dt.datetime.now() - start}')
if pd.__version__ == '1.4.4':
start = dt.datetime.now()
try:
result, tz_parsed = tslib.array_to_datetime(data.ravel("K"),
errors='raise',
utc=False,
dayfirst=False,
yearfirst=False,
require_iso8601=True,
allow_mixed=True)
except ValueError:
values, tz_parsed = conversion.datetime_to_datetime64(data.ravel("K"))
print(f'Execution time: {dt.datetime.now() - start}')
Comment From: kenha37
Thank you for looking at the issue.
I have done some further testing, and the issue seems to only show itself on windows systems. I only have ubuntu and windows computers available, and the issue is consistent on windows but doesn't appear on ubuntu.
When trying to step through the array_to_datetime method in a debugger, it seems the code is rotating through the find_tzfile and load_tzdata functions. Counting the function calls to find_tzfile, it is called 43809 times in the above code.
I have tested that this is a problem both when using virtual environments and the system interpreter.
Comment From: MarcoGorelli
that's useful, thanks
could you try on the main
branch too please?
Comment From: lithomas1
Maybe related to #49048.
Comment From: kenha37
Maybe related to #49048.
I can confirm that installing the tzdata package removes the problem.
that's useful, thanks
could you try on the
main
branch too please?
The main branch by itself retains the problem.
Comment From: MarcoGorelli
I tried building from source on the main
branch on windows, and uninstalling tzdata, and can't reproduce:
User@DESKTOP-U8OKFP3 MINGW64 ~/pandas-dev (main)
$ pip uninstall tzdata
Found existing installation: tzdata 2022.7
Uninstalling tzdata-2022.7:
Would remove:
c:\users\user\pandas-dev\.venv\lib\site-packages\tzdata-2022.7.dist-info\*
c:\users\user\pandas-dev\.venv\lib\site-packages\tzdata\*
Proceed (Y/n)? y
Successfully uninstalled tzdata-2022.7
(.venv)
User@DESKTOP-U8OKFP3 MINGW64 ~/pandas-dev (main)
$ python t.py
Execution time: 0:00:00.011298
(.venv)
User@DESKTOP-U8OKFP3 MINGW64 ~/pandas-dev (main)
$ cat t.py
import pandas as pd
import datetime as dt
import pytz
import logging
from pandas._libs.tslibs import (
conversion,
)
from pandas._libs import (
tslib,
)
date_range = pd.date_range(dt.datetime(2021, 1, 1, tzinfo=pytz.timezone('CET')), dt.datetime(2022, 1, 1, tzinfo=pytz.timezone('CET')), freq='1H')
df = pd.DataFrame({'dtColumn': date_range})
df['dtColumn'] = pd.to_datetime(df['dtColumn'])
data = df.iloc[:, 0:].values
start = dt.datetime.now()
result, tz_parsed = tslib.array_to_datetime(data.ravel("K"),
errors='raise',
utc=False,
dayfirst=False,
yearfirst=False,
)
print(f'Execution time: {dt.datetime.now() - start}')
How did you run on the main branch?
Comment From: kenha37
I installed it in a pipenv using "pipenv install PATH/TO/LOCAL/PANDAS-GIT".
Here I also add and remove tzdata from the pipenv to reproduce the effect of having tzdata in the environment.
I have tested this on a couple of windows 11 pc's, most running Version 10.0.22621 Build 22621.
Comment From: MarcoGorelli
thanks for the info
no idea how to debug this TBH if I can't reproduce it, sorry - hopefully someone else can help
Comment From: kenha37
I have installed pandas from source again just to be sure the install was successful, and I am still seeing the same problem when the tzdata package isn't installed.
I am not sure how the timezone information normally would be installed on windows systems, but if it is installed externally from the python environment on your system, you probably wont see the issue.
There are some suggested solutions in #49048, but I wouldn't know where to start to help with a solution.
Comment From: jbrockmendel
The solution is either
a) require tzdata b) edit is_utz_zoneinfo as suggested https://github.com/pandas-dev/pandas/issues/49048#issuecomment-1287280872
Comment From: tkarni
Not sure if that's the same core issue, however, here's a simple reproduction script of a performance issue introduced in Pandas 1.5.0, tested on Windows 10: https://stackoverflow.com/questions/74765179/python-pandas-1-5-x-takes-too-long-to-retrieve-dict-values-when-the-values-are This is too resolved by merely installing tzdata.
Comment From: jbrockmendel
@MarcoGorelli is this fixed by #51247?
Comment From: MarcoGorelli
thanks for the ping - yes, that's right! let's close then