Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this issue exists on the latest version of pandas.

  • [X] I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

I have been seeing slow performance when using the pandas read_sql_query function when the returned data includes datetimes with a non UTC timezone offset.

I have located the issue to the objects_to_datetime64ns function called when constructing the dataframe. In pandas 1.4. the data function would try to call tslib.array_to_datetime and fail, then having a failover to conversion.datetime_to_datetime64, but for pandas 1.5. the tslib.array_to_datetime is executed which takes a significant amount of time compared to the 1.4.* execution time.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 8dab54d6573f7186ff0c3b6364d5e4dd635ff3e7 python : 3.9.13.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.22000 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : Danish_Denmark.1252 pandas : 1.5.2 numpy : 1.24.1 pytz : 2022.7 dateutil : 2.8.2 setuptools : 65.6.3 pip : 22.3.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None tzdata : None

Prior Performance

On my computer, the difference between the two version is Pandas 1.4.4: Execution time: 0:00:00.013004 Pandas 1.5.2: Execution time: 0:00:21.586688

import pandas as pd
import datetime as dt
import pytz
import logging
from pandas._libs.tslibs import (
    conversion,
)
from pandas._libs import (
    tslib,
)

date_range = pd.date_range(dt.datetime(2021, 1, 1, tzinfo=pytz.timezone('CET')), dt.datetime(2022, 1, 1, tzinfo=pytz.timezone('CET')), freq='1H')
df = pd.DataFrame({'dtColumn': date_range})
df['dtColumn'] = pd.to_datetime(df['dtColumn'])
data = df.iloc[:, 0:].values

print('Starting conversion')
if pd.__version__ == '1.5.2':
    start = dt.datetime.now()
    result, tz_parsed = tslib.array_to_datetime(data.ravel("K"),
                                                errors='raise',
                                                utc=False,
                                                dayfirst=False,
                                                yearfirst=False,
                                                require_iso8601=True,
                                                allow_mixed=True)
    print(f'Execution time: {dt.datetime.now() - start}')

if pd.__version__ == '1.4.4':
    start = dt.datetime.now()
    try:
        result, tz_parsed = tslib.array_to_datetime(data.ravel("K"),
                                                    errors='raise',
                                                    utc=False,
                                                    dayfirst=False,
                                                    yearfirst=False,
                                                    require_iso8601=True,
                                                    allow_mixed=True)
    except ValueError:
        values, tz_parsed = conversion.datetime_to_datetime64(data.ravel("K"))
    print(f'Execution time: {dt.datetime.now() - start}')

Comment From: MarcoGorelli

thanks @kenha37 for your report

I can't reproduce, see below:

(.venv) marcogorelli@DESKTOP-U8OKFP3:~/tmp$ pip install pandas==1.5.2
Collecting pandas==1.5.2
  Using cached pandas-1.5.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.2 MB)
Requirement already satisfied: python-dateutil>=2.8.1 in ./.venv/lib/python3.8/site-packages (from pandas==1.5.2) (2.8.2)
Requirement already satisfied: numpy>=1.20.3 in ./.venv/lib/python3.8/site-packages (from pandas==1.5.2) (1.23.5)
Requirement already satisfied: pytz>=2020.1 in ./.venv/lib/python3.8/site-packages (from pandas==1.5.2) (2022.6)
Requirement already satisfied: six>=1.5 in ./.venv/lib/python3.8/site-packages (from python-dateutil>=2.8.1->pandas==1.5.2) (1.16.0)
Installing collected packages: pandas
  Attempting uninstall: pandas
    Found existing installation: pandas 1.4.4
    Uninstalling pandas-1.4.4:
      Successfully uninstalled pandas-1.4.4
Successfully installed pandas-1.5.2
(.venv) marcogorelli@DESKTOP-U8OKFP3:~/tmp$ python t.py
Starting conversion
Execution time: 0:00:00.008793
(.venv) marcogorelli@DESKTOP-U8OKFP3:~/tmp$ pip install pandas==1.4.4
Collecting pandas==1.4.4
  Using cached pandas-1.4.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.7 MB)
Requirement already satisfied: numpy>=1.18.5 in ./.venv/lib/python3.8/site-packages (from pandas==1.4.4) (1.23.5)
Requirement already satisfied: pytz>=2020.1 in ./.venv/lib/python3.8/site-packages (from pandas==1.4.4) (2022.6)
Requirement already satisfied: python-dateutil>=2.8.1 in ./.venv/lib/python3.8/site-packages (from pandas==1.4.4) (2.8.2)
Requirement already satisfied: six>=1.5 in ./.venv/lib/python3.8/site-packages (from python-dateutil>=2.8.1->pandas==1.4.4) (1.16.0)
Installing collected packages: pandas
  Attempting uninstall: pandas
    Found existing installation: pandas 1.5.2
    Uninstalling pandas-1.5.2:
      Successfully uninstalled pandas-1.5.2
Successfully installed pandas-1.4.4
(.venv) marcogorelli@DESKTOP-U8OKFP3:~/tmp$ python t.py
Starting conversion
Execution time: 0:00:00.008626
(.venv) marcogorelli@DESKTOP-U8OKFP3:~/tmp$ cat t.py
import pandas as pd
import datetime as dt
import pytz
import logging
from pandas._libs.tslibs import (
    conversion,
)
from pandas._libs import (
    tslib,
)

date_range = pd.date_range(dt.datetime(2021, 1, 1, tzinfo=pytz.timezone('CET')), dt.datetime(2022, 1, 1, tzinfo=pytz.timezone('CET')), freq='1H')
df = pd.DataFrame({'dtColumn': date_range})
df['dtColumn'] = pd.to_datetime(df['dtColumn'])
data = df.iloc[:, 0:].values

print('Starting conversion')
if pd.__version__ == '1.5.2':
    start = dt.datetime.now()
    result, tz_parsed = tslib.array_to_datetime(data.ravel("K"),
                                                errors='raise',
                                                utc=False,
                                                dayfirst=False,
                                                yearfirst=False,
                                                require_iso8601=True,
                                                allow_mixed=True)
    print(f'Execution time: {dt.datetime.now() - start}')

if pd.__version__ == '1.4.4':
    start = dt.datetime.now()
    try:
        result, tz_parsed = tslib.array_to_datetime(data.ravel("K"),
                                                    errors='raise',
                                                    utc=False,
                                                    dayfirst=False,
                                                    yearfirst=False,
                                                    require_iso8601=True,
                                                    allow_mixed=True)
    except ValueError:
        values, tz_parsed = conversion.datetime_to_datetime64(data.ravel("K"))
    print(f'Execution time: {dt.datetime.now() - start}')

Comment From: kenha37

Thank you for looking at the issue.

I have done some further testing, and the issue seems to only show itself on windows systems. I only have ubuntu and windows computers available, and the issue is consistent on windows but doesn't appear on ubuntu.

When trying to step through the array_to_datetime method in a debugger, it seems the code is rotating through the find_tzfile and load_tzdata functions. Counting the function calls to find_tzfile, it is called 43809 times in the above code.

I have tested that this is a problem both when using virtual environments and the system interpreter.

Comment From: MarcoGorelli

that's useful, thanks

could you try on the main branch too please?

Comment From: lithomas1

Maybe related to #49048.

Comment From: kenha37

Maybe related to #49048.

I can confirm that installing the tzdata package removes the problem.

that's useful, thanks

could you try on the main branch too please?

The main branch by itself retains the problem.

Comment From: MarcoGorelli

I tried building from source on the main branch on windows, and uninstalling tzdata, and can't reproduce:

User@DESKTOP-U8OKFP3 MINGW64 ~/pandas-dev (main)
$ pip uninstall tzdata
Found existing installation: tzdata 2022.7
Uninstalling tzdata-2022.7:
  Would remove:
    c:\users\user\pandas-dev\.venv\lib\site-packages\tzdata-2022.7.dist-info\*
    c:\users\user\pandas-dev\.venv\lib\site-packages\tzdata\*
Proceed (Y/n)? y
  Successfully uninstalled tzdata-2022.7
(.venv)
User@DESKTOP-U8OKFP3 MINGW64 ~/pandas-dev (main)
$ python t.py
Execution time: 0:00:00.011298
(.venv)
User@DESKTOP-U8OKFP3 MINGW64 ~/pandas-dev (main)
$ cat t.py
import pandas as pd
import datetime as dt
import pytz
import logging
from pandas._libs.tslibs import (
    conversion,
)
from pandas._libs import (
    tslib,
)

date_range = pd.date_range(dt.datetime(2021, 1, 1, tzinfo=pytz.timezone('CET')), dt.datetime(2022, 1, 1, tzinfo=pytz.timezone('CET')), freq='1H')
df = pd.DataFrame({'dtColumn': date_range})
df['dtColumn'] = pd.to_datetime(df['dtColumn'])
data = df.iloc[:, 0:].values

start = dt.datetime.now()
result, tz_parsed = tslib.array_to_datetime(data.ravel("K"),
                                            errors='raise',
                                            utc=False,
                                            dayfirst=False,
                                            yearfirst=False,
                                            )
print(f'Execution time: {dt.datetime.now() - start}')

How did you run on the main branch?

Comment From: kenha37

I installed it in a pipenv using "pipenv install PATH/TO/LOCAL/PANDAS-GIT".

Here I also add and remove tzdata from the pipenv to reproduce the effect of having tzdata in the environment.

I have tested this on a couple of windows 11 pc's, most running Version 10.0.22621 Build 22621.

Comment From: MarcoGorelli

thanks for the info

no idea how to debug this TBH if I can't reproduce it, sorry - hopefully someone else can help

Comment From: kenha37

I have installed pandas from source again just to be sure the install was successful, and I am still seeing the same problem when the tzdata package isn't installed.

I am not sure how the timezone information normally would be installed on windows systems, but if it is installed externally from the python environment on your system, you probably wont see the issue.

There are some suggested solutions in #49048, but I wouldn't know where to start to help with a solution.

Comment From: jbrockmendel

The solution is either

a) require tzdata b) edit is_utz_zoneinfo as suggested https://github.com/pandas-dev/pandas/issues/49048#issuecomment-1287280872

Comment From: tkarni

Not sure if that's the same core issue, however, here's a simple reproduction script of a performance issue introduced in Pandas 1.5.0, tested on Windows 10: https://stackoverflow.com/questions/74765179/python-pandas-1-5-x-takes-too-long-to-retrieve-dict-values-when-the-values-are This is too resolved by merely installing tzdata.

Comment From: jbrockmendel

@MarcoGorelli is this fixed by #51247?

Comment From: MarcoGorelli

thanks for the ping - yes, that's right! let's close then