Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this issue exists on the latest version of pandas.
-
[X] I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
.
Installed Versions
INSTALLED VERSIONS ------------------ commit : b5958ee1999e9aead1938c0bba2b674378807b3d python : 3.6.9.final.0 python-bits : 64 OS : Linux OS-release : 4.15.0-191-generic Version : #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8 pandas : 1.1.5 numpy : 1.19.4 pytz : 2019.3 dateutil : 2.8.0 pip : 19.3.1 setuptools : 41.6.0 Cython : 0.23.4 pytest : 4.0.1 hypothesis : 3.66.8 sphinx : None blosc : None feather : None xlsxwriter : 0.7.3 lxml.etree : 4.3.2 html5lib : None pymysql : None psycopg2 : 2.8.4 (dt dec pq3 ext lo64) jinja2 : 2.8 IPython : 3.2.1 pandas_datareader: None bs4 : 4.4.1 bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : 3.3.3 numexpr : None odfpy : None openpyxl : 3.0.8 pandas_gbq : None pyarrow : None pytables : None pyxlsb : None s3fs : None scipy : 1.5.4 sqlalchemy : 1.3.1 tables : None tabulate : 0.8.5 xarray : None xlrd : 1.2.0 xlwt : None numba : 0.31.0
Prior Performance
While upgrading a large pandas heavy codebase from 0.19.2, I was looking at areas to improve performance as there has been a fairly consistent ~25% performance drop across various slices of the codebase when moving to 1.1.5 (other libraries were upgraded at the same time so not necessarily just pandas contributing to that). Making blocks showed up in profiling as taking quite a lot longer, with this method being an easy place to boost performance as it is heavily used. It looks like this is the case on master as well.
Caching the results of the method produced a ~5% performance increase in quite a large test suite, but would be nice to see the change on asv
.
Generally it looks like the is_*dtype(...)
related places might be able to benefit from attention, so I will look into those if a usable pattern comes out of this. IIR it was a extension dtype added in 2016/probably just after 0.19.2 that looked like it took a fair bit of time.
Comment From: jbrockmendel
This would be great, go for it!
Comment From: Code0x58
The current codebase performance is much better than the 1.1.5 version I had to work with, so broadly caching does not seem to give an advantage when applied as broadly as in this branch.
There were still some obvious and easy speed gains in the is_*_dtype(...)
and get_block_type(...)
methods so I have them in a PR (#49112).
Maybe ExtensionDtype.is_dtype(...)
could be sped up by looking at the cls.construct_from_string(...)
part of it, but applying caching in the base implementation causes some speedsups and some slow downs, so it may be more ideal to add a specialised implementation for methods with a more complex ExtensionDtype.construct_from_string(...)
implementation. I haven't pursued that line of thought in the PR for the main branch. That said, maybe the construct_from_string(...)
implementations may benefit from lest costly implementations (e.g. not using the re module)