Right now, pandas let's you register your extension dtype. This does a few things, but the primary one is letting find your dtype given a string alias provided by the user.
An unfortunate consequence is that
pd.array([1, 2], dtype="mydtype")
may fail, while
import myarray # registers the dtype as a side-effect
pd.array([1, 2], dtype="mydtype")
succeeds. We might consider adding an entrypoint.
# Key is the string alias, value is the path to your Dtype.
"pandas_extension_dtypes": []
and 3rd parties register to it with
"pandas_extension_dtypes": ["mydtype = myarray:MyDtype"]
This avoids any awkard import-time side effects.
Comment From: TomAugspurger
@jorisvandenbossche would the same be appropriate for pyarrow?
Comment From: jorisvandenbossche
Need to think that a bit through.
For pandas, we would then check the entrypoints on pandas import (and not lazily, as for the plotting backend)? Do you know the performance of this? (I seem to recall that can be problematic for certain usecases)
Comment From: TomAugspurger
we would then check the entrypoints on pandas import
Would it need to be at pandas import time? I would think we could do it on demand, e.g. pd.array() (or perhaps in pandas_dtype).
Do you know the performance of this?
Testing with our plotting entrypoint and the holoviews backend we have the following. Note that I already imported hvplot, as that takes ~2s on its own
Total time: 1.18729 s
File: <ipython-input-3-65909a4792ee>
Function: _find_backend at line 4
Line # Hits Time Per Hit % Time Line Contents
==============================================================
4 def _find_backend(backend: str):
5 1 352168.0 352168.0 29.7 import pkg_resources # Delay import for performance.
6
7 3 37380.0 12460.0 3.1 for entry_point in pkg_resources.iter_entry_points("pandas_plotting_backends"):
8 2 4.0 2.0 0.0 if entry_point.name == "matplotlib":
9 # matplotlib is an optional dependency. When
10 # missing, this would raise.
11 1 0.0 0.0 0.0 continue
12 1 143.0 143.0 0.0 print(entry_point, entry_point.name)
13 1 797400.0 797400.0 67.2 _backends[entry_point.name] = entry_point.load()
14
15 1 2.0 2.0 0.0 try:
16 1 188.0 188.0 0.0 print(backend, _backends)
17 1 1.0 1.0 0.0 return _backends[backend]
For reference, import pkg_resources takes ~138ms on my machine (outside of line-profiling, which slows things down).
We might also try using / vendoring the entrypoints package, which I believe is simpler & faster.
Comment From: jpivarski
(As an extension_dtype author, this is a feature I'd very much like to see! So now I'm watching this thread.)
Comment From: jorisvandenbossche
For Arrow, I am not sure it is possible to not do this on import time (so lazily), as there are several places where types could be created (and also if some C++ is called, they should already be registered). Would need to think a bit more about this. Also, I think the slowness of entrypoints is a problem for pyarrow (certainly if it is on import).
Comment From: jbrockmendel
im unclear on what the problem is with the current registration system. Seems like there should be Just One Way.
Comment From: jpivarski
Ideally, the One Way would be entrypoints, rather than explicit registration. The problem with the current method is
An unfortunate consequence is that
python pd.array([1, 2], dtype="mydtype")may fail, while
python import myarray # registers the dtype as a side-effect pd.array([1, 2], dtype="mydtype")succeeds.
Whether or not a dtype is registered depends on explicit importing of a non-Pandas package. Users have to know that "mydtype" traces back to myarray, which might not be obvious if it's a dependency of a dependency of a dependency.
As an extension author, I had another problem: myarray.Array had to be a pd.api.extensions.ExtensionArray subclass, but myarray was supposed to be usable either with Pandas or without it. The myarray package couldn't have Pandas as a dependency because that would interfere with users whose workflows did not involve Pandas. So the array class that inherits from pd.api.extensions.ExtensionArray had to be provisionally imported, and then it looked like this:
import myarray # does not depend on or import Pandas
import myarray.pandas # imports Pandas with a ModuleNotFoundError if not installed
# now stuff works
It's been 3 years, so the way we're doing it eventually evolved into 2 packages: one base package without a Pandas dependency and another with it, so
import awkward as ak # does not depend on or import Pandas
import awkward_pandas # does depend on Pandas, in its pyproject.toml
# now stuff works
(see awkward-pandas). We pushed the problem into the PyPI namespace.
We use Numba in quite the same way: some users need it, others don't, and Numba requires up-front registration. But Numba has entrypoints, so the whole problem is resolved 2 lines in pyproject.toml:
https://github.com/scikit-hep/awkward/blob/791d198ad35298e0b1957d5f03256daf3faad818/pyproject.toml#L52-L53
(and the function these lines call, which is not executed by import awkward, just the entrypoint).