There is some information in our documentation regarding how to use user defined functions in pandas. The API pages of the used methods, and these sections:

  • https://pandas.pydata.org/docs/user_guide/groupby.html#aggregation-with-user-defined-functions
  • https://pandas.pydata.org/docs/user_guide/gotchas.html#gotchas-udf-mutation

My understanding is that we've been mostly discouraging the use of functions like apply, or at least the community has with many posts and comments regarding apply is slow, which seem fair. With the work going on supporting JIT compilers on these functions (see https://github.com/pandas-dev/pandas/pull/54666 and https://github.com/pandas-dev/pandas/pull/61032) this can hopefully change, and allow in some cases for clearer code while not compromising speed.

I think it may be difficult to communicate all the information related to udf in the existing sections on group by and FAQ pages and in the API docs. A dedicated page in the users guide that guides users on when to use udf, a general idea of the API, the differences between the different methods, the options available... seems a better idea.

Also, the APIs of the different methods are quite inconsistent, and in some cases cumbersome. I think writing this page will be a good exercise to identify cases when explaining the functionality to the users is complex and not intuitive, and see if we can address them.