The APIs for the apply
and map
methods seem to not be ideal. Those APIs were created in the very early days of pandas, and both pandas and Python are very different, and we have much more experience, and different environment such as type checking and others.
A good first example is the na_action
parameter of map
. I assume it was designed thinking that different actions could be applied when dealing with missing values in an elementwise operation. In practice, more than 15 years later, none has been implemented. And the resulting API is in my opinion far from ideal:
df.map(func, na_action=False)
df.map(func, na_action="ignore")
This also makes type checking unnecessarily complex. A better API would be using just a boolean skip_na
or ignore_na
:
df.map(func, skip_na=False)
df.map(func, skip_na=True)
df.map(func, skip_na=action == "ignore")
Another example is the inconsistency with args
and kwargs
. Some functions have both, some have just kwargs, we've been recently adding few missing... Also, when exists args
is a regular parameter, while kwargs
is a **
parameter, which is by itself inconsistent, and also confusing, with the number of parameters having slightly increased. For example:
df.apply(func, 0, result_type=None, result_format="reduction", engine=numba.njit, engine_params={"val": 0})
I don't think even advanced pandas users would be able to easily tell what parameters will be passed to the function. A much clearer API would be:
df.apply(func, args=("reduction",), kwargs={"engine_params": {"val": 0}}, axis=0, result_type=None, engine=numba.njit)
I think in this call it's immediate for users to know what are apply
arguments, and when func
arguments.
Another inconsistency is the arg
/ func
parameter in Series.map
and DataFrame.map
. While the functions are conceptually the same, just applying the operator to either a Series
or a DataFram
, the signature and the behavior slightly changes, as Series
will accept a dictionary, and DataFrame
won't. Given that a dictionary can be converted to a function by just appending .get
to it, I think it'd be better to make function consistently accept Python callables or numpy ufuncs.
Finally, the methods have their evolution, including the existance and deletion of applymap
, but at this point is also probably a good idea to deprecate the legacy behavior of Series.apply
behaving as Series.map
depending on the value of by_row
, which is the default. This is a bit tricky for backward compatibility reasons, but I think it eventually needs to be done, as it makes the API very counter-intuitive. map
being always elementwise, and apply
being always axis-wise, will make users life much easier, and the usage much easier to learn and explain.
We can also discuss about result_type
and by_row
in DataFrame.apply
, which are very hard to understand.