Location of the documentation

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html?highlight=loc#pandas.DataFrame.loc

Documentation problem

Surprisingly, the following code works:

import pandas as pd

df = pd.DataFrame([["a", 1], ["b", 2], ["c", 3]], columns=["let", "num"]).set_index(
    "let"
)

dfl = df.loc[set(["a", "b"]), :]

print(dfl)

dd = {"a": 10, "c": 12}

dfd = df.loc[dd, "num"]

print(dfd)

In the documentation, we don't say that you can pass a set or a dict as one of the arguments to .loc .

So either we update the documentation, or maybe we should be raising an Exception?

Comment From: attack68

+1 raising and documenting. also restricting .loc to lists only and not sequence inputs: see https://github.com/pandas-dev/pandas/pull/42329#issuecomment-872249189

Comment From: Delengowski

I'm assuming for speed the implementation of .loc[] tries to iterate rather doing an isinstance check. This is unfortunate because isinstance(dict, typing.Sequence) == False as does isinstance(set, typing.Sequence) iirc.

Comment From: phofl

cc @jbrockmendel Should we decide for 1.4 here?

Comment From: jbrockmendel

it'd be nice to get in, yah

Comment From: phofl

Will look into this the coming days

Comment From: nsfinkelstein

Can I ask what was the reasoning behind deprecating the ability to use a set as an indexer? I've followed the discussion and links above, but cannot discern why the decision was made to deprecate the use of sets as indexers. In my experience, it's incredibly helpful to be able to use sets as indexers, and I'd be happy to contribute code towards un-deprecating this option if possible.

Looking through the code a bit, my understanding is that the index will be converted into a list in any case here: https://github.com/phofl/pandas/blob/f04ec539eaf4ab1d03c1654e6f90b32c6b56f078/pandas/core/frame.py#L3505-L3507.

Comment From: phofl

-1 on undeprecating, sets only work in some cases, in others they naturally raise an error. iloc is one of those cases.

Comment From: nsfinkelstein

Thanks for the response. I meant to ask about why it was deprecated in .loc[], not .iloc[].

Is the issue that if it is deprecated for one, it must be deprecated for both?

Comment From: phofl

Loc is already to flexible, so we restricted this since it was not documented anyway

Comment From: nsfinkelstein

I see, thanks. Would you be open to a pull request that adds documentation rather than alters behavior?

I have a lot of code that relies on this behavior.

Comment From: phofl

Again, I am -1 on undeprecating

Comment From: nsfinkelstein

Can I ask why?

With such an established and widely used library, I would think removing functionality that people rely on should only happen if there's a substantive reason it shouldn't be maintained. I'm trying to understand whether that's the case here. So far I can't see any reason for this functionality to have been removed.

Comment From: phofl

Set has random order, e.g. the order of your DataFrame is random as well, which leads to all sorts of weird behavior, you will probably end up implicitly aligning your DataFrame in follow up operations, which costs performance and ist just counterintuitive. A set as an indexer is in general not a good idea

Comment From: nsfinkelstein

I see, thanks for explaining your reasoning.

I disagree with your assessment for the following reasons. There are many cases in which dataframe order does not matter, and there are also many cases in which dataframe order should be determined by elements of the dataframe, such that it must be sorted after selection in any case.

If the indexer is already a set, as it often is, than in either of these cases, converting it to a list just to satisfy an apparently arbitrary requirement of pandas is inefficient in terms of performance, and also adds unnecessary code complexity.

If the reasoning for the deprectaion is just what's contained in the above comment, I'll go ahead and submit a pull request to un-deprecate the use of sets as indexers for .loc[]. Please let me know if there were additional reasons it was deprecated.

Comment From: attack68

I have been a supporter of formalising and reducing the number of potential input combinations to loc. It is a complex part of the code base especially when considering hierarchical indexes. I believe there are various scattered historical discussions around this.

You are welcome to submit to a PR but my instinct tells me it will be difficult for you to garner the support to get it approved.

Are there any alternatives you can use to set, i.e. is there a sensible workaround to your problems?

Comment From: phofl

Loc is complicated enough, inconsistency with iloc is another reason. Behavior is just counterintuitive. It is literally not covered at all in our testing suite. No clue which corner cases work and which don’t. It was never documented, so people shouldn’t be relying on it anyway. We deprecated this a year ago and nobody complained till now.

Comment From: nsfinkelstein

Thanks to you both for explaining more of your reasoning.

Set operations are very well suited to creating indexers in a wide variety of settings, and so I've used sets for that purpose extensively over the course of many years, both for hierarchical and non-hierarchical indices. I could go through quite a lot of code and wrap the indexer in all my calls to .loc[] in list(), before I update to the most recent version of pandas, in all of the various places the code is deployed. But again, I would be surprised at the need to do this with such a core part of a very established library.

My understanding is that this conversion to a list is exactly what is currently happening anyway in the following lines: https://github.com/phofl/pandas/blob/f04ec539eaf4ab1d03c1654e6f90b32c6b56f078/pandas/core/frame.py#L3505-L3507. Is this understanding incorrect?

After exploring the code a bit more, my understanding is that the issue with sets with hierarchical indices arise not when the indexer is a set, but when the indexer contains sets. I.e. if the collection of indices is a set, there is no problem -- it just gets turned into a list in the lines linked above. If the collection contains sets, then there's a problem because the level of the index that each element in the indexer refers to is ambiguous. Does this match your understanding of the issues you are referencing above?

My suggestion is that we just de-deprecate using sets as indexers, not using indexers that contain sets. I'd be happy to add the appropriate documentation and tests. Given that set indexers are just converted to list indexers, unless I'm mistaken, I'm curious to know what about the behavior you find counter-intuitive. Please let me know your thoughts.

Comment From: jbrockmendel

If updating your code in too many places is the issue, then consider patching Loc.__getitem__/__setitem__ to convert lists in just one place.

I agree with @phofl and @attack68; maintaining Loc is a beast and simplifying the behavior where feasible is a positive.

Comment From: nsfinkelstein

I understand that this functionality will not be undeprecated. Thank you all for your time and consideration, and your work on pandas.

------- Original Message ------- On Thursday, December 1st, 2022 at 7:58 PM, jbrockmendel @.***> wrote:

If updating your code in too many places is the issue, then consider patching Loc.getitem/setitem to convert lists in just one place.

I agree with @.(https://github.com/phofl) and @.(https://github.com/attack68); maintaining Loc is a beast and simplifying the behavior where feasible is a positive.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>