from this SO question
Comment From: hayd
Would this just be on the Series values (i.e. ignore index)?
I had a look in algos and couldn't see anything (np intersect1d seems slow).
OT but a weird thing from that Q is that you can't call Series on a set.
Comment From: cpcloud
yet a frozenset
works....werid...i would've though that frozenset
was a subclass of set
, guess not
Comment From: cpcloud
well it doesn't work...it returns a frozenset
...
Comment From: hayd
(I don't see why we enforce this, list is happy to take a set, why shouldn't Series?)
Comment From: cpcloud
because there's no way to map indices to set, they are arbitrary since a set object is unordered
Comment From: hayd
ha! Series let's a lot of stuff drop through.... eg. Series(1)
Comment From: cpcloud
but you're right....list does it so there must be some arbitrary indices assigned
Comment From: cpcloud
i just discovered that Series
takes generators! i had no idea.
Comment From: hayd
and that's the workaround, right? pass it to list first... kinda sucks Also isn't a dict similarly unordered (and yet we allow that) :s
Interestingly I thought we used np.fromiter to do that, but apparently it's just list.
Comment From: cpcloud
but dict
has keys which are used as the index
Comment From: cpcloud
i guess it's a workaround, but how else would you do it?
Comment From: cpcloud
i made the change....nothing breaks...i'll submit
Comment From: cpcloud
for set
construction that is
Comment From: cpcloud
4482
Comment From: jreback
fyi...the Series(1)
stuff is getting squash in #3862 (by #3482), its just odd
Comment From: hayd
Cool bananas
Comment From: cpcloud
won't be adding this, so closing
Comment From: ghost
@cpcloud, just making sure your closing this is unrelated to the discussion in #4482, which was only about modifying the series ctor to accept sets, not about adding set operations to series.
Comment From: cpcloud
you're right...that pr was just to disallow frozenset
and was only related to the ctor...reopening
Comment From: jreback
@cpcloud what's the status on this?
Comment From: cpcloud
gone by wayside .... i don't really have time to implement this ... but i think we should leave it open ... marking as someday
Comment From: jreback
ok..gr8 thxs
Comment From: jreback
most of these I just push to 0.14.....someday is a box very rarely opened :)
Comment From: makmanalp
Hmm, is this closed now since 0.14 is out?
Comment From: jreback
@makmanalp what are you trying to do?
Comment From: makmanalp
Efficiently calculate which of the rows of a column in df1 also exist in another column in df2 (or perhaps indices instead of columns).
Comment From: jreback
this issue is a bit different than that (see the linked question), did you try isin
?
Comment From: hayd
In the original question it looks like the OP wants to ignore the index... in which case they can use the set operations in Index:
pd.Index(s0.values) & pd.Index(s1.values)
Comment From: hayd
Wow, that's really slow, I take that back...
Comment From: jreback
Index keeps things ordered; should'nt do it that way, better to drop into numpy or python, do the set operation and reconstruct.
Comment From: milindsmart
Any update on this? Is this still impossible? I was looking at doing a symmetric difference.
Comment From: h-vetinari
This is something I would have needed several times in the last half year. Using apply
is just terribly slow for large data sets (and numpy
has fast set implementations like np.intersect1d
) - for example, I have code where apply
+intersect
is 98% of running time.
To recap (since there's a lot of tangential discussion in this thread), I think there is a good case to be made for a .set
-accessor, providing access to universal functions operating on sets, like there currently is for .str
, .dt
. As an example:
import pandas as pd # 0.21.0
import numpy as np # 1.13.3
# this function is just for demonstration purposes
def random_sets(n = 100):
length = 10
# strings of random numbers, padded to common length
s = pd.Series(np.random.randint(0, 10**length-1, (n,), dtype = np.int64)).astype(str).str.zfill(length)
# split into set of individual numbers and cast back to integers
return s.map(set).apply(lambda s: set(int(x) for x in s))
a = random_sets(5)
# WANTED: a.set.intersect(set([1, 2])) should result in:
a.apply(lambda r: r & set([1, 2])) # intersection of 'a' with {1, 2}, e. g.
# 0 {1, 2}
# 1 {1}
# 2 {}
# 3 {1, 2}
# 4 {}
b = random_sets(5)
# WANTED: a.set.intersect(b) should result in:
pd.concat([a, b], keys = ['a', 'b'], axis = 1).apply(lambda row: row['a'] & row['b'], axis = 1) # intersection of 'a' and 'b' (per row!), e. g.
# 0 {8, 1, 2}
# 1 {1, 3, 5, 7, 9}
# 2 {0, 8, 5, 9}
# 3 {8, 2, 3, 6}
# 4 {8, 9, 3, 6}
Like the .str
-methods, it should work with either another Series
(including index alignment), or broadcast a 'scalar' set correspondingly. Some important methods I think should be implemented (the function signature tries to indicate the action per row; the names are just suggestions):
(set, set) -> set:
a.set.intersect(b)
- as above
a.set.union(b)
- row-wise union of a
and b
a.set.diff(b)
- row-wise set difference of a
and b
a.set.xor(b)
- row-wise symmetric difference of a
and b
(set, set) -> bool:
a.set.subset(b)
- row-wise check if a
is subset of b
a.set.superset(b)
- row-wise check if a
is superset of b
(set, obj) -> bool:
a.set.contains(c)
- row-wise check if a
contains c
Comment From: jreback
@h-vetinari sets are not efficiently stored, so this offers only an api benefit, which I have yet to see and interesting use case. You can use Index
for these operations individually and that is quite efficient. Series.isin
is pretty much .contains
. IntervalIndex
will work for some of these cases as well.
Comment From: h-vetinari
@jreback, well, I was hoping not just for an API improvement, but some fast cython code to back it up (like for the .str
-methods). ;-)
Do I understand you correctly that you propose to work with Series
of (short) pd.Index
es? How would you then do something like a.set.intersect(b)
(as described above)?
Comment From: jreback
@h-vetinari and you are welcome to contribute things. I the current impl would be quite inefficient and no each way to get around this ATM.
Comment From: chinchillaLiao
import pandas as pd df = pd.DataFrame({'a':{1,2,3}, 'b':{2,3,4}})
Difference operator works between Series.
df['a - b'] = df['a'] - df['b']
Set intersection operactor doesn't work between Series.
df['a & b'] = df['a'] & df['b']
A very slow way to do intersection between Series:
df['a & b'] = df.apply(lambda row: row['a'] & row['b'], axis = 1)
I found it is much more faster to do intersection this way:
df['a & b'] = df['a'] - (df['a'] - df['b'])
I don't know why.
Comment From: h-vetinari
@chinchillaLiao : cool, didn't know set difference worked on Series! It's the only one to work on pandas level though.
But an even better work-around is to go down to the numpy
-implementation with .values
. In particular, this shouldn't suffer from the speed degradation you're reporting.
(@jreback; my comment half a year ago about a .set
-accessor now seems very superfluous - why not just enable the numpy-behaviour directly in pandas?)
df = pd.DataFrame([[{1,2}, {2,3}],[{2,4}, {3, 1}]], columns=['A', 'B'])
df
# A B
# 0 {1, 2} {2, 3}
# 1 {2, 4} {1, 3}
df['A - B'] = df.A - df.B # only one that work out of the box in pandas
df['A - B']
# 0 {1}
# 1 {2, 4}
# dtype: object
df['A & B'] = df.A & df.B
# TypeError: unsupported operand type(s) for &: 'set' and 'bool'
df['A & B'] = df.A.values & df.B.values
df['A & B']
# 0 {2}
# 1 {}
# Name: A & B, dtype: object
df['A | B'] = df.A | df.B
# TypeError: unsupported operand type(s) for |: 'set' and 'bool'
df['A | B'] = df.A.values | df.B.values
df['A | B']
# 0 {1, 2, 3}
# 1 {1, 2, 3, 4}
# Name: A | B, dtype: object
df['A ^ B'] = df.A ^ df.B
# TypeError: unsupported operand type(s) for ^: 'set' and 'bool'
df['A ^ B'] = df.A.values ^ df.B.values
df['A ^ B']
# 0 {1, 3}
# 1 {1, 2, 3, 4}
# Name: A ^ B, dtype: object
df
# A B A - B A & B A | B A ^ B
# 0 {1, 2} {2, 3} {1} {2} {1, 2, 3} {1, 3}
# 1 {2, 4} {1, 3} {2, 4} {} {1, 2, 3, 4} {1, 2, 3, 4}
In terms of usability, the really cool thing is that this also works for many-to-one
comparisons.
dd = df.A.to_frame()
C = {2, 5}
dd['A - C'] = df.A - C
dd['A & C'] = df.A.values & C
dd['A | C'] = df.A.values | C
dd['A ^ C'] = df.A.values ^ C
dd
# A A - C A & C A | C A ^ C
# 0 {1, 2} {1} {2} {1, 2, 5} {1, 5}
# 1 {2, 4} {4} {2} {2, 4, 5} {4, 5}
Comment From: jreback
sets are not first class and actually completely inefficient in a Series
Comment From: h-vetinari
Inefficient as opposed to what? Some situations fundamentally require processing sets.
And even so, why make treating sets harder than it needs to be? I used to think (see my response from December) that this wasn't implemented at all, but since it's in numpy
already, why not just expose that functionality on a pandas
level? Sure as hell beats writing your own .apply()
loops, both in terms of speed and code complexity.
Comment From: jreback
complexity in terms of implementation and code
sure if u wanted to contribute would be great
but it’s not trivial to do in a first class supoorted way
Comment From: hayd
@h-vetinari what is your use case for this? How does this come up?
IMO a nice way to contribute this would be with an extension type in a library. Depending on your use case. If you have a smallish finite super-set you can describe each set as a bitarray (and hence do set operations cheaply).
Note: This is quite different from the original issue: set operations like set(s1) & set(s2)
...
Comment From: h-vetinari
@jreback
OK, I'm thinking about contributing that. Since the numpy-methods I showed above are actually not nan-safe,
np.array([{1,2}, np.nan]) | np.array([{2,4}, {3, 1}])
# TypeError: unsupported operand type(s) for |: 'float' and 'set'
I'm back to thinking that a set accessor for Series (not for Index) would be the best. And, since I wouldn't have to write the cython for those methods, I think I can come up with such a wrapper relatively easily.
@hayd Thanks for the link. I've had several use cases over time, e.g. joining email-addresses/telephone numbers when deduplicating user information. But it keeps cropping up. I'm actually really happy I found out about those numpy-methods some days ago. ;-)
Re:
Note: This is quite different from the original issue: set operations like set(s1) & set(s2)...
I've chosen to comment on this issue (rather than opening a new one) due to the title, which imo has a much larger scope. I could easily open a more general issue, if desired.
Comment From: jbrockmendel
Discussed on today's dev call and the consensus was to convert your Series to Index and do setops on those. Closing.