Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
import numpy as np
df = pd.DataFrame({"A": [[0, 1], [5], [], [2, 3]],"B": [9, 8, 7, 6],"C": [[1, 2], np.nan, [], [3, 4]],})
df.explode(["A", "C"])
Issue Description
The DataFrame.explode
function has incomplete behavior for the above example when exploding on multiple columns. For the above DataFrame df = pd.DataFrame({"A": [[0, 1], [5], [], [2, 3]],"B": [9, 8, 7, 6],"C": [[1, 2], np.nan, [], [3, 4]],})
the outputs for exploding columns "A" and "C" individual are correctly outputted as
>>> df.explode("A")
A B C
0 0 9 [1, 2]
0 1 9 [1, 2]
1 5 8 NaN
2 NaN 7 []
3 2 6 [3, 4]
3 3 6 [3, 4]
>>> df.explode("C")
A B C
0 [0, 1] 9 1
0 [0, 1] 9 2
1 [5] 8 NaN
2 [] 7 NaN
3 [2, 3] 6 3
3 [2, 3] 6 4
However, when attempting df.explode(["A", "C"])
, one receives the error
>>> df.explode(["A", "C"])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/emaanhariri/opt/anaconda3/lib/python3.9/site-packages/pandas/core/frame.py", line 8254, in explode
raise ValueError("columns must have matching element counts")
ValueError: columns must have matching element counts
Here the mylen
function on line 8335 in the explode function is defined as mylen = lambda x: len(x) if is_list_like(x) else -1
induces the error above as entries []
and np.nan
are computed as having different lengths despite both ultimately occupying a single entry in the expansion.
Expected Behavior
We expect the output to be an appropriate mix between the output of df.explode("A")
and df.explode("C")
as follows.
>>> df.explode(["A", "C"])
A B C
0 0 1 1
0 1 1 2
1 5 7 NaN
2 NaN 2 NaN
3 3 4 1
3 4 4 2
Installed Versions
Comment From: rhshadrach
Thanks for the report! Confirmed on main, it looks like using 1 instead of -1 may fix? I haven't checked thoroughly though. Would you be interested in submitting a PR to fix @ehariri?
Comment From: ehariri
Hi @rhshadrach, sure I could submit a PR! I believe the fix would also need to change len(x)
to max(1, len(x))
to account for empty lists as well?
Comment From: rhshadrach
@ehariri - that sounds right to me, but haven't taken an in depth look. A PR would be great!
Comment From: GYHHAHA
Personally I think the current behavior is reasonable. Since the NaN from "C" actually means zero element or a certain type we don't know, why we still want to match [5]? What's the desire output when [5] is changed to [5,5]? Should we also expanding the nan value or raise ValueError?
Comment From: rhshadrach
@GYHHAHA - shouldn't df.explode("A").explode("C")
be the same as df.explode(["A", "C"])
?
Comment From: GYHHAHA
For example, will the multi-column explode result be the same as a.explode("A").explode("B")
? Does this mean any row with both iterable object, we will make a cross-join like explosion? If this is true, why not directly iteratively call the single column explode for elements in column list? @rhshadrach
>>> a = pd.DataFrame({"A":[[1,2,3]],"B":[[1,2]]})
>>> a
A B
0 [1, 2, 3] [1, 2]
>>> a.explode("A")
A B
0 1 [1, 2]
0 2 [1, 2]
0 3 [1, 2]
>>> a.explode("A").explode("B")
A B
0 1 1
0 1 2
0 2 1
0 2 2
0 3 1
0 3 2
>>> a.explode(["A", "B"])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\gyh\miniconda3\envs\final\lib\site-packages\pandas\core\frame.py", line 8337, in explode
raise ValueError("columns must have matching element counts")
ValueError: columns must have matching element counts
Comment From: ehariri
@GYHHAHA, I believe your point is valid with regards to always expecting or not expecting df.explode(['a', 'b', 'c', ...])
to match df.explode('a').explode('b').explode('c')...
. However I believe that point is moot because does not change the fact that the cardinality of NaN and []
should still have cardinality 1 when exploded on multiple columns.
Comment From: rhshadrach
Ah, thanks for the correction @GYHHAHA - I did not realize that .explode(['a', 'b'])
will zip rather than take the product.
@ehariri - +1 on your expectation of cardinality 1.