Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

df = pd.DataFrame({"A": [[0, 1], [5], [], [2, 3]],"B": [9, 8, 7, 6],"C": [[1, 2], np.nan, [], [3, 4]],})

df.explode(["A", "C"])

Issue Description

The DataFrame.explode function has incomplete behavior for the above example when exploding on multiple columns. For the above DataFrame df = pd.DataFrame({"A": [[0, 1], [5], [], [2, 3]],"B": [9, 8, 7, 6],"C": [[1, 2], np.nan, [], [3, 4]],}) the outputs for exploding columns "A" and "C" individual are correctly outputted as

>>> df.explode("A")
     A  B       C
0    0  9  [1, 2]
0    1  9  [1, 2]
1    5  8     NaN
2  NaN  7      []
3    2  6  [3, 4]
3    3  6  [3, 4]

>>> df.explode("C")
        A  B    C
0  [0, 1]  9    1
0  [0, 1]  9    2
1     [5]  8  NaN
2      []  7  NaN
3  [2, 3]  6    3
3  [2, 3]  6    4

However, when attempting df.explode(["A", "C"]), one receives the error

>>> df.explode(["A", "C"])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/emaanhariri/opt/anaconda3/lib/python3.9/site-packages/pandas/core/frame.py", line 8254, in explode
    raise ValueError("columns must have matching element counts")
ValueError: columns must have matching element counts

Here the mylen function on line 8335 in the explode function is defined as mylen = lambda x: len(x) if is_list_like(x) else -1 induces the error above as entries [] and np.nan are computed as having different lengths despite both ultimately occupying a single entry in the expansion.

Expected Behavior

We expect the output to be an appropriate mix between the output of df.explode("A") and df.explode("C") as follows.

>>> df.explode(["A", "C"])
     A  B    C
0    0  1    1
0    1  1    2
1    5  7  NaN
2  NaN  2  NaN
3    3  4    1
3    4  4    2

Installed Versions

INSTALLED VERSIONS ------------------ commit : 945c9ed766a61c7d2c0a7cbb251b6edebf9cb7d5 python : 3.8.12.final.0 python-bits : 64 OS : Darwin OS-release : 21.3.0 Version : Darwin Kernel Version 21.3.0: Wed Jan 5 21:37:58 PST 2022; root:xnu-8019.80.24~20/RELEASE_ARM64_T6000 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.3.4 numpy : 1.20.3 pytz : 2021.3 dateutil : 2.8.2 pip : 21.2.4 setuptools : 58.0.4 Cython : 0.29.24 pytest : 6.2.5 hypothesis : None sphinx : 4.4.0 blosc : None feather : None xlsxwriter : None lxml.etree : 4.7.1 html5lib : None pymysql : 1.0.2 psycopg2 : None jinja2 : 3.0.2 IPython : 7.29.0 pandas_datareader: None bs4 : 4.10.0 bottleneck : None fsspec : 2021.11.1 fastparquet : None gcsfs : 2021.11.1 matplotlib : 3.4.3 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 5.0.0 pyxlsb : None s3fs : None scipy : 1.7.3 sqlalchemy : 1.4.27 tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : 0.55.0

Comment From: rhshadrach

Thanks for the report! Confirmed on main, it looks like using 1 instead of -1 may fix? I haven't checked thoroughly though. Would you be interested in submitting a PR to fix @ehariri?

Comment From: ehariri

Hi @rhshadrach, sure I could submit a PR! I believe the fix would also need to change len(x) to max(1, len(x)) to account for empty lists as well?

Comment From: rhshadrach

@ehariri - that sounds right to me, but haven't taken an in depth look. A PR would be great!

Comment From: GYHHAHA

Personally I think the current behavior is reasonable. Since the NaN from "C" actually means zero element or a certain type we don't know, why we still want to match [5]? What's the desire output when [5] is changed to [5,5]? Should we also expanding the nan value or raise ValueError?

Comment From: rhshadrach

@GYHHAHA - shouldn't df.explode("A").explode("C") be the same as df.explode(["A", "C"])?

Comment From: GYHHAHA

For example, will the multi-column explode result be the same as a.explode("A").explode("B")? Does this mean any row with both iterable object, we will make a cross-join like explosion? If this is true, why not directly iteratively call the single column explode for elements in column list? @rhshadrach

>>> a = pd.DataFrame({"A":[[1,2,3]],"B":[[1,2]]})
>>> a
           A       B
0  [1, 2, 3]  [1, 2]
>>> a.explode("A")
   A       B
0  1  [1, 2]
0  2  [1, 2]
0  3  [1, 2]
>>> a.explode("A").explode("B")
   A  B
0  1  1
0  1  2
0  2  1
0  2  2
0  3  1
0  3  2
>>> a.explode(["A", "B"])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\gyh\miniconda3\envs\final\lib\site-packages\pandas\core\frame.py", line 8337, in explode
    raise ValueError("columns must have matching element counts")
ValueError: columns must have matching element counts

Comment From: ehariri

@GYHHAHA, I believe your point is valid with regards to always expecting or not expecting df.explode(['a', 'b', 'c', ...]) to match df.explode('a').explode('b').explode('c').... However I believe that point is moot because does not change the fact that the cardinality of NaN and [] should still have cardinality 1 when exploded on multiple columns.

Comment From: rhshadrach

Ah, thanks for the correction @GYHHAHA - I did not realize that .explode(['a', 'b']) will zip rather than take the product.

@ehariri - +1 on your expectation of cardinality 1.