Is your feature request related to a problem?
Yes. Currently, the df.explode
method always returns an object for the column being exploded. This leads to loss of information about the dtype
of the exploded column.
E.g.
s = pd.Series([1,2,3]) # <- dtype('int64')
df = pd.DataFrame({'A': [s, s, s, s], 'B': 1})
df.explode("A").dtypes
0 | |
---|---|
A | object |
B | int64 |
It would be great if pandas could return the underlying dtype if it was consistent across all rows. (Or return the best dtype (int -> float -> object).)
Describe the solution you'd like
- solution 1: The best case scenario would be where pandas would directly infer the dtype if it was consistent (ignoring NaNs) across the across the row.
s = pd.Series([1,None,3]) # <- dtype('float64')
df = pd.DataFrame({'A': [s, s, s, s], 'B': 1}) # <- empty list is converted to NaN
df.explode("A").dtypes
0 | |
---|---|
A | float64 |
B | int64 |
- solution 2: Providing a argument to force inferring the dtype:
s = pd.Series([1,None,3]) # <- dtype('float64')
df = pd.DataFrame({'A': [s, s, s, s], 'B': 1}) # <- empty list is converted to NaN
df.explode("A", infer_type=True).dtypes
0 | |
---|---|
A | float64 |
B | int64 |
Describe alternatives you've considered
Currently, I use the following workaround:
s = pd.Series([1,None,3]) # <- dtype('float64')
df = pd.DataFrame({'A': [s, s, s, s], 'B': 1}) # <- empty list is converted to NaN
d = df.A[0].dtype
df2 = df.explode("A")
df2.A = df2.A.astype(d)
API breaking implications
Not sure.
Comment From: WillAyd
I think this is reasonable. Can probably add a maybe_infer_objects
call on the result of the exploded column (you'll find this in pandas._libs.lib)
Comment From: erfannariman
Can you elaborate @WillAyd , I looked in lib
but there was nothing even starting with maybe_infer
, also in the whole project I could only find:
- maybe_infer_freq
- maybe_infer_tz
- maybe_infer_dtype_type
Comment From: WillAyd
My mistake - I meant to say maybe_convert_objects
which is in pandas._libs.lib
Comment From: jreback
ok this is not going to work as-is as its a breaking change. would suggest that we add downcast=None|'infer'
option which folks can opt-in (and at some point we could deprecate and default).
Comment From: jbrockmendel
if the implementation is just going to be if infer: return result.infer_objects(copy=False)
, then we should just tell users to do that directly
Comment From: daskol
then we should just tell users to do that directly
I think the issue is that pandas
does not preserve element type at the moment. From my perspective, explode
should create a column with common dtype
of all elements of lists by default because there is no a meaningful reason to change data element type from np.int64
to object
. So, explode
should act like [[a]] -> [a]
but this requires that column elements are actually of common list type [a]
.