Code Sample, a copy-pastable example if possible
In [62]: df = pd.DataFrame([(float(x) for x in range(0, 10)), (float(x) for x in range(10,20))])
In [63]: df
Out[63]:
0 1 2 3 4 5 6 7 8 9
0 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0
1 10.0 11.0 12.0 13.0 14.0 15.0 16.0 17.0 18.0 19.0
In [64]: df[0]
Out[64]:
0 0.0
1 10.0
Name: 0, dtype: float64
In [65]: df[0].astype(int)
Out[65]:
0 0
1 10
Name: 0, dtype: int64
In [66]: df[0] = df[0].astype(int)
In [67]: df
Out[67]:
0 1 2 3 4 5 6 7 8 9
0 0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0
1 10 11.0 12.0 13.0 14.0 15.0 16.0 17.0 18.0 19.0
In [68]: df.iloc[0]
Out[68]:
0 0.0
1 1.0
2 2.0
3 3.0
4 4.0
5 5.0
6 6.0
7 7.0
8 8.0
9 9.0
Name: 0, dtype: float64
Problem description
After I reassign the 0th column as int, I expect it to be int, and it appears to be that way. But when I do a .iloc
on the dataframe, it seems to be returning back into being a float somehow!
This is the narrowed-down version of a more insidious problem where instead of doing an iloc[]
I was running a .apply(f)
on a dataframe and the dtypes of the resulting dataframe were all messed up even when the function f
wasn't doing anything discernible with the types, so I narrowed it down to this.
Current workaround is to re-cast all the types in f
, but that can get frustrating real quick depending on the number of columns.
Expected Output
I expect the row to be a mixed dtype object, with the dtype of each cell matching that of the column:
In [68]: df.iloc[0]
Out[68]:
0 0
1 1.0
2 2.0
3 3.0
4 4.0
5 5.0
6 6.0
7 7.0
8 8.0
9 9.0
Name: 0, dtype: object
Output of pd.show_versions()
Comment From: jreback
duplicate of https://github.com/pandas-dev/pandas/issues/12859
This is as expected, mixed int-float get upcasted to float. In particular on a cross section you an get upcasting generally as you are cutting across mixed dtypes (you won't get upcast if its a single dtype).
I don't think there is an easy way to get around this, though for perf reasons you never want this to be object
.
Comment From: makmanalp
@jreback I see. More philosophically, if we had a string or datetime or other dtype column in there, wouldn't the dtype of the result of the .iloc[] necessarily have to be object? It seems like that's the more sensible thing to happen. Besides, the implicit upcasting behavior on .apply or .iloc or .to_dict is surprising and not easy to track down.
Or is this upcasting behavior considered normal because .iloc[] returns a series, so in which case we've flipped things sideways and switched from the multi-columnar (and thus multi-dtype) dataframe format to a single "column" (which has one dtype, unless we make it be "object")?
I hit the .to_dict() version almost immediately after :-) I'm trying to clean out float numbers that are polluting some json output, essentially, and these are two things that I ran into back to back.
Comment From: jreback
if you had any other mixed dtypes it would upcast appropriately (to object if needed)
this comes fundamentally from numpy
In [1]: np.array([1, 1.0])
Out[1]: array([ 1., 1.])
since holding mixed dtypes in a 1-d is not generally supported, this is how it is.
Comment From: makmanalp
OK, thank you!