Pandas Indexing into a dataframe re-casts / "forgets" dtypes

Code Sample, a copy-pastable example if possible

In [62]: df = pd.DataFrame([(float(x) for x in range(0, 10)), (float(x) for x in range(10,20))])

In [63]: df
Out[63]:
      0     1     2     3     4     5     6     7     8     9
0   0.0   1.0   2.0   3.0   4.0   5.0   6.0   7.0   8.0   9.0
1  10.0  11.0  12.0  13.0  14.0  15.0  16.0  17.0  18.0  19.0

In [64]: df[0]
Out[64]:
0     0.0
1    10.0
Name: 0, dtype: float64

In [65]: df[0].astype(int)
Out[65]:
0     0
1    10
Name: 0, dtype: int64

In [66]: df[0] = df[0].astype(int)

In [67]: df
Out[67]:
    0     1     2     3     4     5     6     7     8     9
0   0   1.0   2.0   3.0   4.0   5.0   6.0   7.0   8.0   9.0
1  10  11.0  12.0  13.0  14.0  15.0  16.0  17.0  18.0  19.0

In [68]: df.iloc[0]
Out[68]:
0    0.0
1    1.0
2    2.0
3    3.0
4    4.0
5    5.0
6    6.0
7    7.0
8    8.0
9    9.0
Name: 0, dtype: float64

Problem description

After I reassign the 0th column as int, I expect it to be int, and it appears to be that way. But when I do a .iloc on the dataframe, it seems to be returning back into being a float somehow!

This is the narrowed-down version of a more insidious problem where instead of doing an iloc[] I was running a .apply(f) on a dataframe and the dtypes of the resulting dataframe were all messed up even when the function f wasn't doing anything discernible with the types, so I narrowed it down to this.

Current workaround is to re-cast all the types in f, but that can get frustrating real quick depending on the number of columns.

Expected Output

I expect the row to be a mixed dtype object, with the dtype of each cell matching that of the column:

In [68]: df.iloc[0]
Out[68]:
0    0
1    1.0
2    2.0
3    3.0
4    4.0
5    5.0
6    6.0
7    7.0
8    8.0
9    9.0
Name: 0, dtype: object

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.0.final.0 python-bits: 64 OS: Darwin OS-release: 15.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.20.1 pytest: 3.0.7 pip: 9.0.1 setuptools: 35.0.2 Cython: 0.25.2 numpy: 1.12.1 scipy: 0.19.0 xarray: None IPython: 6.0.0 sphinx: None patsy: 0.4.1 dateutil: 2.6.0 pytz: 2017.2 blosc: 1.5.1 bottleneck: 1.2.0 tables: 3.3.0 numexpr: 2.6.2 feather: None matplotlib: 2.0.2 openpyxl: None xlrd: 1.0.0 xlwt: 1.2.0 xlsxwriter: None lxml: None bs4: None html5lib: 0.999999999 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: 0.0.9 pandas_gbq: None pandas_datareader: None

Comment From: jreback

duplicate of https://github.com/pandas-dev/pandas/issues/12859

This is as expected, mixed int-float get upcasted to float. In particular on a cross section you an get upcasting generally as you are cutting across mixed dtypes (you won't get upcast if its a single dtype).

I don't think there is an easy way to get around this, though for perf reasons you never want this to be object.

Comment From: makmanalp

@jreback I see. More philosophically, if we had a string or datetime or other dtype column in there, wouldn't the dtype of the result of the .iloc[] necessarily have to be object? It seems like that's the more sensible thing to happen. Besides, the implicit upcasting behavior on .apply or .iloc or .to_dict is surprising and not easy to track down.

Or is this upcasting behavior considered normal because .iloc[] returns a series, so in which case we've flipped things sideways and switched from the multi-columnar (and thus multi-dtype) dataframe format to a single "column" (which has one dtype, unless we make it be "object")?

I hit the .to_dict() version almost immediately after :-) I'm trying to clean out float numbers that are polluting some json output, essentially, and these are two things that I ran into back to back.

Comment From: jreback

if you had any other mixed dtypes it would upcast appropriately (to object if needed)

this comes fundamentally from numpy

In [1]: np.array([1, 1.0])
Out[1]: array([ 1.,  1.])

since holding mixed dtypes in a 1-d is not generally supported, this is how it is.