Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

from typing import Union

import numpy as np
import pandas as pd

def get_sample_df():
    return pd.DataFrame(
        [
            ["1", np.nan],
            ["2", np.nan],
            ["3", np.nan],
            ["4", np.nan]
        ],
        dtype="object" # Reproduce the dataframe I'm working with.
    )

def try_cast_int(s: str) -> Union[int, str]:
    # some complex preprocessing here
    # I removed those to make my code shorter
    try:
        return int(s)
    except ValueError:
        return s

df = get_sample_df()

# Due to business reasons, apply `try_cast_int` only to the top half of `df`.
df.iloc[:2, :] = df.iloc[:2, :].applymap(try_cast_int)

Issue Description

Hello pandas team!

I found a bug(?) today. I expect that the df after the last line in the "Reproducible Example" will be a dataframe like below.

0 1
0 1 np.NaN
1 2 np.NaN
2 3 np.NaN
3 4 np.NaN

But what I got was like below. I have no idea why I got float such like 1.0 and 2.0 instead of int. | | 0 | 1 | |---:|----:|----:| | 0 | 1.0 | np.NaN | | 1 | 2.0 | np.NaN | | 2 | 3 | np.NaN | | 3 | 4 | np.NaN |

One interesting fact is df.iloc[:2, :].applymap(try_cast_int) before reassigning will return a dataframe like below. | | 0 | 1 | |---:|----:|----:| | 0 | 1 | np.NaN | | 1 | 2 | np.NaN |

It seems that integers on the 1st column are converted into float values when partial reassigning.

My questions are - Why type conversion happens? - How can we avoid this behavior? (Is there any context manager or something for that?)

Expected Behavior

The df after the last line in the "Reproducible Example" will be a dataframe like below. | | 0 | 1 | |---:|----:|----:| | 0 | 1 | np.NaN | | 1 | 2 | np.NaN | | 2 | 3 | np.NaN | | 3 | 4 | np.NaN |

I guess that "apply try_cast_int to the top half of df" is a cause of this issue. Pandas are not designed to do this kind of task, right?


I found that I can avoid this behavior and get what I want by following below steps. 1. spit the df into 2 before applying try_cast_int 2. apply try_cast_int to the top half of the df 3. No processing is done on the bottom half of the df 4. concat them into 1

example code

df = get_sample_df()

df_top = df.iloc[:2, :]
df_top = df_top.applymap(try_cast_int)

df_bottom = df.iloc[2:, :]

df_full = pd.concat([df_top, df_bottom], axis=0)

Best of luck.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 66e3805b8cabe977f40c05259cc3fcf7ead5687d python : 3.8.16.final.0 python-bits : 64 OS : Linux OS-release : 5.10.133+ Version : #1 SMP Fri Aug 26 08:44:51 UTC 2022 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.3.5 numpy : 1.21.6 pytz : 2022.6 dateutil : 2.8.2 pip : 21.1.3 setuptools : 57.4.0 Cython : 0.29.32 pytest : 3.6.4 hypothesis : None sphinx : 1.8.6 blosc : None feather : 0.4.1 xlsxwriter : None lxml.etree : 4.9.2 html5lib : 1.0.1 pymysql : None psycopg2 : 2.9.5 (dt dec pq3 ext lo64) jinja2 : 2.11.3 IPython : 7.9.0 pandas_datareader: 0.9.0 bs4 : 4.6.3 bottleneck : None fsspec : 2022.11.0 fastparquet : None gcsfs : None matplotlib : 3.2.2 numexpr : 2.8.4 odfpy : None openpyxl : 3.0.10 pandas_gbq : 0.17.9 pyarrow : 9.0.0 pyxlsb : None s3fs : None scipy : 1.7.3 sqlalchemy : 1.4.45 tables : 3.7.0 tabulate : 0.8.10 xarray : 2022.12.0 xlrd : 1.2.0 xlwt : 1.3.0 numba : 0.56.4

Comment From: phofl

Hi, thanks for your report. This looks buggy, investigations are welcome.

Edit: Any reason you are not using astype here? Just asking, this should be much faster.

Comment From: petrov826

@phofl Thank you for your response!

Any reason you are not using astype here? Just asking, this should be much faster.

Do you mean my custom function try_cast_int? It is sililar to .astype(int) but do some extra job. It removes , char to cast string like 3,000 to int 3000 and many more. We can achieve that by preprocessing strings by using df.str.replace(",", "") and .astype(int), though.


Edit: I want to do my best to parse integer string. If the value conversion fails, the value is supposed to be left as it is. This is why I don’t do in-place preprocessing.

I recognize that my try_cast_int clearly violates Single Responsibility Principle…

Comment From: phofl

Ah got you, seemed a bit overcomplicated here but makes sense if you do more preprocessing