Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
from typing import Union
import numpy as np
import pandas as pd
def get_sample_df():
return pd.DataFrame(
[
["1", np.nan],
["2", np.nan],
["3", np.nan],
["4", np.nan]
],
dtype="object" # Reproduce the dataframe I'm working with.
)
def try_cast_int(s: str) -> Union[int, str]:
# some complex preprocessing here
# I removed those to make my code shorter
try:
return int(s)
except ValueError:
return s
df = get_sample_df()
# Due to business reasons, apply `try_cast_int` only to the top half of `df`.
df.iloc[:2, :] = df.iloc[:2, :].applymap(try_cast_int)
Issue Description
Hello pandas team!
I found a bug(?) today. I expect that the df after the last line in the "Reproducible Example" will be a dataframe like below.
| 0 | 1 | |
|---|---|---|
| 0 | 1 | np.NaN |
| 1 | 2 | np.NaN |
| 2 | 3 | np.NaN |
| 3 | 4 | np.NaN |
But what I got was like below. I have no idea why I got float such like 1.0 and 2.0 instead of int.
| | 0 | 1 |
|---:|----:|----:|
| 0 | 1.0 | np.NaN |
| 1 | 2.0 | np.NaN |
| 2 | 3 | np.NaN |
| 3 | 4 | np.NaN |
One interesting fact is df.iloc[:2, :].applymap(try_cast_int) before reassigning will return a dataframe like below.
| | 0 | 1 |
|---:|----:|----:|
| 0 | 1 | np.NaN |
| 1 | 2 | np.NaN |
It seems that integers on the 1st column are converted into float values when partial reassigning.
My questions are - Why type conversion happens? - How can we avoid this behavior? (Is there any context manager or something for that?)
Expected Behavior
The df after the last line in the "Reproducible Example" will be a dataframe like below.
| | 0 | 1 |
|---:|----:|----:|
| 0 | 1 | np.NaN |
| 1 | 2 | np.NaN |
| 2 | 3 | np.NaN |
| 3 | 4 | np.NaN |
I guess that "apply try_cast_int to the top half of df" is a cause of this issue. Pandas are not designed to do this kind of task, right?
I found that I can avoid this behavior and get what I want by following below steps.
1. spit the df into 2 before applying try_cast_int
2. apply try_cast_int to the top half of the df
3. No processing is done on the bottom half of the df
4. concat them into 1
example code
df = get_sample_df()
df_top = df.iloc[:2, :]
df_top = df_top.applymap(try_cast_int)
df_bottom = df.iloc[2:, :]
df_full = pd.concat([df_top, df_bottom], axis=0)
Best of luck.
Installed Versions
Comment From: phofl
Hi, thanks for your report. This looks buggy, investigations are welcome.
Edit: Any reason you are not using astype here? Just asking, this should be much faster.
Comment From: petrov826
@phofl Thank you for your response!
Any reason you are not using astype here? Just asking, this should be much faster.
Do you mean my custom function try_cast_int? It is sililar to .astype(int) but do some extra job. It removes , char to cast string like 3,000 to int 3000 and many more. We can achieve that by preprocessing strings by using df.str.replace(",", "") and .astype(int), though.
Edit: I want to do my best to parse integer string. If the value conversion fails, the value is supposed to be left as it is. This is why I don’t do in-place preprocessing.
I recognize that my try_cast_int clearly violates Single Responsibility Principle…
Comment From: phofl
Ah got you, seemed a bit overcomplicated here but makes sense if you do more preprocessing