Code Sample, a copy-pastable example if possible
df = pd.DataFrame({'a': [1], 'val': [1.35]})
# no groupby, result is 'object', correct behavior
print(df['val'].transform(lambda x: x.map('+{}'.format)).dtype)
# result can't be converted to float, result is 'object', correct behavior
print(df.groupby('a')['val'].transform(lambda x: x.map('(+{})'.format)).dtype)
# result is 'float64' and plus sign is lost, INCORRECT behavior (should be 'object')
print(df.groupby('a')['val'].transform(lambda x: x.map('+{}'.format)).dtype)
# convert column to object type before transform:
df['val']=df['val'].astype(object)
# same code as before, but now result is 'object' and plus sign is shown, correct behavior
print(df.groupby('a')['val'].transform(lambda x: x.map('+{}'.format)).dtype)
Problem description
It is sometimes useful to use DataFrame.groupby()[col].transform()
to create a string column based on a float64 column. In the case leading up to this issue report, I wanted to create a string column showing the actual value for the group min, and e.g., "+1.2" for the difference of the other rows from the min. However, Pandas unexpectedly converts this column back to float64 values, dropping the '+' sign for the non-min rows. This only happens with groupby
, not when the transformation is applied to the whole column. It also only happens if the original column is a float type and the column of strings can be successfully converted back to floats.
I don't think users expect string columns to be converted back to the type of the original column (float in this case), even if that is possible. Pandas also doesn't seem to promise this behavior, since it only occurs with groupby
and not when applying transform
to the whole column.
It is possible to work around this by converting the original column to object type before applying the groupby and transform, but I don't think that should be necessary.
I would recommend dropping this conversion, and create a new series with the same type returned by the transform
function.
Expected Output
object object object object
Output of pd.show_versions()
Comment From: mroeschke
Looks like we get the expected dtypes on master. Could use a test
In [3]: df = pd.DataFrame({'a': [1], 'val': [1.35]})
...: # no groupby, result is 'object', correct behavior
...: print(df['val'].transform(lambda x: x.map('+{}'.format)).dtype)
...: # result can't be converted to float, result is 'object', correct behavior
...: print(df.groupby('a')['val'].transform(lambda x: x.map('(+{})'.format)).dtype)
...: # result is 'float64' and plus sign is lost, INCORRECT behavior (should be 'object')
...: print(df.groupby('a')['val'].transform(lambda x: x.map('+{}'.format)).dtype)
...: # convert column to object type before transform:
...: df['val']=df['val'].astype(object)
...: # same code as before, but now result is 'object' and plus sign is shown, correct behavior
...: print(df.groupby('a')['val'].transform(lambda x: x.map('+{}'.format)).dtype)
object
object
object
object
Comment From: xmu23
Hi, I am a newcomer and want to try to write a test for this good first issue. Can anyone tell me how should I know which file to put this test method in?
Comment From: jackgoldsmith4
Hey @xmu23, I'm also a newcomer and have been grabbing some of these issues. If you're still interested, this page and this page are really useful for writing tests and figuring out where to put them.
Comment From: andrewchen1216
Does this issue still require a test? If so, can I take it?