I'm having trouble figuring out how DataFrame.replace() is supposed to work. I'm not sure if this is a bug or a documentation issue.
In [1]: import pandas
In [2]: df = pandas.DataFrame({"col1":range(5), "col2":[0.5]*3+[1.0]*2})
In [3]: df
Out[3]:
col1 col2
0 0 0.5
1 1 0.5
2 2 0.5
3 3 1.0
4 4 1.0
In [4]: df.replace(1.0, "a")
Out[4]:
col1 col2
0 0 0.5
1 a 0.5
2 2 0.5
3 3 a
4 4 a
In [5]: df.replace(1.0, "a").replace(0.5, "b")
Out[5]:
col1 col2
0 0 b
1 a b
2 2 b
3 3 a
4 4 a
So far, so good, everything makes sense. But I would have expected this to accomplish the same as above:
In [6]: df.replace({1.0:"a", 0.5:"b"})
Out[6]:
col1 col2
0 b b
1 a a
2 2 b
3 3 a
4 4 b
As you can see, I'm getting alternating "b" and "a". From a quick browse of the source code, it seems that the dictionary-replacement option should result in the same outcome as the following (which gives what I would have expected):
In [15]: df.replace([1.0, 0.5], ["a", "b"])
Out[15]:
col1 col2
a 0 b
b a b
c 2 b
d 3 a
e 4 a
I'm not sure what the to_replace=dict option is supposed to be doing but (at least for pandas v 0.12.0) it isn't doing what I would have expected.
Whether this is a bug or not, the df.replace() method needs better documentation. It's not enough to include a disclaimer that "This method has a lot of options. You are encouraged to experiment and play with this method to gain intuition about how it works."
Comment From: cpcloud
What version of pandas are you using? I believe this is fixed in master.
On Monday, November 18, 2013, Noah wrote:
I'm having trouble figuring out how DataFrame.replace() is supposed to work. I'm not sure if this is a bug or a documentation issue.
In [1]: import pandas
In [2]: df = pandas.DataFrame({"col1":range(5), "col2":[0.5]_3+[1.0]_2})
In [3]: df Out[3]: col1 col2 0 0 0.5 1 1 0.5 2 2 0.5 3 3 1.0 4 4 1.0
In [4]: df.replace(1.0, "a") Out[4]: col1 col2 0 0 0.5 1 a 0.5 2 2 0.5 3 3 a 4 4 a
In [5]: df.replace(1.0, "a").replace(0.5, "b") Out[5]: col1 col2 0 0 b 1 a b 2 2 b 3 3 a 4 4 a
So far, so good, everything makes sense. But I would have expected this to accomplish the same as above:
In [6]: df.replace({1.0:"a", 0.5:"b"}) Out[6]: col1 col2 0 b b 1 a a 2 2 b 3 3 a 4 4 b
As you can see, I'm getting alternating "b" and "a". From a quick browse of the source code, it seems that the dictionary-replacement option should result in the same outcome as the following (which gives what I would have expected):
In [15]: df.replace([1.0, 0.5], ["a", "b"]) Out[15]: col1 col2 a 0 b b a b c 2 b d 3 a e 4 a
I'm not sure what the to_replace=dict option is supposed to be doing but (at least for pandas v 0.12.0) it isn't doing what I would have expected.
Whether this is a bug or not, the df.replace() method needs better documentation. It's not enough to include a disclaimer that "This method has a lot of options. You are encouraged to experiment and play with this method to gain intuition about how it works."
— Reply to this email directly or view it on GitHubhttps://github.com/pydata/pandas/issues/5541 .
Best, Phillip Cloud
Comment From: nspies
You are correct. Let me amend the above then to regard simply the documentation of what is clearly a really complicated replace interface. There should at least be examples of the different types of syntax (I couldn't find any anywhere in the docs).
Comment From: jankatins
See also #5338
Comment From: jreback
@cpcloud ok...not sure what is to be done here
@nspies would you care to provide a doc update as a PR seeing that their is some confusion
Comment From: cpcloud
Yes ... the replace
API is quite crowded. It's extremely flexible, but that flexibility seems to cause confusion among users (me too) and it's (sometimes) a PITA to debug (though debugging could be made easier by an internal refactor with no API changes).
@jreback maybe we should refactor this API a bit. I think replace
tries to do too much. I think it might actually explode very soon. Maybe something like
- replace
: scalar replacement
- replace_list
: replaces anything where to_replace
or value
is a list
- replace_mapping
: likewise for dict
s/Series
- replace_regex
: replaces only regexes, alternatively allow a regex
argument in the others, similar to how it works now.
While this does add more names to the namespace, it's less of a strain on the working memory since you can look at some code with this API and say "oh, this is replacement using a [dict, list, etc]" rather than tracing all the way back up the stack to discover that the types weren't what you thought.
Comment From: cpcloud
This is a breaking API change, so it may have to wait until 1.0 or whatever the first breaking API release is.
Comment From: jreback
@cpcloud nice suggestions
maybe could do instead use a an object for replace
e.g.
- df.replace(...)
- df.replace.list(...)
- df.replace.mapping(...)
- df.replace.regex(...)
in theory this would also allow to dispatch in a backwards compat way, e.g. the __call__
of the replace object could just be current replace with dispatching (and eventually deprecate it to only scalars) (you can do this compat dispatching whether you do it with an object based replace or not I think)
Comment From: cpcloud
Nice. I like the object based idea. Gives time for users to adapt.
Comment From: nspies
Yes, I approve of some refactoring here! The current method is too powerful for its own good (although the documentation has improved substantially since I filed this ticket).
I haven't done any contributions to the project as of yet, but I'm happy to do at least the documentation parts if someone refactors the code into bits that I can understand.
On Apr 6, 2014, at 8:33 AM, Phillip Cloud notifications@github.com wrote:
Nice. I like the object based idea. Gives time for users to adapt.
— Reply to this email directly or view it on GitHub.
Comment From: mroeschke
This looks to work correctly on main now and I believe we have tests for this so closing