I was caught by surprise the other day when doing some vector subtraction when using pd.get_dummies. The issue is that the default dtype is np.uint8 which means that cases where a 1 is subtracted from a 0 will result in 255.
I tweeted about this surprise (with an example of this issue) and the overwhelming response was that this felt like a pretty big surprise and, in most cases, undesired default behavior. Bill Tubbs then made a mention of this in another issue where it was recommended that a new issue be created for this.
My guess is that the defaulting to np.uint8 is to reduce memory demands on what are potentially large, very sparse matrices. While I'm sure there are use cases that demand this, it seems like the risk of defaulting to np.uint8 outweigh the benefits of just choosing an signed representation.
Comment From: phofl
Please provide the output of pd.show_versions and provide something repriducible.
we have a bug template for bug reports or an enhancement template if you think that this is more of an enhancement.
Comment From: willkurt
To be clear, this is definitely an enhancement and not a bug since this is the documented behavior of get_dummies
.
The issue is that defaulting to np.uint8
is not what most people expect to be the default behavior, and leads to unexpected result in pretty common use cases (subtracting vectors), and is considered to be a pretty severe 'gotcha'.
I've included the requested information as well as an example below. If you need more info I'm happy to use a template, just provide a pointer where I could find one.
Here's the output of pd.show_versions:
INSTALLED VERSIONS
------------------
commit : 945c9ed766a61c7d2c0a7cbb251b6edebf9cb7d5
python : 3.10.0.final.0
python-bits : 64
OS : Darwin
OS-release : 19.6.0
Version : Darwin Kernel Version 19.6.0: Thu Sep 16 20:58:47 PDT 2021; root:xnu-6153.141.40.1~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.3.4
numpy : 1.21.4
pytz : 2021.3
dateutil : 2.8.2
pip : 21.3.1
setuptools : 57.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 7.29.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.5.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.7.2
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None
Here's an example of the issue:
vec1 = pd.get_dummies(["okay", "gotcha", "okay"])
vec2 = pd.get_dummies(["gotcha", "okay", "okay"])
diff_default = vec1 - vec2
diff_default
returns
gotcha okay
0 255 1
1 1 255
2 0 0
The intended behavior can be achieved by specifying any signed dtype as you can see here:
vec1 = pd.get_dummies(["okay", "gotcha", "okay"],
dtype=np.float32)
vec2 = pd.get_dummies(["gotcha", "okay", "okay"],
dtype=np.float32)
diff_correct = vec1 - vec2
diff_correct
returns the generally expected result:
gotcha okay
0 -1.0 1.0
1 1.0 -1.0
2 0.0 0.0
My (and many other people's) issue with this is that the default should not lead to unexpected results. It doesn't need to be a np.float32
, but it should be a signed dtype.
Comment From: phofl
Thx, we would probably need a deprecation cycle if we would change that
Comment From: willkurt
Makes sense to me. I appreciate you looking into this, thanks!
Comment From: MarcoGorelli
From git log it looks like it's been like this for years, but I can't tell why uint8 was chosen over int8.
I'd be in favour of deprecating the current default in favour of int8
Comment From: dhvalden
I support this change. uint8 can lead to hard to track errors
Comment From: MarcoGorelli
@pandas-dev/pandas-core anyone got any objections to changing the default type? Any reason to not just make the default type bool
? Then, the return type would be clear, and if people need to do arithmetic operations on the dummy values, they can do their own dtype conversion. But at least they wouldn't run into unexpected behaviour like this
General suggestion for others who would like this changed: please use reactions to express support, no need to add comments just indicating that you'd also like to see this
Comment From: jreback
bool is a problem as doesn't play nice with missing values
could certain return Boolean
this would be a breaking change and so needs to wait for 2.0 (i think their is a tracking issue)
Comment From: MarcoGorelli
bool is a problem as doesn't play nice with missing values
Sure, but why would get_dummies
return a missing value anyway? Unless I'm missing something, the return values would always be 0
or 1
Comment From: bashtage
IMO you either go with int64
or just make them float
, if you want to move away from the idea of using the smallest int dtype that can represent the encoded categorical variable. float
is probably the most sensible (between int64
and float
) since it has the same storage requirements and handles nan
fine.
Comment From: bashtage
The intended behavior can be achieved by specifying any signed dtype as you can see here:
This isn't quite right. Any int dtype can wrap under the right conditions. It wouldn't happen subtracting 2 dummies, but you cannot know that there isn't some edge case out there.
I agree that uint
is particularly problematic here since np.iinfo(dt).max
is always 1 to the left of 0.
Comment From: bashtage
Another funny behavior of get_dummies
In [26]: pd.get_dummies(c,dummy_na=True)
Out[26]:
a b NaN
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
In [25]: ~pd.get_dummies(c,dummy_na=True)
Out[25]:
a b NaN
0 254 255 255
1 255 254 255
2 254 255 255
3 255 255 254
The suggestion to use bool
would avoid this issue.
Comment From: MarcoGorelli
Thanks @bashtage
float is probably the most sensible (between int64 and float) since it has the same storage requirements and handles nan fine.
Regarding handling nan
- is there an example of a case when get_dummies
returns nan
? If not, then bool
should be fine, right?
Comment From: MarcoGorelli
@willkurt do you want to open a PR for this? This'd involve:
- in
pandas/tests/reshape/test_get_dummies.py
, for tests which don't already specify adtype
, passnp.dtype(np.uint8)
- add a test which doesn't specify a
dtype
, and assert that aFutureWarning
with a message like "the default dtype will change from'uint8'
to'bool', please specify a
dtypeto silence this warning
is raised - in
pandas/core/reshape/encoding.py
, in_get_dummies_1d
, add aFutureWarning
with a message like the above ifdtype
wasn't passed by the user
Not strictly necessary, but I think dtype=None
could also be changed to dtype=lib.no_default
Sounds like there's agreement on changing the default type away from uint8
, we can always revisit the message about what it'll be changed to in the PR review
If anyone wants to work on this, here's the contributing guide, and feel free to ask for help if anything's unclear
Comment From: WillAyd
Probably in the minority but I think uint8 is a natural return type. bool would also be ok
Int64 and float are pretty heavy handed - I think memory usage is really important here.
Comment From: Dev-Khant
Hi everyone, I am starting in Open Source and willing to contribute. Can I work on this issue and can anyone help me get started?
Comment From: MarcoGorelli
Hey - thanks, but there's already a PR open
Comment From: Dev-Khant
OK, can you suggest me any beginner issue to work on.
Comment From: wany-oh
Would this change also apply to Series.str.get_dummies()
?
Comment From: MarcoGorelli
that's a good point, .str.get_dummies
still defaults to int64
- want to open a separate issue about changing that to bool too?