Pandas ENH: pd.get_dummies should not default to dtype np.uint8

I was caught by surprise the other day when doing some vector subtraction when using pd.get_dummies. The issue is that the default dtype is np.uint8 which means that cases where a 1 is subtracted from a 0 will result in 255.

I tweeted about this surprise (with an example of this issue) and the overwhelming response was that this felt like a pretty big surprise and, in most cases, undesired default behavior. Bill Tubbs then made a mention of this in another issue where it was recommended that a new issue be created for this.

My guess is that the defaulting to np.uint8 is to reduce memory demands on what are potentially large, very sparse matrices. While I'm sure there are use cases that demand this, it seems like the risk of defaulting to np.uint8 outweigh the benefits of just choosing an signed representation.

Comment From: phofl

Please provide the output of pd.show_versions and provide something repriducible.

we have a bug template for bug reports or an enhancement template if you think that this is more of an enhancement.

Comment From: willkurt

To be clear, this is definitely an enhancement and not a bug since this is the documented behavior of get_dummies.

The issue is that defaulting to np.uint8 is not what most people expect to be the default behavior, and leads to unexpected result in pretty common use cases (subtracting vectors), and is considered to be a pretty severe 'gotcha'.

I've included the requested information as well as an example below. If you need more info I'm happy to use a template, just provide a pointer where I could find one.

Here's the output of pd.show_versions:

INSTALLED VERSIONS
------------------
commit           : 945c9ed766a61c7d2c0a7cbb251b6edebf9cb7d5
python           : 3.10.0.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 19.6.0
Version          : Darwin Kernel Version 19.6.0: Thu Sep 16 20:58:47 PDT 2021; root:xnu-6153.141.40.1~1/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.3.4
numpy            : 1.21.4
pytz             : 2021.3
dateutil         : 2.8.2
pip              : 21.3.1
setuptools       : 57.0.0
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 3.0.3
IPython          : 7.29.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : 3.5.1
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyxlsb           : None
s3fs             : None
scipy            : 1.7.2
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : None

Here's an example of the issue:

vec1 = pd.get_dummies(["okay", "gotcha", "okay"])
vec2 = pd.get_dummies(["gotcha", "okay", "okay"])
diff_default = vec1 - vec2
diff_default

returns

   gotcha  okay
0     255     1
1       1   255
2       0     0

The intended behavior can be achieved by specifying any signed dtype as you can see here:

vec1 = pd.get_dummies(["okay", "gotcha", "okay"],
                     dtype=np.float32)
vec2 = pd.get_dummies(["gotcha", "okay", "okay"],
                     dtype=np.float32)
diff_correct = vec1 - vec2
diff_correct

returns the generally expected result:

   gotcha  okay
0    -1.0   1.0
1     1.0  -1.0
2     0.0   0.0

My (and many other people's) issue with this is that the default should not lead to unexpected results. It doesn't need to be a np.float32, but it should be a signed dtype.

Comment From: phofl

Thx, we would probably need a deprecation cycle if we would change that

Comment From: willkurt

Makes sense to me. I appreciate you looking into this, thanks!

Comment From: MarcoGorelli

From git log it looks like it's been like this for years, but I can't tell why uint8 was chosen over int8.

I'd be in favour of deprecating the current default in favour of int8

Comment From: dhvalden

I support this change. uint8 can lead to hard to track errors

Comment From: MarcoGorelli

@pandas-dev/pandas-core anyone got any objections to changing the default type? Any reason to not just make the default type bool? Then, the return type would be clear, and if people need to do arithmetic operations on the dummy values, they can do their own dtype conversion. But at least they wouldn't run into unexpected behaviour like this

General suggestion for others who would like this changed: please use reactions to express support, no need to add comments just indicating that you'd also like to see this

Comment From: jreback

bool is a problem as doesn't play nice with missing values

could certain return Boolean

this would be a breaking change and so needs to wait for 2.0 (i think their is a tracking issue)

Comment From: MarcoGorelli

bool is a problem as doesn't play nice with missing values

Sure, but why would get_dummies return a missing value anyway? Unless I'm missing something, the return values would always be 0 or 1

Comment From: bashtage

IMO you either go with int64 or just make them float, if you want to move away from the idea of using the smallest int dtype that can represent the encoded categorical variable. float is probably the most sensible (between int64 and float) since it has the same storage requirements and handles nan fine.

Comment From: bashtage

The intended behavior can be achieved by specifying any signed dtype as you can see here:

This isn't quite right. Any int dtype can wrap under the right conditions. It wouldn't happen subtracting 2 dummies, but you cannot know that there isn't some edge case out there.

I agree that uint is particularly problematic here since np.iinfo(dt).max is always 1 to the left of 0.

Comment From: bashtage

Another funny behavior of get_dummies


In [26]: pd.get_dummies(c,dummy_na=True)
Out[26]:
   a  b  NaN
0  1  0    0
1  0  1    0
2  1  0    0
3  0  0    1


In [25]: ~pd.get_dummies(c,dummy_na=True)
Out[25]:
     a    b  NaN
0  254  255  255
1  255  254  255
2  254  255  255
3  255  255  254

The suggestion to use bool would avoid this issue.

Comment From: MarcoGorelli

Thanks @bashtage

float is probably the most sensible (between int64 and float) since it has the same storage requirements and handles nan fine.

Regarding handling nan - is there an example of a case when get_dummies returns nan? If not, then bool should be fine, right?

Comment From: MarcoGorelli

@willkurt do you want to open a PR for this? This'd involve:

in pandas/tests/reshape/test_get_dummies.py, for tests which don't already specify a dtype, pass np.dtype(np.uint8)
add a test which doesn't specify a dtype, and assert that a FutureWarning with a message like "the default dtype will change from 'uint8' to 'bool', please specify adtypeto silence this warning is raised
in pandas/core/reshape/encoding.py, in _get_dummies_1d, add a FutureWarning with a message like the above if dtype wasn't passed by the user

Not strictly necessary, but I think dtype=None could also be changed to dtype=lib.no_default

Sounds like there's agreement on changing the default type away from uint8, we can always revisit the message about what it'll be changed to in the PR review

If anyone wants to work on this, here's the contributing guide, and feel free to ask for help if anything's unclear

Comment From: WillAyd

Probably in the minority but I think uint8 is a natural return type. bool would also be ok

Int64 and float are pretty heavy handed - I think memory usage is really important here.

Comment From: Dev-Khant

Hi everyone, I am starting in Open Source and willing to contribute. Can I work on this issue and can anyone help me get started?

Comment From: MarcoGorelli

Hey - thanks, but there's already a PR open

Comment From: Dev-Khant

OK, can you suggest me any beginner issue to work on.

Comment From: wany-oh

Would this change also apply to Series.str.get_dummies()?

Comment From: MarcoGorelli

that's a good point, .str.get_dummies still defaults to int64 - want to open a separate issue about changing that to bool too?