Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
I have a dataset with a few categories that are coded as being the presence or absence of an intervention. I am fitting a simple linear regression using OLS. Multiple interventions may be present at the same time so I am adding an effect for each effect. However the dummy variable is not being encoded the way I want it to be and it make the effects hard to interpret.
data = {
"cat1": [0,0,0,1,1,1,1],
"cat2": [0,0,0,0,0,1,1],
"cat3": [1,1,0,0,1,1,1],
"cat4": [0,1,0,1,1,1,1],
}
#load data into a DataFrame object:
dftest = pd.DataFrame(data)
# variable to store the label mapping into
label_map = {}
for fact in dftest.columns :
dftest[fact], label_map[fact] = pd.factorize(dftest[fact])
print(label_map[fact] )
Produces the following dummy coding output...
Int64Index([0, 1], dtype='int64')
Int64Index([0, 1], dtype='int64')
Int64Index([1, 0], dtype='int64')
Int64Index([0, 1], dtype='int64')
Issue Description
How do I ensure that 0 in the original mapping in is always the dummy for all features? Can I specify which level in a factor should be the dummy?
Expected Behavior
for fact in dftest.columns :
dftest[fact], label_map[fact] = pd.factorize(dftest[fact])
print(label_map[fact] )
The following output should be reproduced when prints the list of key mappings if this were to be fixed
Int64Index([0, 1], dtype='int64')
Int64Index([0, 1], dtype='int64')
Int64Index([0, 1], dtype='int64')
Int64Index([0, 1], dtype='int64')
Installed Versions
Comment From: phofl
The order is preserved, did you try sort=True in factorize?
Comment From: phofl
Closing for now, please ping to reopen when you can address comments