I use the get_dummies
function while feature engineering for scikit-learn classifiers.
Something I realized is if a value in a dummy column is missing from a df, the dimensions of the reshaped matrix will differ.
df_1 = pd.DataFrame([
{'size' : 's', 'backorder' : True},
{'size' : 'm', 'backorder' : True},
{'size' : 'l', 'backorder' : True},
])
df_2 = pd.DataFrame([
{'size' : 's', 'backorder' : True},
{'size' : 's', 'backorder' : True},
{'size' : 'l', 'backorder' : True},
])
pd.get_dummies(df_1, 'size').shape # returns (3,4)
pd.get_dummies(df_2, 'size').shape # returns (3,3)
Its not elegant, but is it reasonable to add a param for unique values in a dummy column (to preserve shape)?
I'm happy to give this a shot if it sounds like a useful feature.
Comment From: jreback
The typical way to do this would be to turn this into a categorical with all your categories. In facto .get_dummies
does exactly this,
In [17]: pd.get_dummies(df_2['size'].astype('category', categories=list('sml')))
Out[17]:
s m l
0 1 0 0
1 1 0 0
2 0 0 1