Code Sample, a copy-pastable example if possible
Consider the dataframe df = pd.DataFrame([{'apple': 1, 'pear':'a', 'carrot': 1}, {'apple':'a', 'pear':2, 'carrot':3}, {'apple': 2, 'pear':3, 'carrot':1}, {'apple': 3, 'pear':'b', 'carrot': 1}, {'apple': 4, 'pear':4, 'carrot': 1}])
which equals:
apple carrot pear
0 1 1 a
1 a 3 2
2 2 1 3
3 3 1 b
4 4 1 4
If I use get_dummies I get
carrot apple_1 apple_2 apple_3 apple_4 apple_a pear_2 pear_3 \
0 1 1 0 0 0 0 0 0
1 3 0 0 0 0 1 1 0
2 1 0 1 0 0 0 0 1
3 1 0 0 1 0 0 0 0
4 1 0 0 0 1 0 0 0
pear_4 pear_a pear_b
0 0 1 0
1 0 0 0
2 0 0 0
3 0 0 1
4 1 0 0
This makes a new column for every value in the apple and pear columns (11 in total). This would be many more if we had more rows with numerical values in them in df
.
But much better (at least for me) is what sklearn DictVectorizer does.
from sklearn.feature_extraction import DictVectorizer
enc = DictVectorizer(sparse = False)
enc.fit_transform(df.to_dict(orient='r'))
This gives:
array([[ 1., 0., 1., 0., 1., 0.],
[ 0., 1., 3., 2., 0., 0.],
[ 2., 0., 1., 3., 0., 0.],
[ 3., 0., 1., 0., 0., 1.],
[ 4., 0., 1., 4., 0., 0.]])
We can see the features names of the columns with:
enc.feature_names_
['apple', 'apple=a', 'carrot', 'pear', 'pear=a', 'pear=b']
So we can that this has only created new columns for the non-numerical values in the apple and pear columns.
Problem description
get_dummies creates a huge number of new columns for numerical values in a column where only one or two entries are in fact non-numerical.
Expected Output
It would be great if there were an option for get_dummies to only make new columns for the non-numerical values in a column.
Output of pd.show_versions()
Comment From: pratapvardhan
Just for this usecase, here are two ways you could do it.
In [958]: df.replace('', 0).join(df.eq('').add_suffix('=')).astype(float)
Out[958]:
apple carrot pear apple= carrot= pear=
0 3.0 1.0 0.0 0.0 0.0 1.0
1 0.0 3.0 2.0 1.0 0.0 0.0
2 4.0 1.0 3.0 0.0 0.0 0.0
Or, use pd.to_numeric
instead of replacing blanks, for any non-numeric values,
In [959]: dfn = df.apply(pd.to_numeric)
In [960]: dfn.fillna(0).join(dfn.isnull().add_suffix('=')).astype(float)
Out[960]:
apple carrot pear apple= carrot= pear=
0 3.0 1.0 0.0 0.0 0.0 1.0
1 0.0 3.0 2.0 1.0 0.0 0.0
2 4.0 1.0 3.0 0.0 0.0 0.0
and, drop columns which are all-zero perhaps?
Comment From: pratapvardhan
And, instead of
enc.fit_transform(df.T.to_dict().values())
you could just do
enc.fit_transform(df.to_dict(orient='records'))
orient=records
will give you records list-like.
Comment From: lesshaste
@pratapvardhan Those are nice solutions to the particular example I gave. I guess one can drop all zero columns with dfn.loc[:, (dfn != 0).any(axis=0)]
? It would be interesting to benchmark the solutions you gave. In fact I thought pd.get_dummies(dfn, dummy_na=True)
should work too for my simple example but it appears not to.
It would have been better if I had shown an example with more different categorical values. Take this very slightly bigger example:
apple carrot pear
0 3 1 a
1 a 3 2
2 4 1 3
3 3 1 b
4 3 1 4
enc.fit_transform(df.to_dict(orient='r'))
array([[ 3., 0., 1., 0., 1., 0.],
[ 0., 1., 3., 2., 0., 0.],
[ 4., 0., 1., 3., 0., 0.],
[ 3., 0., 1., 0., 0., 1.],
[ 3., 0., 1., 4., 0., 0.]])
But using get_dummies we get:
pd.get_dummies(df)
carrot apple_3 apple_4 apple_a pear_2 pear_3 pear_4 pear_a pear_b
0 1 1 0 0 0 0 0 1 0
1 3 0 0 1 1 0 0 0 0
2 1 0 1 0 0 1 0 0 0
3 1 1 0 0 0 0 0 0 1
4 1 1 0 0 0 0 1 0 0
In general, my data will have dozens of categorical values and tens of thousands of numerical values per column. It would be great if pandas had an easy way to deal with this situation.
Comment From: pratapvardhan
Ah, thanks, can reproduce it, removing the above comment, could you add this to first post under Code Sample
section?
Comment From: pratapvardhan
@lesshaste -- for time-being, you could use pd.get_dummies
like?
In [1192]: dfn = df.apply(pd.to_numeric, errors='coerce').isnull() # or df.applymap(np.isreal)
In [1193]: df.mask(dfn, 0).join(pd.get_dummies(df.where(dfn)).filter(like='_'))
Out[1193]:
apple carrot pear apple_a pear_a pear_b
0 3 1 0 0 1 0
1 0 3 2 1 0 0
2 4 1 3 0 0 0
3 3 1 0 0 0 1
4 3 1 4 0 0 0
In [1194]: pd.DataFrame(enc.fit_transform(df.to_dict(orient='r')), columns=enc.feature_names_)
Out[1194]:
apple apple=a carrot pear pear=a pear=b
0 3.0 0.0 1.0 0.0 1.0 0.0
1 0.0 1.0 3.0 2.0 0.0 0.0
2 4.0 0.0 1.0 3.0 0.0 0.0
3 3.0 0.0 1.0 0.0 0.0 1.0
4 3.0 0.0 1.0 4.0 0.0 0.0
Comment From: lesshaste
@pratapvardhan Your new solution is ingenuous! Is there an easy way to make a large test dataframe so we can benchmark it?
Comment From: MarcoGorelli
Hi @lesshaste - what's the use-case for this?
Comment From: MarcoGorelli
Doesn't look like there's been any uptake on this in 5+ years, so without a use-case let's close - can reopen if necessary
thanks anyway for the suggestion!