Code Sample, a copy-pastable example if possible

Consider the dataframe df = pd.DataFrame([{'apple': 1, 'pear':'a', 'carrot': 1}, {'apple':'a', 'pear':2, 'carrot':3}, {'apple': 2, 'pear':3, 'carrot':1}, {'apple': 3, 'pear':'b', 'carrot': 1}, {'apple': 4, 'pear':4, 'carrot': 1}]) which equals:

  apple  carrot pear
0     1       1    a
1     a       3    2
2     2       1    3
3     3       1    b
4     4       1    4

If I use get_dummies I get

   carrot  apple_1  apple_2  apple_3  apple_4  apple_a  pear_2  pear_3  \
0       1        1        0        0        0        0       0       0   
1       3        0        0        0        0        1       1       0   
2       1        0        1        0        0        0       0       1   
3       1        0        0        1        0        0       0       0   
4       1        0        0        0        1        0       0       0   

   pear_4  pear_a  pear_b  
0       0       1       0  
1       0       0       0  
2       0       0       0  
3       0       0       1  
4       1       0       0  

This makes a new column for every value in the apple and pear columns (11 in total). This would be many more if we had more rows with numerical values in them in df.

But much better (at least for me) is what sklearn DictVectorizer does.

    from sklearn.feature_extraction import DictVectorizer
    enc = DictVectorizer(sparse = False)
    enc.fit_transform(df.to_dict(orient='r'))

This gives:

array([[ 1.,  0.,  1.,  0.,  1.,  0.],
       [ 0.,  1.,  3.,  2.,  0.,  0.],
       [ 2.,  0.,  1.,  3.,  0.,  0.],
       [ 3.,  0.,  1.,  0.,  0.,  1.],
       [ 4.,  0.,  1.,  4.,  0.,  0.]])

We can see the features names of the columns with:

    enc.feature_names_
    ['apple', 'apple=a', 'carrot', 'pear', 'pear=a', 'pear=b']

So we can that this has only created new columns for the non-numerical values in the apple and pear columns.

Problem description

get_dummies creates a huge number of new columns for numerical values in a column where only one or two entries are in fact non-numerical.

Expected Output

It would be great if there were an option for get_dummies to only make new columns for the non-numerical values in a column.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.12.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-93-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: None.None pandas: 0.19.2 nose: 1.3.7 pip: 9.0.1 setuptools: 34.3.2 Cython: 0.25.2 numpy: 1.12.1 scipy: 0.19.0 statsmodels: 0.8.0 xarray: None IPython: 5.3.0 sphinx: None patsy: 0.4.1 dateutil: 2.6.0 pytz: 2016.10 blosc: None bottleneck: None tables: None numexpr: None matplotlib: 2.0.0 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: 4.5.3 html5lib: 0.9999999 httplib2: None apiclient: None sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.9.5 boto: None pandas_datareader: None

Comment From: pratapvardhan

Just for this usecase, here are two ways you could do it.

In [958]: df.replace('', 0).join(df.eq('').add_suffix('=')).astype(float)
Out[958]:
   apple  carrot  pear  apple=  carrot=  pear=
0    3.0     1.0   0.0     0.0      0.0    1.0
1    0.0     3.0   2.0     1.0      0.0    0.0
2    4.0     1.0   3.0     0.0      0.0    0.0

Or, use pd.to_numeric instead of replacing blanks, for any non-numeric values,

In [959]: dfn = df.apply(pd.to_numeric)

In [960]: dfn.fillna(0).join(dfn.isnull().add_suffix('=')).astype(float)
Out[960]:
   apple  carrot  pear  apple=  carrot=  pear=
0    3.0     1.0   0.0     0.0      0.0    1.0
1    0.0     3.0   2.0     1.0      0.0    0.0
2    4.0     1.0   3.0     0.0      0.0    0.0

and, drop columns which are all-zero perhaps?

Comment From: pratapvardhan

And, instead of enc.fit_transform(df.T.to_dict().values()) you could just do enc.fit_transform(df.to_dict(orient='records'))

orient=records will give you records list-like.

Comment From: lesshaste

@pratapvardhan Those are nice solutions to the particular example I gave. I guess one can drop all zero columns with dfn.loc[:, (dfn != 0).any(axis=0)]? It would be interesting to benchmark the solutions you gave. In fact I thought pd.get_dummies(dfn, dummy_na=True) should work too for my simple example but it appears not to.

It would have been better if I had shown an example with more different categorical values. Take this very slightly bigger example:

  apple  carrot pear
0     3       1    a
1     a       3    2
2     4       1    3
3     3       1    b
4     3       1    4
enc.fit_transform(df.to_dict(orient='r'))
array([[ 3.,  0.,  1.,  0.,  1.,  0.],
       [ 0.,  1.,  3.,  2.,  0.,  0.],
       [ 4.,  0.,  1.,  3.,  0.,  0.],
       [ 3.,  0.,  1.,  0.,  0.,  1.],
       [ 3.,  0.,  1.,  4.,  0.,  0.]])

But using get_dummies we get:

pd.get_dummies(df)
   carrot  apple_3  apple_4  apple_a  pear_2  pear_3  pear_4  pear_a  pear_b
0       1        1            0             0            0           0           0           1           0
1       3        0            0             1            1           0           0           0           0
2       1        0            1             0            0           1           0           0           0
3       1        1            0             0            0           0           0           0           1
4       1        1            0             0            0           0           1           0           0

In general, my data will have dozens of categorical values and tens of thousands of numerical values per column. It would be great if pandas had an easy way to deal with this situation.

Comment From: pratapvardhan

Ah, thanks, can reproduce it, removing the above comment, could you add this to first post under Code Sample section?

Comment From: pratapvardhan

@lesshaste -- for time-being, you could use pd.get_dummies like?

In [1192]: dfn = df.apply(pd.to_numeric, errors='coerce').isnull() # or df.applymap(np.isreal)

In [1193]: df.mask(dfn, 0).join(pd.get_dummies(df.where(dfn)).filter(like='_'))
Out[1193]:
  apple  carrot pear  apple_a  pear_a  pear_b
0     3       1    0        0       1       0
1     0       3    2        1       0       0
2     4       1    3        0       0       0
3     3       1    0        0       0       1
4     3       1    4        0       0       0

In [1194]: pd.DataFrame(enc.fit_transform(df.to_dict(orient='r')), columns=enc.feature_names_)
Out[1194]:
   apple  apple=a  carrot  pear  pear=a  pear=b
0    3.0      0.0     1.0   0.0     1.0     0.0
1    0.0      1.0     3.0   2.0     0.0     0.0
2    4.0      0.0     1.0   3.0     0.0     0.0
3    3.0      0.0     1.0   0.0     0.0     1.0
4    3.0      0.0     1.0   4.0     0.0     0.0

Comment From: lesshaste

@pratapvardhan Your new solution is ingenuous! Is there an easy way to make a large test dataframe so we can benchmark it?

Comment From: MarcoGorelli

Hi @lesshaste - what's the use-case for this?

Comment From: MarcoGorelli

Doesn't look like there's been any uptake on this in 5+ years, so without a use-case let's close - can reopen if necessary

thanks anyway for the suggestion!