Pandas DataFrame.from_dict automatic dtype conversion?

I would expect this syntax to work without problems

import pandas as pd
d = [{'a':'1','b':'2'},{'a':'3','b':'4'}]
pd.DataFrame.from_dict(d, orient='columns', dtype={'a':int,'b':int})

Expected Output

Expected DataFrame:

       a  b
    0  1  2
    1  3  4

with dtypes:

    a    int64
    b    int64
    dtype: object

Instead, the output is

/usr/lib/python2.7/dist-packages/numpy/core/_internal.pyc in _makenames_list(adict, align)
     24     for fname in fnames:
     25         obj = adict[fname]
---> 26         n = len(obj)
     27         if not isinstance(obj, tuple) or n not in [2, 3]:
     28             raise ValueError("entry not a 2- or 3- tuple")

TypeError: object of type 'type' has no len()

Output of `pd.show_versions()`

# Paste the output here pd.show_versions() here INSTALLED VERSIONS ------------------ commit: None python: 2.7.9.final.0 python-bits: 64 OS: Linux OS-release: 3.19.0-68-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_GB.UTF-8 pandas: 0.18.1 nose: 1.3.4 pip: None setuptools: 26.1.1 Cython: 0.24.1 numpy: 1.8.2 scipy: 0.14.1 statsmodels: None xarray: None IPython: 5.1.0 sphinx: None patsy: None dateutil: 2.2 pytz: 2015.7 blosc: None bottleneck: None tables: None numexpr: None matplotlib: 1.4.2 openpyxl: 2.3.5 xlrd: 1.0.0 xlwt: None xlsxwriter: None lxml: 3.4.2 bs4: 4.3.2 html5lib: 0.999 httplib2: 0.9.2 apiclient: 1.5.3 sqlalchemy: 0.9.8 pymysql: None psycopg2: None jinja2: 2.8 boto: 2.34.0 pandas_datareader: None

Comment From: jorisvandenbossche

From the docstring of from_dict:

dtype : dtype, default None
    Data type to force, otherwise infer

So I think the dtype argument here only supports single dtypes, not dicts of dtypes (as some other pandas functions do). In the example case this is even simpler:

In [16]: pd.DataFrame.from_dict(d, orient='columns', dtype=int)
Out[16]: 
   a  b
0  1  2
1  3  4

In [17]: pd.DataFrame.from_dict(d, orient='columns', dtype=int).dtypes
Out[17]: 
a    int64
b    int64
dtype: object

but for the general case this would be a enhancement for from_dict to accept dicts.

Comment From: JoaoAparicio

Alright, so imagine that I have one column int and one column float. Problem still stands, no?

So I think the dtype argument here only supports single dtypes, not dicts of dtypes (as some other pandas functions do).

Should this be improved?

Comment From: jorisvandenbossche

Yes, a PR to improve this would be welcomed.

Comment From: JoaoAparicio

Like this?

my_dtypes = { ( ... ) }
for k,v in my_dtypes.iteritems():
    if k in df.columns:
        df[k] = df[k].apply(lambda x: v(x))

Comment From: jreback

duplicate issue: https://github.com/pandas-dev/pandas/issues/4464

Comment From: jorisvandenbossche

@JoaoAparicio basically, yes, but the dataframe constructor code is rather complex (many options / code paths) so you would have to see where this fits. There are possibly also ways do to it more efficiently during dataframe creation instead of astype afterwards. See also how pd.DataFrame().astype(dict) implements this. Which is, BTW, something you can also use at the moment:

pd.DataFrame.from_dict(d, orient='columns').astype({'a':int,'b':int})

this works fine with different dtypes (and this is also more explicit that it happens after the dataframe creation).

Comment From: avnishbm

@JoaoAparicio basically, yes, but the dataframe constructor code is rather complex (many options / code paths) so you would have to see where this fits. There are possibly also ways do to it more efficiently during dataframe creation instead of astype afterwards. See also how pd.DataFrame().astype(dict) implements this. Which is, BTW, something you can also use at the moment:

pd.DataFrame.from_dict(d, orient='columns').astype({'a':int,'b':int})

this works fine with different dtypes (and this is also more explicit that it happens after the dataframe creation).

Though astype() fails if the column has a missing value (or np.nan), hence trying to convert from float to int would fail. Other problem is that if the dict has a row with missing column value, it assumes it to be np.nan and then convert the entire column as float, where as rest of the elements in the column were int. Getting it back to int type using astype() also fails as mentioned (for np.nan value) i.e. the column remains of float type.

Pandas DataFrame.from_dict automatic dtype conversion?

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`