Pandas [ENH] Need a best dtype method

Introduction

Basically, the pandas.read_* methods and constructors are awesome at assigning the highest level of dtype that can include all values of a column. But such functionality is lacking for DataFrames created by other methods (stack, unstack are prime examples).

There has been a lot of discussion about dtypes here (ref. #9216, #5902 and especially #9589), and I understand it is a well rehearsed topic, but with no general consensus. An unfortunate result of those discussions was the deprecation of the .convert_objects method for being too forceful. However, the undercurrent in those discussions (IMHO) point to, and my needs often require a (DataFrame and Series) method which will intelligently assign the lowest generic dtype based on the data.

The method may optionally take a list of dtypes or a dictionary of column names, dtypes to assign user specified dtypes. Note that I am proposing this in addition to the existing to_* methods. The following example will help illustrate:

In [1]: df = pd.DataFrame({'c1' : list('AAABBBCCC'),
                           'c2' : list('abcdefghi'),
                           'c3' : np.random.randn(9),
                           'c4' : np.arange(9)})
        df.dtypes
Out[1]: c1     object
        c2     object
        c3    float64
        c4      int64
        dtype: object

In [2]: df = df.stack().unstack()
        df.dtypes
Out[2]: c1     object
        c2     object
        c3     object
        c4     object
        dtype: object

Expected Output

Define a method .set_dtypes which does the following: 1. Either takes a boolean keyword argument infer to infer and reset the column dtype to the least general dtype such that values are not lost. 2. Or takes a list or dictionary of dtypes to force each column into user specified dtypes, with an optional errors keyword argument to handle casting errors.

As illustrated below:

In [3]: df.set_dtypes(infer=True).dtypes
Out[3]: c1     object
        c2     object
        c3    float64
        c4      int64
        dtype: object

In [4]: df.set_dtypes(types=[np.int64]*4, errors='coerce').dtypes
Out[4]: c1     int64
        c2     int64
        c3     int64
        c4     int64
        dtype: object

In [5]: df.set_dtypes(types=[np.int64]*4, errors='coerce') # Note loss of data
Out[5]:     c1  c2  c3  c4
        0   NaN NaN 1   0
        1   NaN NaN 1   1
        2   NaN NaN 0   2
        3   NaN NaN 0   3
        4   NaN NaN 0   4
        5   NaN NaN 0   5
        6   NaN NaN 2   6
        7   NaN NaN 0   7
        8   NaN NaN 1   8

In [6]: df.set_dtypes(types=[np.int64]*4, errors='ignore').dtypes
Out[6]: c1     object
        c2     object
        c3     object
        c4      int64
        dtype: object

Additional Notes

I understand that date and time types will be a little difficult to infer. However, following the logic powering pandas.read_*, date and time types are not automatically inferred, but explicitly passed by the user.

It would be a one-size-fits-all solution if users were allowed to pass True, and False in addition to dtype to force when specifying dtypes per column. True in this case would indicate infer automatically (set the best dtype), while False would indicate ignore column from conversion.

Comment From: jreback

In [8]: df.dtypes
Out[8]: 
c1     object
c2     object
c3    float64
c4      int64
dtype: object

In [9]: df.stack().unstack().dtypes
Out[9]: 
c1    object
c2    object
c3    object
c4    object
dtype: object

In [10]: df.stack().unstack().apply(lambda x: pd.to_numeric(x, errors='ignore')).dtypes
Out[10]: 
c1     object
c2     object
c3    float64
c4      int64
dtype: object

In [11]: pd.to_numeric?

# you can even request downcasting
In [12]: df.stack().unstack().apply(lambda x: pd.to_numeric(x, errors='ignore', downcast='integer')).dtypes
Out[12]: 
c1     object
c2     object
c3    float64
c4       int8
dtype: object

Comment From: jreback

if you really want to

In [13]: df.stack().unstack().apply(lambda x: pd.to_numeric(x, errors='coerce')).dtypes
Out[13]: 
c1    float64
c2    float64
c3    float64
c4      int64
dtype: object

Comment From: jreback

note that the downcast kw is new in 0.19.0; .to_numeric has been around for a while.

Comment From: jreback

docs are: http://pandas.pydata.org/pandas-docs/stable/basics.html#object-conversion

Comment From: jreback

Explict astyping is also available in 0.19.0

In [9]: df.stack().unstack().astype({'c3' : 'float', 'c4' :'int8'}).dtypes
Out[9]: 
c1     object
c2     object
c3    float64
c4       int8
dtype: object

All of the above said, if you think the docs could use an enhancement, the pls submit a doc PR.

Comment From: jreback

Further note that the stack/unstack idiom does not generally play nice with mixed object and other dtypes, so a more typical pattern is.

In [15]: df.set_index(['c1', 'c2']).stack().unstack().reset_index().dtypes
Out[15]: 
c1     object
c2     object
c3    float64
c4    float64
dtype: object

Comment From: dragonator4

@jreback, Thank you for all the replies. However, how will you explain the following behavior?

In [1]: df = pd.DataFrame({'c1' : list('ABC'),
                           'c2' : list('123'),
                           'c3' : np.random.randn(3),
                           'c4' : np.arange(3)})
        df.dtypes
Out[1]: c1     object
        c2     object
        c3    float64
        c4      int64
        dtype: object

In [2]: df.apply(lambda x: pd.to_numeric(x, errors='ignore'), axis=0).dtypes
Out[2]: c1     object
        c2     object
        c3    float64
        c4      int64
        dtype: object

The expected output (using .set_types method I proposed):

In [2]: df.set_types(infer=True).dtypes
Out[2]: c1     object
        c2      int64
        c3    float64
        c4      int64
        dtype: object

Note that this cannot be done by ignoring errors in pd.to_numeric. Errors will have to be coerced to get the desired output, which will loose data on the object column, while upcasting the int64 columns to float64. downcast='integer' does nothing in this case.

In my present use case, I am fitting models to data from thousands of sensors, and returning a Series with model result information (see this SO question). The resulting DataFrame is then unstacked (see answer to that question) and I get a frame with many columns, and hundreds of thousands of rows.

Unfortunately, not all rows in a given column have the same type. For example, if the model fits one parameter, I get a single number, if it fits multiple parameters, I get a list. Since I cannot be expected to manually examine all values in the frame, and I cannot afford to loose data from typecasting, I need a better method to cast intelligently so that I can build further analysis of model results.

Even the following doesn't help:

df.apply(lambda x: pd.to_numeric(x, errors='corece'), axis=0).fillna(df)

Comment From: TomAugspurger

Your call here

df.apply(lambda x: pd.to_numeric(x, errors='ignore'), axis=1).dtypes

is applying pd.to_numeric to each row. Are you sure you want axis=1?

Comment From: dragonator4

Sorry @TomAugspurger, that was a typo, out of habit. I want axis=0. I have edited accordingly.

Actually, correcting that typo results in this behavior:

In [3]: df.apply(lambda x: pd.to_numeric(x, errors='corece')).fillna(df).dtypes
Out[3]: c1    object
        c2     int64
        c3    object
        c4     int64
        dtype: object

Why is column c3 converted to object from float64?

Comment From: jreback

In [45]: df.apply(lambda x: pd.to_numeric(x, errors='corece')).fillna(df).apply(lambda x: pd.to_numeric(x, errors='ignore')).dtypes
Out[45]: 
c1     object
c2      int64
c3    float64
c4      int64
dtype: object

filling a formerly float column (e.g. c1) with objects works, but also coerces c3 to object because its in the same block. This is an implementation detail ATM. But i'll open an issue for it.

Having another method for this operation is really not worth it. We already have .astype() and explict conversions.

Comment From: dragonator4

I agree, having another method is not worth it if the method is not doing anything new. But the proposed method can, in principle, replace .astype and pd.to_* and the erstwhile .convert_objects methods. It can be like pd.set_dtypes(<frame or series>, infer=<boolean>, types=<list or dict>, errors=<'ignore', 'coerce', 'raise'>). Basically, it is a one-size-fits-all solution..

The current solution of applying to_numeric twice, and filling NaNs is a bit too round about and clunky. It will also mess up datetime columns. In my use case, datetime is part of the index (raw data is grouped on datetime), but if it weren't (as in if I was returning datetime), then I can foresee it having problems. Post conversion with to_numeric (via applying it twice), one will also have to apply to_datetime in the case of datetimes.

Pandas already does a great job with converting datatime and timedelta types by inferring the format. Basically what I am driving at is a universal method to handle all generic forms of type casting, with the ability to automatically infer types if needed, and take user inputs as desired. I agree the algorithm for such a method might get a bit hairy with datetime types, but Pandas is known for it's simplistic flexibility.

A method to handle all simple conditions of typecasting does not exist. I rest my case here, and leave the decision to you. But it is worth considering a universal type setting and resetting method.

Comment From: jreback

@dragonator4 methods with lots of keyword args that are potentially overlapping are confusing and non-pythonic.

you should re-examine your pipeline to see why in fact you are doing .stack().unstack(); as I pointed above, this is not a normal idiom.

Further the reason .convert_objects() was removed because of ambiguity in type detection esp w.r.t datetime/timedeltas, multiple answers are correct and its an underspecified problem.

If you want to suggest a small API addition that fills out a need that is not currently satisfied, please do so. But, for example, removing .astype() is a non-starter, this is a long-time API method for many eco-systems.

As I xrefed the convert-fill-convert idiom is buggy ATM in this specific case. Pull-requests to fix are also welcome.