Introduction
Basically, the pandas.read_*
methods and constructors are awesome at assigning the highest level of dtype that can include all values of a column. But such functionality is lacking for DataFrames created by other methods (stack, unstack are prime examples).
There has been a lot of discussion about dtypes here (ref. #9216, #5902 and especially #9589), and I understand it is a well rehearsed topic, but with no general consensus. An unfortunate result of those discussions was the deprecation of the .convert_objects
method for being too forceful. However, the undercurrent in those discussions (IMHO) point to, and my needs often require a (DataFrame and Series) method which will intelligently assign the lowest generic dtype based on the data.
The method may optionally take a list of dtypes or a dictionary of column names, dtypes to assign user specified dtypes. Note that I am proposing this in addition to the existing to_*
methods. The following example will help illustrate:
In [1]: df = pd.DataFrame({'c1' : list('AAABBBCCC'),
'c2' : list('abcdefghi'),
'c3' : np.random.randn(9),
'c4' : np.arange(9)})
df.dtypes
Out[1]: c1 object
c2 object
c3 float64
c4 int64
dtype: object
In [2]: df = df.stack().unstack()
df.dtypes
Out[2]: c1 object
c2 object
c3 object
c4 object
dtype: object
Expected Output
Define a method .set_dtypes
which does the following:
1. Either takes a boolean keyword argument infer
to infer and reset the column dtype to the least general dtype such that values are not lost.
2. Or takes a list or dictionary of dtypes to force each column into user specified dtypes, with an optional errors
keyword argument to handle casting errors.
As illustrated below:
In [3]: df.set_dtypes(infer=True).dtypes
Out[3]: c1 object
c2 object
c3 float64
c4 int64
dtype: object
In [4]: df.set_dtypes(types=[np.int64]*4, errors='coerce').dtypes
Out[4]: c1 int64
c2 int64
c3 int64
c4 int64
dtype: object
In [5]: df.set_dtypes(types=[np.int64]*4, errors='coerce') # Note loss of data
Out[5]: c1 c2 c3 c4
0 NaN NaN 1 0
1 NaN NaN 1 1
2 NaN NaN 0 2
3 NaN NaN 0 3
4 NaN NaN 0 4
5 NaN NaN 0 5
6 NaN NaN 2 6
7 NaN NaN 0 7
8 NaN NaN 1 8
In [6]: df.set_dtypes(types=[np.int64]*4, errors='ignore').dtypes
Out[6]: c1 object
c2 object
c3 object
c4 int64
dtype: object
Additional Notes
I understand that date and time types will be a little difficult to infer. However, following the logic powering pandas.read_*
, date and time types are not automatically inferred, but explicitly passed by the user.
It would be a one-size-fits-all solution if users were allowed to pass True
, and False
in addition to dtype to force when specifying dtypes per column. True
in this case would indicate infer automatically (set the best dtype), while False
would indicate ignore column from conversion.
Comment From: jreback
In [8]: df.dtypes
Out[8]:
c1 object
c2 object
c3 float64
c4 int64
dtype: object
In [9]: df.stack().unstack().dtypes
Out[9]:
c1 object
c2 object
c3 object
c4 object
dtype: object
In [10]: df.stack().unstack().apply(lambda x: pd.to_numeric(x, errors='ignore')).dtypes
Out[10]:
c1 object
c2 object
c3 float64
c4 int64
dtype: object
In [11]: pd.to_numeric?
# you can even request downcasting
In [12]: df.stack().unstack().apply(lambda x: pd.to_numeric(x, errors='ignore', downcast='integer')).dtypes
Out[12]:
c1 object
c2 object
c3 float64
c4 int8
dtype: object
Comment From: jreback
if you really want to
In [13]: df.stack().unstack().apply(lambda x: pd.to_numeric(x, errors='coerce')).dtypes
Out[13]:
c1 float64
c2 float64
c3 float64
c4 int64
dtype: object
Comment From: jreback
note that the downcast
kw is new in 0.19.0; .to_numeric
has been around for a while.
Comment From: jreback
docs are: http://pandas.pydata.org/pandas-docs/stable/basics.html#object-conversion
Comment From: jreback
Explict astyping is also available in 0.19.0
In [9]: df.stack().unstack().astype({'c3' : 'float', 'c4' :'int8'}).dtypes
Out[9]:
c1 object
c2 object
c3 float64
c4 int8
dtype: object
All of the above said, if you think the docs could use an enhancement, the pls submit a doc PR.
Comment From: jreback
Further note that the stack/unstack idiom does not generally play nice with mixed object and other dtypes, so a more typical pattern is.
In [15]: df.set_index(['c1', 'c2']).stack().unstack().reset_index().dtypes
Out[15]:
c1 object
c2 object
c3 float64
c4 float64
dtype: object
Comment From: dragonator4
@jreback, Thank you for all the replies. However, how will you explain the following behavior?
In [1]: df = pd.DataFrame({'c1' : list('ABC'),
'c2' : list('123'),
'c3' : np.random.randn(3),
'c4' : np.arange(3)})
df.dtypes
Out[1]: c1 object
c2 object
c3 float64
c4 int64
dtype: object
In [2]: df.apply(lambda x: pd.to_numeric(x, errors='ignore'), axis=0).dtypes
Out[2]: c1 object
c2 object
c3 float64
c4 int64
dtype: object
The expected output (using .set_types
method I proposed):
In [2]: df.set_types(infer=True).dtypes
Out[2]: c1 object
c2 int64
c3 float64
c4 int64
dtype: object
Note that this cannot be done by ignoring errors in pd.to_numeric
. Errors will have to be coerced to get the desired output, which will loose data on the object
column, while upcasting the int64
columns to float64
. downcast='integer'
does nothing in this case.
In my present use case, I am fitting models to data from thousands of sensors, and returning a Series with model result information (see this SO question). The resulting DataFrame is then unstacked (see answer to that question) and I get a frame with many columns, and hundreds of thousands of rows.
Unfortunately, not all rows in a given column have the same type. For example, if the model fits one parameter, I get a single number, if it fits multiple parameters, I get a list. Since I cannot be expected to manually examine all values in the frame, and I cannot afford to loose data from typecasting, I need a better method to cast intelligently so that I can build further analysis of model results.
Even the following doesn't help:
df.apply(lambda x: pd.to_numeric(x, errors='corece'), axis=0).fillna(df)
Comment From: TomAugspurger
Your call here
df.apply(lambda x: pd.to_numeric(x, errors='ignore'), axis=1).dtypes
is applying pd.to_numeric
to each row. Are you sure you want axis=1
?
Comment From: dragonator4
Sorry @TomAugspurger, that was a typo, out of habit. I want axis=0
. I have edited accordingly.
Actually, correcting that typo results in this behavior:
In [3]: df.apply(lambda x: pd.to_numeric(x, errors='corece')).fillna(df).dtypes
Out[3]: c1 object
c2 int64
c3 object
c4 int64
dtype: object
Why is column c3
converted to object
from float64
?
Comment From: jreback
In [45]: df.apply(lambda x: pd.to_numeric(x, errors='corece')).fillna(df).apply(lambda x: pd.to_numeric(x, errors='ignore')).dtypes
Out[45]:
c1 object
c2 int64
c3 float64
c4 int64
dtype: object
filling a formerly float column (e.g. c1) with objects works, but also coerces c3 to object because its in the same block. This is an implementation detail ATM. But i'll open an issue for it.
Having another method for this operation is really not worth it. We already have .astype()
and explict conversions.
Comment From: dragonator4
I agree, having another method is not worth it if the method is not doing anything new. But the proposed method can, in principle, replace .astype
and pd.to_*
and the erstwhile .convert_objects
methods. It can be like pd.set_dtypes(<frame or series>, infer=<boolean>, types=<list or dict>, errors=<'ignore', 'coerce', 'raise'>)
. Basically, it is a one-size-fits-all solution..
The current solution of applying to_numeric
twice, and filling NaNs is a bit too round about and clunky. It will also mess up datetime columns. In my use case, datetime is part of the index (raw data is grouped on datetime), but if it weren't (as in if I was returning datetime), then I can foresee it having problems. Post conversion with to_numeric
(via applying it twice), one will also have to apply to_datetime
in the case of datetimes.
Pandas already does a great job with converting datatime and timedelta types by inferring the format. Basically what I am driving at is a universal method to handle all generic forms of type casting, with the ability to automatically infer types if needed, and take user inputs as desired. I agree the algorithm for such a method might get a bit hairy with datetime types, but Pandas is known for it's simplistic flexibility.
A method to handle all simple conditions of typecasting does not exist. I rest my case here, and leave the decision to you. But it is worth considering a universal type setting and resetting method.
Comment From: jreback
@dragonator4 methods with lots of keyword args that are potentially overlapping are confusing and non-pythonic.
you should re-examine your pipeline to see why in fact you are doing .stack().unstack()
; as I pointed above, this is not a normal idiom.
Further the reason .convert_objects()
was removed because of ambiguity in type detection esp w.r.t datetime/timedeltas, multiple answers are correct and its an underspecified problem.
If you want to suggest a small API addition that fills out a need that is not currently satisfied, please do so. But, for example, removing .astype()
is a non-starter, this is a long-time API method for many eco-systems.
As I xrefed the convert-fill-convert idiom is buggy ATM in this specific case. Pull-requests to fix are also welcome.