Code Sample
df1 = DataFrame({'A': [1,None], 'B':[to_datetime('abc', errors='coerce'),to_datetime('2016-01-01')]})
df2 = DataFrame({'A': [2,3]})
df1.update(df2, overwrite=False)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-5-a766b5317aac> in <module>()
1 df1 = DataFrame({'A': [1,None], 'B':[to_datetime('abc', errors='coerce'),to_datetime('2016-01-01')]})
2 df2 = DataFrame({'A': [2,3]})
----> 3 df1.update(df2, overwrite=False)
~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/frame.py in update(self, other, join, overwrite, filter_func, raise_conflict)
3897
3898 self[col] = expressions.where(mask, this, that,
-> 3899 raise_on_error=True)
3900
3901 # ----------------------------------------------------------------------
~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/computation/expressions.py in where(cond, a, b, raise_on_error, use_numexpr)
229
230 if use_numexpr:
--> 231 return _where(cond, a, b, raise_on_error=raise_on_error)
232 return _where_standard(cond, a, b, raise_on_error=raise_on_error)
233
~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/computation/expressions.py in _where_numexpr(cond, a, b, raise_on_error)
152
153 if result is None:
--> 154 result = _where_standard(cond, a, b, raise_on_error)
155
156 return result
~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/computation/expressions.py in _where_standard(cond, a, b, raise_on_error)
127 def _where_standard(cond, a, b, raise_on_error=True):
128 return np.where(_values_from_object(cond), _values_from_object(a),
--> 129 _values_from_object(b))
130
131
TypeError: invalid type promotion
Problem description
A similar problem as in issue #15593 which was fixed in pandas version 0.20.2, NaT values anywhere in the DataFrame still throws the following exception: TypeError: invalid type promotion
Output of pd.show_versions()
Comment From: olizhu
I've just tested some more and it seems that the error occurs whenever there is a null object in a column containing datetimes. So replacing NaT with NaN still has the same error.
Comment From: TomAugspurger
So when we reindex df2
like df1
we end up with different dtypes
In [22]: df2.reindex_like(df1).dtypes
Out[22]:
A int64
B float64
dtype: object
I wonder if we could add a parameter to reindex_like
for controlling the dtypes of columns that are created. I could see that being broadly useful.
Comment From: sboltz
I just encountered this issue and was wondering if there were any updates or workarounds for it? Thanks.
Comment From: IanFLee
I'm also having this issue in sklearn.preprocessing with StandardScaler(). It definitely seems to be a datetime issue so I've just dropped that column for the type being, but eventually I'll need it back, so fingers crossed.
Comment From: wxing11
take
Comment From: wxing11
Hi, new contributor here so please correct me if I'm wrong!
This seems to be caused by situations where the Dataframe to be updated has a Datetime column with NaT values and the input Dataframe has either
- A matching column by index but of a type that isn't Datetime/Object. I assume an error here is expected.
- No matching column by index, so the call to reindex_like in the update function creates a column that isn't of type Datetime/Object. (The example case above)
Since in the situation of the second case the created column is full of only NA values, would it be reasonable to solve this by just adding a check to the function that if a column is full of only NA values, to skip the updating of that column?
I created a PR with an implementation of this as well as a couple new test cases including the one introduced above.
Comment From: MarcoGorelli
I wonder if we could add a parameter to reindex_like for controlling the dtypes of columns that are created
How would this work? Would the dtype be taken from the other
DataFrame in reindex
? Because if so, one issue would be with null columns getting converted to bool
:
In [1]: pd.Series([np.nan, np.nan]).astype(bool)
Out[1]:
0 True
1 True
dtype: bool
Alternatively, there could be an option to exclude null columns could be excluded from the result of reindex_like
, but then that would still require an update to
https://github.com/pandas-dev/pandas/blob/2d126dd0c5fd9768a772ffefede956dfff827667/pandas/core/frame.py#L8196-L8198
to skip over columns which aren't in both this
and that
At the moment, I'm struggling to see a simpler solution that that proposed in https://github.com/pandas-dev/pandas/pull/49395 cc @mroeschke (as you'd commented on the PR)
Comment From: mroeschke
Maybe generally a full reindex_like
is not needed generally, as only the shared columns should be updated?
Comment From: wxing11
I pushed a new commit to my PR that only reindexes rows and then skips non matching columns. Does that seem right for what you were saying?