Code Sample, a copy-pastable example if possible
import pandas as pd
df = pd.concat([
pd.date_range(start='1/1/2018', end='1/30/2018').to_series().dt.tz_localize("UTC"),
pd.date_range(start='1/1/2018', end='1/20/2018').to_series().dt.tz_localize("UTC")
], axis=1)
df.min(axis=1)
Problem description
Without tz_localize
, everything works as expected. Also, if both columns have the same length and time zones, it works just fine. However, the combination of time zones and different lengths results in:
2018-01-01 NaN
2018-01-02 NaN
2018-01-03 NaN
2018-01-04 NaN
2018-01-05 NaN
2018-01-06 NaN
2018-01-07 NaN
2018-01-08 NaN
2018-01-09 NaN
2018-01-10 NaN
2018-01-11 NaN
...
of type float64.
Expected Output
List of dates
Output of pd.show_versions()
I have found some related fixed issues, but not exactly this one.
Comment From: MYUNGJE
may i take a look at it ?
Comment From: TomAugspurger
That'd be great!
Comment From: MaximeWeyl
Here is a workaround I'm using, but it is slow :
def __column_max_without_nans(df):
"""Takes the min between two columns, avoiding a bug in pandas
Link to github issue : https://github.com/pandas-dev/pandas/issues/27794
"""
return df[df.notnull().all(axis=1)].max(axis=1).reindex(df.index)
Comment From: randomgambit
great! My workaround: get rid of the timezones altogether. dt.tz_localize(None).
.. But this might not be ideal in certain cases
Comment From: jkukul
@MaximeWeyl, your workaround only solves the issue partially, i.e. when all timestamps (with timezone) are present in an axis.
Consider the following example:
s1 = pd.Series([pd.Timestamp('2017-01-01T10', tz='UTC'), pd.Timestamp('2017-01-01T11', tz='UTC')])
s2 = pd.Series([pd.Timestamp('2017-01-01T12', tz='UTC'), pd.NaT])
df = pd.DataFrame(dict(s1=s1, s2=s2))
Expected output of df.min(axis=1)
(without the bug) is:
0 2017-01-01 10:00:00+00:00
1 2017-01-01 11:00:00+00:00
dtype: datetime64[ns, UTC]
Applying your workaround:
df[df.notnull().all(axis=1)].max(axis=1).reindex(df.index)
gives:
0 2017-01-01 12:00:00+00:00
1 NaT
dtype: datetime64[ns, UTC]
As you can see, your workaround helps for the 0th row, but returns an incorrect value for the 1st row which has an NaT
value in one of the columns.
Comment From: jkukul
Here's my workaround - instead of using:
df.min(axis=1)
I'm using:
df.apply(lambda s: s[s.notnull()].min(), axis=1)
Comment From: jbrockmendel
The underlying issue here is that df.min(axis=1)
is going through a path that accesses df.values
instead of operating block-wise.
Comment From: LanSaid
sad, its alive in 1.2.3, any idea can we fix it?
Comment From: jbrockmendel
part of the way there: passing numeric_only=False
gives the expected result for https://github.com/pandas-dev/pandas/issues/27794#issuecomment-596760964
Comment From: jbrockmendel
Fixed by the removal of numeric_only=None, closing.