Pandas Min/max does not work for dates with timezones if there are missing values in the data frame

Code Sample, a copy-pastable example if possible

import pandas as pd

df = pd.concat([
  pd.date_range(start='1/1/2018', end='1/30/2018').to_series().dt.tz_localize("UTC"),
  pd.date_range(start='1/1/2018', end='1/20/2018').to_series().dt.tz_localize("UTC")
], axis=1)

df.min(axis=1)

Problem description

Without tz_localize, everything works as expected. Also, if both columns have the same length and time zones, it works just fine. However, the combination of time zones and different lengths results in:

2018-01-01   NaN
2018-01-02   NaN
2018-01-03   NaN
2018-01-04   NaN
2018-01-05   NaN
2018-01-06   NaN
2018-01-07   NaN
2018-01-08   NaN
2018-01-09   NaN
2018-01-10   NaN
2018-01-11   NaN
...

of type float64.

Expected Output

List of dates

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit : None python : 3.6.5.final.0 python-bits : 64 OS : Linux OS-release : 4.4.0-17134-Microsoft machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8 pandas : 0.25.0 numpy : 1.17.0 pytz : 2019.1 dateutil : 2.8.0 pip : 19.1.1 setuptools : 41.0.1 Cython : 0.29.10 pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : 1.1.8 lxml.etree : 4.3.4 html5lib : None pymysql : None psycopg2 : 2.8.2 (dt dec pq3 ext lo64) jinja2 : 2.10.1 IPython : 7.5.0 pandas_datareader: None bs4 : None bottleneck : 1.2.1 fastparquet : None gcsfs : None lxml.etree : 4.3.4 matplotlib : 3.1.1 numexpr : 2.6.9 odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pytables : None s3fs : None scipy : 1.3.0 sqlalchemy : 1.3.3 tables : 3.5.2 xarray : None xlrd : 1.2.0 xlwt : None xlsxwriter : 1.1.8

I have found some related fixed issues, but not exactly this one.

Comment From: MYUNGJE

may i take a look at it ?

Comment From: TomAugspurger

That'd be great!

Comment From: MaximeWeyl

Here is a workaround I'm using, but it is slow :

def __column_max_without_nans(df):
    """Takes the min between two columns, avoiding a bug in pandas

    Link to github issue : https://github.com/pandas-dev/pandas/issues/27794
    """
    return df[df.notnull().all(axis=1)].max(axis=1).reindex(df.index)

Comment From: randomgambit

great! My workaround: get rid of the timezones altogether. dt.tz_localize(None)... But this might not be ideal in certain cases

Comment From: jkukul

@MaximeWeyl, your workaround only solves the issue partially, i.e. when all timestamps (with timezone) are present in an axis.

Consider the following example:

s1 = pd.Series([pd.Timestamp('2017-01-01T10', tz='UTC'), pd.Timestamp('2017-01-01T11', tz='UTC')])
s2 = pd.Series([pd.Timestamp('2017-01-01T12', tz='UTC'), pd.NaT])

df = pd.DataFrame(dict(s1=s1, s2=s2))

Expected output of df.min(axis=1) (without the bug) is:

0   2017-01-01 10:00:00+00:00
1   2017-01-01 11:00:00+00:00
dtype: datetime64[ns, UTC]

Applying your workaround:

df[df.notnull().all(axis=1)].max(axis=1).reindex(df.index)

gives:

0   2017-01-01 12:00:00+00:00
1                         NaT
dtype: datetime64[ns, UTC]

As you can see, your workaround helps for the 0th row, but returns an incorrect value for the 1st row which has an NaT value in one of the columns.

Comment From: jkukul

Here's my workaround - instead of using:

df.min(axis=1)

I'm using:

df.apply(lambda s: s[s.notnull()].min(), axis=1)

Comment From: jbrockmendel

The underlying issue here is that df.min(axis=1) is going through a path that accesses df.values instead of operating block-wise.

Comment From: LanSaid

sad, its alive in 1.2.3, any idea can we fix it?

Comment From: jbrockmendel

part of the way there: passing numeric_only=False gives the expected result for https://github.com/pandas-dev/pandas/issues/27794#issuecomment-596760964

Comment From: jbrockmendel

Fixed by the removal of numeric_only=None, closing.

Pandas Min/max does not work for dates with timezones if there are missing values in the data frame

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`