Pandas Poor performance of constructing of DataFrame from series which indexes have different time zones

Code Sample, a copy-pastable example if possible

import time

import pandas as pd

a = pd.Series([1, 1], index=pd.DatetimeIndex([pd.Timestamp('2019-04-01T01:00:00+01:00'),
                                              pd.Timestamp('2019-04-07T01:00:00+01:00')]))
a = a.resample('1s').sum()  # ~500k items

# same TZ
b = pd.Series([1], index=pd.DatetimeIndex([pd.Timestamp('2019-04-01T02:00:00+01:00')]))
b = b.resample('1s').sum()  # 1 item

# different TZ
c = pd.Series([1], index=pd.DatetimeIndex([pd.Timestamp('2019-04-01T02:00:00+02:00')]))
c = c.resample('1s').sum()  # 1 item

start = time.time()
pd.DataFrame({'A': a, 'B': b})
print('same TZ:', time.time() - start)

start = time.time()
pd.DataFrame({'A': a, 'C': c})
print('different TZ:', time.time() - start)

Problem description

DataFrame is constructed from 2 series. The performance degrades significantly when series have indexes with different time zones. The code from above returns:

same TZ: 0.037207841873168945
different TZ: 8.304282188415527

Expected Output

It's expected that performance is the same.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit : None python : 3.6.7.final.0 python-bits : 64 OS : Linux OS-release : 4.15.0-58-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 0.25.1 numpy : 1.16.4 pytz : 2019.1 dateutil : 2.8.0 pip : 19.1.1 setuptools : 41.0.1 Cython : None pytest : 4.4.1 hypothesis : None sphinx : 1.8.5 blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 2.10.1 IPython : 7.6.1 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : 3.0.3 numexpr : 2.6.9 odfpy : None openpyxl : 2.6.2 pandas_gbq : None pyarrow : None pytables : None s3fs : None scipy : 1.2.1 sqlalchemy : None tables : 3.5.1 xarray : None xlrd : 1.2.0 xlwt : None xlsxwriter : None

Comment From: jreback

you are trying to combine disparate indexes

In [17]: c.index.tz                                                                                                                                         
Out[17]: pytz.FixedOffset(120)

In [18]: a.index.tz                                                                                                                                         
Out[18]: pytz.FixedOffset(60)

so these need to be aligned, by definition then they must be converted to UTC.

prun shows

In [19]: %prun pd.DataFrame({'A': a, 'C': c})                                                                                                               

         11409056 function calls (11409012 primitive calls) in 18.003 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1555212   12.046    0.000   14.302    0.000 datetimes.py:601(<lambda>)
        2    1.416    0.708    1.515    0.758 {method 'get_indexer' of 'pandas._libs.index.IndexEngine' objects}
        4    0.955    0.239   15.257    3.814 {pandas._libs.lib.map_infer}
        2    0.886    0.443    0.950    0.475 base.py:1726(is_unique)
  1555231    0.838    0.000    1.924    0.000 datetimes.py:625(tz)
  1555747    0.632    0.000    0.872    0.000 {built-in method builtins.getattr}
  1555249    0.241    0.000    0.241    0.000 dtypes.py:701(tz)

likely we are taking an inefficient path here in attempting to convert (when we know up front that we must instead use a vectorized conversion to UTC).

Pandas Poor performance of constructing of DataFrame from series which indexes have different time zones

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`