Code Sample, a copy-pastable example if possible
import time
import pandas as pd
a = pd.Series([1, 1], index=pd.DatetimeIndex([pd.Timestamp('2019-04-01T01:00:00+01:00'),
pd.Timestamp('2019-04-07T01:00:00+01:00')]))
a = a.resample('1s').sum() # ~500k items
# same TZ
b = pd.Series([1], index=pd.DatetimeIndex([pd.Timestamp('2019-04-01T02:00:00+01:00')]))
b = b.resample('1s').sum() # 1 item
# different TZ
c = pd.Series([1], index=pd.DatetimeIndex([pd.Timestamp('2019-04-01T02:00:00+02:00')]))
c = c.resample('1s').sum() # 1 item
start = time.time()
pd.DataFrame({'A': a, 'B': b})
print('same TZ:', time.time() - start)
start = time.time()
pd.DataFrame({'A': a, 'C': c})
print('different TZ:', time.time() - start)
Problem description
DataFrame is constructed from 2 series. The performance degrades significantly when series have indexes with different time zones. The code from above returns:
same TZ: 0.037207841873168945
different TZ: 8.304282188415527
Expected Output
It's expected that performance is the same.
Output of pd.show_versions()
INSTALLED VERSIONS
------------------
commit : None
python : 3.6.7.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-58-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 0.25.1
numpy : 1.16.4
pytz : 2019.1
dateutil : 2.8.0
pip : 19.1.1
setuptools : 41.0.1
Cython : None
pytest : 4.4.1
hypothesis : None
sphinx : 1.8.5
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.6.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.0.3
numexpr : 2.6.9
odfpy : None
openpyxl : 2.6.2
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.2.1
sqlalchemy : None
tables : 3.5.1
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None
Comment From: jreback
you are trying to combine disparate indexes
In [17]: c.index.tz
Out[17]: pytz.FixedOffset(120)
In [18]: a.index.tz
Out[18]: pytz.FixedOffset(60)
so these need to be aligned, by definition then they must be converted to UTC.
prun shows
In [19]: %prun pd.DataFrame({'A': a, 'C': c})
11409056 function calls (11409012 primitive calls) in 18.003 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1555212 12.046 0.000 14.302 0.000 datetimes.py:601(<lambda>)
2 1.416 0.708 1.515 0.758 {method 'get_indexer' of 'pandas._libs.index.IndexEngine' objects}
4 0.955 0.239 15.257 3.814 {pandas._libs.lib.map_infer}
2 0.886 0.443 0.950 0.475 base.py:1726(is_unique)
1555231 0.838 0.000 1.924 0.000 datetimes.py:625(tz)
1555747 0.632 0.000 0.872 0.000 {built-in method builtins.getattr}
1555249 0.241 0.000 0.241 0.000 dtypes.py:701(tz)
likely we are taking an inefficient path here in attempting to convert (when we know up front that we must instead use a vectorized conversion to UTC).
if you'd like to look in more detail would be great.
Comment From: jreback
cc @jbrockmendel
Comment From: jbrockmendel
I'm now seeing 25.5ms where this used to be 14.3s. Closing as complete.