There is huge performance difference between, to_datetime and using astype for a epoch time series:
Pandas to_datetime
performance :
%timeit pd.to_datetime(pd.Series(np.arange(1352500101000000000,1420438546000000000,6000000000)))
Output: 1.51 s ± 37.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Pandas astype('datetime64[ns]')
performance :
%timeit pd.Series(np.arange(1352500101000000000,1420438546000000000,6000000000)).astype('datetime64[ns]')
57.5 ms ± 987 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I am unable to find reason for this performance variance, any help will be great
Thanks!
Output of pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-79-generic
machine: x86_64
processor:
byteorder: little
LC_ALL: en_US.UTF-8
LANG: C.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.20.2
pytest: 3.1.2
pip: 8.1.2
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: 0.9.6
IPython: 6.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.0
tables: 3.4.2
numexpr: 2.6.2
feather: 0.4.0
matplotlib: 2.0.0
openpyxl: 2.5.0a2
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.5.3
html5lib: 0.999
sqlalchemy: 1.1.5
pymysql: None
psycopg2: None
jinja2: 2.9.5
s3fs: 0.1.0
pandas_gbq: None
pandas_datareader: 0.4.0
Comment From: jorisvandenbossche
If you specify the unit
, the difference is already much smaller:
In [15]: s = pd.Series(np.arange(1352500101000000000,1420438546000000000,6000000000))
In [16]: %timeit pd.to_datetime(s)
1.65 s ± 54.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [17]: %timeit s.astype('datetime64[ns]')
64 ms ± 965 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [18]: %timeit pd.to_datetime(s, unit='ns')
259 ms ± 4.72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
(but still the difference seems larger than it should be)
Comment From: jreback
the rest of the diff is related to #17449, this ends up being copied 3 times internally
In [10]: %timeit pd.to_datetime(s, unit='ns')
107 ms +- 2.16 ms per loop (mean +- std. dev. of 7 runs, 10 loops each)
In [11]: %timeit s.astype('datetime64[ns]')
20.3 ms +- 283 us per loop (mean +- std. dev. of 7 runs, 10 loops each)
In [12]: %timeit s.copy()
22.3 ms +- 533 us per loop (mean +- std. dev. of 7 runs, 10 loops each)
Comment From: jreback
closing, but if you want to help on that other issue would be great.