There is huge performance difference between, to_datetime and using astype for a epoch time series:

Pandas to_datetime performance :

%timeit pd.to_datetime(pd.Series(np.arange(1352500101000000000,1420438546000000000,6000000000)))

Output: 1.51 s ± 37.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Pandas astype('datetime64[ns]') performance :

%timeit pd.Series(np.arange(1352500101000000000,1420438546000000000,6000000000)).astype('datetime64[ns]')

57.5 ms ± 987 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

I am unable to find reason for this performance variance, any help will be great

Thanks!

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-79-generic machine: x86_64 processor: byteorder: little LC_ALL: en_US.UTF-8 LANG: C.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.20.2 pytest: 3.1.2 pip: 8.1.2 setuptools: 27.2.0 Cython: 0.25.2 numpy: 1.12.1 scipy: 0.19.0 xarray: 0.9.6 IPython: 6.1.0 sphinx: None patsy: 0.4.1 dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: 1.2.0 tables: 3.4.2 numexpr: 2.6.2 feather: 0.4.0 matplotlib: 2.0.0 openpyxl: 2.5.0a2 xlrd: 1.0.0 xlwt: None xlsxwriter: None lxml: None bs4: 4.5.3 html5lib: 0.999 sqlalchemy: 1.1.5 pymysql: None psycopg2: None jinja2: 2.9.5 s3fs: 0.1.0 pandas_gbq: None pandas_datareader: 0.4.0

Comment From: jorisvandenbossche

If you specify the unit, the difference is already much smaller:

In [15]: s = pd.Series(np.arange(1352500101000000000,1420438546000000000,6000000000))

In [16]: %timeit pd.to_datetime(s)
1.65 s ± 54.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [17]: %timeit s.astype('datetime64[ns]')
64 ms ± 965 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [18]: %timeit pd.to_datetime(s, unit='ns')
259 ms ± 4.72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

(but still the difference seems larger than it should be)

Comment From: jreback

the rest of the diff is related to #17449, this ends up being copied 3 times internally

In [10]: %timeit pd.to_datetime(s, unit='ns')
107 ms +- 2.16 ms per loop (mean +- std. dev. of 7 runs, 10 loops each)

In [11]: %timeit s.astype('datetime64[ns]')
20.3 ms +- 283 us per loop (mean +- std. dev. of 7 runs, 10 loops each)

In [12]: %timeit s.copy()
22.3 ms +- 533 us per loop (mean +- std. dev. of 7 runs, 10 loops each)

Comment From: jreback

closing, but if you want to help on that other issue would be great.