pandas.to_datetime
called with an int
is too slow for my use case. Basically, I have a loop that sequentially gets an integer from a generator of about 1 000 000 numbers, converts it to pandas.Timestamp
and passes it to a function. A profiler says that the call of pandas.to_datetime
takes about 40 % of the total run time of my program.
Compared to datetime.datetime.fromtimestamp
, it's more than 60 times slower:
$ python -m timeit -n 1000000 -s 'import datetime' 'datetime.datetime.fromtimestamp(30, tz=datetime.timezone.utc)'
1000000 loops, best of 3: 0.889 usec per loop
$ python -m timeit -n 1000000 -s 'import pandas' 'pandas.to_datetime(30, utc=True, unit="s")'
1000000 loops, best of 3: 62.8 usec per loop
$ python -c 'import pandas;pandas.show_versions()'
INSTALLED VERSIONS
------------------
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-47-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.19.1
nose: None
pip: 9.0.1
setuptools: 20.7.0
Cython: None
numpy: 1.11.2
scipy: None
statsmodels: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.7
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None
Can you please provide/document a faster way to instantiate a pandas.Timestamp
instance from an epoch time?
Comment From: jreback
why would you do this in a loop? simply pass the entire list
Comment From: jreback
In [10]: r = list(range(100000))
In [11]: %timeit [ datetime.datetime.fromtimestamp(30+v, tz=datetime.timezone.utc) for v in r ]
1 loop, best of 3: 251 ms per loop
In [12]: %timeit pd.to_datetime(r, utc=True, unit='s')
10 loops, best of 3: 84.5 ms per loop
Comment From: radekholy24
Because the whole data from the generator do not fit into memory? But yeah, I can do that in my case.
Comment From: TomAugspurger
FYI @PyDeq
In [652]: %timeit pd.Timestamp.utcfromtimestamp(30)
The slowest run took 11.44 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.76 µs per loop
vs
In [653]: %timeit datetime.datetime.fromtimestamp(30, tz=datetime.timezone.utc)
The slowest run took 14.00 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.62 µs per loop
But agreed with @jreback, you're much better off using vectorized methods in pandas.
Comment From: radekholy24
@TomAugspurger thanks. Unfortunately, pd.Timestamp.utcfromtimestamp
is not documented.
Comment From: TomAugspurger
Mind opening a PR to fix that?
Comment From: radekholy24
No promises but I can consider doing a PR in case of spare time, sure.
Comment From: TomAugspurger
https://github.com/pandas-dev/pandas/issues/5218 seems to be the reason it's not in the API docs at the moment.
Comment From: jorisvandenbossche
Using just plain Timestamp
constructor is actually also fast:
In [51]: %timeit pd.Timestamp.utcfromtimestamp(30)
The slowest run took 39.10 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.77 µs per loop
In [52]: %timeit pd.Timestamp(30, unit='s', tz='UTC')
The slowest run took 14.56 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 1.97 µs per loop
And this one is documented (so I would prefer this over Timestamp.utcfromtimestamp
)
(and it also gives me the impression that the performance of to_datetime
can certainly be improved for this case)
Comment From: radekholy24
@jorisvandenbossche, what document do you mean? So far, I've found only examples with strings as the first arguments and no unit
nor tz
arguments.
Anyway, thanks. That is actually what I expected to be the resolution of this request.