We have a check for whether format == '%Y%m%d'
, but this actually seems to be slower:
In [86]: s = pd.Series(['20120101']*1000000)
In [87]: %timeit pd.to_datetime(s)
229 ms ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [88]: %timeit pd.to_datetime(s, format='%Y%m%d')
749 ms ± 67.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Comment From: jreback
something changed recently? with this; the path now excepts every time (inside the _attemptYYMMDD). this should be much faster.
Comment From: pratapvardhan
I suspect, for format='%Y%m%d'
it goes through _attempt_YYYYMMDD
in pandas.core.tools.datetimes
and is reconverting the type from object to int to object again and then having lib.try_parse_year_month_day
prior to tslib.array_to_datetime
seems taking up time. Whereas, without format
, it directly uses tslib.array_to_datetime
Comment From: chris-b1
One change, though not especially recent, is that the iso 8601 path will now handle '%Y%m%d'
, so it will be much faster than it once was. Still seems like it would be possible to make _attempt_YYYYMMDD
faster, not clear to me why we're casting back and forth to objects.
Comment From: mroeschke
I looked into this briefly in https://github.com/mroeschke/pandas/tree/non_object_parsing and was able to get it down ~2x from master by getting rid of casting to object, but it's still slower than not providing a format:
In [13]: pd.__version__
Out[13]: '0.24.0.dev0+890.gd57115285.dirty'
In [14]: s = pd.Series(['20120101']*1000000)
In [15]: %timeit pd.to_datetime(s)
135 ms ± 852 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Branch
In [16]: %timeit pd.to_datetime(s, format='%Y%m%d')
240 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# master
In [3]: %timeit pd.to_datetime(s, format='%Y%m%d')
531 ms ± 34.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Given that this branch can provide a slowdown, it might be worth removing this path for now.
Comment From: jreback
this used to be way faster than naive parsing something changed -
Comment From: jreback
This patch makes the speed almost the same. But since its just duplicating a lot of paths, maybe just easier to simplify this (e.g. if format is %Y%m%d then just remove the format and parse).
I think this got way slower because of floating point math was introduced (e.g. /
rather than //
)
diff --git a/pandas/_libs/tslibs/conversion.pyx b/pandas/_libs/tslibs/conversion.pyx
index 8cf42bf93..2f303e1f2 100644
--- a/pandas/_libs/tslibs/conversion.pyx
+++ b/pandas/_libs/tslibs/conversion.pyx
@@ -481,6 +481,34 @@ cdef _TSObject convert_str_to_tsobject(object ts, object tz, object unit,
return convert_to_tsobject(ts, tz, unit, dayfirst, yearfirst)
+def try_parse_year_month_day(int64_t[:] years, int64_t[:] months,
+ int64_t[:] days):
+ cdef:
+ Py_ssize_t i, n
+ int64_t[:] iresult
+ npy_datetimestruct dts
+
+ n = len(years)
+ if len(months) != n or len(days) != n:
+ raise ValueError('Length of years/months/days must all be equal')
+ result = np.empty(n, dtype='M8[ns]')
+ iresult = result.view('i8')
+
+ dts.hour = 0
+ dts.min = 0
+ dts.sec = 0
+ dts.us = 0
+ dts.ps = 0
+
+ for i in range(n):
+ dts.year = years[i]
+ dts.month = months[i]
+ dts.day = days[i]
+ result[i] = dtstruct_to_dt64(&dts)
+
+ return np.asarray(result)
+
+
cdef inline check_overflows(_TSObject obj):
"""
Check that we haven't silently overflowed in timezone conversion
diff --git a/pandas/_libs/tslibs/parsing.pyx b/pandas/_libs/tslibs/parsing.pyx
index 71bb8f796..cd911153c 100644
--- a/pandas/_libs/tslibs/parsing.pyx
+++ b/pandas/_libs/tslibs/parsing.pyx
@@ -10,7 +10,6 @@ from cython import Py_ssize_t
from cpython.datetime cimport datetime
-
import numpy as np
# Avoid import from outside _libs
@@ -19,7 +18,6 @@ if sys.version_info.major == 2:
else:
from io import StringIO
-
# dateutil compat
from dateutil.tz import (tzoffset,
tzlocal as _dateutil_tzlocal,
@@ -465,23 +463,6 @@ def try_parse_date_and_time(object[:] dates, object[:] times,
return result.base # .base to access underlying ndarray
-def try_parse_year_month_day(object[:] years, object[:] months,
- object[:] days):
- cdef:
- Py_ssize_t i, n
- object[:] result
-
- n = len(years)
- if len(months) != n or len(days) != n:
- raise ValueError('Length of years/months/days must all be equal')
- result = np.empty(n, dtype='O')
-
- for i in range(n):
- result[i] = datetime(int(years[i]), int(months[i]), int(days[i]))
-
- return result.base # .base to access underlying ndarray
-
-
def try_parse_datetime_components(object[:] years,
object[:] months,
object[:] days,
diff --git a/pandas/core/tools/datetimes.py b/pandas/core/tools/datetimes.py
index dcba51d26..c82140eff 100644
--- a/pandas/core/tools/datetimes.py
+++ b/pandas/core/tools/datetimes.py
@@ -244,6 +244,7 @@ def _convert_listlike_datetimes(arg, box, format, name=None, tz=None,
if format == '%Y%m%d':
try:
result = _attempt_YYYYMMDD(arg, errors=errors)
+ return DatetimeIndex._simple_new(result, name=name, tz=tz)
except (ValueError, TypeError, tslibs.OutOfBoundsDatetime):
raise ValueError("cannot convert the input to "
"'%Y%m%d' date format")
@@ -713,12 +714,9 @@ def _attempt_YYYYMMDD(arg, errors):
"""
def calc(carg):
- # calculate the actual result
- carg = carg.astype(object)
- parsed = parsing.try_parse_year_month_day(carg / 10000,
- carg / 100 % 100,
- carg % 100)
- return tslib.array_to_datetime(parsed, errors=errors)[0]
+ return conversion.try_parse_year_month_day(carg // 10000,
+ (carg // 100) % 100,
+ carg % 100)
def calc_with_mask(carg, mask):
result = np.empty(carg.shape, dtype='M8[ns]')
Comment From: jreback
of course for this example, cache=True
helps quite a bit :>
Comment From: MarcoGorelli
This path is buggy anyway, I'd suggest just removing it https://github.com/pandas-dev/pandas/pull/50054
Comment From: MarcoGorelli
Sending it down the same path that it would go down if format
isn't specified isn't always possible, as it doesn't handle 6-digit %Y%m%d dates:
In [5]: to_datetime('199934', format='%Y%m%d')
Out[5]: Timestamp('1999-03-04 00:00:00')
In [6]: to_datetime('199934')
---------------------------------------------------------------------------
ParserError: month must be in 1..12: 199934 present at position 0
I still think we should remove this path though, even if it means going down the strptime
path - if the %Y%m%d
fastpath is buggy, then it doesn't matter how fast it is