Pandas PERF: to_datime fastpath for %Y%m%d is slower

We have a check for whether format == '%Y%m%d', but this actually seems to be slower:

In [86]: s = pd.Series(['20120101']*1000000)

In [87]: %timeit pd.to_datetime(s)
229 ms ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [88]: %timeit pd.to_datetime(s, format='%Y%m%d')
749 ms ± 67.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Comment From: jreback

something changed recently? with this; the path now excepts every time (inside the _attemptYYMMDD). this should be much faster.

Comment From: pratapvardhan

I suspect, for format='%Y%m%d' it goes through _attempt_YYYYMMDD in pandas.core.tools.datetimes and is reconverting the type from object to int to object again and then having lib.try_parse_year_month_day prior to tslib.array_to_datetime seems taking up time. Whereas, without format, it directly uses tslib.array_to_datetime

Comment From: chris-b1

One change, though not especially recent, is that the iso 8601 path will now handle '%Y%m%d', so it will be much faster than it once was. Still seems like it would be possible to make _attempt_YYYYMMDD faster, not clear to me why we're casting back and forth to objects.

Comment From: mroeschke

I looked into this briefly in https://github.com/mroeschke/pandas/tree/non_object_parsing and was able to get it down ~2x from master by getting rid of casting to object, but it's still slower than not providing a format:

In [13]: pd.__version__
Out[13]: '0.24.0.dev0+890.gd57115285.dirty'

In [14]: s = pd.Series(['20120101']*1000000)

In [15]: %timeit pd.to_datetime(s)
135 ms ± 852 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Branch
In [16]: %timeit pd.to_datetime(s, format='%Y%m%d')
240 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# master
In [3]: %timeit pd.to_datetime(s, format='%Y%m%d')
531 ms ± 34.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Given that this branch can provide a slowdown, it might be worth removing this path for now.

Comment From: jreback

this used to be way faster than naive parsing something changed -

Comment From: jreback

This patch makes the speed almost the same. But since its just duplicating a lot of paths, maybe just easier to simplify this (e.g. if format is %Y%m%d then just remove the format and parse).

I think this got way slower because of floating point math was introduced (e.g. / rather than //)

diff --git a/pandas/_libs/tslibs/conversion.pyx b/pandas/_libs/tslibs/conversion.pyx
index 8cf42bf93..2f303e1f2 100644
--- a/pandas/_libs/tslibs/conversion.pyx
+++ b/pandas/_libs/tslibs/conversion.pyx
@@ -481,6 +481,34 @@ cdef _TSObject convert_str_to_tsobject(object ts, object tz, object unit,
     return convert_to_tsobject(ts, tz, unit, dayfirst, yearfirst)


+def try_parse_year_month_day(int64_t[:] years, int64_t[:] months,
+                             int64_t[:] days):
+    cdef:
+        Py_ssize_t i, n
+        int64_t[:] iresult
+        npy_datetimestruct dts
+
+    n = len(years)
+    if len(months) != n or len(days) != n:
+        raise ValueError('Length of years/months/days must all be equal')
+    result = np.empty(n, dtype='M8[ns]')
+    iresult = result.view('i8')
+
+    dts.hour = 0
+    dts.min = 0
+    dts.sec = 0
+    dts.us = 0
+    dts.ps = 0
+
+    for i in range(n):
+        dts.year = years[i]
+        dts.month = months[i]
+        dts.day = days[i]
+        result[i] = dtstruct_to_dt64(&dts)
+
+    return np.asarray(result)
+
+
 cdef inline check_overflows(_TSObject obj):
     """
     Check that we haven't silently overflowed in timezone conversion
diff --git a/pandas/_libs/tslibs/parsing.pyx b/pandas/_libs/tslibs/parsing.pyx
index 71bb8f796..cd911153c 100644
--- a/pandas/_libs/tslibs/parsing.pyx
+++ b/pandas/_libs/tslibs/parsing.pyx
@@ -10,7 +10,6 @@ from cython import Py_ssize_t

 from cpython.datetime cimport datetime

-
 import numpy as np

 # Avoid import from outside _libs
@@ -19,7 +18,6 @@ if sys.version_info.major == 2:
 else:
     from io import StringIO

-
 # dateutil compat
 from dateutil.tz import (tzoffset,
                          tzlocal as _dateutil_tzlocal,
@@ -465,23 +463,6 @@ def try_parse_date_and_time(object[:] dates, object[:] times,
     return result.base  # .base to access underlying ndarray


-def try_parse_year_month_day(object[:] years, object[:] months,
-                             object[:] days):
-    cdef:
-        Py_ssize_t i, n
-        object[:] result
-
-    n = len(years)
-    if len(months) != n or len(days) != n:
-        raise ValueError('Length of years/months/days must all be equal')
-    result = np.empty(n, dtype='O')
-
-    for i in range(n):
-        result[i] = datetime(int(years[i]), int(months[i]), int(days[i]))
-
-    return result.base  # .base to access underlying ndarray
-
-
 def try_parse_datetime_components(object[:] years,
                                   object[:] months,
                                   object[:] days,
diff --git a/pandas/core/tools/datetimes.py b/pandas/core/tools/datetimes.py
index dcba51d26..c82140eff 100644
--- a/pandas/core/tools/datetimes.py
+++ b/pandas/core/tools/datetimes.py
@@ -244,6 +244,7 @@ def _convert_listlike_datetimes(arg, box, format, name=None, tz=None,
             if format == '%Y%m%d':
                 try:
                     result = _attempt_YYYYMMDD(arg, errors=errors)
+                    return DatetimeIndex._simple_new(result, name=name, tz=tz)
                 except (ValueError, TypeError, tslibs.OutOfBoundsDatetime):
                     raise ValueError("cannot convert the input to "
                                      "'%Y%m%d' date format")
@@ -713,12 +714,9 @@ def _attempt_YYYYMMDD(arg, errors):
     """

     def calc(carg):
-        # calculate the actual result
-        carg = carg.astype(object)
-        parsed = parsing.try_parse_year_month_day(carg / 10000,
-                                                  carg / 100 % 100,
-                                                  carg % 100)
-        return tslib.array_to_datetime(parsed, errors=errors)[0]
+        return conversion.try_parse_year_month_day(carg // 10000,
+                                                   (carg // 100) % 100,
+                                                   carg % 100)

     def calc_with_mask(carg, mask):
         result = np.empty(carg.shape, dtype='M8[ns]')

Comment From: jreback

of course for this example, cache=True helps quite a bit :>

Comment From: MarcoGorelli

This path is buggy anyway, I'd suggest just removing it https://github.com/pandas-dev/pandas/pull/50054

Comment From: MarcoGorelli

Sending it down the same path that it would go down if format isn't specified isn't always possible, as it doesn't handle 6-digit %Y%m%d dates:

In [5]: to_datetime('199934', format='%Y%m%d')
Out[5]: Timestamp('1999-03-04 00:00:00')

In [6]: to_datetime('199934')
---------------------------------------------------------------------------
ParserError: month must be in 1..12: 199934 present at position 0

I still think we should remove this path though, even if it means going down the strptime path - if the %Y%m%d fastpath is buggy, then it doesn't matter how fast it is