Pandas PERF: Some asv tests take a long time to setup

I hacked asv to log the total execution time, including setup, for each benchmark. Some of these are parametrized over several cases, so they may not actually be slow. time is in seconds.

	time	file	klass	method
68	00:03:11.379267	frame_ctor	FrameConstructorDTIndexFromOffsets	time_frame_ctor
394	00:01:01.671905	inference	to_numeric_downcast	time_downcast
199	00:00:46.330012	groupby	GroupBySuite	time_describe
559	00:00:24.698904	replace	replace_large_dict	time_replace_large_dict
204	00:00:24.212386	groupby	GroupBySuite	time_mad
210	00:00:22.481497	groupby	GroupBySuite	time_pct_change
215	00:00:18.909368	groupby	GroupBySuite	time_skew
200	00:00:18.732072	groupby	GroupBySuite	time_diff
212	00:00:18.317290	groupby	GroupBySuite	time_rank
219	00:00:16.845357	groupby	GroupBySuite	time_unique

Ideally we could optimize the setup time on these as well. We could maybe modify the benchmark to do less work / run faster, but I'd like to avoid that if possible.

Link to the full CSV: https://gist.github.com/9d80aa45750224d7453863f2f754160d

Comment From: jorisvandenbossche

One of the things I already have thought about before is that we probably can do less in the setup ? I mean, is it needed to create the DataFrames in a setup (which gets repeated a lot), or would it be sufficient to just create them once in the file? (as long as the benchmarks don't modify it)

Eg https://github.com/pandas-dev/pandas/blob/master/asv_bench/benchmarks/groupby.py#L399 we could do the actual creation of the frame outside setup, and in the setup only pick the right one based on the parameters.

Comment From: TomAugspurger

There's also the setup_cache method, which executes once

On Jun 30, 2017, at 3:17 AM, Joris Van den Bossche notifications@github.com wrote:

One of the things I already have thought about before is that we probably can do less in the setup ? I mean, is it needed to create the DataFrames in a setup (which gets repeated a lot), or would it be sufficient to just create them once in the file? (as long as the benchmarks don't modify it)

Eg https://github.com/pandas-dev/pandas/blob/master/asv_bench/benchmarks/groupby.py#L399 we could do the actual creation of the frame outside setup, and in the setup only pick the right one based on the parameters.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

Comment From: jreback

is this actually a big deal?

Comment From: jbrockmendel

Is there a measure of the relative importance of these benchmarks? i.e. if I find improvements in subset A and slowdowns in subset B, when can this count as a win?

Comment From: jorisvandenbossche

@jbrockmendel there is no 'measure' for that, apart from common sense.

But note that this is issue is not about that, it is about the fact that some tests take a long time to run (not because of pandas being too slow, but because of the set-up of the test)

Comment From: jbrockmendel

For some of these, e.g. reindex.Reindexing, there is a lot of unnecessary setup being done. df2 is constructed for every call of all three benchmarks, but is only used in one. Same for s1 and s2. It isn't obvious how much mileage can be gained by trimming these down though.

Comment From: jbrockmendel

@TomAugspurger can you post the hack you used to measure setup time? I'm poking at this instead of doing real work.

Comment From: TomAugspurger

Happy to encourage productive procrastination :)

This is the diff in my local copy (which is based on a pretty old commit now)

diff --git a/asv/benchmarks.py b/asv/benchmarks.py
index 60442b9..b3b96ba 100644
--- a/asv/benchmarks.py
+++ b/asv/benchmarks.py
@@ -89,6 +89,8 @@ def run_benchmark(benchmark, root, env, show_stderr=False,

         - `errcode`: Error return code.
     """
+    import time
+    t0 = time.time()
     name = benchmark['name']
     result = {'stderr': '', 'profile': None, 'errcode': 0}
     bench_results = []
@@ -230,6 +232,10 @@ def run_benchmark(benchmark, root, env, show_stderr=False,
             with log.indent():
                 log.error(result['stderr'])

+        t1 = time.time()
+        line = '{},{}\n'.format(name, t1 - t0)
+        with open("log.csv", "a") as f:
+            f.write(line)
         return result


@@ -535,7 +541,6 @@ class Benchmarks(dict):
               and be a byte string containing the cProfile data.
         """
         log.info("Benchmarking {0}".format(env.name))
-
         with log.indent():
             benchmarks = sorted(list(six.iteritems(self)))

Comment From: jbrockmendel

Thanks, worked like a charm. What didn't work was my attempt to short-cut setup calls by defining attributes at the class level.

Comment From: jorisvandenbossche

Thanks, worked like a charm. What didn't work was my attempt to short-cut setup calls by defining attributes at the class level.

I think using 'global' data works, so defined on the module level (as long as it is only used for benchmarks that don't modify data). In some cases (if the module is not too long, something generic can be reused) that can be used I think.

Comment From: mroeschke

I think this has been superseded by https://github.com/pandas-dev/pandas/issues/44450 so closing