Pandas PERF: some regressions - Nineya|java/go/python

http://pydata.github.io/pandas/# is a view since 0.14 (its not every tag, but a sampling). The regressions pages is now working here - [ ] http://pydata.github.io/pandas/#groupby.groupby_indices.time_groupby_indices - [ ] http://pydata.github.io/pandas/#frame_ctor.frame_ctor_nested_dict_int64.time_frame_ctor_nested_dict_int64 (#11158) - [ ] http://pydata.github.io/pandas/#index_object.index_int64_intersection.time_index_int64_intersection - [ ] http://pydata.github.io/pandas/#groupby.groupby_ngroups_100_cumprod.time_groupby_ngroups_100_cumprod - [ ] http://pydata.github.io/pandas/#frame_methods.frame_get_dtype_counts.time_frame_get_dtype_counts - [ ] http://pydata.github.io/pandas/#ctors.index_from_series_ctor.time_index_from_series_ctor - [ ] http://pydata.github.io/pandas/#binary_ops.frame_multi_and_st.time_frame_multi_and_st - [ ] http://pydata.github.io/pandas/#indexing.series_loc_list_like.time_series_loc_list_like

timeseries / period related: - [x] http://pydata.github.io/pandas/#plotting.plot_timeseries_period.time_plot_timeseries_period (#11194) - [ ] http://pydata.github.io/pandas/#timeseries.timeseries_iter_periodindex.time_timeseries_iter_periodindex - [ ] http://pydata.github.io/pandas/#timeseries.period_setitem.time_period_setitem - [ ] http://pydata.github.io/pandas/#timeseries.datetimeindex_normalize.time_datetimeindex_normalize - [ ] http://pydata.github.io/pandas/#timeseries.dataframe_resample_max_string.time_dataframe_resample_max_string

Comment From: jorisvandenbossche

@sinhrks I was looking at the time series plotting slowdown (time_plot_timeseries_period, there is a ca 5 times slowdown in timeseries plotting since 0.16.2)

It is related to some of the Period changes, namely that freq is no longer a string but a DateOffset object. If you profile df.plot(), most of the time is cause by to_offset. At a certain point (in converter.py:convert), a object dtyped array of Period objects is converted back to a PeriodIndex:

In [1]: values = pd.period_range('1/1/1975', periods=2000).astype(object).values

In [2]: values
Out[2]:
array([Period('1975-01-01', 'D'), Period('1975-01-02', 'D'),
       Period('1975-01-03', 'D'), ..., Period('1980-06-20', 'D'),
       Period('1980-06-21', 'D'), Period('1980-06-22', 'D')], dtype=object)

In [3]: %timeit pd.PeriodIndex(values, freq='D')
100 loops, best of 3: 1.86 ms per loop

Above is with 0.16.2, on master this gives me 109 ms instead of 1.86 ms. Reason for the slowdown is that PeriodIndex._from_arraylike will try to extract the freq from each object, and checks if the freq is equal to the given freq. Previously this was a string equality check, now a DateOffset/string equality check.

Now, looking for a possible fix, this commit: https://github.com/jorisvandenbossche/pandas/commit/55ecbf0f4b0c1126f45fd738245e12d7c1139b75 (making it compare strings again) does solve the perf issue for a big part. But I was wondering, do you know a better approach? Maybe we could prevent this step (array of Period objects -> PeriodIndex) altogether in the plotting code? (although this is initially called from

Comment From: sinhrks

Thanks for catching this. In addition to your ideas, caching str -> freq mapping in to_offset may work. This is already done in get_offset, and I think the same cache can be used.

Let me look into this.

Comment From: jreback

closing as stale