http://pydata.github.io/pandas/# is a view since 0.14 (its not every tag, but a sampling). The regressions pages is now working here - [ ] http://pydata.github.io/pandas/#groupby.groupby_indices.time_groupby_indices - [ ] http://pydata.github.io/pandas/#frame_ctor.frame_ctor_nested_dict_int64.time_frame_ctor_nested_dict_int64 (#11158) - [ ] http://pydata.github.io/pandas/#index_object.index_int64_intersection.time_index_int64_intersection - [ ] http://pydata.github.io/pandas/#groupby.groupby_ngroups_100_cumprod.time_groupby_ngroups_100_cumprod - [ ] http://pydata.github.io/pandas/#frame_methods.frame_get_dtype_counts.time_frame_get_dtype_counts - [ ] http://pydata.github.io/pandas/#ctors.index_from_series_ctor.time_index_from_series_ctor - [ ] http://pydata.github.io/pandas/#binary_ops.frame_multi_and_st.time_frame_multi_and_st - [ ] http://pydata.github.io/pandas/#indexing.series_loc_list_like.time_series_loc_list_like
timeseries / period related: - [x] http://pydata.github.io/pandas/#plotting.plot_timeseries_period.time_plot_timeseries_period (#11194) - [ ] http://pydata.github.io/pandas/#timeseries.timeseries_iter_periodindex.time_timeseries_iter_periodindex - [ ] http://pydata.github.io/pandas/#timeseries.period_setitem.time_period_setitem - [ ] http://pydata.github.io/pandas/#timeseries.datetimeindex_normalize.time_datetimeindex_normalize - [ ] http://pydata.github.io/pandas/#timeseries.dataframe_resample_max_string.time_dataframe_resample_max_string
Comment From: jorisvandenbossche
@sinhrks I was looking at the time series plotting slowdown (time_plot_timeseries_period
, there is a ca 5 times slowdown in timeseries plotting since 0.16.2)
It is related to some of the Period changes, namely that freq
is no longer a string but a DateOffset object.
If you profile df.plot()
, most of the time is cause by to_offset
. At a certain point (in converter.py:convert), a object dtyped array of Period objects is converted back to a PeriodIndex:
In [1]: values = pd.period_range('1/1/1975', periods=2000).astype(object).values
In [2]: values
Out[2]:
array([Period('1975-01-01', 'D'), Period('1975-01-02', 'D'),
Period('1975-01-03', 'D'), ..., Period('1980-06-20', 'D'),
Period('1980-06-21', 'D'), Period('1980-06-22', 'D')], dtype=object)
In [3]: %timeit pd.PeriodIndex(values, freq='D')
100 loops, best of 3: 1.86 ms per loop
Above is with 0.16.2, on master this gives me 109 ms instead of 1.86 ms. Reason for the slowdown is that PeriodIndex._from_arraylike will try to extract the freq from each object, and checks if the freq is equal to the given freq. Previously this was a string equality check, now a DateOffset/string equality check.
Now, looking for a possible fix, this commit: https://github.com/jorisvandenbossche/pandas/commit/55ecbf0f4b0c1126f45fd738245e12d7c1139b75 (making it compare strings again) does solve the perf issue for a big part. But I was wondering, do you know a better approach? Maybe we could prevent this step (array of Period objects -> PeriodIndex) altogether in the plotting code? (although this is initially called from
Comment From: sinhrks
Thanks for catching this. In addition to your ideas, caching str -> freq mapping in to_offset
may work. This is already done in get_offset
, and I think the same cache can be used.
Let me look into this.
Comment From: jreback
closing as stale