Pandas EHN: consider multiple skiplist classes for different datatypes

Right now, when we calculate something like a rolling max, if a float32 dataset is being passed in, it upcasts it to a float64. This increases the memory requirements of a big dataset considerably.

This is due to the fact that the skiplist implementation only works with doubles. Would you be opposed if I added instantiations for int32, int64, float32, uint64, uint32?

This way, we can avoid copies for big datasets (which take up a lot of memory, and time).

cc @kawochen

xref #3007

Comment From: wesm

Why don't we adapt the algorithms from bottleneck (which are faster and do not require using skiplists)

Comment From: kawochen

rolling_max and rolling_min are from bottleneck. I think skiplists are used in rolling_median and rolling_quantile (an alternative is max-median-min-heap)

Comment From: jennolsen84

Yeah, I was thinking of rolling_median. Bottleneck's implementation doesn't ignore NaNs (http://berkeleyanalytics.com/bottleneck/reference.html#bottleneck.move_median). I will see if I can contribute an implementation there that accepts NaNs.

This one and only commit comment doesn't look that encouraging though :) https://github.com/kwgoodman/bottleneck/commits/5a9ecb818b63ffcb8d6e787e51a236a109302c6a/bottleneck/src/auto_pyx/csrc/move_median.c

Comment From: kawochen

wow bottleneck is quite a bit faster. Should work for rolling_quantile too.

Comment From: jreback

@kawochen can you post any benchmarks? as an aside, still need to keep these routines as bottleneckis optional (though that is a separate issue)

Comment From: jennolsen84

When benchmarking, please remember to benchmark for different window sizes and different axis.

When testing median, I came across different implementations where the running times were very different depending on window_size due to the algorithms being used. E.g. if window sizes are small, maintaining multiple heaps hurts performance compared to running quickselect multiple times. Whereas, with big window sizes, it (big O) starts to matter a lot and multiple heap implementations start to win.

Also, with bigger matrices, and iterating across different axis, the cache effects could matter a bit too. So, when you benchmark, please be careful about these effects.

Comment From: kawochen

Will work on that :)

Comment From: jennolsen84

rolling_median is an operation that can be sped up using multi-threading easily. The C library can easily export a re-entrant function that doesn't use the GIL and takes in an input and an output buffer.

What is the pandas policy regarding this? E.g. numexpr evals automatically use multiple cores (unless explicitly told to use only one core), but bottleneck uses 1 core by default. If it is OK to use multiple threads, should I just create a concurrent.futures.ThreadPoolExecutor, and submit the tasks there? (if the input size is big enough, and the window size is small enough of course)

Comment From: jennolsen84

@jreback @wesm any feedback on the pandas policy? I would like to avoid doing un-necessary work, if possible. In this case, I am referring to writing a threaded implementation, and have the PR in pandas get rejected.

I have upgraded the move_median algorithm in bottleneck to accept nans, and min_period (but it is called min_count in bottleneck).

I benched it in different scenarios and am consistently getting ~ 3.5 - 5.5x performance of the pandas master [except for the trivial case of window_size=1, which bn optimizes]. So, I am guessing that the pandas' rolling median is using an algorithm with a O(n * lg window_size) running time already.

The double heap algorithm implementation (contributed by someone else) in bottleneck is pretty efficient. Also, bottleneck generates code for double, float, int32, and int64 data types at compile time.

In case you're interested in the move median implementation:

https://github.com/jennolsen84/bottleneck/tree/movemedian

I have to clean up the function names a little bit and add more documentation. I have tested the implementation thoroughly so it should be good. Test cases were a bit tough to generate -- could use advice on this. I ended up hammering the implementation with random datasets, and that helped shake out (hopefully all?) a lot of bugs.

Comment From: jennolsen84

here is the timing data (I stopped it after while, as it wasn't that interesting):

https://gist.github.com/jennolsen84/70cf058cd0e02ef299d3

Comment From: wesm

@jennolsen84 is it possible to handle the multithreading using pthreads in C rather than via Python threads?

Comment From: jreback

@jennolsen84 so in general to play nice, you can: - release the GIL whenever possible (though have to be careful as sometimes this can add overhead in some cases) - you shouldn't use python threads or python multiprocessing as this doesn't play nice with higher level than pandas who want to use python threads.

note that:numexpr recently (IIRC in 2.5, just released) now actually plays nice by holding a lock its own thread usage (before it was not threadsafe).

Comment From: jennolsen84

@jreback yep, I sent in the PR to numexpr for that lock.. :) https://github.com/pydata/numexpr/pull/200 .

@wesm yes, can definitely use pthreads. How many threads should be used? I was thinking of peeking at numexpr settings, but numexpr is not required by pandas.

On a bigger picture. I think the best option might be to expose something that returns a list of functions, which can then be scheduled as the user pleases. This allows the user to use an existing threadpool (one that might be partially in use, so this can avoid oversaturation of the CPU, etc). At this moment, it is not possible to implement efficiently on top of pandas due to the function signatures in pandas. (The out parameter gets us part of the way there, we would also need to have a window for warmup).

Thoughts? I can move the discussion of the last paragraph to a separate issue, if you would like.

Comment From: jennolsen84

bottleneck 1.1 now has a good implementation of move median. It can handle nans, etc. There is talk about implementing parallel algorithms in bottleneck as well. Perhaps pandas can just use those.

Comment From: mroeschke

Seems like the general feel is that we want to move towards using bottleneck (or maybe even numba) in the future so I guess extending the skiplist to handle more types is not really a priority so closing