Pandas Klib upgrade (factorizing performance increase)

Hi,

I'm working with two colleagues (@FrancescElies & @JaviBi4) on seeing if we can add groupby to bcolz, what we did first was to see if we can make Pandas' excellent klib implementation more library independent. See https://github.com/CarstVaartjes/khash A while back khash (https://github.com/attractivechaos/klib) was updated to 0.2.8:

2013-05-02 (0.2.8):

* Use quadratic probing. When the capacity is power of 2, stepping function
  i*(i+1)/2 guarantees to traverse each bucket. It is better than double
  hashing on cache performance and is more robust than linear probing.

  In theory, double hashing should be more robust than quadratic probing.
  However, my implementation is probably not for large hash tables, because
  the second hash function is closely tied to the first hash function,
  which reduce the effectiveness of double hashing.

Reference: http://research.cs.vt.edu/AVresearch/hashing/quadratic.php

We updated the klib to it (which was something of a headache to be honest), but the quadratic probing seems to be quite faster which might make it interesting for Pandas to retrofit this into the general Pandas master. See the example:

import numpy as np
import random
import pandas as pd
from khash.hashtable import Int64HashTable, StringHashTable

def new_klib_int(input_array):
    ht = Int64HashTable(len(input_array))
    return ht.get_labels_groupby(input_array)

def new_klib_str(input_array):
    ht = StringHashTable(len(input_array))
    return ht.factorize(input_array)

def new_klib_float(input_array):
    ht = Float64HashTable(len(input_array))
    return ht.factorize(input_array)

a = np.random.randint(100, size=10000000)
b = np.fromiter((random.choice(['NY', 'IL', 'OH', 'CA']) for _ in xrange(10000000)), dtype='S2')
b = pd.Series(b).values
c = np.random.random_sample(10000000) * 10000

%timeit pd.factorize(a)
%timeit new_klib_int(a)
%timeit pd.factorize(b)
%timeit new_klib_str(b)
%timeit pd.factorize(c)
%timeit new_klib_float(c)

In [20]: %timeit pd.factorize(a)
10 loops, best of 3: 122 ms per loop

In [21]: %timeit new_klib_int(a)
10 loops, best of 3: 101 ms per loop

In [22]: %timeit pd.factorize(b)
1 loops, best of 3: 496 ms per loop

In [23]: %timeit new_klib_str(b)
10 loops, best of 3: 165 ms per loop

In [36]: %timeit pd.factorize(c)
1 loops, best of 3: 1.61 s per loop

In [37]: %timeit new_klib_float(c)
1 loops, best of 3: 1.44 s per loop

So a 20%ish improvement for int64 and a 65% increase for strings in this example. Just thought to give you a heads up about this, as you might have missed 0.2.8 (and it's a bit of a pain to adjust everything so this might help to make Pandas even faster)

Edit: added float64 example too (10-15% improvement)

Comment From: jreback

thanks @CarstVaartjes

klib is shipped with pandas so should be straightforward to update I think. just a couple of files.

You are welcome to take a shot, otherways will try to get for 0.15.1

Comment From: CarstVaartjes

We'll take a shot tomorrow afternoon; there were quite some changes and I think Wes also added quite some own inline stuff in the current khash.c but with some diff comparisons we hope we should be okay. And the performance increase is significant, so it would be nice to make it in time for 0.15.0

Comment From: jreback

ok if u could get done on next day or 2 and everything pass then could do (run a vbench as well)

Comment From: FrancescElies

@jreback we would love to make it happen, unfortunately with my current knowledge in C, Python and pandas I cannot get python setup.py build_ext --inplace execute correctly with the new version of klib.

First error out of many others:

In file included from pandas/parser.c:360:
In file included from pandas/src/parser/tokenizer.h:36:
pandas/src/klib/khash.h:187:21: error: redefinition of '__ac_HASH_UPPER'
static const double __ac_HASH_UPPER = 0.77;

In case you would like to have a look you can find: - Changes under: https://github.com/FrancescElies/pandas/tree/klib_upgrade_0.2.8/pandas/src/klib - Pandas-independent implementation of klib at https://github.com/CarstVaartjes/khash

Everything we made is based on pandas-team's work, any tips or suggestions are welcome.

I am sorry to say, that this won't be possible in the following days.

Thanks

Comment From: jreback

@FrancescElies

ok no prob. maybe can do this for 0.15.1.

something that might help is seeing what was originally changed, e.g.

git blame pandas/src/klib/khash.h shows what/when things were changed (e.g. like a diff from the original file).

So I am not sure what @wesm changed but maybe can fiigurr out from this.

Comment From: FrancescElies

@jreback thanks for the info, actually this is exactly the way we adapted klib 0.2.8 which seems to work already a an independent package, but when I put this changes to pandas seems to have problems with other files (like parser.c), files that we don't really use in the other project.

I suppose I need to have a deeper look to the other files in pandas to understand the interconnection between them before being able to make it work properly.

Thanks again for your rapid answer

Comment From: jreback

@FrancescElies np. I think its appropriate to keep it as an-line package as don't want to add an explicit dep (and it is license compat and works well that way).

that said, pls have a closer look! I am imagine the perf boost IS worth it.

and FYI, this library is also vastly faster than almost anything else,

e.g. see commentary here: http://article.gmane.org/gmane.comp.python.numeric.general/58724/match=unique+performance

essentially numpy should use klib (or similar) and use hashtables to do unique.

Comment From: immerrr

I might have time to look into this at the weekend as this definitely looks interesting, but I can't guarantee anything.

Comment From: jreback

go @immerrr !

Comment From: wesm

Very interesting, this is good to know that the quadratic probing has a practical impact in many cases.

I can try to look at the compilation issues if you can't sort them out, let me know.

Comment From: immerrr

I've integrated the code from 0.2.8 in #8547, but for some reason vbench results are not as astounding: new version is systematically faster, especially in all-unique cases, but the increase is nowhere near 60% (https://gist.github.com/immerrr/98a02338df8cbabb3a86).

I did refactor the benchmarks somewhat and made string values 10 chars each to be closer to real-life conditions, so that may have influenced the result.

Comment From: CarstVaartjes

My example might have been a bit skewed because of the limited amount of options and the short string I guess. Also, the big increase was only for strings in my example, the rest was 10-20%, but still decent and good to see that it's still 10%ish in your gist :) P.s. great work!

Comment From: immerrr

I wonder why your example compares StringHashTable(len(x)).factorize(x) to pd.factorize(x), the latter is a lot more involved:

In [5]: np.random.seed(0); strs = np.array(['AA', 'BB', 'CC', 'DD'], dtype=np.object_).take(np.random.randint(4, size=10000))

In [6]: %timeit pd.factorize(strs)
1000 loops, best of 3: 285 µs per loop

In [7]: %%timeit
ht = pd.hashtable.StringHashTable(len(strs))
ht.factorize(strs)
   ...: 
10000 loops, best of 3: 171 µs per loop

As for the actual performance, I'm struggling to get quadratic probing on par with the old version, in short-string scenario it's about 5 times slower and valgrind shows that there's a lot more probing going on and strcmp calls are rather expensive:

old version

In [2]: strs = pd.util.testing.rands_array(2, 10000)

In [3]: strs[:5]
Out[3]: array(['SV', '1a', 'd7', 'dN', 'jt'], dtype=object)

In [4]: %%timeit
   ...: ht = pd.hashtable.StringHashTable(len(strs))
   ...: ht.factorize(strs)
   ...: 
1000 loops, best of 3: 544 µs per loop

new version

In [4]: np.random.seed(0); strs = pd.util.testing.rands_array(2, 10000)

In [5]: strs[:5]
Out[5]: array(['SV', '1a', 'd7', 'dN', 'jt'], dtype=object)

In [6]: %%timeit
ht = pd.hashtable.StringHashTable(len(strs))
ht.factorize(strs)
   ...: 
100 loops, best of 3: 2.55 ms per loop

Comment From: immerrr

Ok, I've realized that 3.5k two character strings that only slightly differ from each other may be a corner case for a hash function as simple as x31 and a probing scheme that favours neighboring buckets. I've re-run it on short and duplicated strings and the newer version is indeed slightly faster.

old version

In [16]: %%timeit np.random.seed(0); strs = np.array(['NY', 'IL', 'OH', 'CA'], dtype=np.object_).take(np.random.randint(4, size=10000))
ht = pd.hashtable.StringHashTable(len(strs))
ht.factorize(strs)
   ....: 
10000 loops, best of 3: 158 µs per loop

In [17]: %%timeit np.random.seed(0); strs = np.array(['AA', 'BB', 'CC', 'DD'], dtype=np.object_).take(np.random.randint(4, size=10000))
ht = pd.hashtable.StringHashTable(len(strs))
ht.factorize(strs)
   ....: 
10000 loops, best of 3: 176 µs per loop

In [18]: pd.__version__
Out[18]: '0.15.0-6-g403f38d'

new version

In [16]: %%timeit np.random.seed(0); strs = np.array(['NY', 'IL', 'OH', 'CA'], dtype=np.object_).take(np.random.randint(4, size=10000))
ht = pd.hashtable.StringHashTable(len(strs))
ht.factorize(strs)
   ....: 
10000 loops, best of 3: 155 µs per loop

In [17]: %%timeit np.random.seed(0); strs = np.array(['AA', 'BB', 'CC', 'DD'], dtype=np.object_).take(np.random.randint(4, size=10000))
ht = pd.hashtable.StringHashTable(len(strs))
ht.factorize(strs)
   ....: 
10000 loops, best of 3: 169 µs per loop

In [18]: pd.__version__
Out[18]: '0.15.0-9-g7d0d41b'

Comment From: jbrockmendel

@CarstVaartjes was a conclusion ever reached about this possible upgrade?

Comment From: CarstVaartjes

Hi! not really as I missed the later update! It's still an interesting 10% performance increase in a core function I think. Also khash is really stable (so it's not really a moving target, so this would really get the best out of it) -> https://github.com/attractivechaos/klib/commits/master/khash.h

Comment From: WillAyd

This was attempted in https://github.com/pandas-dev/pandas/pull/49197 but ultimately reverted due to a performance degradation with isin. The hashing expert @realead has a separate issue open to convert isin to using a set implementation rather than a hash implementation in https://github.com/pandas-dev/pandas/issues/39799 . Paired together those changes might get us moving forward on this

Comment From: realead

@WillAyd we had investigated quadratic probing back while merging #36729. If I remember results correctly: the downside of the quadratic probing was that there where some "normal" sequences for which the performance degenerated quite a bit. It was decided to use the current approach, because the probability for degeneration for "normal" sequences is much smaller.

We have some such "normal" sequences in the performance suite, which would lead to degenerate behavior of quadratic probing. This is probably what you have observed.

Comment From: WillAyd

Yea I think this comment was a clear example of that https://github.com/pandas-dev/pandas/pull/49197#issuecomment-1320930369