Pandas PERF, TODO: Ensure 128-bit alignment all rows/columns for SIMD friendliness

Split off from #3114 ...

It doesn't seem (from my testing) that numpy guarantees 128-bit alignment the start of an array (which is odd, since malloc usually does that no matter what, from my testing), and it certainly doesn't 128-bit align each row/column since they're always contiguous.

Apparently it can be done manually via a method like described below: http://mail.scipy.org/pipermail/scipy-user/2009-March/020283.html (This just aligns the start of the array, but if you overallocate more and play with strides you can do the same thing to every c-contig row or f-contig column)

The downside is that this has to be done everywhere where allocation currently happens...even with a helper function to do it it's going to be somewhat messy.

Also, to avoid too much overhead, I think this should only be done when each row/column is at least 256 bits (32 bytes) naturally, so the absolute worst case space penalty is 33% + 96 bits constant (on a 32-bit machine) or 20% + 64 bits constant (on a 64-bit machine)...

We can bump that up a bit higher to 512 bits if that still seems like too much, but it means that DataFrames with 5-7 float64/int64 columns could be pathologically misaligned. (So basically, this won't help at all until at least 9 columns are present...)

Comment From: stephenwlin

@y-p

alignment can yield big wins. most def.

need to validate that this is what's actually happening though.

Already almost 99% sure that arrays are not always 128-bit aligned at the start (I tried testing for it manually at runtime from Cython, requiring both input and output arrays to be 128-bit aligned basically eliminates the effect of my refinement, meaning they're almost never both aligned unless randomly so), I don't know if that's causing the entire performance/variability issue but it can't be helping.

Comment From: ghost

This sounds like it might involve some pain to apply throughout?

Just getting some perf numbers on a toy example, measuring the benefit if any, would be real interesting...

This really is turning into an ATLAS/BLAS style line of research.. :)

Comment From: stephenwlin

@y-p can probably just target a few key cases (block consolidation, DataFrame construction, etc.) to handle 90%...and with a helper function

Comment From: ghost

Making performant code often involves making the codebase more fragmented, less readable, and raises the entry barrier for new developers to come in. Just a good idea to do a spot check to make sure the perf gains are significant enough before monkey patching the libc with inline asm....

Would this be somehthing users can employ in their own code, so if they invest the effort, pandas perf will benefit? or does it strictly require changes to pandas itself?

Comment From: stephenwlin

@y-p I think just handling allocations that Python does internally is good enough...if someone passes us a non-aligned array then we just won't trigger the optimized case (detecting alignment at runtime is easy, just convert the pointer to an integer and mod 16)...to be honest, just handling the block consolidation case might do 75% of what we need in practice (but not for the test suite, since it almost never consolidates), and that's just one line of code.

Comment From: stephenwlin

Also, notice that on the low-end, I've gotten results with my SIMD-triggering small-array optimization like:

frame_reindex_axis0                         0.4789     1.0887     0.4399
frame_reindex_axis0                         0.4191     0.9355     0.4480

This happened more than once and on an overnight run with no competing processes, so it's probably not a fluke.

The high end results:

frame_reindex_axis0                         1.4141     0.4546     3.1107
frame_reindex_axis0                         1.4211     0.4163     3.4140

make sense too because it's repeatedly calling the vectorized code in a loop, but the vectorized code needs to check for alignment at the beginning and end of both input and output source arrays. If the stride is not a multiple of 128-bits, then the alignments will be random (since the indexing is random as well) and branch prediction will not only defeated, you'll actually pay a significant penalty having to revert the mispredicted pipeline. (See Why is processing a sorted array faster than an unsorted array?)

Comment From: ghost

That looks really promising.

Comment From: stephenwlin

@wesm any thoughts on the space overhead? (again, worst case would be at 288 bits per row, bumped up to 384 bits per row, all downhill from there. (guess we have to see how much performance improves at that size, first, of course, just curious if you think its enough overhead to be relevant)

Comment From: jreback

I suspect there might be a nice trade off here

say u increase memory by 1.5x but get 1.5 speed improvement I would take that say <10000 rows for sure

Comment From: stephenwlin

@jreback looks more like absolute-worst-case 33% (i.e. 1.33x) space for up to 125% (i.e. 2.25x) improvement (or possibly more, with further refinement), we'll see after more testing :)...we can always bump up the minimum size for alignment but that means that anyone with fewer columns per block than the minimum gets the short shrift even if they'd prefer the speed

Comment From: jreback

then that's even better

wonder how that scales to lots of rows ...

Comment From: stephenwlin

@y-p non-cached builds have the same variability, apparently, so your build cacher is vindicated :) (sorry for the bad guess!)

Comment From: ghost

link

Comment From: stephenwlin

if 288 -> 384 is ok, then 192 -> 256 should be ok too (same ratio)...so this allows speedup anytime >= 3 int64/float64 are stored together

Comment From: stephenwlin

wonder how that scales to lots of rows ...

The percent improvement should stay constant (or even get better, due to better branch prediction) as the number of rows increases; the real variable is the size of the copied row (not sure if that's what you meant...)...presumably it should improve the most with smaller rows and then gradually even out until you get to 256 bytes (which is where we start calling memmove() anyway)

Comment From: ghost

@stephenwlin , do you have a direct way of checking the alignment of arrays? or are you inferring it from benchmarks?

I thought perhaps it would be worthwhile to repeat the tests with py33, just in case some of the internal wiring saw changes during the great migration, if improvements to this were made in the last couple of years, they probably would have landed in 3.x, not 2.7.

Comment From: stephenwlin

@y-p I can check at runtime easily, it's just hard to get that information out as output from the test suite directly; basically, I just turned off the optimization except in 128-bit alignment cases, and almost all the performance changes (positive and negative) disappeared (because doubles are only 32-bit aligned on my machine, by default, so two arrays that happen to be 128-bit aligned is rare). turning them off except in 64-bit alignment cases made the effect larger than the 128-bit requirement case, but not as much as turning off the requirement entirely. (that's where I'm getting the results that vary from ~0.45 to ~3.4) EDIT: I have definitely confirmed that numpy arrays are not 128-bit aligned by default

Comment From: stephenwlin

this works for allocating SIMD-friendly arrays: - http://pastebin.com/RnmmVsiz

output: - http://pastebin.com/fQVG4Nks

(it handles row/column alignment as long as it can do so within 4/3 of the original size, and it's better than the version above in that it uses the largest usable type for allocation rather than wasting a constant 15 bytes every time...i.e. if uint64 is usable, then the maximum penalty is 8 bytes, if uint32 then 12 bytes, if uint16 then 14 bytes...it even works with float96. EDIT: will now also fall back to smaller power of 2 alignments if possible within 4/3 of the original size)

can one of you on 64-bit run it and make sure all the assertions pass? I can't imagine that they wouldn't...

Comment From: jreback

In [21]: %run /home/jreback/aligned_empty.py.txt
1 uint8 1 => 1 (1.0x)
2 uint8 2 => 2 (1.0x)
3 uint8 3 => 4 (1.33333333333x)
4 uint8 4 => 4 (1.0x)
5 uint8 5 => 6 (1.2x)
6 uint8 6 => 8 (1.33333333333x)
7 uint8 7 => 8 (1.14285714286x)
8 uint8 8 => 8 (1.0x)
9 uint8 9 => 12 (1.33333333333x)
10 uint8 10 => 12 (1.2x)
11 uint8 11 => 12 (1.09090909091x)
12 uint8 12 => 16 (1.33333333333x)
13 uint8 13 => 16 (1.23076923077x)
14 uint8 14 => 16 (1.14285714286x)
15 uint8 15 => 16 (1.06666666667x)
16 uint8 16 => 16 (1.0x)
1 uint16 2 => 2 (1.0x)
2 uint16 4 => 4 (1.0x)
3 uint16 6 => 8 (1.33333333333x)
4 uint16 8 => 8 (1.0x)
5 uint16 10 => 12 (1.2x)
6 uint16 12 => 16 (1.33333333333x)
7 uint16 14 => 16 (1.14285714286x)
8 uint16 16 => 16 (1.0x)
9 uint16 18 => 24 (1.33333333333x)
10 uint16 20 => 24 (1.2x)
11 uint16 22 => 24 (1.09090909091x)
12 uint16 24 => 32 (1.33333333333x)
13 uint16 26 => 32 (1.23076923077x)
14 uint16 28 => 32 (1.14285714286x)
15 uint16 30 => 32 (1.06666666667x)
16 uint16 32 => 32 (1.0x)
1 uint32 4 => 4 (1.0x)
2 uint32 8 => 8 (1.0x)
3 uint32 12 => 16 (1.33333333333x)
4 uint32 16 => 16 (1.0x)
5 uint32 20 => 24 (1.2x)
6 uint32 24 => 32 (1.33333333333x)
7 uint32 28 => 32 (1.14285714286x)
8 uint32 32 => 32 (1.0x)
9 uint32 36 => 48 (1.33333333333x)
10 uint32 40 => 48 (1.2x)
11 uint32 44 => 48 (1.09090909091x)
12 uint32 48 => 48 (1.0x)
13 uint32 52 => 64 (1.23076923077x)
14 uint32 56 => 64 (1.14285714286x)
15 uint32 60 => 64 (1.06666666667x)
16 uint32 64 => 64 (1.0x)
1 float64 8 => 8 (1.0x)
2 float64 16 => 16 (1.0x)
3 float64 24 => 32 (1.33333333333x)
4 float64 32 => 32 (1.0x)
5 float64 40 => 48 (1.2x)
6 float64 48 => 48 (1.0x)
7 float64 56 => 64 (1.14285714286x)
8 float64 64 => 64 (1.0x)
9 float64 72 => 80 (1.11111111111x)
10 float64 80 => 80 (1.0x)
11 float64 88 => 96 (1.09090909091x)
12 float64 96 => 96 (1.0x)
13 float64 104 => 112 (1.07692307692x)
14 float64 112 => 112 (1.0x)
15 float64 120 => 128 (1.06666666667x)
16 float64 128 => 128 (1.0x)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/usr/local/lib/python2.7/site-packages/ipython-0.13.1-py2.7.egg/IPython/utils/py3compat.pyc in execfile(fname, *where)
    176             else:
    177                 filename = fname
--> 178             __builtin__.execfile(filename, *where)

/home/jreback/aligned_empty.py.txt in <module>()
     78 test(16, dtype='uint32')
     79 test(16, dtype='float64')
---> 80 test(16, dtype='float96')

Comment From: jreback

fyi...this is how I setup a virtual 32-bit box, you can easily do the same for 64-bit. It has maintained a persistent ssh connection (I have not restarted it), and build many times. (e.g. I just installed numpy 1.7.0 to test ou a change the other day). This is not the exact link I used, but should get you started.

http://www.vagrantup.com/ http://docs.openstack.org/diablo/openstack-compute/admin/content/openstack-compute-installation-using-virtualbox-vagrant-and-chef.html

Comment From: stephenwlin

Hmm, can you do a 64-bit virtual box within a 32-bit host?

TypeError Traceback (most recent call last) /usr/local/lib/python2.7/site-packages/ipython-0.13.1-py2.7.egg/IPython/utils/py3compat.pyc in execfile(fname, where) 176 else: 177 filename = fname --> 178 builtin.execfile(filename, where)

/home/jreback/aligned_empty.py.txt in () 78 test(16, dtype='uint32') 79 test(16, dtype='float64') ---> 80 test(16, dtype='float96')

I guess your machine doesn't support 'float96' (at least in 64-bit mode?) 96 doesn't divide evenly into 64-bit, so that makes sense. Anyway, I'll take a look when I get a 64-bit build either way, "float96" is a corner case anyway but I wanted to test a type whose size wasn't a factor of 128-bits.

Comment From: jreback

these are remote hosts, so u can find exactly what u want (arch and os) then u build your environment and such

Comment From: stephenwlin

oh ok...well, I might just dual boot, it won't be that hard

Comment From: stephenwlin

OK something very screwy is going on with frame_reindex_axis0...I've verified in a C testbed that taking advantage of alignment and contiguity for small arrays can help by up to 40% (and this can be applied to other algos too, and probably with greater benefits if more than just copying is being done), but there's too much noise from other factors and unpredictable outliers to get a clean read from vb_suite.

i'll update this in a week or so after doing a lot of systematic (50 vb-suite overnights) with targeted changes, that should get to the bottom of this.

EDIT: ok, just found this, testing a dummy commit (just changed a release notes file) against its immediate parent

frame_reindex_axis0                  1.4083       0.4687       3.0049

something is definitely amiss and it will make it hard to get clean performance numbers until we figure this out...

Independent of the frame_reindex_axis0 issue though, I do get performance results like:

reindex_frame_level_reindex                 0.8219     1.6047     0.5122
reindex_frame_level_reindex                 0.8178     1.4096     0.5802
reindex_frame_level_reindex                 0.8347     1.4042     0.5944
reindex_frame_level_align                   0.8747     1.4033     0.6233
reindex_frame_level_reindex                 0.8170     1.2948     0.6310
reindex_frame_level_align                   0.8664     1.3697     0.6325
reindex_frame_level_reindex                 0.8344     1.3103     0.6368
reindex_frame_level_reindex                 0.8205     1.2783     0.6419
reindex_frame_level_reindex                 0.7967     1.2393     0.6428
reindex_frame_level_reindex                 0.8046     1.2517     0.6428

for the top 10 improvements in reindex_frame_level_*, versus:

reindex_frame_level_align                   1.3920     1.3016     1.0695
reindex_frame_level_reindex                 1.3499     1.2603     1.0711
reindex_frame_level_align                   1.3774     1.2707     1.0839
reindex_frame_level_align                   1.4130     1.2864     1.0984
reindex_frame_level_reindex                 1.4262     1.2599     1.1320
reindex_frame_level_reindex                 1.4292     1.2603     1.1340
reindex_frame_level_align                   1.4512     1.2794     1.1343
reindex_frame_level_reindex                 1.4277     1.2287     1.1620
reindex_frame_level_reindex                 1.4711     1.2557     1.1715
reindex_frame_level_reindex                 1.4835     1.2424     1.1941

for the top 10 worst regressions, across 50 tests (0.84586 mean +- 0.17275 standard deviation)...these two tests specifically copy small arrays over and over so would be particularly alignment-sensitive. So I'm still betting on alignment as part of the issue: if we fix the alignment problem we ought to be able to lock in the improvements (modulo whatever random variation is coming from the the unknown noise affecting frame_reindex_axis0 in particular, but possibly the rest of the suite as well).

Comment From: ghost

tweaked test_perf a bit, should have less overhead, significant if you're running a short test many times:

λ time ./test_perf.sh -t 7d11531 -b  da54321 -r  frame_reindex_axis0

real    0m15.694s
user    0m12.598s
sys 0m1.804s

Comment From: stephenwlin

0.84586 mean +- 0.17275 standard deviation

actually, just realized geometric mean is more important here, since it's ratios (ratios skew the arithmetic mean higher, since the average of 1/x and x is greater than 1 for positive x)

In [28]: np.exp(np.log(df[3]).mean()) - 1.0
Out[28]: -0.17197250073993742

In [29]: np.exp(np.log(df[3]).std()) - 1.0
Out[29]: 0.23146744636695282

so it's an average of 17% improvement, +- 23% std deviation...if the deviation is due to a factor that can be controlled somehow, then the improvement could easily be greater that the current average.

Comment From: stephenwlin

Here are performance results from C with SSE, 32-bit Ubuntu Linux with GCC 4.7.2, -O3 flag.

       naive   memmove    contig       sse
count                                     
2          1  0.694165  1.886186  5.275820
3          1  0.690138  1.393083  4.196950
4          1  0.491735  1.142775  3.224919
5          1  0.608015  1.122087  3.527070
6          1  0.546905  1.279570  4.005941
7          1  0.358374  1.449010  3.783525
8          1  0.565996  1.465419  3.898268
9          1  0.641629  1.338270  3.209346
10         1  0.666388  1.251177  3.018939
11         1  0.495000  1.115493  2.635607
12         1  0.715945  1.387879  3.212425
13         1  0.803279  1.288380  2.797422
14         1  0.784871  1.330000  2.979633
15         1  0.574341  1.358223  2.288217
16         1  0.773923  1.365252  2.237864
17         1  0.837805  1.227882  1.839357
18         1  0.865311  1.298379  1.956897
19         1  0.708071  1.302797  1.714467
20         1  0.880435  1.406537  1.860811
21         1  0.914513  1.289720  1.877551
22         1  0.965116  1.321393  1.814208
23         1  0.855867  1.386364  1.775132
24         1  0.953092  1.419048  1.977876
25         1  0.995519  1.318497  1.821038
26         1  1.023810  1.367179  1.893466
27         1  0.977011  1.436114  2.023810
28         1  0.996992  1.364198  1.894286
29         1  1.083062  1.303922  1.950147
30         1  1.092715  1.330645  1.976048
31         1  0.988848  1.379668  1.996997
32         1  1.108534  1.390852  2.030349
33         1  1.185121  1.418219  1.932299
34         1  1.451556  1.659553  2.373547
35         1  1.229607  1.652792  2.319088
36         1  1.175795  1.389353  1.858939
37         1  1.313297  1.416503  1.912467
38         1  1.233550  1.336345  1.945906
39         1  1.411504  1.689619  2.155405
40         1  1.454049  1.676810  2.315942
41         1  1.477778  1.645361  2.225941
42         1  1.499051  1.628866  2.234795
43         1  1.398762  1.564787  2.164159
44         1  1.482662  1.641079  2.225035
45         1  1.528155  1.581910  2.162088
46         1  1.564307  1.520349  2.254310
47         1  1.414923  1.553447  2.243867
48         1  1.515444  1.686359  2.350299
49         1  1.570707  1.640295  2.260174
50         1  1.587810  1.607741  2.247076
51         1  1.430698  1.574207  2.121379
52         1  1.531000  1.616684  2.225291
53         1  1.596074  1.526680  2.210300
54         1  1.592516  1.609244  2.197991
55         1  1.440419  1.577244  2.134181
56         1  1.528398  1.566528  2.159026
57         1  1.593717  1.592050  2.137640
58         1  1.680131  1.626850  2.246715
59         1  1.494516  1.539014  2.172464
60         1  1.561458  1.556594  2.105337
61         1  1.638368  1.517875  2.141210
62         1  1.597430  1.541322  2.152958
63         1  1.491474  1.494472  2.121255
64         1  1.541841  1.568085  2.099715
...
497        1  1.359706  1.174881  1.393025
498        1  1.349453  1.151869  1.371985
499        1  1.324887  1.167464  1.365672
500        1  1.315406  1.124708  1.356203
501        1  1.339171  1.155008  1.354147
502        1  1.302785  1.165595  1.384909
503        1  1.349309  1.170264  1.394286
504        1  1.297445  1.144928  1.346591
505        1  1.327869  1.151659  1.376771
506        1  1.345725  1.130367  1.379048
507        1  1.337037  1.138801  1.342007
508        1  1.378505  1.179057  1.412835
509        1  1.327256  1.153724  1.376181
510        1  1.313474  1.146400  1.364762
511        1  1.351054  1.183936  1.415946
512        1  1.345725  1.149206  1.388303

"count" is the number of doubles to be copied in a loop, "naive" is the equivalent of the simple cython loop (variable strides), "memmove" is with a simple call to memmove (no threshold), "contig" is loop that assumes contiguous copies, and "sse" is an aligned sse version, higher numbers are better.

So SSE and alignment can definitely make a difference (probably even more so in other algorithms, this is just a basic example).

EDIT: Updated the results above with a further refinement.

Comment From: ghost

test_perf can now save the results frame in a pickle file, Hope it helps you collect and analyze run data. a12d480

λ ./test_perf.sh -t 85e7db7 -b fefdcd8 -r axis0 -d o.pickle 
...

-----------------------------------------------------------------------
Test name                      | target[ms] |  base[ms]  |   ratio    |
-----------------------------------------------------------------------
stats_rank2d_axis0_average          28.9161      28.8471       1.0024
frame_reindex_axis0                  0.8263       0.8147       1.0143
-----------------------------------------------------------------------
Test name                      | target[ms] |  base[ms]  |   ratio    |
-----------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [85e7db7] : BUG/CLN: Exception in HDFStore are now ValueError or TypeError 
Base   [fefdcd8] : BUG: GH3163 fixed to_csv with a boundry condition issue at the chunksize break


*** Results were also written to the logfile at '/home/user1/src/pandas/vb_suite.log'
*** The results DataFrame was written to '/home/user1/src/pandas/vb_suite/o.pickle'

In [22]: pd.load("vb_suite/o.pickle")
Out[22]: 
                               t_head  t_baseline     ratio
name                                                       
stats_rank2d_axis0_average  28.916097   28.847098  1.002392
frame_reindex_axis0          0.826338    0.814686  1.014302

Comment From: stephenwlin

@y-p thanks--I'm already scripted to use to cat, grep, sed, sort, and cut from the command line, but maybe I'll transition to this next time since it's probably more robust.

Comment From: stephenwlin

results are updated above, now small arrays w/ <=10 doubles improves 3x-5x, w/ <=64 doubles improves up to 2x, big arrays seem to have steady 1.36x improvement.

Comment From: stephenwlin

Wow! Here's what you get when you apply this to copying 32-bit values (i.e. float32/int32):

       naive   memmove    contig       sse
count
1          1  0.465275  0.885417  3.491656
2          1  0.598496  1.037311  2.432639
3          1  0.573469  0.685784  3.568254
4          1  0.672474  0.860908  4.546039
5          1  0.808379  0.752618  2.009643
6          1  0.859197  0.828036  4.660550
7          1  0.954169  0.935468  5.056848
8          1  0.975012  1.070801  5.876506
9          1  1.078116  0.701514  5.273713
10         1  1.073359  0.753096  5.791667
11         1  1.155860  0.821912  6.476667
12         1  1.074197  0.920304  6.978417
13         1  0.572901  1.554756  7.072727
14         1  0.728261  1.664953  7.619608
15         1  0.662453  1.787097  8.012397
16         1  1.252747  1.900000  8.500000
17         1  0.767801  1.291417  6.670103
18         1  1.498067  1.351710  7.069343
19         1  0.882809  1.400868  7.533074
20         1  1.636210  1.484267  7.798387
21         1  0.880237  1.816729  6.978339
22         1  1.183797  1.921492  7.278810
23         1  0.972320  1.918570  7.697211
24         1  1.770147  2.074034  7.954733
25         1  1.120949  1.444444  6.588435
26         1  2.043340  1.616221  6.854610
27         1  1.120000  1.659794  6.974729
28         1  2.070513  1.719610  7.231343
29         1  1.152109  2.181102  6.059375
30         1  1.562197  2.222989  6.238710
31         1  1.253896  2.307049  6.394040
32         1  2.194318  2.404732  6.590444

Comment From: jreback

that's in 32-bit arch so would expect similar on 64-bitt with. 64 bit types?

Comment From: stephenwlin

The performance of the optimized version for 64-bit types should be similar to the first results I posted before, since the SSE2 instructions are the same (also, I'm actually using a 64-bit processor anyway, just in 32-bit mode; I don't think pure SSE2 code is different depending on the mode).

However, the baseline speed for 64-bit types is probably faster on 64-bit arch, so the speedup might not be as noticeable (i.e. the optimized version will be just as optimized, but the baseline it's compared against will not be as slow).

Comment From: jreback

makes sense. the point of 64-bit sse is to do more operations at a time. the advantage we have (with your knowledge) is that we know things about he data that the general case cannot know and thus cannot handle

Comment From: stephenwlin

Here's performance numbers on x86-64: instead of normalizing by row I decided to print out the raw times this time; however, I normalized the number of iterations to keep the total number of bytes copied as close to constant as possible, barring rounding...(learned a lot of fun facts about x86 performance tuning while doing this, by the way!)

First the results using double as the element type (i.e. numpy.float64)..."count" is the number of double columns copied at a time over 100 randomly re-indexed rows.

       naive memmove contig    sse
count  
1      3.01s   7.07s  2.83s  2.19s
2      2.04s   3.53s  1.72s  1.12s
3      1.53s   2.35s  1.37s  0.91s
4      1.39s   1.90s  1.50s  0.71s
5      1.41s   1.60s  1.29s  0.63s
6      1.70s   1.26s  1.70s  0.52s
7      1.69s   1.08s  1.69s  0.52s
8      1.67s   0.94s  0.93s  0.54s
9      1.66s   0.84s  0.90s  0.53s
10     1.66s   0.80s  0.88s  0.49s
11     1.65s   0.73s  0.86s  0.48s
12     1.65s   0.67s  0.86s  0.44s
13     1.64s   0.62s  0.84s  0.45s
14     1.64s   0.58s  0.81s  0.41s
15     1.63s   0.58s  0.80s  0.43s
16     1.63s   0.51s  0.76s  0.40s
17     1.63s   0.58s  0.75s  0.43s
18     1.62s   0.74s  0.77s  0.39s
19     1.63s   0.75s  0.73s  0.40s
20     1.62s   0.73s  0.84s  0.44s
32     1.98s   0.63s  0.72s  0.48s
33     1.84s   0.70s  0.75s  0.53s
64     1.47s   0.58s  0.75s  0.56s
65     1.46s   0.66s  0.77s  0.56s
128    1.27s   0.63s  0.85s  0.61s
129    1.28s   0.64s  0.78s  0.59s
256    1.20s   0.73s  0.91s  0.71s
257    1.19s   0.90s  0.88s  0.72s
512    1.19s   0.89s  1.03s  0.87s
513    1.20s   1.05s  1.03s  0.87s
1024   1.19s   0.90s  1.04s  0.88s
1025   1.20s   1.09s  1.06s  0.91s
2048   1.22s   0.89s  1.08s  0.92s
2049   1.22s   1.06s  1.08s  0.92s
4096   1.31s   1.11s  1.24s  1.06s
4097   1.32s   1.16s  1.19s  1.15s
8192   1.68s   1.60s  1.82s  1.51s
8193   1.81s   1.62s  1.84s  1.54s
16384  2.11s   2.02s  2.01s  1.96s
16385  2.10s   2.07s  2.01s  2.02s

And here's using unsigned char as the element type (i.e. numpy.uint8, and what numpy.bool_ uses under the hood)...unlike in some previous tests I've posted, this does not rely on over-copying arrays, since it's not possible to know in general if you're copying into rounded-up alignment slack space or into real data, so it's actually slower to copy 15 bytes (1 uint64 + 1 uint32 + 1 uint16 + 1 uint8) versus 16 bytes (1 m128i, the SSE intrinsic type)

        naive memmove  contig     sse
count   
1      23.97s  52.21s  22.74s  17.73s <-- no SSE actually until size 16, actually,
2      14.96s  26.08s  13.75s   9.00s  -- but uses other alignment-dependent optimizations
3      12.26s  17.43s  10.54s   7.08s
4      11.19s  13.03s   9.98s   4.47s
5      10.43s  10.48s   8.90s   4.84s
6      13.64s   8.69s   8.18s   3.66s
7      13.50s   7.46s   8.20s   3.45s
8      13.40s   6.51s   6.52s   2.51s
9      13.34s   5.79s   6.78s   2.67s
10     13.28s   5.21s   6.72s   2.38s
11     13.40s   4.82s   6.92s   2.55s
12     13.19s   4.34s   6.71s   1.97s
13     13.16s   4.01s   6.49s   2.16s
14     13.12s   3.73s   6.42s   1.94s
15     13.11s   3.50s   6.16s   2.11s
16     13.08s   3.26s   5.98s   1.08s <-- pure SSE
17     13.07s   3.31s   5.85s   1.90s
18     13.05s   2.89s   5.96s   1.79s
19     13.01s   2.97s   5.88s   1.92s
20     13.02s   2.60s   5.69s   1.64s
31     15.24s   1.84s   5.29s   1.65s
32     14.76s   1.78s   5.11s   0.65s <-- pure SSE
33     14.58s   1.71s   5.12s   1.08s
63     11.70s   0.90s   4.77s   1.00s
64     11.65s   0.90s   4.70s   0.53s
65     11.59s   0.96s   4.70s   0.92s
127    10.08s   0.57s   4.52s   0.76s
128    10.07s   0.52s   4.48s   0.40s <-- pure SSE
129    10.06s   0.64s   4.48s   0.85s
255     9.29s   1.34s   4.69s   0.66s
256     9.28s   0.61s   4.69s   0.48s
257     9.28s   1.35s   4.74s   0.82s
511     8.90s   0.92s   4.91s   0.61s
512     8.86s   0.56s   4.85s   0.55s <-- pure SSE
513     8.88s   0.87s   4.85s   0.81s
1023    8.69s   0.71s   4.87s   0.66s
1024    8.69s   0.61s   4.84s   0.59s <-- pure SSE
1025    8.69s   0.70s   4.88s   0.87s
2047    8.60s   0.96s   4.99s   0.74s
2048    8.60s   0.73s   4.96s   0.71s <-- pure SSE
2049    8.59s   1.01s   4.84s   0.77s
4095    8.70s   1.15s   5.14s   0.89s
4096    8.56s   0.88s   5.15s   0.87s <-- pure SSE
4097    8.55s   1.18s   4.85s   0.87s
8191    8.52s   1.14s   5.13s   0.89s
8192    8.53s   0.90s   5.12s   0.87s <-- pure SSE
8193    8.53s   1.19s   4.84s   0.91s
16383   8.52s   1.17s   5.14s   0.93s
16384   8.53s   0.85s   5.14s   0.92s <-- pure SSE
16385   8.52s   1.18s   4.86s   0.94s

Results for other data types are somewhere in between these two extremes.

Interestingly, naive version (basically what Cython does by default, with generic code handling variable strides) is basically just as performant as any of the optimized ones for large enough double array sizes, which is probably why the memmove() optimization doesn't show up very much on vbenches. It appears that the processor (a new Intel Core i5 in this case, but same results on an older Core 2 more or less) is able to figure out what's actually happening even without the alignment and SIMD hints and do the optimal thing behind the scenes, once the block size is big enough. (The compiler is definitely not doing anything tricky, I checked the assembly output)

Anyway, so this definitely does make a difference, just not in the case we happen to be exercising in vbench (double elements, and large double arrays in particular). But the code is currently, to put it charitably, a wee bit less maintainable than what's currently in Cython (don't even ask =D...beyond the dependency on SSE2 headers, there's lots of macros, and autogenerated code where macros won't work easily) so i'm not going to productionize until I find a more compelling reason to than what we have so far...

In particular, since take is basically just copying over and over, I'm guessing it's more amenable to processor pipeline optimizing, even without SIMD, than other algorithms. So when find some time to continue this I'm going to try to see if I can get similar results for other algos, using the same approach...will report findings when I have some...

Comment From: jreback

@stephenwlin any follow up on this?

Comment From: wesm

I'm sorry that @stephenwlin is no longer with us. I have created #33 to ensure we do aligned / padded allocations in pandas 2.0 where we are allocating our own memory.