macstl beats Intel autovectorization

Thousand Operations per Second (greater = faster)
compiler gcc 3.3 CW 9.3 VC++ 2003 ICC 8.1
os OS X 10.3.6 WinXP SP2
cpu G5 2.0 P4 2.8 PM 1.6 P4 2.8 PM 1.6
multiply add 2000 2000 641 714 493 500
inner product 909 909 709 1250 714 909
polynomial 667 714 337 313 355 323
hypotenuse 345 370 214 131 213 134
complex multiply add 500 --- --- --- --- ---
predicate 2000 1111 --- --- --- ---
slicing 270 238 --- --- --- ---
power 60.6 50.3 --- --- --- ---
trigonometric 39.7 71.9 20.9 15.3 30.9 22.9

Autovectorization promises the power of SIMD without the pain of SIMD. An autovectorizing compiler, such as Intel ICC 8.1 or gcc 4.0, will magically convert your scalar loop code into vector instructions without you having to do it yourself. Unfortunately, vectors are finicky customers — they want to be aligned properly, ordered sequentially in memory and fed branchless code for best performance. The autovectorizer has to discover these patterns in your code that may or may not be there — an uphill task for a mere machine.

macstl offers something better than autovectorization and manual vectorization. Use the inherently parallel, standard construct of valarray, and macstl has enough information to align, order and unbranch the generated code. The proof is in the running — pitting macstl 0.2.1 against its first autovectorizing contender, Intel ICC 8.1.

Times Faster than Scalar Loops (greater = faster)
compiler gcc 3.3 CW 9.3 VC++ 2003 ICC 8.1
os OS X 10.3.6 WinXP SP2
cpu G5 2.0 P4 2.8 PM 1.6 P4 2.8 PM 1.6
multiply add 3.4 3.4 1.4 2.4 1.9 2.6
inner product 2.7 2.7 3.3 4.5 1.1 0.9
polynomial 2.3 3.2 1.0 1.4 1.4 2.25
hypotenuse 4.1 6.9 4.6 5.1 2.1 1.6
complex multiply add 3.2 --- --- --- --- ---
predicate 2.8 2.0 --- --- --- ---
slicing 0.84 0.76 --- --- --- ---
power 6.7 5.4 --- --- --- ---
trigonometric 11.7 16.2 6.2 2.8 15.6 8.7

The commentary

In terms of raw speed (thousand operations per second) for macstl-generated code, ICC 8.1 loses to VC++ 2003 in the simple multiply add and inner product, draws even on polynomial and hypotenuse, and wins in the difficult trigonometric — as befits its reputation as a powerhouse compiler.

What’s more interesting though, is how many times faster macstl was over the scalar loops. In the ICC case, several of the scalar loops were autovectorized by the compiler, yet macstl was faster than the autovectorizer in all cases except one — the inner product on a Pentium M 1.6Ghz.

In fact, the difficult trigometric test autovectorized the worst: 15.6x vs. 6.2x on Pentium 4 2.8 Ghz — macstl inlined its specially tuned trig functions, while ICC generated slow calls to an external DLL. I’ve drawn the trigonometric test for your reference at the top of the page; as you can see, you get more mileage out of macstl!

Mon, 28 Mar 2005. © Pixelglow Software.
Links
Search
Download
Purchase
Discuss