How Fast is the Valarray, Really?

Operations per Clock Tick (larger = faster)
operation gcc 3.1 libstdc++ macstl 0.1, Altivec off macstl 0.1, Altivec on
inline arithmetic 446 888 3460
inline transcendental 74 78 1052
outline transcendental 80 99 39
inline scalarization 1485 1488 4291
unchunked apply 408 408 404
unchunked slice 2932 2865 2890
unchunked mask 221 161 168
unchunked indirect 358 425 540

Here’s a stack of benchmarks that show how the implementation stacks up.

Each test is run untimed for a few loops, then timed for many loops, and a throughput value calculated. The source is available in the main.cpp inside the download.

I used a size of 1000 elements for all valarrays. This intentionally keeps the data cache-bound, which maximizes the effect of Altivec code in slow bus architectures like the test machine — a Dual 450 Mhz Power Macintosh G4 with 2 x 1 MB L2 cache and 1 GB memory, running Mac OS X 10.2.6. Mileage will differ with a bandwidth-tuned Power Macintosh G5.

The code is compiled using the gcc 3.1 libstdc++ valarray classes and also with the macstl 0.1 valarray implementation, with the Altivec optimizations turned off (by commenting out the appropriate chunk_traits specialization) and also on. The compiler switches used were:

-O3 -faltivec -fstrict-aliasing -save-temps

As you can see, the inline arithmetic test is 7.76x faster, the inline transcendental test is 14.21x faster and the inline scalarization test is 2.89x faster than gcc scalar code. The combination of vector code and inlining is unbeatable.

The outline transcendental test is actually slower than gcc scalar code, showing how much is lost by calling into separately compiled modules. And the unchunked rates are comparable to or worse than gcc, indicating areas for more performance tuning.

A Closer Look

Even in the non-optimized case, macstl code is almost twice as fast in the inline arithmetic case than gcc’s. A look at the compiled PowerPC opcodes for the inner loop of the following expression reveals why — keep an eye on the all-important loads and stores, which could access slow memory.

std::valarray <float> vf1 (vf2 + vf3 + vf4);

Compiled Opcodes
gcc 3.1 libstdc++ macstl 0.1, Altivec off macstl 0.1, Altivec on
lwz r9,0(r3)
slwi r2,r12,2
lwz r4,4(r3)
addi r12,r12,1
lwz r11,4(r9)
lwz r10,0(r9)
lwz r7,4(r11)
lwz r6,4(r10)
lfsx f0,r7,r2
lfsx f1,r6,r2
lwz r0,4(r4)
fadds f2,f1,f0
lfsx f3,r2,r0
fadds f1,f2,f3
stfs f1,0(r5)
addi r5,r5,4
bdnz L172
slwi r2,r11,2
addi r11,r11,1
lfsx f4,r4,r2
lfsx f0,r5,r2
lfsx f3,r6,r2
fadds f2,f4,f0
fadds f1,f2,f3
stfsx f1,r2,r8
bdnz L452
slwi r2,r9,4
addi r9,r9,1
lvx v1,r5,r2
lvx v0,r4,r2
lvx v13,r6,r2
vaddfp v0,v0,v1
vaddfp v1,v0,v13
stvx v1,r2,r8
bdnz L504

Since the expression is adding 3 valarrays and storing into 1 valarray, the theoretical minimum number of loads is 3 and stores is 1.

In the gcc case, there are 7 extraneous lwz to load various pointers, 3 lsfx to load the actual floats and 1 stfs to store the result. The lwz are strictly unnecessary as these pointers are not modified within the code.

In the macstl without Altivec case, the code has eliminated the 7 lwz, removed 1 loop index increment addi and replaced the stfs with a stfsx to reuse the loop index r2. In the macstl with Altivec case, the code has replaced the scalar lsfx with the vector lvx, the scalar fadds with vaddfp and stfsx with the vector stvx within the same number of opcodes.

Thus as you can see, my library succeeded in getting rid of all the extraneous loads and stores, and reducing opcode count from 17 to just 9 — hand tuning would save an additional opcode at most. The optimized version exactly replaced all the vectorizable opcodes as well.

More Results

A set of results for the new Power Mac G5 would really tell how good the new architecture is. Gentle reader, you could run this on your G5 — or better still, donate me a G5, I won't complain.

Mon, 29 Sep 2003. © Pixelglow Software.
» The macstl gcc rematch