|operation||gcc 3.1 libstdc++||macstl 0.1, Altivec off||macstl 0.1, Altivec on|
Heres a stack of benchmarks that show how the implementation stacks up.
Each test is run untimed for a few loops, then timed for many loops, and a throughput value calculated. The source is available in the
main.cpp inside the download.
I used a size of 1000 elements for all valarrays. This intentionally keeps the data cache-bound, which maximizes the effect of Altivec code in slow bus architectures like the test machine a Dual 450 Mhz Power Macintosh G4 with 2 x 1 MB L2 cache and 1 GB memory, running Mac OS X 10.2.6. Mileage will differ with a bandwidth-tuned Power Macintosh G5.
The code is compiled using the gcc 3.1 libstdc++ valarray classes and also with the macstl 0.1 valarray implementation, with the Altivec optimizations turned off (by commenting out the appropriate
chunk_traits specialization) and also on. The compiler switches used were:
-O3 -faltivec -fstrict-aliasing -save-temps
As you can see, the inline arithmetic test is 7.76x faster, the inline transcendental test is 14.21x faster and the inline scalarization test is 2.89x faster than gcc scalar code. The combination of vector code and inlining is unbeatable.
The outline transcendental test is actually slower than gcc scalar code, showing how much is lost by calling into separately compiled modules. And the unchunked rates are comparable to or worse than gcc, indicating areas for more performance tuning.
Even in the non-optimized case, macstl code is almost twice as fast in the inline arithmetic case than gccs. A look at the compiled PowerPC opcodes for the inner loop of the following expression reveals why keep an eye on the all-important loads and stores, which could access slow memory.
std::valarray <float> vf1 (vf2 + vf3 + vf4);
|gcc 3.1 libstdc++||macstl 0.1, Altivec off||macstl 0.1, Altivec on|
Since the expression is adding 3 valarrays and storing into 1 valarray, the theoretical minimum number of loads is 3 and stores is 1.
In the gcc case, there are 7 extraneous
lwz to load various pointers, 3
lsfx to load the actual floats and 1
stfs to store the result. The
lwz are strictly unnecessary as these pointers are not modified within the code.
In the macstl without Altivec case, the code has eliminated the 7
lwz, removed 1 loop index increment
addi and replaced the
stfs with a
stfsx to reuse the loop index
r2. In the macstl with Altivec case, the code has replaced the scalar
lsfx with the vector
lvx, the scalar
stfsx with the vector
stvx within the same number of opcodes.
Thus as you can see, my library succeeded in getting rid of all the extraneous loads and stores, and reducing opcode count from 17 to just 9 hand tuning would save an additional opcode at most. The optimized version exactly replaced all the vectorizable opcodes as well.
A set of results for the new Power Mac G5 would really tell how good the new architecture is. Gentle reader, you could run this on your G5 or better still, donate me a G5, I won't complain.