Pentium vs G5

Thousand Operations per Second (greater = faster)
compiler gcc 3.3 CW 9.3 VC++ 2003
os OS X 10.3.6 WinXP SP2
cpu G5 2.0 P4 2.8 PM 1.6
multiply add 2000 2083 538 704
inner product 943 943 641 1220
polynomial 676 704 321 311
hypotenuse 347 370 214 131
complex multiply add 485 --- --- ---
predicate 2500 1220 --- ---
slicing 275 238 118 78
power 60.5 50.4 --- ---
trigonometric 39.8 71.7 29.1 19.8

macstl 0.2 and the fighting compilers are back after a hiatus of over a year. For your viewing pleasure, we have cleaned out the ring, trained up the incumbent and brought in two new contenders — the everyman Pentium 4 at 2.8GHz and the new kid Pentium M (Centrino) at 1.6GHz — to challenge our dual PowerPC G5 at 2.0 GHz.

The wrestling ring

The benchmarks are all-new over the ones featured in macstl 0.1.5. They target significantly longer expressions such as multiply-adds, polynomials and trigonometric functions — the kind of expressions you’d use in real life. All are tuned to be single-threaded, live within L2 cache and have denormal handling off, minimizing the skew of fast dual processors and slow main memory. We’ve also set compiler options to the highest optimization levels, including strict aliasing and loop unrolling.

We’re also measuring speed-up over hand-coded scalar loops, which tells you directly how much benefit you’d get out of macstl on your platform. This test will be handy for seeing how we fare against auto-vectorizing compilers, if and when they become generally available.

Times Faster than Scalar Loops (greater = faster)
compiler gcc 3.3 CW 9.3 VC++ 2003
os OS X 10.3.6 WinXP SP2
cpu G5 2.0 P4 2.8 PM 1.6
multiply add 3.5 3.6 1.2 2.4
inner product 2.8 2.8 3.0 4.1
polynomial 2.3 3.2 1.1 1.4
hypotenuse 4.1 6.8 4.7 5.2
complex multiply add 3.1 --- --- ---
predicate 3.5 2.2 --- ---
slicing 0.84 0.75 0.33 0.51
power 6.7 5.4 --- ---
trigonometric 11.8 16.1 9.6 3.6

The commentary

The speed crown is won deservedly by CodeWarrior 9.3 on the PowerPC G5. Note how on the complicated expressions like polynomial, hypotenuse and trigonometric, the number of operations is significantly higher than gcc 3.3. The macstl-generated code also shows the greatest speed-up over scalar loops.

The G5 roundly trashes both the Pentium 4 and the Pentium M, despite being at a slower rate than the Pentium 4. So much for the MHz myth. I put this down to the abundance of registers available for the PowerPC ISA and proper design of the Altivec unit, allowing SIMD calculations to run full speed on the CPU rather than being hampered by loads from cache or memory.

Interestingly enough, the Pentium M holds its own against the faster Pentium 4, especially with the simpler expressions — a result of Intel’s redesign of the architecture and cache. This factors in its overall win for inner product, which has the least use of store opcodes.

Clearly, macstl will accelerate your code on all sorts of compilers, operating systems and CPUs. Only the slicing test showed an actual slowdown over writing your own loops, while the trigometric test showed a speed-up of 3.6x to 16.1x over your own loops. So why don’t you download macstl, run the benchmark on your own system and see if it’s worth the sticker price!

Mon, 31 Jan 2005. © Pixelglow Software.
» reference