A high performance implementation of `valarray`

and associated `lib.numerics`

classes, optimized for SIMD architectures like the PowerPC G4 and G5 Altivec, and the Pentium 3, Pentium 4 and Itanium MMX/SSE/SSE2/SSE3.

The valarray lets you access the raw parallel processing power of these machines without worrying about the nitty gritty of new opcodes and hand scheduling.

- For SIMD programmers, it offers a clear, natural syntax without sacrificing speed: what could be easier than writing
`sin (v1) * cos (v2) + sin (v2) * cos (v1)`

? Yet this will still run 3.6x to 16.2x faster than a hand-coded scalar loop on different architectures. - For valarray programmers, it is faster than the standard implementation: the above code is 17.4x faster than Codewarrior MSL C++, 11.6x faster than gcc libstdc++ and 9.5x faster than Visual C++. Extensions smooth over some of the inconsistencies of the standard, and pumps it up with fixed-size and memory-mapped arrays.
- For numerics programmers, it has a wealth of parallelized algorithms such as integer division, trigonometric functions and complex number arithmetic. All carefully tested for speed and accuracy.

The implementation is built on top of the portable SIMD vec classes, which allows new SIMD architectures to be easily swapped in.