[macstl-dev] macstl 0.3.0.32
Glen Low
glen.low at pixelglow.com
Fri Aug 12 23:56:08 WST 2005
On 11/08/2005, at 11:14 PM, Glen Low wrote:
> Dear All
>
> Just committed the new version of macstl to the Subversion
> repository. This has extensive but largely transparent changes to
> support better CSE and inlining, especially when compiling -
> faltivec without -maltivec on Apple gcc 4.0. I have optimized
> literal terms in valarray expressions as well as element access to
> valarray.
A little elaboration on the improvements in 0.3.0.32
1. Better CSE. Compiling with -faltivec without -maltivec on Apple
gcc 4.0, the following sorts of expressions will have minimal loads:
valarray <float> v1, v2, vr;
vr = v1 + v1 + v1 + v1; // 1 load, 3 adds in the inner loop
instead of 4 loads, 3 adds.
vr = (v1 + v2) + (v1 + v2); // 2 loads, 2 adds in the inner loop
instead of 4 loads, 3 adds.
You will get slightly worse results with -faltivec and -maltivec on
together.
2. Inlining. macstl now doesn't require turning inline limits up
to the wazoo on gcc, Visual C++ and ICC, it uses the minimal amount
of forced inlining to get inner loops to compile as one code path.
Besides helping with compile times, this also helps on -faltivec
without -maltivec which otherwise won't inline vector code.
3. Literal terms. Any literal terms should be faster both in the
inner loop (especially -faltivec without -maltivec) and prolog/epilog
sessions. E.g.
vr = 3.0 + v1; // 3.0 is a literal term
4. Element access to valarray. Previous versions had poor element
access code to chunked valarrays, so v1 [0] would generate poor code
and if the valarray was chunked but the entire expression wasn't
chunked, evaluating it would be much slower than the equivalent
scalar code. Thanks to gcc 4.0 having better support for proxies and
temporaries, I was able to rearrange the iterators so that element
access is now as fast as C element access and unchunked expressions
should evaluate almost as fast as the C equivalent. All this rework
is still fully aliasing compliant so you should still be able to
access by element and do chunked operations without worrying that the
compiler is going to reorder them wrongly.
Hint: if you want to see if a particular expression is chunked or
not, look for the chunk_begin member e.g.
((v1 + v2) + v3).chunk_begin (); // compiles because (v1 + v2) +
v3 is chunked
atan (v1).chunk_begin (); // doesn't compile because atan (v1)
isn't chunked -- no vectorized version of atan available yet
// atan(v1) on
0.3 and earlier used to be slower than the implied hand-coded loop,
but now it should be almost the same speed
or if you're using gcc or an equivalent, you can use __typeof and
look for the const_chunk_iterator typedef e.g.
__typeof ((v1 + v2) + v3)::const_chunk_iterator; // exists
__typeof (atan (v1))::const_chunk_iterator; // doesn't exist
Hint: Using typeof is also handy if you want to store the expression
off for subsequent (re)evaluation, rather than evaluate down to a
valarray or statarray e.g.
__typeof (v1 + v2) temp = v1 + v2;
vr = temp + temp;
is more efficient than
valarray <float> temp = v1 + v2;
vr = temp + temp;
Not as good as having CSE, but at least you don't pay for extraneous
stores and temp memory.
5. -faltivec without -maltivec. Code compiled with these options
should be comparable if not faster than -faltivec and -maltivec,
several spurious non-vectorized memcpy's were eliminated.
Cheers, Glen Low
---
pixelglow software | simply brilliant stuff
www.pixelglow.com
aim: pixglen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.pixelglow.com/lists/archive/macstl-dev/attachments/20050812/ac57db93/attachment.html
More information about the macstl-dev
mailing list