[macstl-dev] not all rosey in the gcc-4.0.0 land
glen.low at pixelglow.com
Sun Jul 3 13:46:51 WST 2005
On 03/07/2005, at 3:10 AM, Ilya Lipovsky wrote:
>>> Could you please try that with valarray<stdext::complex <float>
>>> > ? Because this is where my code fails.
>> Sure thing.
>> using namespace stdext;
>> valarray <complex <float> > v1 (complex <float> (1.0f,
>> 2.0f), 100);
>> valarray <complex <float> > v2 (complex <float> (3.0f,
>> 4.0f), 100);
>> std::cout << (v1 * v2).sum ();
>> benchmark has exited with status 0.
>> on my Mac as expected.
>> Perhaps the error is with particular values of v1, v2 etc.? (BTW,
>> the complex multiply then sum is also optimized to use some
>> combination of vectorized fma, from recollection, so any error
>> would start at valarray_altivec.h:154 -- test that is involved by
>> inserting a std::cout << "x" in the static "call" function.) You
>> can do a random search of the problem space by looking at
>> exhaustive.cpp and configuring it with the right functor template,
>> stdext::accumulator <stdext::plus>.
> I'd like to reiterate that the expression above works just fine
> with my code as well. It works well with -O0 and -O1 but *not* with
> -O2 and -O3. Again, I disassembled the code and ran it instruction
> by instruction to see its flow. Even with some C++ code
> rearrangement inside complex_fma the same code is generated with O2
> and O3: a ppc-decrement-counter-and-branch into itself (<label
> +offset>: bdnz label+offset) -- i.e. an empty loop that goes on
> decrementing the counter until it's zero. Afterwards it fp-loads
> the supposedly calculated value and fp-stores it into my variable.
> And it contains gibberish.
> With -O1 the loop looks & works perfectly normal (I'd say, even
The accumulating loop is found in valarray_altivec.h:154. Some of the
things I would try that you may or may not have tried already:
1. Throw a spanner into the optimizer works. Usually the optimizer
cannot optimize around an output statement or a volatile memory
write, so you can try either. E.g. create a global volatile static
int, then write to it inside of the loop and various places you think
might be overoptimizing. The place which successfully breaks the
overoptimization would give you a clue as to what level it's
2. If you're getting this error only with sum () and not regular
assigns e.g. vr = v1 * v2 or vr = v1 * v2 + v3, then it's a pretty
good bet it has something to do with the init parameter in the above.
Try changing the parameter declaration there from T init to const T&
init, and copying the init to a private init_copy within the
function. Try making it volatile etc.
3. The code at line 154 is called from valarray_algorithm.h:60,
there's another place to do 1, 2 and other things to see if this is
where the overoptimization happens. This is where the valarray is
examined so that only the initial sequence is vectorized, while the
tail, left-over elements use a scalar loop (called tail).
FSF gcc 4.0 release is dated 20 April 2005, and I suspect Apple put
in a lot of effort over and beyond that to get it working with
Altivec code for Tiger and Xcode 2.1 -- more's the pity they seem to
be all for switching to Intel. So we may be better off waiting for
4.0.1 if we can't resolve the overoptimization, and leave only 3.4.x
the supported compiler for YDL at all optimization levels --
according to the gcc.gnu.org site the 4.0 branch has been frozen as
of 13 June in preparation for 4.0.1 release.
Cheers, Glen Low
pixelglow software | simply brilliant stuff
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the macstl-dev