[macstl-dev] Re: Question
glen.low at pixelglow.com
Thu Jun 2 08:50:28 WST 2005
On 02/06/2005, at 8:11 AM, Ilya Lipovsky wrote:
> Please accept my apology for belated reply. I cannot answer your
> gcc 3.4 related questions, as I am not working with it. Currently,
> however, for optimization's sake add the following lines (in
> addition to Mike's) to config.h:
> // enable templated function classes to be expanded into
> // code by default
> #define __VEC__
> // maximize inlining
> #define inline inline __attribute__ ((always_inline))
> The last directive is essential due the compiler being stubbornly
> lazy about inlining certain nested template functions (such as the
> complex fused multiply-add when used as optimization in certain
Yes, on Apple gcc 4.0 I had to put in this to get it to inline properly:
-finline-limit=10000 --param large-function-growth=50000 --param
Note the inline limit is actually lower than in 3.3, which still
seems to work OK with the benchmark.
> Also, I'd like to note that it may be unprofitable avoiding a
> double load. As hard to believe as it may be, branch misprediction
> is more costly than a single load. If this is true on Freescale's
> 7450 (G4) then I'm more than 80% sure that it's true on the 970
> (G5) as well (or even more costly!). Branch misprediction will be
> happening every 2nd time on the G4, independently of whether you
> have dynamic prediction enabled or not.
We can still do a double load and avoid branching, possibly by using
the appropriate lvsr or lvsl on the even/odd index.
The other alternative is to increment the vector iterator a float at
a time instead of "half" a float at a time. That might involve a
change to the fundamentals of macstl, so I'll have to think carefully
> Also, regarding your question [in the latest email sent to the dev
> mail list] about why gcc 4.0 doesn't help vectorizing your code: it
> will help if you check out: http://gcc.gnu.org/projects/tree-ssa/
> You expand your code into vector operations already... what other
> improvements do you expect to get? gcc needs conventional scalar
> code as input to do that.
The autovectorization I was expecting was on the scalar side of the
benchmark. Thus I would expect that my vector throughput would remain
relatively similar to 3.3, while the * over raw should go down,
almost to 1 if autovectorization is supposed to be as fast as macstl.
But the results still stubbornly show at least 3x speed up over even
the simplest loops. Mind you, the benchmarks all exercise moderately
complicated expressions, even the first multiply add is something like:
for (int i = 0; i != size; ++i)
a [i] = b [i] * c [i] + d [i];
But the ICC autovectorizer successfully tackles that.
> I do not have an opinion on c = f issue, I am not sure if I fully
> understand what the problem truly is.
I'll see if I can come up with a summary of the issues, and perhaps a
way forward, and ask for opinions from the others.
Cheers, Glen Low
pixelglow software | simply brilliant stuff
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the macstl-dev