[macstl-dev] Re: Question

Glen Low glen.low at pixelglow.com
Thu Jun 2 08:50:28 WST 2005

  • Previous message: [macstl-dev] Re: Question
  • Next message: [macstl-dev] Re: Question
  • Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]


On 02/06/2005, at 8:11 AM, Ilya Lipovsky wrote:

> Glen,
>
> Please accept my apology for belated reply. I cannot answer your  
> gcc 3.4 related questions, as I am not working with it. Currently,  
> however, for optimization's sake add the following lines (in  
> addition to Mike's) to config.h:
>
>         // enable templated function classes to be expanded into  
> AltiVec
>     // code by default
>         #define __VEC__
>         // maximize inlining
>         #define inline inline __attribute__ ((always_inline))
>
> The last directive is essential due the compiler being stubbornly  
> lazy about inlining certain nested template functions (such as the  
> complex fused multiply-add when used as optimization in certain  
> situations).

Yes, on Apple gcc 4.0 I had to put in this to get it to inline properly:

-finline-limit=10000 --param large-function-growth=50000 --param  
inline-unit-growth=50000

Note the inline limit is actually lower than in 3.3, which still  
seems to work OK with the benchmark.
>
> Also, I'd like to note that it may be unprofitable avoiding a  
> double load. As hard to believe as it may be, branch misprediction  
> is more costly than a single load. If this is true on Freescale's  
> 7450 (G4) then I'm more than 80% sure that it's true on the 970  
> (G5) as well (or even more costly!). Branch misprediction will be  
> happening every 2nd time on the G4, independently of whether you  
> have dynamic prediction enabled or not.

We can still do a double load and avoid branching, possibly by using  
the appropriate lvsr or lvsl on the even/odd index.

The other alternative is to increment the vector iterator a float at  
a time instead of "half" a float at a time. That might involve a  
change to the fundamentals of macstl, so I'll have to think carefully  
about it.

>
> Also, regarding your question [in the latest email sent to the dev  
> mail list] about why gcc 4.0 doesn't help vectorizing your code: it  
> will help if you check out: http://gcc.gnu.org/projects/tree-ssa/ 
> vectorization.html
>
> You expand your code into vector operations already... what other  
> improvements do you expect to get? gcc needs conventional scalar  
> code as input to do that.

The autovectorization I was expecting was on the scalar side of the  
benchmark. Thus I would expect that my vector throughput would remain  
relatively similar to 3.3, while the * over raw should go down,  
almost to 1 if autovectorization is supposed to be as fast as macstl.  
But the results still stubbornly show at least 3x speed up over even  
the simplest loops. Mind you, the benchmarks all exercise moderately  
complicated expressions, even the first multiply add is something like:

for (int i = 0; i != size; ++i)
     a [i] = b [i] * c [i] + d [i];

But the ICC autovectorizer successfully tackles that.
>
> I do not have an opinion on c = f issue, I am not sure if I fully  
> understand what the problem truly is.

I'll see if I can come up with a summary of the issues, and perhaps a  
way forward, and ask for opinions from the others.

Cheers, Glen Low


---
pixelglow software | simply brilliant stuff
www.pixelglow.com
aim: pixglen

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.pixelglow.com/lists/archive/macstl-dev/attachments/20050602/0ec537e2/attachment.html

  • Previous message: [macstl-dev] Re: Question
  • Next message: [macstl-dev] Re: Question
  • Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the macstl-dev mailing list