[macstl-dev] Re: Question
glen.low at pixelglow.com
Tue Jul 19 07:59:56 WST 2005
On 19/07/2005, at 12:15 AM, Ilya Lipovsky wrote:
>> Thus 0.3.1 or 0.3.2 will have the always_inline turned on, but in
>> this fashion:
>> A. We'll define a macro called ALWAYS_INLINE (any other
>> suggestions?) in config.h and define it appropriately for gcc and
>> Visual C++ (and possibly ICC). Hopefully the attribute should work
>> in the same syntactic position as inline already does in C++. (I
>> don't want to redefine the inbuilt "inline" because it would
>> conflict with client code at the least, and I feel it's dangerous
>> to redefine built-ins.)
>> B. We'll have to tag ALL candidate functions with this, even
>> the ones within classes and class templates that are implicitly
>> inline, in order to force them to be inlined. In general this
>> should be OK since a large part of them are generated via private
>> macros anyway.
>> C. A clean compile with no explicit inline tuning at the
>> compiler level should still provide the same performance level as
>> the current high inlining levels.
> Seems like a good idea to me. One note, however: *some* of my
> benchmarks run slightly (e.g. 54 as opposed to 50 millisecs for
> about 1K points) slower with the always_inline in place of regular
> inline. I don't exactly know how the compiler optimizes things, but
> it seems to me that "always_inline" is invoked before any
> optimizations take place, thus making it harder for the optimizer
> do its job, because, I think, it is harder for it to discern the
> original structure. For gcc 3.4 there is practically no difference.
> gcc 4.0 is pickier (nevertheless, it produces correct results!!).
> It may be using some kind of heuristic to determine when "enough is
> enough" for the code's mass, and avoids any further implicit
> inlining. This can be avoided, I surmise, if we implement the
> ALWAYS_INLINE macro the way you've proposed, thus forcing the
> compiler to inline everything.
Interesting, I've been having similar problems. gcc 4.0 has very good
CSE (common subexpression elimination) powers but they tend to tire
out when the chain of expressions is too long or when always_inline
interrupts it. It's been fairly hit and miss at the moment, though
I'm narrowing down on a couple of solutions. At the least you'd get
valarray <float> f;
f + f
using only one lvx vs, two lvx's if the compiler cannot detect the
common subexpression "f".
The two things that have helped so far are (1) turning -maltivec or -
mcpu=G5 off -- this even impacts CSE of scalar code for some strange
reason -- although I'm not sure whether the FSF version has the same
bug/feature, and (2) having array_term store elements as __vector
float instead of vec <float, 4>, and having the chunk_iterator
convert them into vec <float, 4> on the fly.
always_inline seems to sometimes kill the CSE, but it's unavoidable
for the solution in (2), since the chunk_iterator needs to return a
Cheers, Glen Low
pixelglow software | simply brilliant stuff
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the macstl-dev