[macstl-dev] Re: Question
Ilya Lipovsky
lipovsky at skycomputers.com
Thu Jun 2 08:11:04 WST 2005
Glen,
Please accept my apology for belated reply. I cannot answer your gcc 3.4
related questions, as I am not working with it. Currently, however, for
optimization's sake add the following lines (in addition to Mike's) to
config.h:
// enable templated function classes to be expanded into AltiVec
// code by default
#define __VEC__
// maximize inlining
#define inline inline __attribute__ ((always_inline))
The last directive is essential due the compiler being stubbornly lazy
about inlining certain nested template functions (such as the complex
fused multiply-add when used as optimization in certain situations).
Also, I'd like to note that it may be unprofitable avoiding a double load.
As hard to believe as it may be, branch misprediction is more costly than
a single load. If this is true on Freescale's 7450 (G4) then I'm more than
80% sure that it's true on the 970 (G5) as well (or even more costly!).
Branch misprediction will be happening every 2nd time on the G4,
independently of whether you have dynamic prediction enabled or not.
Also, regarding your question [in the latest email sent to the dev
mail list] about why gcc 4.0 doesn't help vectorizing
your code: it will help if you check out:
http://gcc.gnu.org/projects/tree-ssa/vectorization.html
You expand your code into vector operations already... what other
improvements do you expect to get? gcc needs conventional scalar code
as input to do that.
I do not have an opinion on c = f issue, I am not sure if I fully
understand what the problem truly is.
-Ilya
On Tue, 24 May 2005, Glen Low wrote:
> On 18/05/2005, at 2:51 AM, Ilya Lipovsky wrote:
>>>
>
>>> Now element_cast <> isn't as bad as you think. Consider that the valarray
>>> expression template engine I wrote can actually reconfigure expressions at
>>> compile time for efficiency e.g.
>>>
>>> (a * b) + c
>>>
>>> actually recomposes the expression to use something like madd (c, a, b)
>>> i.e. what looks like two separate operations in the expression can be
>>> merged into a single.
>>>
>>> That means
>>>
>>> element_cast <complex> (a) * b
>>>
>>> need not actually unpack the float a into a complex then multiply by b,
>>> but invoke some sort of merged operation which multiplies 2 complex by 2
>>> float at a time. (The only limitation I see is that the iterator would
>>> need to step through 2 complex at a time, so the float vector may have to
>>> be loaded twice -- it would take a smart loop unroller in the compiler to
>>> see that double load and optimize to a single one...)
>>>
>>>
>>
>> What bothers me is this:
>>
>> Is the chunking mechanism going to correctly iterate thru the arrays? E.g.,
>> if we're stepping thru 2 complex entries at a time, aren't we in danger of
>> stepping thru 4 real entries? Or my fears are completely unfounded, and the
>> mechanism is such that it only iterates by the least possible number of
>> elements instead of the 128 bits atom? Could you check?
>
> The chunking works off expression template objects. Each term in an
> expression becomes an expression template object, and complex terms are then
> made up of simple terms, all the way down to the leaf terms which are either
> literals or arrays (valarrays). An operator overload (or function overload)
> on 1 or 2 expression template objects yields another (arbitrary) expression
> template object, that how the composition happens.
>
> Now each expression template object declares a chunk_begin that is a
> STL-style iterator into the chunks produced by the expression template
> object. This chunk_iterator (and const_chunk_iterator) must yield vector
> elements, whose scalar elements are the same type as the expression template
> object's value type. So for an expression template of value type float, its
> chunk iterators must have type vec <float, n>.
>
> Thus a hypothetical combination of a float valarray and a complex float
> valarray should have a chunking iterator whose type is vec <complex <float>,
> n> since complex <float> is the type of the expression. Now vec <xxx> must
> fit into a vector register, so therefore n == 2 for Altivec.
>
> A binary function ET simply takes the chunk iterator of its two constituent
> ET's and yields its own iterator. Therefore, it must take the float valarray
> chunk iterator which returns vec <float, 4> and the complex float valarray
> which returns vec <complex <float>, 2> and somehow produce a vec <complex
> <float>, 2>. A typical way to do this is for the iterator to know if it's on
> an even or odd index; if even, take the lower 2 floats of vec <float, 4> and
> combine it with the vec <complex <float>, 2>; if odd, do the corresponding
> thing. When you increment the binary function iterator, it increments the
> complex valarray iterator similarly, but only increments the float valarray
> iterator every other time and sets even or odd index appropriately. Hope that
> makes sense...
>>
>>> The issue then becomes whether it is convenient for users to use
>>> element_cast. The valarray expression engine works on identical types and
>>> has no notion of type promotion (yet). Some people have said element_cast
>>> is rather clunky and would rather automatic promotions like regular C
>>> (i.e. float -> complex float, integer -> float etc.). My worry from a
>>> syntactic point of view is that these conversions aren't free, more so
>>> with SIMD architectures, so there's a need to highlight expensive
>>> conversions. What do you think?
>>>
>>>
>>
>> I side with the people [who want automatic promotion] in this case. Just
>> because a conversion is of SIMD-type doesn't mean it's not expensive.
>> Consider integer->float. I don't know anything about x86, but on ppc you
>> have to save the GPU register on stack and then reload the data into an FPU
>> register. Cheap? I don't think it's cheaper than converting a float into a
>> complex float thru the VPU, which is having 1 temp register and do:
>>
>> vxor vtemp, vtemp, vtemp /* puts -0.0 in vtemp */
>>
>> and then finish with
>>
>> vmrghw vdest, vdest, vtemp /* you get the original first 2 floats in
>> complex format with the other 2 original
>> floats discarded */
>>
>> This all is going to be cheaper on VPU (at least on G4) than on scalar
>> hardware. Converting from float to int is even more trivial, requiring the
>> use of only one specialized AltiVec instruction.
>
> The conversion is cheap, but not free. People might still blithely write i +
> f where i is an int and f a float, expecting it to do the right thing, not
> aware that it costs some. If they realized that, they might be able to choose
> an algorithm which used only ints or floats. However I'm beginning to be
> persuaded to your stand, mainly because that's how (fortunately or
> unfortunately) C already works, and even the supposedly more typesafe
> descendants like Java and C# do the same.
>
> The remaining issue is a hairy one though. ET's are composed up of sub-ET's,
> all the way up to the assignment operator, which then actually "does"
> something (the actual copy). So we could say c + f is always c, c * f is
> always c etc. But what about:
>
> c = f
>
> There will have to be some rewiring happening around the assignment operator
> which in C++ must be a member function, so you don't get the same flexibility
> with the binary function ET's working through free functions.
>
>
>
>
More information about the macstl-dev
mailing list