[macstl-dev] Re: Question
glen.low at pixelglow.com
Tue May 17 08:29:40 WST 2005
On 17/05/2005, at 4:51 AM, Ilya Lipovsky wrote:
I think your element_cast <> may not be such a bad idea. However, I
am not sure it is the best, either. I think the right idea in this
case is to actually expand the chunking mechanism to adapt varying
types. To be combined in a native fashion within one loop.
Why I believe it to be a better idea? Because, for example, in the
case provided in my previous email element_cast <> will convert a
compact representation into a strided one. This will require 2 extra
vmrghw instructions + 1 load per iteration (to make a <float> into
<complex <float>> ) as opposed to simply loading the 4 floats
natively and multiplying them with the 2 registers that contain 4
complex<float>'s. I am not even counting the wasted vperm's of the
operator*<complex<float> >. We don't need the extra vperm's and the
vmaddfp's in the natively implemented operator*<complex<float>,
<float> > case. The operator should be able to be implemented as
template <> struct multiplies <macstl::vec <float, 4>, macstl::vec
<stdext::complex <float>, 4> >
typedef macstl::vec <float, 4> first_argument_type;
typedef macstl::vec <stdext::complex <float>, 4>
typedef macstl::vec <stdext::complex <float>, 4> result_type;
result_type operator() (const first_argument_type& lhs, const
second_argument_type& rhs) const
using namespace macstl;
return ..... ; // this is nontrivial ;-)
The problem is that macstl::vec (on Altivec) is defined as a 128-bit
quantity corresponding exactly to one vector register. In practice
stuffing anything more spoils gcc 3.3's ability to enregister
macstl::vec -- we need to ensure that it only ever contains one field
of native type in order to get gcc to keep it in registers only.
Therefore vec <complex <float>, 4> can't work -- a complex float is
64 bit and therefore such a beast would be 256 bit.
We have to tackle the multiplication at a higher level, at the
valarray expression level.
Now element_cast <> isn't as bad as you think. Consider that the
valarray expression template engine I wrote can actually reconfigure
expressions at compile time for efficiency e.g.
(a * b) + c
actually recomposes the expression to use something like madd (c, a,
b) i.e. what looks like two separate operations in the expression can
be merged into a single.
element_cast <complex> (a) * b
need not actually unpack the float a into a complex then multiply by
b, but invoke some sort of merged operation which multiplies 2
complex by 2 float at a time. (The only limitation I see is that the
iterator would need to step through 2 complex at a time, so the float
vector may have to be loaded twice -- it would take a smart loop
unroller in the compiler to see that double load and optimize to a
The issue then becomes whether it is convenient for users to use
element_cast. The valarray expression engine works on identical types
and has no notion of type promotion (yet). Some people have said
element_cast is rather clunky and would rather automatic promotions
like regular C (i.e. float -> complex float, integer -> float etc.).
My worry from a syntactic point of view is that these conversions
aren't free, more so with SIMD architectures, so there's a need to
highlight expensive conversions. What do you think?
This conversation is interesting, so I'm going to suggest we continue
it in the mailing list.
The question, then, is how hard is it to implement such a beast. What
is your opinion?
I don't mind doing some coding as long as my manager(s) approve. I am
just a soldier ;).
OK thanks for the offer. Once we thrash out what you need and what
the others are happy with, we can work something out.
Cheers, Glen Low
pixelglow software | simply brilliant stuff
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the macstl-dev