[macstl-dev] Re: Question
lipovsky at skycomputers.com
Wed May 18 02:51:16 WST 2005
On Tue, 17 May 2005, MacSTL wrote:
> On 17/05/2005, at 4:51 AM, Ilya Lipovsky wrote:
>> Hi Glen,
>> I think your element_cast <> may not be such a bad idea. However, I am not
>> sure it is the best, either. I think the right idea in this case is to
>> actually expand the chunking mechanism to adapt varying types. To be
>> combined in a native fashion within one loop.
>> Why I believe it to be a better idea? Because, for example, in the case
>> provided in my previous email element_cast <> will convert a compact
>> representation into a strided one. This will require 2 extra vmrghw
>> instructions + 1 load per iteration (to make a <float> into <complex
>> <float>> ) as opposed to simply loading the 4 floats natively and
>> multiplying them with the 2 registers that contain 4 complex<float>'s. I am
>> not even counting the wasted vperm's of the operator*<complex<float> >. We
>> don't need the extra vperm's and the vmaddfp's in the natively implemented
>> operator*<complex<float>, <float> > case. The operator should be able to be
>> implemented as follows:
>> template <> struct multiplies <macstl::vec <float, 4>, macstl::vec
>> <stdext::complex <float>, 4> >
>> typedef macstl::vec <float, 4> first_argument_type;
>> typedef macstl::vec <stdext::complex <float>, 4> second_argument_type;
>> typedef macstl::vec <stdext::complex <float>, 4> result_type;
>> result_type operator() (const first_argument_type& lhs, const
>> second_argument_type& rhs) const
>> using namespace macstl;
>> return ..... ; // this is nontrivial ;-)
> The problem is that macstl::vec (on Altivec) is defined as a 128-bit quantity
> corresponding exactly to one vector register. In practice stuffing anything
> more spoils gcc 3.3's ability to enregister macstl::vec -- we need to ensure
> that it only ever contains one field of native type in order to get gcc to
> keep it in registers only.
> Therefore vec <complex <float>, 4> can't work -- a complex float is 64 bit
> and therefore such a beast would be 256 bit.
> We have to tackle the multiplication at a higher level, at the valarray
> expression level.
Could you elaborate on that? Also, how involved should such an
implementation be? Can we change something to make the process easier to
implement? It's very important for me to understand that in as thorough
manner as possible. Thanks in advance.
> Now element_cast <> isn't as bad as you think. Consider that the valarray
> expression template engine I wrote can actually reconfigure expressions at
> compile time for efficiency e.g.
> (a * b) + c
> actually recomposes the expression to use something like madd (c, a, b) i.e.
> what looks like two separate operations in the expression can be merged into
> a single.
> That means
> element_cast <complex> (a) * b
> need not actually unpack the float a into a complex then multiply by b, but
> invoke some sort of merged operation which multiplies 2 complex by 2 float at
> a time. (The only limitation I see is that the iterator would need to step
> through 2 complex at a time, so the float vector may have to be loaded twice
> -- it would take a smart loop unroller in the compiler to see that double
> load and optimize to a single one...)
What bothers me is this:
Is the chunking mechanism going to correctly iterate thru the arrays?
E.g., if we're stepping thru 2 complex entries at a time, aren't we in
danger of stepping thru 4 real entries? Or my fears are completely
unfounded, and the mechanism is such that it only iterates by the least
possible number of elements instead of the 128 bits atom? Could you check?
> The issue then becomes whether it is convenient for users to use
> element_cast. The valarray expression engine works on identical types and has
> no notion of type promotion (yet). Some people have said element_cast is
> rather clunky and would rather automatic promotions like regular C (i.e.
> float -> complex float, integer -> float etc.). My worry from a syntactic
> point of view is that these conversions aren't free, more so with SIMD
> architectures, so there's a need to highlight expensive conversions. What do
> you think?
I side with the people [who want automatic promotion] in this case. Just
because a conversion is of SIMD-type doesn't mean it's not expensive.
Consider integer->float. I don't know anything about x86, but on ppc you
have to save the GPU register on stack and then reload the data into an
FPU register. Cheap? I don't think it's cheaper than converting a float
into a complex float thru the VPU, which is having 1 temp register and do:
vxor vtemp, vtemp, vtemp /* puts -0.0 in vtemp */
and then finish with
vmrghw vdest, vdest, vtemp /* you get the original first 2 floats in
complex format with the other 2 original
floats discarded */
This all is going to be cheaper on VPU (at least on G4) than on scalar
hardware. Converting from float to int is even more trivial, requiring the
use of only one specialized AltiVec instruction.
> This conversation is interesting, so I'm going to suggest we continue it in
> the mailing list.
>> The question, then, is how hard is it to implement such a beast. What is
>> your opinion?
>> I don't mind doing some coding as long as my manager(s) approve. I am just
>> a soldier ;).
> OK thanks for the offer. Once we thrash out what you need and what the others
> are happy with, we can work something out.
> Cheers, Glen Low
> pixelglow software | simply brilliant stuff
> aim: pixglen
More information about the macstl-dev