[macstl-dev] Re: Question
lipovsky at skycomputers.com
Fri Jun 3 04:44:34 WST 2005
On Thu, 2 Jun 2005, Glen Low wrote:
>> Also, I'd like to note that it may be unprofitable avoiding a double load.
>> As hard to believe as it may be, branch misprediction is more costly than a
>> single load. If this is true on Freescale's 7450 (G4) then I'm more than
>> 80% sure that it's true on the 970 (G5) as well (or even more costly!).
>> Branch misprediction will be happening every 2nd time on the G4,
>> independently of whether you have dynamic prediction enabled or not.
> We can still do a double load and avoid branching, possibly by using the
> appropriate lvsr or lvsl on the even/odd index.
> The other alternative is to increment the vector iterator a float at a time
> instead of "half" a float at a time. That might involve a change to the
> fundamentals of macstl, so I'll have to think carefully about it.
Ok, regarding the first alternative: you can use lvsl on the *advancing*
pointer to load/reload the data in correct fashion: so that the pair of
floats to be used are placed to the leftmost position. Do a vperm. Then
you do another vperm to place the floats in interleaved fashion with the
imaginary entries being taken from a vector that's filled with 0's. This
will achieve the element_cast<vector <complex <float>, 2> (vector <float,
2> ) conversion.
You can also do this: have the pattern 00 01 ... 07 00 01 ... 07. Load you
vector element. Do a vperm using the abovementioned pattern. Then do
another vperm like the second vperm from the paragraph above. Thus, you
got your conversion. However, right after it [i.e. before finishing the
loop iteration] you use the vaddubm instruction and add to the 00 01 ...
07 00 01 ... 07 value this value: 08 08 ... 08 (16 times) this will give
you: 08 09 ... 0F 08 ... 0F . In the next iteration you will use this
value + the 2nd vperm to achieve your element_cast conversion yet again.
In the next iteration after this one you will have: 10 11 ... 17 10 11 ...
17 etc. Since you're vperming the loaded value with itself, it doesn't
matter what you have in the upper 4 bits of each byte of your permute
pattern vector. So even though you'll be producing some trash in the upper
ones, your lower 4 bits are always going to oscillate between either x or
x+8. So every iteration will be producing the correct conversion as long
as you keep incrementing your pattern at the end.
The first approach is more elegant. The second is more efficient. So
better choose the 2nd, if you're going with the element_cast alternative
Now, regarding the second alternative, "incrementing the vector iterator a
float at a time instead of 'half' a float at a time." I assume you meant
"increment the vector iterator 4 floats at a time instead of 2 floats at a
time." That is the best way of going about the situation. It is both
elegant and very efficient (in terms of run-time). However, as you've
mentioned, design time is the trade-off.
> The autovectorization I was expecting was on the scalar side of the
> benchmark. Thus I would expect that my vector throughput would remain
> relatively similar to 3.3, while the * over raw should go down, almost to 1
> if autovectorization is supposed to be as fast as macstl. But the results
> still stubbornly show at least 3x speed up over even the simplest loops. Mind
> you, the benchmarks all exercise moderately complicated expressions, even the
> first multiply add is something like:
> for (int i = 0; i != size; ++i)
> a [i] = b [i] * c [i] + d [i];
> But the ICC autovectorizer successfully tackles that.
Use gdb to disassemble the compiler-generated code and check whether it
actually generates AltiVec instructions.
More information about the macstl-dev