An expression is const-chunkable if it is either a valarray of an Altivec type, a scalar of an Altivec type or an expression containing only such valarrays and scalars, arithmetic operators and transcendental functions.

What kinds of valarray expressions will take the express lane of Altivec optimization? It all comes down to the chunk.

The Altivec optimization is invoked whenever a const-chunkable expression constructs or is assigned to a chunkable expression, or a const-chunkable expression is summarized using sum, min or max.

Currently, only Altivec base types are optimizable: char, unsigned char, short, unsigned short, long, unsigned long and float. I expect to add some more types to the list eventually: long long, unsigned long long and certain std::complex types.

A chunkable expression is an l-value that can be written to in chunks. Only valarrays of Altivec types and `std::valarray <bool>`

are chunkable.

A const-chunkable expression is an r-value that can be read from in chunks. An expression is const-chunkable if it is either a valarray of an Altivec type, a scalar of an Altivec type or an expression containing only such valarrays and scalars, arithmetic operators and transcendental functions — except for certain boolean expressions.

The following sorts of expressions are unchunkable (neither chunkable nor const-chunkable) and thus won’t be optimized:

- valarrays of non-Altivec types
- certain boolean expressions (see below)
- apply called on expressions
- subsets of expressions i.e. slice, gslice, mask and indirect.

For example:

`std::valarray <float> vf1, vf2, vf3;`

std::valarray <double> vd1, vd2, vd3;

vf1 // const-chunkable, Altivec base type

vf1 * vf2 + vf3 // const-chunkable, only arithmetic

cos (vf1) + sin (vf2) // const-chunkable, arithmetic and transcendental

vd1 // unchunkable, not Altivec base type

vd1 * vd2 + vd3 // unchunkable, not Altivec base type

vf1 [vl1] // unchunkable, indirect subset

vf1 [vf2 == vf3] // unchunkable, mask subset

C++ has the type bool which usually has the same size as the processor word size, so for the PowerPC it is 4 bytes long. Though a bool is either true or false, you can actually store any word-sized integer in a bool variable. C++ simply treats zero values as false, and nonzero values as true.

Altivec introduces the concept of sized booleans — booleans that are either 1, 2 or 4 bytes long. These are the results of various boolean-valued Altivec functions based on the element size, and have to have all bits 0 or 1.

I’ve encapsulated these sized booleans in the `macstl::boolean`

template. For example, where `vs1`

and `vs2`

are vector signed shorts, then `vec_eq (vs1, vs2)`

is a vector bool short, whose elements are 2 byte sized booleans, or `macstl::boolean <short>`

objects.

These differences complicate boolean expressions somewhat.

First, expressions that combine differently sized boolean chunks are not const-chunkable at all, since the corresponding Altivec types would have different numbers of elements.

Second, while `std::valarray <bool>`

is chunkable, it is not const-chunkable: while you can write chunks into a `std::valarray <bool>`

from some boolean expression, you can’t read chunks from an expression involving `std::valarray <bool>`

. I put this restriction in because each `std::valarray <bool>`

element can store an arbitrary integer thanks to its C++ legacy, but Altivec would expect it to have all bits 1 or 0.

But even though they are not optimized for Altivec, the scalar code they produce will still be clean and tight.

Third, even if a boolean expression is const-chunkable, it must have long-sized chunks to construct or assign to a `std::valarray <bool>`

. This follows from the fact that bool is actually long-sized on the PowerPC. The other kinds of const-chunkable expressions may still be summarized, or construct or be assigned to `std::valarray <macstl::boolean <T> >`

, where `T`

is either `char`

, `short`

or `long`

.

For example:

`std::valarray <bool> vb1, vb2;`

std::valarray <float> vf1, vf2;

std::valarray <short> vs1, vs2;

bool b;

vb1 = vf1 == vf2; // optimized, since expression has long-sized chunks

vb1 = vs1 == vs2; // not optimized, since expression has short-sized chunks

vb1 = (vf1 == vf2) && (vs1 == vs2);

// not optimized, combining different sized chunks

vb2 = vb1; // not optimized, vb1 not const-chunkable

vb2 = vb1 && (vf1 == vf2); // not optimized, vb1 not const-chunkable

b = (vf1 == vf2).sum (); // optimized, vf1 == vf2 is const-chunkable

b = (vs1 == vs2).sum (); // optimized, vs1 == vs2 is const-chunkable

b = ((vf1 == vf2) && (vs1 == vs2)).sum (); // not optimized, combining different sized chunks

Benchmarks have shown that trying to optimize unchunkable expressions didn’t yield much of a performance gain over scalar code. But even though they are not optimized for Altivec, the scalar code they produce will still be clean and tight.

In the future, I may investigate making more of these chunkable.

» vec