An expression is const-chunkable if it is either a valarray of an Altivec type, a scalar of an Altivec type or an expression containing only such valarrays and scalars, arithmetic operators and transcendental functions.
What kinds of valarray expressions will take the express lane of Altivec optimization? It all comes down to the chunk.
The Altivec optimization is invoked whenever a const-chunkable expression constructs or is assigned to a chunkable expression, or a const-chunkable expression is summarized using sum, min or max.
Currently, only Altivec base types are optimizable: char, unsigned char, short, unsigned short, long, unsigned long and float. I expect to add some more types to the list eventually: long long, unsigned long long and certain std::complex types.
A chunkable expression is an l-value that can be written to in chunks. Only valarrays of Altivec types and
std::valarray <bool> are chunkable.
A const-chunkable expression is an r-value that can be read from in chunks. An expression is const-chunkable if it is either a valarray of an Altivec type, a scalar of an Altivec type or an expression containing only such valarrays and scalars, arithmetic operators and transcendental functions except for certain boolean expressions.
The following sorts of expressions are unchunkable (neither chunkable nor const-chunkable) and thus wont be optimized:
std::valarray <float> vf1, vf2, vf3;
std::valarray <double> vd1, vd2, vd3;
vf1 // const-chunkable, Altivec base type
vf1 * vf2 + vf3 // const-chunkable, only arithmetic
cos (vf1) + sin (vf2) // const-chunkable, arithmetic and transcendental
vd1 // unchunkable, not Altivec base type
vd1 * vd2 + vd3 // unchunkable, not Altivec base type
vf1 [vl1] // unchunkable, indirect subset
vf1 [vf2 == vf3] // unchunkable, mask subset
C++ has the type bool which usually has the same size as the processor word size, so for the PowerPC it is 4 bytes long. Though a bool is either true or false, you can actually store any word-sized integer in a bool variable. C++ simply treats zero values as false, and nonzero values as true.
Altivec introduces the concept of sized booleans booleans that are either 1, 2 or 4 bytes long. These are the results of various boolean-valued Altivec functions based on the element size, and have to have all bits 0 or 1.
Ive encapsulated these sized booleans in the
macstl::boolean template. For example, where
vs2 are vector signed shorts, then
vec_eq (vs1, vs2) is a vector bool short, whose elements are 2 byte sized booleans, or
macstl::boolean <short> objects.
These differences complicate boolean expressions somewhat.
First, expressions that combine differently sized boolean chunks are not const-chunkable at all, since the corresponding Altivec types would have different numbers of elements.
std::valarray <bool> is chunkable, it is not const-chunkable: while you can write chunks into a
std::valarray <bool> from some boolean expression, you cant read chunks from an expression involving
std::valarray <bool>. I put this restriction in because each
std::valarray <bool> element can store an arbitrary integer thanks to its C++ legacy, but Altivec would expect it to have all bits 1 or 0.
But even though they are not optimized for Altivec, the scalar code they produce will still be clean and tight.
Third, even if a boolean expression is const-chunkable, it must have long-sized chunks to construct or assign to a
std::valarray <bool>. This follows from the fact that bool is actually long-sized on the PowerPC. The other kinds of const-chunkable expressions may still be summarized, or construct or be assigned to
std::valarray <macstl::boolean <T> >, where
T is either
std::valarray <bool> vb1, vb2;
std::valarray <float> vf1, vf2;
std::valarray <short> vs1, vs2;
vb1 = vf1 == vf2; // optimized, since expression has long-sized chunks
vb1 = vs1 == vs2; // not optimized, since expression has short-sized chunks
vb1 = (vf1 == vf2) && (vs1 == vs2);
// not optimized, combining different sized chunks
vb2 = vb1; // not optimized, vb1 not const-chunkable
vb2 = vb1 && (vf1 == vf2); // not optimized, vb1 not const-chunkable
b = (vf1 == vf2).sum (); // optimized, vf1 == vf2 is const-chunkable
b = (vs1 == vs2).sum (); // optimized, vs1 == vs2 is const-chunkable
b = ((vf1 == vf2) && (vs1 == vs2)).sum (); // not optimized, combining different sized chunks
Benchmarks have shown that trying to optimize unchunkable expressions didnt yield much of a performance gain over scalar code. But even though they are not optimized for Altivec, the scalar code they produce will still be clean and tight.
In the future, I may investigate making more of these chunkable.