[macstl-dev] Re: Windows build without SSE2/3 (e.g. athlon XP)
vectest and benchmark
glen.low at pixelglow.com
Thu Feb 3 08:07:14 WST 2005
> For both projects:
> Adjust the project defines to __MMX__ and __SSE__ only (no __SSE2__)
> and set the project build arch to SSE (rather than SSE2).
> Tests of vectest project:
> Compilation problems in vec_mmx.h
>> From line 3779 there is a section for SSE/MMX that has some problems
> (definition of <unsigned short,8> for the maximum function. Replacing
> the 8
> with 4 (incl the subsequent <short,8> inside the function allows
> compilation and passes unsigned short tests
> Same applies to minimum at line 3858 onwards
> (test_func() OK but test_accum() is noted as undefined, due ( I think)
> to the lack of
> template <> struct accumulator <maximum <macstl::vec <unsigned short,
> 4> > >
I'll look into it.
> Other points:
> vec<float,4> fails at min and max in both test_func and test_acc
> Don't know why?? QNAN handling? maybe different handling in scalar and
> vector code?
Yes it's NaN handling. C89 and C++98 are silent about NaN handling in
max and min code, but C99 says that fmax and fmin are supposed to
ignore NaN rather than propagate NaN ("C99-style max"). However both
Altivec and MMX/SSE use Java-style max, which propagates the NaN. So
for the common interface I've adopted C99-style max, which is fairly
easy to get on Altivec and requires a little more thought on SSE, and
it would devolve to faster Java-style max when finite math optimization
is on (is there a macro or option for this on VC++?).
> Benchmark test:
> With __MMX__, __SSE__ and arch set to SSE
> Program crashes at entry to main
> Changing the denormal activation to
> #ifdef __SSE__
> // _mm_setcsr (_mm_getcsr () | 0x8040); // on Intel, treat denormals
> as zero for full speed
> as per recommendations some way down the thread at
> then allowed the program to execute on my AthlonXP (Which I then
> showed to be equivalent to changing the bit mask to 0x8000 ie just the
> FTZ bit.)
> I also tried 0x40 (DAZ bit 6 undocumented) as recommended in the same
> thread but suspect that this further (P4-specific?) assistance may not
> be available for Athlon's as it repeated the crash.
> Querying _get_csr() after the revised function call on the Athlon
> gives me a value of 0x98F0 which looks like the FTZ bit(15) is set but
> not the DAZ bit 6.
> It would be interesting to know whether the function call is
> equivalent to a mask of 0x8040 on P4's and 0x8000 on Athlons
> automatically. Perhaps you can try it on a P4 and let me know?
> Otherwise you could consider the function at
> This code sets bit 15 FTZ and then queries cpuid before deciding on
> bit 6.
Yes, I'm considering to make it part of the common interface i.e. a
single function that sets or clears denormal handling on all
Cheers, Glen Low
pixelglow software | simply brilliant stuff
More information about the macstl-dev