Questions & Answers

how to cast __m128 to union when returning

I want to return the result of _mm_add_ps() but the returning type should be a custon union that has __m128 member inside.

I tested the performance of returning __m128 and a custom union. It seems that on MSVC this:

return _mm_add_ps(V1, V2);

is faster than this:

vector4 x = { .vector = _mm_add_ps(left.vector, right.vector) };
return x;

where vector4 is defined as:

typedef union vector4
    struct { float x; float y; float z; float w; };
    struct { float r; float g; float b; float a; };
    struct { float s; float t; float m; float q; };
    float points[4];
    __m128 vector;
}  __declspec(align(16)) vector4;

I wonder if I can just cast the result of the _mm_add_ps() which is __m128 to union vector4 directly to avoid this performance differance. This is also measured in release build.

I tried to use: return (vector4) _mm_add_ps(left.vector, right.vector); But it doesn't work. Returning the error: No suitable conversion from __m128 to "float" exists.

2023-01-11 09:12:57
Does the function with a vector4 return value inline, or does that return value have to actually exist in the asm across function boundaries? If the latter, a difference in calling convention might prevent returning a union in a single vector register. In that case there's no solution except to actually return __m128 and construct the vector4 in the caller.
2023-01-11 09:12:57
Related: How to force the compiler to pass a "vector of 4" wrapper class as single XMM register? but I think that's about x86-64 System V, not the Windows calling convention. (Are you compiling for 32 or 64-bit?)
2023-01-11 09:12:57
It's 64 bits, the function is decleared as inline but declearing it as something else doesn't change the error if by "inline" you are reffering to that. And what do you mean by "or does that return value have to actually exist in the asm across function boundaries". Can you elobrate that a bit more ?
2023-01-11 09:12:57
If the function inlines into its caller (i.e. the definition is visible in the same compilation unit, or via link-time optimization), getting it into and out of a vector4 should optimize away, otherwise your compiler is doing a bad job and/or your microbenchmark isn't measuring what it should be. If not, the compiler will have to respect what the calling convention says about how to return a union of those types, when it generates assembly code for a stand-alone definition of this function. (Which callers that don't inline it will reach with a call instruction).
Answers(0) :