So starting my ramble, this is a typical (albeit somewhat contrived) use of a C++ object. An object is declared somewhere, someone passes a pointer to it and it does something:

class MyObject { public: MyObject(); ~MyObject(); void JumpUpAndDown(); private: int m_myInternalBits; }; void myFunc(MyObject *theThing) { if (theThing == null) { return; } theThing->JumpUpAndDown(); }

In my opinion, this approach is more difficult than it needs to be for several reasons:

First, I don’t need to know what MyObject’s private bits are. I don’t care how it stores whatever it is that it stores. If I do, I more than likely have to read the implementation anyway. This isn’t data hiding, it’s strongly-suggested-access-restriction.

Second, every time I go to use theThing, I have to check that it’s a valid pointer. Why? I’m calling a function on a piece of data (C++ actually calls data on a function on data, but that’s for another post.) If the pointer isn’t valid, shouldn’t the function be intelligent enough to not use it? Also, the majority of times that theThing is compared against null, it’s not. If it is null and is an error, it should have been caught at object creation, not object use.

Third, the act of checking is contagious. Whatever calls myFunc likely did some validity checking, so theThing was probably validated before the function was called. The computer is spending cycles and branches doing these checks, but worse, we humans are spending time and brain power handling cases that occur only in extremely rare circumstances.

Finally, it makes code and debugging very complex. If we have to validate data at every step, we have to handle multiple conditions and have multiple ways of doing things. We could also be hiding issues by inserting null checks until the crash goes away. Sure, it doesn’t crash anymore, but that’s because that critical line of code also isn’t running!

Let’s try something completely different. What happens if we only check for pointer validity when it’s critical and at the time where it would actually cause the crash? How difficult is it to write such that if the pointer isn’t valid, it doesn’t matter?

Let’s start by converting the above to a more C style OOP approach:

typedef struct _MyThing_t* MyThing_t; MyThing_t MyThing_Create(); void MyThing_Destroy(MyThing_t); void MyThing_JumpUpAndDown(MyThing_t theThng); void myFunc(MyThing_t theThing) { MyThing_JumpUpAndDown(theThing); }

Interestingly, this has a number of immediate impacts.

First, it’s half the size. Hopefully, smaller code means easier maintenance.

Second, changes to the interface or implementation have different impacts. If the interface changes (add/remove/modify a function) both styles must recompile everything that references the interface. If internal data changes, only the original example must recompile everything since it has published it’s workings to the world (even though they’re not supposed to use them.) This potentially improves iteration time, especially in large complicated code bases that take a while to compile or have complicated include trees.

Third, validity checks are no longer necessary which reduces complexity and eliminates checking contagion. Because a MyThing_t is only ever accessed internally, we actually have no way to validate it. In fact, I’d argue that we don’t have a right to validate it as that crosses the “need to know” boundary. We could check to see if it’s a null pointer, but that’s about it. The caller has no idea what’s in the object, so they can’t possibly know if it’s valid or not, so why check it?

Here’s a more realistic example. This is a snippet of UI code in my engine:

void add_child(window_t parent, window_t child) { if (instance(parent)) { if (instance(child)) { make_orphan(child->ui, child); add_to_head(parent->ui, child); child->parent = parent; send_message(child, WMSG_EVENT_PARENT_CHANGED, 0, (uintptr_t)parent); send_message(parent, WMSG_EVENT_GAIN_CHILD, 0, (uintptr_t)child); release(child); } release(parent); } }

The only checks in this code are the reference counting. If the child didn’t have a previous UI (it’s sort of like a window group,) make_orphan instantly succeeds. After all, it technically did exactly what it advertised: the window is an orphan. Besides, the only time this is possible is the first time the window is created. If the window ever changes parent’s or moves to another UI hierarchy, the UI would be valid.

Here’s more code, which renders a sphere. This is the lowest level rendering interface there is and, as such, is the most verbose:

data_cluster_t cmd = data_cluster_create(); render::push_fill_mode_to_cluster(cmd, FM_SOLID); render::push_alpha_blend_enable_to_cluster(cmd, true); render::push_alpha_blend_mode_to_cluster(cmd, ABM_ONE_MINUS_SOURCE); render::push_vertex_buffer_to_cluster(cmd, sphere_vertex_buffer); render::push_index_buffer_to_cluster(cmd, sphere_index_buffer); render::set_shader_param_mat44_to_cluster(cmd, modelview_index, modelview); render::set_shader_param_vec4_to_cluster(cmd, color_index, color); const uint32_t index_count = render::index_buffer::length(sphere_index_buffer); render::draw_object_to_cluster(cmd, PM_TRIANGLE, index_count); render::pop_index_buffer_to_cluster(cmd); render::pop_vertex_buffer_to_cluster(cmd); render::pop_alpha_blend_mode_to_cluster(cmd); render::pop_alpha_blend_enable_to_cluster(cmd); render::pop_fill_mode_to_cluster(cmd); render::insert_cluster(cmd); data_cluster_release(cmd);

My engine sends all render commands to a command buffer that gets executed later. Individual commands are not guaranteed order so they are pushed to a “cluster” which can then be inserted as an atomic unit into the command buffer. This guarantees state atomicity, allowing me to “render” from all threads at the same time without having to worry about state changes.

After I moved to this style, I was able to remove about 30% of the code in my user interface code and even simplified a number of algorithms.

Of course, this style of coding isn’t appropriate for all use cases. For example, if code is written in such a way that a system needs to do something whether or not they have a valid MyThing_t, you would need to do some kind of validity check (or modify the code so it only runs that state when it actually does have a valid object?) It’s also definitely not intended to be a topic that I expect people to jump up and down and go “Wow! That’s the best thing evar! Everyone Should Convert!” because, frankly, it may be a terrible idea. Who knows, right?

And, finally, in what feels like a litany of caveats and commas, this isn’t meant to be about Defensive or Offensive Programming or a critique on error codes versus exception handling. I’m more than happy to discuss those, but I didn’t want to address it here; this was about simplifying code.

]]>The test code is in C++ and is compiled in Microsoft Visual C++ 2010 express with default console project settings, release mode with /O2 (Maximize Speed.)

In my tests, I have the code broken out into a main.cpp, test_impl.h and test_impl.cpp to enforce separation. I found that some function declarations were inlined even with declspec(noinline) attached to it. To prevent the functions from being compiled out, I have a global ‘int test_value’ that each function simply increments.

If you’re interested in the actual timings and cost of the call, see Agner Fog‘s excellent instruction tables document. I reference them for the AMD K8 processor.

Here’s the summary of the code:

__declspec(noinline) void test_function() { test_value += 1; } void (*test_functionptr)() = test_function; class test_class_novtable { public: __declspec(noinline) void test_function() { test_value += 1; } }; class test_class_abstract { public: virtual ~test_class_abstract() {} virtual void test_function() = 0; }; class test_class_abstract2 { public: virtual ~test_class_abstract() {} __declspec(noinline) void test_function() { test_value += 1; } }; test_class_abstract2 test_object_abstract2; test_class_abstract *test_object_abstract = &test_object_abstract2;

The first test was a single function call and generated the expected assembly, simply calling the mangled function name. Assuming I’m understanding the CALL instruction, it is 16-22 macro ops and 23-32 cycles of latency.

call ?test_function@@YAXXZ

Next is the function pointer, which issues the call on a memory address. This is slightly more expensive at 16-22 macro ops and 24-33 cycles of latency:

call DWORD PTR ?test_functionptr@@3P6AXXZA

Next, the standard class member call, which is identical to the normal function call but with more mangling to identify the class name. It will push the hidden ‘this’ parameter onto the stack, so even though the call is the same, the overall cost may not be:

call ?test_function@test_class_novtable@@QAEXXZ

And now the pure abstract virtual class:

mov eax, DWORD PTR ?test_object_abstract2@@3Vtest_class_abstract2@@A mov edx, DWORD PTR [eax+4] mov ecx, OFFSET ?test_object_abstract2@@3Vtest_class_abstract2@@A\ call edx

As it turns out, calling a virtual class is significantly more expensive, assuming there are no cache misses. The reason for this is that it can’t simply call the function. It has to load the object, load the vtable, load the function code, and finally perform the jump.

Assuming the AMD K8 processor, and no cache misses, calling a function in a virtual table is 3 cycles of latency for each mov. This means it’s an extra ~9 cycles per call, or half to a third more time *per call*.

The vtable (and it’s calling cost) can be represented in C as below. The object is a pointer to a struct containing an array of function pointers:

struct test_vtobject { void (**vtable)(); } void (*test_vtable[])() = { test_function }; test_vtobject test_vtobject_impl = { test_vtable }; test_vtobject *test_vtobject_ptr = &test_vtobject_impl;

Calling it with an integer (I reused test_value) to select the function looked like this:

test_vtobject_ptr->vtable[test_value]();

And resulted in this assembly:

mov eax, DWORD PTR ?test_value@@3HA mov ecx, DWORD PTR ?test_vtobject_impl@@3Utest_vtobject@@A mov edx, DWORD PTR [ecx+eax*4] call edx

An alternative implementation is the traditional C object which ditches the array in favor of named function pointers.

struct test_cobject { void (*test_function0)(); void (*test_function1)(); void (*test_function2)(); void (*test_function3)(); // data }; test_cobject test_cobject_impl = { test_fn_0, test_fn_1, test_fn_2, test_function }; test_cobject *test_cobject_ptr = &test_cobject_impl;

This implementation resulted in function pointer calling:

call DWORD PTR ?test_cobject_impl@@3Utest_cobject@@A+12

I’ll keep playing around with it.

]]>As a recap, this series is on using SIMD (specifically, 128 bit SSE 4 float vectors) to optimize a batch normalizing of vectors. While the end result may not be the most super-useful-awesomest-thing-in-the-world, it is a simple target for covering vec3 operations in SIMD and some optimization techniques.

This post is mostly about instruction pipelining, referring to a lot of vector timing data.

The super-mini introduction to instruction pipelines is that you can issue one instruction before the previous one has finished and that each instruction takes time to complete.

Assume, for a moment, that we have this super awesome computer that can do asynchronous operations. Among the commands we have available is an ADD command that takes two variables, adds them together, and returns the result. ADD takes 1 second to complete, but you can start as many of them as you like in that one second.

x0 = ADD(a, b); x1 = ADD(c, d); x2 = ADD(x0, x1);

Here, both ADD(a, b) and ADD(c, d) can be processed at the same time since they don’t depend on each other. The final ADD, however, relies on the outputs of both of the previous two in order to do it’s computation, so it has to wait the full second to get the results back from both x0 and x1 in order to do the final ADD. This is called a hazard or a pipeline stall or bubble.

When discussing pipelines, you’ll hear the same two terms a lot: Latency and Throughput. Latency means “how long does it take for the instruction to finish” and Throughput means “how many can I issue to the pipeline in one cycle.” One very good analogy uses pipes transporting water to illustrate the difference.

I’ve slightly reorganized the code since the first post as we’ll be moving instructions around quite a bit.

The vec3 structure:

typedef struct _vec3 { float x; float y; float z; } vec3;

The scalar normalize function:

inline void vec3Normalize(const vec3 * const in, vec3 * const out) { const float len = sqrt(in->x * in->x + in->y * in->y + in->z * in->z); out->x = in->x / len; out->y = in->y / len; out->z = in->z / len; }

The current batch normalization function:

void vec3BatchNormalizeSSE(const vec3 * const in, const uint32_t count, vec3 * const out) { const uint32_t prepass = ((uintptr_t)in & 0xf) / 4; const uint32_t midpass = (count - prepass) & ~3; // process as many vectors as it takes to get to aligned data for (uint32_t i = 0; i < prepass; ++i) { vec3Normalize(&in[i], &out[i]); } for (uint32_t i = prepass; i < prepass + midpass; i += 4) { // load the vectors from the source data const float * const src = &in[i].x; const __m128 source_0 = _mm_load_ps(src + 0); const __m128 source_1 = _mm_load_ps(src + 4); const __m128 source_2 = _mm_load_ps(src + 8); // compute the square of each vector const __m128 square_0 = _mm_mul_ps(source_0, source_0); const __m128 square_1 = _mm_mul_ps(source_1, source_1); const __m128 square_2 = _mm_mul_ps(source_2, source_2); // prepare to add, transposing the data into place // first transpose // // x0, y0, z0, x1 x0, y0, y1, z1 // y1, z1, x2, y2 => x2, y2, y3, z3 // z2, x3, y3, z3 z0, x1, z2, x3 const __m128 xpose1_0 = _mm_shuffle_ps(square_0, square_1, _MM_SHUFFLE(1, 0, 1, 0)); const __m128 xpose1_1 = _mm_shuffle_ps(square_1, square_2, _MM_SHUFFLE(3, 2, 3, 2)); const __m128 xpose1_2 = _mm_shuffle_ps(square_0, square_2, _MM_SHUFFLE(1, 0, 3, 2)); // second transpose // // x0, y0, y1, z1 x0, y1, z2, x3 // x2, y2, y3, z3 => y0, z1, y2, z3 // z0, x1, z2, x3 z0, x1, x2, y3 const __m128 xpose2_0 = _mm_shuffle_ps(xpose1_0, xpose1_2, _MM_SHUFFLE(3, 2, 2, 0)); const __m128 xpose2_1 = _mm_shuffle_ps(xpose1_0, xpose1_1, _MM_SHUFFLE(3, 1, 3, 1)); const __m128 xpose2_2 = _mm_shuffle_ps(xpose1_2, xpose1_1, _MM_SHUFFLE(2, 0, 1, 0)); // sum the components to get the squared length const __m128 sum1 = _mm_add_ps(xpose2_1, xpose2_2); const __m128 lensq = _mm_add_ps(xpose2_0, sum1); // calculate the scale as a reciprocal sqrt const __m128 rcpsqrt = _mm_rsqrt_ps(lensq); // to apply it, we have to mangle it around again // s0, s0, s0, s1 // x, y, z, w => s1, s1, s2, s2 // s2, s3, s3, s3 const __m128 scale_0 = _mm_shuffle_ps(rcpsqrt, rcpsqrt, _MM_SHUFFLE(1, 0, 0, 0)); const __m128 scale_1 = _mm_shuffle_ps(rcpsqrt, rcpsqrt, _MM_SHUFFLE(2, 2, 1, 1)); const __m128 scale_2 = _mm_shuffle_ps(rcpsqrt, rcpsqrt, _MM_SHUFFLE(3, 3, 3, 2)); // multiply the original vector by the scale, completing the normalize const __m128 norm_0 = _mm_mul_ps(source_0, scale_0); const __m128 norm_1 = _mm_mul_ps(source_1, scale_1); const __m128 norm_2 = _mm_mul_ps(source_2, scale_2); // store the result into the output array (unaligned data) float * const dst = &out[i].x; _mm_storeu_ps(dst + 0, norm_0); _mm_storeu_ps(dst + 4, norm_1); _mm_storeu_ps(dst + 8, norm_2); } // process any leftover vectors indivually for (uint32_t i = prepass + midpass; i < count; ++i) { vec3Normalize(&in[i], &out[i]); } }

When optimizing at this low of a level, you need to be aware of your target CPU. Just because you’re developing on one CPU doesn’t mean that it’ll be faster for others. The CPU I’m optimizing on is an AMD Turion 64 from the AMD K8 series. According to Agner Fog’s instruction tables, these are the timings we need to look at:

Intrinsic Instruction Latency Throughput _mm_load_ps MOVAPS ??? 1 _mm_storeu_ps MOVUPS ??? 0.5 _mm_shuffle_ps SHUFPS 3 0.5 _mm_add_ps ADDPS 4 0.5 _mm_mul_ps MULPS 4 0.5 _mm_rsqrt_ps RSQRTPS 3 0.5

Note that the ones marked with ‘???’ don’t currently have measurements in the tables.

Looking through the code, we immediately see that the initial three loads are used immediately. This leaves no time for the instruction to finish (latency) but isn’t a problem for how fast we issue them (throughput):

const float * const src = &in[i].x; const __m128 source_0 = _mm_load_ps(src + 0); const __m128 source_1 = _mm_load_ps(src + 4); const __m128 source_2 = _mm_load_ps(src + 8);

At the end of the loop, however, is where we store the result back out to the destination pointer. These instructions, MOVUPS, have a throughput of 0.5, which means only one MOVAPS can be issued every two cycles.

We can alleviate both problems by loading the data for the next iteration at the end of the loop, interleaved with the loads to cover their throughput stall. This gives the loads a little more time to process and fits the stores tighter together:

__m128 active_0; __m128 active_1; __m128 active_2; { const float * const src = &in[prepass].x; active_0 = _mm_load_ps(src + 0); active_1 = _mm_load_ps(src + 4); active_2 = _mm_load_ps(src + 8); } for (uint32_t i = prepass; i < prepass + midpass; i += 4) { // load the vectors from the source data // const float * const src = &in[i].x; // const __m128 source_0 = _mm_load_ps(src + 0); // const __m128 source_1 = _mm_load_ps(src + 4); // const __m128 source_2 = _mm_load_ps(src + 8);

// store the result into the output array (unaligned data) const float * const src = &in[i].x; float * const dst = &out[i].x; _mm_storeu_ps(dst + 0, norm_0); active_0 = _mm_load_ps(src + 12); _mm_storeu_ps(dst + 4, norm_1); active_1 = _mm_load_ps(src + 16); _mm_storeu_ps(dst + 8, norm_2); active_2 = _mm_load_ps(src + 20); }

The end result:

Name Total Shortest Longest Average Percent of Original first sse 1,204,622 115 225,868 120 17.65% load change 1,108,781 106 64,499 110 16.24%

That’s a pretty good improvement for a fairly simple change.

The next thing we can look at is replacing some of these commands with more appropriate versions. A couple of suspects jump out pretty quickly:

const __m128 xpose1_0 = _mm_shuffle_ps(square_0, square_1, _MM_SHUFFLE(1, 0, 1, 0)); const __m128 xpose1_1 = _mm_shuffle_ps(square_1, square_2, _MM_SHUFFLE(3, 2, 3, 2));

Any time I see patterns like this, I’m suspicious. As it turns out, shuffling with 0101 or 2323 is equivalent to _mm_movelh_ps and _mm_movehl_ps. Agner Fog’s instruction tables list a latency of 2 and a throughput of 2 for both instructions. If we insert both and try them out, it’s slightly faster. And when I say slightly, I mean it. The new version runs at 99% of the time of the previous version, or about one clock per iteration.

Rather than implement this change, I looked at the code in the light of using MOVLHPS and MOVHLPS instead. Turns out that there’s a better change to be had by changing this:

// prepare to add, transposing the data into place // first transpose // // x0, y0, z0, x1 x0, y0, y1, z1 // y1, z1, x2, y2 => x2, y2, y3, z3 // z2, x3, y3, z3 z0, x1, z2, x3 const __m128 xpose1_0 = _mm_shuffle_ps(square_0, square_1, _MM_SHUFFLE(1, 0, 1, 0)); const __m128 xpose1_1 = _mm_shuffle_ps(square_1, square_2, _MM_SHUFFLE(3, 2, 3, 2)); const __m128 xpose1_2 = _mm_shuffle_ps(square_0, square_2, _MM_SHUFFLE(1, 0, 3, 2)); // second transpose // // x0, y0, y1, z1 x0, y1, z2, x3 // x2, y2, y3, z3 => y0, z1, y2, z3 // z0, x1, z2, x3 z0, x1, x2, y3 const __m128 xpose2_0 = _mm_shuffle_ps(xpose1_0, xpose1_2, _MM_SHUFFLE(3, 2, 2, 0)); const __m128 xpose2_1 = _mm_shuffle_ps(xpose1_0, xpose1_1, _MM_SHUFFLE(3, 1, 3, 1)); const __m128 xpose2_2 = _mm_shuffle_ps(xpose1_2, xpose1_1, _MM_SHUFFLE(2, 0, 1, 0)); // sum the components to get the squared length const __m128 sum1 = _mm_add_ps(xpose2_1, xpose2_2); const __m128 lensq = _mm_add_ps(xpose2_0, sum1);

To this:

// prepare to add, transposing the data into place // first transpose // // 0: x0 y0 z0 x1 x0 y0 y1 z1 // 1: y1 z1 x2 y2 => z0 x1 x2 y2 // 2: z2 x3 y3 z3 z2 x3 y3 z3 const __m128 xpose1_0 = _mm_movelh_ps(square_0, square_1); const __m128 xpose1_1 = _mm_movehl_ps(square_1, square_0); const __m128 xpose1_2 = square_2; // second transpose // // 0: x0 y0 y1 z1 x0 y0 y1 z1 // 1: z0 x1 x2 y2 => z0 x1 z2 x3 // 2: z2 x3 y3 z3 x2 y2 y3 z3 const __m128 xpose2_0 = xpose1_0; const __m128 xpose2_1 = _mm_movelh_ps(xpose1_1, xpose1_2); const __m128 xpose2_2 = _mm_movehl_ps(xpose1_2, xpose1_1); // third transpose // 0: x0 y0 y1 z1 x0 y1 x2 y3 // 1: z0 x1 z2 x3 => z0 x1 z2 x3 // 2: x2 y2 y3 z3 y0 z1 y2 z3 const __m128 xpose3_0 = _mm_shuffle_ps(xpose2_0, xpose2_2, _MM_SHUFFLE(2, 0, 2, 0)); const __m128 xpose3_1 = xpose2_1; const __m128 xpose3_2 = _mm_shuffle_ps(xpose2_0, xpose2_2, _MM_SHUFFLE(3, 1, 3, 1)); // sum the components to get the squared length const __m128 sum1 = _mm_add_ps(xpose3_0, xpose3_1); const __m128 lensq = _mm_add_ps(xpose3_2, sum1);

I left in the extraneous rows (e.g., xpose 3_1 = xpose2_1) just to improve clarity. Compiling in release removes them since they’re not really necessary.

This change makes a pretty decent difference:

Name Total Shortest Longest Average Percent load change 1,108,781 106 64,499 110 16.24% move hl/lh 1,008,740 97 120,744 100 14.78%

There are still many repeated instructions whose throughput is 0.5. If we could interleave the different operations, we could save nearly one cycle per instruction. Unluckily, all of the data is fairly dependent on each other. Where it isn’t, interleaving would likely cause more harm than good because of latency.

One option we could consider is to process 8 vectors at a time instead of 4. This would give us the ability to mix the instructions better, decreasing the time we spend stalled on throughput. But, there’s a problem: we only have 16 SIMD registers. By processing 8 vec3’s at a time, we’d need at least 18 registers. If more registers are needed than are available, the data is cached to memory, which is very slow. I’m not entirely ruling this out yet. I’m sure we can rearrange the code so we don’t use 18 registers, but, for now, the lost readability is too much. I’ll put this method in my back pocket for later.

Let’s recap the progression of each version tried:

Name Total Shortest Longest Average Percent of Original scalar 6,826,228 658 234,622 682 100.00% first sse 1,204,622 115 225,868 120 17.65% load change 1,108,781 106 64,499 110 16.24% move hl/lh 1,008,740 97 120,744 100 14.78%

It’s not the fastest it could be, but at some point you have to weigh the gain versus the cost. Honestly, if I weren’t doing this for fun, I probably would have stopped at “first sse” unless this was a critical loop for a well known platform. Reducing the time by that much would usually put another offender on the top of the list, changing the optimization target.

Overall, I’m pretty happy with the result. It’s satisfied the original constraints of not requiring or changing data alignment, the function signature is identical to the original causing no other refactor work, and is invisible to most applications. We did lose some precision by moving to SSE over float, but it’s within what I would consider a tolerable difference.

Here’s the final code:

typedef struct _vec3 { float x; float y; float z; } vec3; inline void vec3Normalize(const vec3 * const in, vec3 * const out) { const float len = sqrt(in->x * in->x + in->y * in->y + in->z * in->z); out->x = in->x / len; out->y = in->y / len; out->z = in->z / len; } void vec3BatchNormalizeSSE(const vec3 * const in, const uint32_t count, vec3 * const out) { const uint32_t prepass = ((uintptr_t)in & 0xf) / 4; const uint32_t midpass = (count - prepass) & ~3; // process as many vectors as it takes to get to aligned data for (uint32_t i = 0; i < prepass; ++i) { vec3Normalize(&in[i], &out[i]); } __m128 active_0; __m128 active_1; __m128 active_2; { const float * const src = &in[prepass].x; active_0 = _mm_load_ps(src + 0); active_1 = _mm_load_ps(src + 4); active_2 = _mm_load_ps(src + 8); } for (uint32_t i = prepass; i < prepass + midpass; i += 4) { // compute the square of each vector const __m128 square_0 = _mm_mul_ps(active_0, active_0); const __m128 square_1 = _mm_mul_ps(active_1, active_1); const __m128 square_2 = _mm_mul_ps(active_2, active_2); // prepare to add, transposing the data into place // first transpose // // 0: x0 y0 z0 x1 x0 y0 y1 z1 // 1: y1 z1 x2 y2 => z0 x1 x2 y2 // 2: z2 x3 y3 z3 z2 x3 y3 z3 const __m128 xpose1_0 = _mm_movelh_ps(square_0, square_1); const __m128 xpose1_1 = _mm_movehl_ps(square_1, square_0); // second transpose // // 0: x0 y0 y1 z1 x0 y0 y1 z1 // 1: z0 x1 x2 y2 => z0 x1 z2 x3 // 2: z2 x3 y3 z3 x2 y2 y3 z3 const __m128 xpose2_1 = _mm_movelh_ps(xpose1_1, square_2); const __m128 xpose2_2 = _mm_movehl_ps(square_2, xpose1_1); // third transpose // 0: x0 y0 y1 z1 x0 y1 x2 y3 // 1: z0 x1 z2 x3 => z0 x1 z2 x3 // 2: x2 y2 y3 z3 y0 z1 y2 z3 const __m128 xpose3_0 = _mm_shuffle_ps(xpose1_0, xpose2_2, _MM_SHUFFLE(2, 0, 2, 0)); const __m128 xpose3_2 = _mm_shuffle_ps(xpose1_0, xpose2_2, _MM_SHUFFLE(3, 1, 3, 1)); // sum the components to get the squared length const __m128 sum1 = _mm_add_ps(xpose3_0, xpose2_1); const __m128 lensq = _mm_add_ps(xpose3_2, sum1); // calculate the scale as a reciprocal sqrt const __m128 rcpsqrt = _mm_rsqrt_ps(lensq); // to apply it, we have to mangle it around again // s0, s0, s0, s1 // x, y, z, w => s1, s1, s2, s2 // s2, s3, s3, s3 const __m128 scale_0 = _mm_shuffle_ps(rcpsqrt, rcpsqrt, _MM_SHUFFLE(1, 0, 0, 0)); const __m128 scale_1 = _mm_shuffle_ps(rcpsqrt, rcpsqrt, _MM_SHUFFLE(2, 2, 1, 1)); const __m128 scale_2 = _mm_shuffle_ps(rcpsqrt, rcpsqrt, _MM_SHUFFLE(3, 3, 3, 2)); // multiply the original vector by the scale, completing the normalize const __m128 norm_0 = _mm_mul_ps(active_0, scale_0); const __m128 norm_1 = _mm_mul_ps(active_1, scale_1); const __m128 norm_2 = _mm_mul_ps(active_2, scale_2); // store the result into the output array (unaligned data) const float * const src = &in[i].x; float * const dst = &out[i].x; _mm_storeu_ps(dst + 0, norm_0); active_0 = _mm_load_ps(src + 12); _mm_storeu_ps(dst + 4, norm_1); active_1 = _mm_load_ps(src + 16); _mm_storeu_ps(dst + 8, norm_2); active_2 = _mm_load_ps(src + 20); } // process any leftover vectors indivually for (uint32_t i = prepass + midpass; i < count; ++i) { vec3Normalize(&in[i], &out[i]); } }

]]>

While I do cover some of the basics of what SIMD is, why it’s faster, etc., this is by no means meant to be a tutorial on the subject. If you’re looking for one, there are lots of them out there, try searching.

SIMD, or Single Instruction Multiple Data, is a way of having one instruction apply to multiple pieces of data. For example, if I have two SIMD vectors and I add them together, it’s one instruction (ADDPS, for example) that applies to every element of each of the vectors:

When done with scalar math, each addition is a single instruction, so theoretically you could run at 4x speed if you have perfect scalar to SIMD math conversions. Of course, it never works out this way, but that’s a topic for another discussion…

Vector math libraries typically come in one of two varieties: a 3 element { x, y, z } or a 4 element { x, y, z, w } version. Since the w component of a 4 element vector is usually 1 for a point or 0 for a vector, it’s typically left off. Keeping it can cost a fairly significant amount of memory and the usage (point versus ray) can be assumed in most uses.

Here’s a SIMD vector4 in memory:

Here’s a scalar vector3 in memory:

The colored block [–] marks 16 byte boundaries. CPUs are really good at loading aligned data and SIMD vectors always start with X on an even 16 byte boundary, making them really fast. The scalar version aligns on a natural boundary of a float, or 4 bytes. Not being on an even 16 byte boundary, it is very slow to get into or out of SIMD.

Here’s our scalar vec3 definition:

typedef struct _vec3 { float x; float y; float z; } vec3;

Here’s a function to normalize a single vector:

inline void vec3Normalize(const vec3 * const in, vec3 * const out) { const float len = sqrt(in->x * in->x + in->y * in->y + in->z * in->z); out->x = in->x / len; out->y = in->y / len; out->z = in->z / len; }

And here’s our naÃ¯ve method of a scalar batch normalize:

void vec3BatchNormalize(const vec3 * const in, const uint32_t count, vec3 * const out) { for (uint32_t i = 0; i < count; ++i) { vec3Normalize(&in[i], &out[i]; } }

We could convert this code to SIMD by loading the vec3’s into SIMD registers, but we have a few problems. First, our data isn’t aligned, so it’s going to be slow. Second, you can’t load a vec3 into a SIMD vec4 without reading either the X component of the next vector or invalid memory.

To fix the first issue, we know that the input data will be on a natural alignment of 4 bytes. This means that a 16 byte aligned vector will be found in at most 4 vec3’s, or 12 floats. We’ll call this our “pre-pass” vectors, since these are the ones we’ll handle individually before the real batch processing begins.

To fix the second issue, we could just process the data differently. Rather than handling a single vector at a time, we can do four at a time. This also relates to alignment – if we process four at a time, we’re guaranteed to always be aligned. To determine how many we can do, the math is simply (count – prepass) & (~3). We’ll call this our “mid-pass” count.

Of course, we have to process the trailing vectors as well, but that’s simply whatever’s left over, or (count – (prepass + midpass)).

Before inserting SIMD, we have this code as a template to work with:

void vec3BatchNormalizeSSE0(const vec3 * const in, const uint32_t count, vec3 * const out) { const uint32_t prepass = ((uintptr_t)in & 0xf) / 4; const uint32_t midpass = (count - prepass) & ~3; for (uint32_t i = 0; i < prepass; ++i) { vec3Normalize(&in[i], &out[i]); } for (uint32_t i = prepass; i < prepass + midpass; ++i) { vec3Normalize(&in[i], &out[i]); } for (uint32_t i = prepass + midpass; i < count; ++i) { vec3Normalize(&in[i], &out[i]); } }

Processing the vectors four at a time makes things a bit more interesting. We start by loading four vec3’s into three vec4’s:

const float * const src = &in[i].x; const __m128 source[] = { _mm_load_ps(src + 0), _mm_load_ps(src + 4), _mm_load_ps(src + 8) };

It looks like this in memory (I’ve colored the scalar vectors to make them easier to see):

To normalize a vector, we divide it by it’s length which is sqrt(x*x + y*y + z*z). So our first SIMD operation is (x*x + y*y + z*z) but there’s a problem: that’s a vec3 normal and our first vec4 is filled with { x, y, z, x } instead.

No worries! By processing them four at a time, we can make use of the w component to do some work for us as well. Rather than simply leaving the w component as zero or one, or worse, taking the time to zero it out, we use that slot for data in the next vector. We’re doing the same calculation later, so we may as well use the slot!

Multiply the vectors by themselves to get the squared values:

const __m128 square[] = { _mm_mul_ps(source[0], source[0]), _mm_mul_ps(source[1], source[1]), _mm_mul_ps(source[2], source[2]) };

In memory, the result of the multiply looks like this:

Using the shuffle instruction, we can move the vector around so we can sum the results in one pass. Because of our data organization, we have to do this in two steps:

const __m128 xpose1[] = { _mm_shuffle_ps(square[0], square[1], _MM_SHUFFLE(1, 0, 1, 0)), _mm_shuffle_ps(square[1], square[2], _MM_SHUFFLE(3, 2, 3, 2)), _mm_shuffle_ps(square[0], square[2], _MM_SHUFFLE(1, 0, 3, 2)) }; const __m128 xpose2[] = { _mm_shuffle_ps(xpose1[0], xpose1[2], _MM_SHUFFLE(3, 2, 2, 0)), _mm_shuffle_ps(xpose1[0], xpose1[1], _MM_SHUFFLE(3, 1, 3, 1)), _mm_shuffle_ps(xpose1[2], xpose1[1], _MM_SHUFFLE(2, 0, 1, 0)) };

Now we can easily sum all three of the vectors into one final squared length vector and calculate the reciprocal square root:

const __m128 sum1 = _mm_add_ps(xpose2[1], xpose2[2]); const __m128 lensq = _mm_add_ps(xpose2[0], sum1); const __m128 rcpsqrt = _mm_rsqrt_ps(lensq);

We now have a vector that contains the scale for the original data, but is, yet again, out of order. To fix it, we reshuffle the resulting vector out to three vectors that we can multiply against the original input:

const __m128 scale[] = { _mm_shuffle_ps(rcpsqrt, rcpsqrt, _MM_SHUFFLE(1, 0, 0, 0)), _mm_shuffle_ps(rcpsqrt, rcpsqrt, _MM_SHUFFLE(2, 2, 1, 1)), _mm_shuffle_ps(rcpsqrt, rcpsqrt, _MM_SHUFFLE(3, 3, 3, 2)) };

Multiply the original input vectors by the scale, creating the normal vectors:

const __m128 norm[] = { _mm_mul_ps(source[0], scale[0]), _mm_mul_ps(source[1], scale[1]), _mm_mul_ps(source[2], scale[2]) };

And, finally, store the normals to the output buffer. Since we don’t know the alignment of the output, we have to do unaligned stores, which are very slow:

float * const dst = &out[i].x; _mm_storeu_ps(dst + 0, norm[0]); _mm_storeu_ps(dst + 4, norm[1]); _mm_storeu_ps(dst + 8, norm[2]);

Here’s the full code, ready for your copy/paste pleasure:

void vec3BatchNormalizeSSE1(const vec3 * const in, const uint32_t count, vec3 * const out) { const uint32_t prepass = ((uintptr_t)in & 0xf) / 4; const uint32_t midpass = (count - prepass) & ~3; // process as many vectors as it takes to get to aligned data for (uint32_t i = 0; i < prepass; ++i) { vec3Normalize(&in[i], &out[i]); } for (uint32_t i = prepass; i < prepass + midpass; i += 4) { // load the vectors from the source data const float * const src = &in[i].x; const __m128 source[] = { _mm_load_ps(src + 0), _mm_load_ps(src + 4), _mm_load_ps(src + 8) }; // compute the square of each vector const __m128 square[] = { _mm_mul_ps(source[0], source[0]), _mm_mul_ps(source[1], source[1]), _mm_mul_ps(source[2], source[2]) }; // prepare to add, transposing the data into place // first transpose // // x0, y0, z0, x1 x0, y0, y1, z1 // y1, z1, x2, y2 => x2, y2, y3, z3 // z2, x3, y3, z3 z0, x1, z2, x3 const __m128 xpose1[] = { _mm_shuffle_ps(square[0], square[1], _MM_SHUFFLE(1, 0, 1, 0)), _mm_shuffle_ps(square[1], square[2], _MM_SHUFFLE(3, 2, 3, 2)), _mm_shuffle_ps(square[0], square[2], _MM_SHUFFLE(1, 0, 3, 2)) }; // second transpose // // x0, y0, y1, z1 x0, y1, z2, x3 // x2, y2, y3, z3 => y0, z1, y2, z3 // z0, x1, z2, x3 z0, x1, x2, y3 const __m128 xpose2[] = { _mm_shuffle_ps(xpose1[0], xpose1[2], _MM_SHUFFLE(3, 2, 2, 0)), _mm_shuffle_ps(xpose1[0], xpose1[1], _MM_SHUFFLE(3, 1, 3, 1)), _mm_shuffle_ps(xpose1[2], xpose1[1], _MM_SHUFFLE(2, 0, 1, 0)) }; // sum the components to get the squared length const __m128 sum1 = _mm_add_ps(xpose2[1], xpose2[2]); const __m128 lensq = _mm_add_ps(xpose2[0], sum1); // calculate the scale as a reciprocal sqrt const __m128 rcpsqrt = _mm_rsqrt_ps(lensq); // to apply it, we have to mangle it around again // s0, s0, s0, s1 // x, y, z, w => s1, s1, s2, s2 // s2, s3, s3, s3 const __m128 scale[] = { _mm_shuffle_ps(rcpsqrt, rcpsqrt, _MM_SHUFFLE(1, 0, 0, 0)), _mm_shuffle_ps(rcpsqrt, rcpsqrt, _MM_SHUFFLE(2, 2, 1, 1)), _mm_shuffle_ps(rcpsqrt, rcpsqrt, _MM_SHUFFLE(3, 3, 3, 2)) }; // multiply the original vector by the scale, completing the normalize const __m128 norm[] = { _mm_mul_ps(source[0], scale[0]), _mm_mul_ps(source[1], scale[1]), _mm_mul_ps(source[2], scale[2]) }; // store the result into the output array (unaligned data) float * const dst = &out[i].x; _mm_storeu_ps(dst + 0, norm[0]); _mm_storeu_ps(dst + 4, norm[1]); _mm_storeu_ps(dst + 8, norm[2]); } // process any leftover vectors indivually for (uint32_t i = prepass + midpass; i < count; ++i) { vec3Normalize(&in[i], &out[i]); } }

As a first pass, that’s a pretty significant improvement. My perf testing in release (MS Visual Studio in /O2) reports these clock cycles per batch normalize (10,000 runs with 1,000 warm-up passes of 4107 pre-generated vec3’s):

Name: Total Shortest Longest Average scalar 7,278,466 658 180,093 659 first attempt 1,312,945 117 112,395 117

That’s not too shabby of a start! The SIMD version runs in about 18% of the total time of the original.

In fact, here’s the same run with /Ox optimizations turned on with the SIMD version running in about 19% of the scalar time:

Name: Total Shortest Longest Average scalar 7,532,213 658 246,995 659 first attempt 1,429,562 117 112,923 117

And just to cover some more bases, here’s the debug run with optimizations disabled, running in about 34% of the scalar time:

Name: Total Shortest Longest Average scalar 16,559,810 1530 273,216 1530 first attempt 5,659,434 520 95,957 520

Here’s where it gets really fun, though. The above version is somewhat naÃ¯ve in that it doesn’t take into account data loads, pipelining, etc. The only pipelining being done is by the compiler and the CPU at runtime. We can definitely do better.

But that’s for the next post!

]]>