Have you read Part 1?

As a recap, this series is on using SIMD (specifically, 128 bit SSE 4 float vectors) to optimize a batch normalizing of vectors. While the end result may not be the most super-useful-awesomest-thing-in-the-world, it is a simple target for covering vec3 operations in SIMD and some optimization techniques.

This post is mostly about instruction pipelining, referring to a lot of vector timing data.

Instruction Pipeline

The super-mini introduction to instruction pipelines is that you can issue one instruction before the previous one has finished and that each instruction takes time to complete.

Assume, for a moment, that we have this super awesome computer that can do asynchronous operations. Among the commands we have available is an ADD command that takes two variables, adds them together, and returns the result. ADD takes 1 second to complete, but you can start as many of them as you like in that one second.

Here, both ADD(a, b) and ADD(c, d) can be processed at the same time since they don’t depend on each other. The final ADD, however, relies on the outputs of both of the previous two in order to do it’s computation, so it has to wait the full second to get the results back from both x0 and x1 in order to do the final ADD. This is called a hazard or a pipeline stall or bubble.

When discussing pipelines, you’ll hear the same two terms a lot: Latency and Throughput. Latency means “how long does it take for the instruction to finish” and Throughput means “how many can I issue to the pipeline in one cycle.” One very good analogy uses pipes transporting water to illustrate the difference.

The Code

I’ve slightly reorganized the code since the first post as we’ll be moving instructions around quite a bit.

The vec3 structure:

The scalar normalize function:

The current batch normalization function:

When optimizing at this low of a level, you need to be aware of your target CPU. Just because you’re developing on one CPU doesn’t mean that it’ll be faster for others. The CPU I’m optimizing on is an AMD Turion 64 from the AMD K8 series. According to Agner Fog’s instruction tables, these are the timings we need to look at:

Note that the ones marked with ‘???’ don’t currently have measurements in the tables.

Looking For Change

Looking through the code, we immediately see that the initial three loads are used immediately. This leaves no time for the instruction to finish (latency) but isn’t a problem for how fast we issue them (throughput):

At the end of the loop, however, is where we store the result back out to the destination pointer. These instructions, MOVUPS, have a throughput of 0.5, which means only one MOVAPS can be issued every two cycles.

We can alleviate both problems by loading the data for the next iteration at the end of the loop, interleaved with the loads to cover their throughput stall. This gives the loads a little more time to process and fits the stores tighter together:

The end result:

That’s a pretty good improvement for a fairly simple change.

Shuffle And Move

The next thing we can look at is replacing some of these commands with more appropriate versions. A couple of suspects jump out pretty quickly:

Any time I see patterns like this, I’m suspicious. As it turns out, shuffling with 0101 or 2323 is equivalent to _mm_movelh_ps and _mm_movehl_ps. Agner Fog’s instruction tables list a latency of 2 and a throughput of 2 for both instructions. If we insert both and try them out, it’s slightly faster. And when I say slightly, I mean it. The new version runs at 99% of the time of the previous version, or about one clock per iteration.

Rather than implement this change, I looked at the code in the light of using MOVLHPS and MOVHLPS instead. Turns out that there’s a better change to be had by changing this:

To this:

I left in the extraneous rows (e.g., xpose 3_1 = xpose2_1) just to improve clarity. Compiling in release removes them since they’re not really necessary.

This change makes a pretty decent difference:

Packing It In

There are still many repeated instructions whose throughput is 0.5. If we could interleave the different operations, we could save nearly one cycle per instruction. Unluckily, all of the data is fairly dependent on each other. Where it isn’t, interleaving would likely cause more harm than good because of latency.

One option we could consider is to process 8 vectors at a time instead of 4. This would give us the ability to mix the instructions better, decreasing the time we spend stalled on throughput. But, there’s a problem: we only have 16 SIMD registers. By processing 8 vec3’s at a time, we’d need at least 18 registers. If more registers are needed than are available, the data is cached to memory, which is very slow. I’m not entirely ruling this out yet. I’m sure we can rearrange the code so we don’t use 18 registers, but, for now, the lost readability is too much. I’ll put this method in my back pocket for later.

The End?

Let’s recap the progression of each version tried:

It’s not the fastest it could be, but at some point you have to weigh the gain versus the cost. Honestly, if I weren’t doing this for fun, I probably would have stopped at “first sse” unless this was a critical loop for a well known platform. Reducing the time by that much would usually put another offender on the top of the list, changing the optimization target.

Overall, I’m pretty happy with the result. It’s satisfied the original constraints of not requiring or changing data alignment, the function signature is identical to the original causing no other refactor work, and is invisible to most applications. We did lose some precision by moving to SSE over float, but it’s within what I would consider a tolerable difference.

Here’s the final code:

Tagged on: