Recently, I wanted to see how fast I could make a function that normalizes every vector in an array. The goal was to take “natural” data and make it as fast as possible without putting any undue restrictions on the caller, such as alignment. While this article specifically uses 128 bit SSE vectors comprised of four floats, the same general concept applies to any size vector on any platform.

While I do cover some of the basics of what SIMD is, why it’s faster, etc., this is by no means meant to be a tutorial on the subject. If you’re looking for one, there are lots of them out there, try searching.

SIMD, or Single Instruction Multiple Data, is a way of having one instruction apply to multiple pieces of data. For example, if I have two SIMD vectors and I add them together, it’s one instruction (ADDPS, for example) that applies to every element of each of the vectors:

vec4_add

When done with scalar math, each addition is a single instruction, so theoretically you could run at 4x speed if you have perfect scalar to SIMD math conversions. Of course, it never works out this way, but that’s a topic for another discussion…

Vector math libraries typically come in one of two varieties: a 3 element { x, y, z } or a 4 element { x, y, z, w } version. Since the w component of a 4 element vector is usually 1 for a point or 0 for a vector, it’s typically left off. Keeping it can cost a fairly significant amount of memory and the usage (point versus ray) can be assumed in most uses.

Here’s a SIMD vector4 in memory:
simd_vector

Here’s a scalar vector3 in memory:
scalar_vector

The colored block [] marks 16 byte boundaries. CPUs are really good at loading aligned data and SIMD vectors always start with X on an even 16 byte boundary, making them really fast. The scalar version aligns on a natural boundary of a float, or 4 bytes. Not being on an even 16 byte boundary, it is very slow to get into or out of SIMD.

Here’s our scalar vec3 definition:

Here’s a function to normalize a single vector:

And here’s our naïve method of a scalar batch normalize:

We could convert this code to SIMD by loading the vec3’s into SIMD registers, but we have a few problems. First, our data isn’t aligned, so it’s going to be slow. Second, you can’t load a vec3 into a SIMD vec4 without reading either the X component of the next vector or invalid memory.

To fix the first issue, we know that the input data will be on a natural alignment of 4 bytes. This means that a 16 byte aligned vector will be found in at most 4 vec3’s, or 12 floats. We’ll call this our “pre-pass” vectors, since these are the ones we’ll handle individually before the real batch processing begins.

To fix the second issue, we could just process the data differently. Rather than handling a single vector at a time, we can do four at a time. This also relates to alignment – if we process four at a time, we’re guaranteed to always be aligned. To determine how many we can do, the math is simply (count – prepass) & (~3). We’ll call this our “mid-pass” count.

Of course, we have to process the trailing vectors as well, but that’s simply whatever’s left over, or (count – (prepass + midpass)).

Before inserting SIMD, we have this code as a template to work with:

Processing the vectors four at a time makes things a bit more interesting. We start by loading four vec3’s into three vec4’s:

It looks like this in memory (I’ve colored the scalar vectors to make them easier to see):
3vec_vec3

To normalize a vector, we divide it by it’s length which is sqrt(x*x + y*y + z*z). So our first SIMD operation is (x*x + y*y + z*z) but there’s a problem: that’s a vec3 normal and our first vec4 is filled with { x, y, z, x } instead.

No worries! By processing them four at a time, we can make use of the w component to do some work for us as well. Rather than simply leaving the w component as zero or one, or worse, taking the time to zero it out, we use that slot for data in the next vector. We’re doing the same calculation later, so we may as well use the slot!

Multiply the vectors by themselves to get the squared values:

In memory, the result of the multiply looks like this:
vec_square

Using the shuffle instruction, we can move the vector around so we can sum the results in one pass. Because of our data organization, we have to do this in two steps:

vec_transpose_1

Now we can easily sum all three of the vectors into one final squared length vector and calculate the reciprocal square root:

We now have a vector that contains the scale for the original data, but is, yet again, out of order. To fix it, we reshuffle the resulting vector out to three vectors that we can multiply against the original input:

vec_transpose_2

Multiply the original input vectors by the scale, creating the normal vectors:

And, finally, store the normals to the output buffer. Since we don’t know the alignment of the output, we have to do unaligned stores, which are very slow:

Here’s the full code, ready for your copy/paste pleasure:

As a first pass, that’s a pretty significant improvement. My perf testing in release (MS Visual Studio in /O2) reports these clock cycles per batch normalize (10,000 runs with 1,000 warm-up passes of 4107 pre-generated vec3’s):

That’s not too shabby of a start! The SIMD version runs in about 18% of the total time of the original.

In fact, here’s the same run with /Ox optimizations turned on with the SIMD version running in about 19% of the scalar time:

And just to cover some more bases, here’s the debug run with optimizations disabled, running in about 34% of the scalar time:

Here’s where it gets really fun, though. The above version is somewhat naïve in that it doesn’t take into account data loads, pipelining, etc. The only pipelining being done is by the compiler and the CPU at runtime. We can definitely do better.

But that’s for the next post!

Tagged on:         

Leave a Reply

Your email address will not be published. Required fields are marked *