Have you read Part 1?

As a recap, this series is on using SIMD (specifically, 128 bit SSE 4 float vectors) to optimize a batch normalizing of vectors. While the end result may not be the most super-useful-awesomest-thing-in-the-world, it is a simple target for covering vec3 operations in SIMD and some optimization techniques.

This post is mostly about instruction pipelining, referring to a lot of vector timing data.

Instruction Pipeline

The super-mini introduction to instruction pipelines is that you can issue one instruction before the previous one has finished and that each instruction takes time to complete.

Assume, for a moment, that we have this super awesome computer that can do asynchronous operations. Among the commands we have available is an ADD command that takes two variables, adds them together, and returns the result. ADD takes 1 second to complete, but you can start as many of them as you like in that one second.

x0 = ADD(a, b);
x1 = ADD(c, d);
x2 = ADD(x0, x1);

Here, both ADD(a, b) and ADD(c, d) can be processed at the same time since they don’t depend on each other. The final ADD, however, relies on the outputs of both of the previous two in order to do it’s computation, so it has to wait the full second to get the results back from both x0 and x1 in order to do the final ADD. This is called a hazard or a pipeline stall or bubble.

When discussing pipelines, you’ll hear the same two terms a lot: Latency and Throughput. Latency means “how long does it take for the instruction to finish” and Throughput means “how many can I issue to the pipeline in one cycle.” One very good analogy uses pipes transporting water to illustrate the difference.

The Code

I’ve slightly reorganized the code since the first post as we’ll be moving instructions around quite a bit.

The vec3 structure:

typedef struct _vec3
{
  float x;
  float y;
  float z;
} vec3;

The scalar normalize function:

inline void vec3Normalize(const vec3 * const in, vec3 * const out)
{
  const float len = sqrt(in->x * in->x + in->y * in->y + in->z * in->z);

  out->x = in->x / len;
  out->y = in->y / len;
  out->z = in->z / len;
}

The current batch normalization function:

void vec3BatchNormalizeSSE(const vec3 * const in,
                            const uint32_t count,
                            vec3 * const out)
{
  const uint32_t prepass = ((uintptr_t)in & 0xf) / 4;
  const uint32_t midpass = (count - prepass) & ~3;

  // process as many vectors as it takes to get to aligned data
  for (uint32_t i = 0; i < prepass; ++i)
  {
    vec3Normalize(&in[i], &out[i]);
  }

  for (uint32_t i = prepass; i < prepass + midpass; i += 4)
  {
    // load the vectors from the source data
    const float * const src = &in[i].x;
    const __m128 source_0 = _mm_load_ps(src + 0);
    const __m128 source_1 = _mm_load_ps(src + 4);
    const __m128 source_2 = _mm_load_ps(src + 8);

    // compute the square of each vector
    const __m128 square_0 = _mm_mul_ps(source_0, source_0);
    const __m128 square_1 = _mm_mul_ps(source_1, source_1);
    const __m128 square_2 = _mm_mul_ps(source_2, source_2);

    // prepare to add, transposing the data into place

    // first transpose
    //
    // x0, y0, z0, x1    x0, y0, y1, z1
    // y1, z1, x2, y2 => x2, y2, y3, z3
    // z2, x3, y3, z3    z0, x1, z2, x3
    const __m128 xpose1_0 = _mm_shuffle_ps(square_0, square_1, _MM_SHUFFLE(1, 0, 1, 0));
    const __m128 xpose1_1 = _mm_shuffle_ps(square_1, square_2, _MM_SHUFFLE(3, 2, 3, 2));
    const __m128 xpose1_2 = _mm_shuffle_ps(square_0, square_2, _MM_SHUFFLE(1, 0, 3, 2));

    // second transpose
    //
    // x0, y0, y1, z1    x0, y1, z2, x3
    // x2, y2, y3, z3 => y0, z1, y2, z3
    // z0, x1, z2, x3    z0, x1, x2, y3
    const __m128 xpose2_0 = _mm_shuffle_ps(xpose1_0, xpose1_2, _MM_SHUFFLE(3, 2, 2, 0));
    const __m128 xpose2_1 = _mm_shuffle_ps(xpose1_0, xpose1_1, _MM_SHUFFLE(3, 1, 3, 1));
    const __m128 xpose2_2 = _mm_shuffle_ps(xpose1_2, xpose1_1, _MM_SHUFFLE(2, 0, 1, 0));

    // sum the components to get the squared length
    const __m128 sum1 = _mm_add_ps(xpose2_1, xpose2_2);
    const __m128 lensq = _mm_add_ps(xpose2_0, sum1);

    // calculate the scale as a reciprocal sqrt
    const __m128 rcpsqrt = _mm_rsqrt_ps(lensq);

    // to apply it, we have to mangle it around again
    //               s0, s0, s0, s1
    // x, y, z, w => s1, s1, s2, s2
    //               s2, s3, s3, s3
    const __m128 scale_0 = _mm_shuffle_ps(rcpsqrt, rcpsqrt, _MM_SHUFFLE(1, 0, 0, 0));
    const __m128 scale_1 = _mm_shuffle_ps(rcpsqrt, rcpsqrt, _MM_SHUFFLE(2, 2, 1, 1));
    const __m128 scale_2 = _mm_shuffle_ps(rcpsqrt, rcpsqrt, _MM_SHUFFLE(3, 3, 3, 2));

    // multiply the original vector by the scale, completing the normalize
    const __m128 norm_0 = _mm_mul_ps(source_0, scale_0);
    const __m128 norm_1 = _mm_mul_ps(source_1, scale_1);
    const __m128 norm_2 = _mm_mul_ps(source_2, scale_2);

    // store the result into the output array (unaligned data)
    float * const dst = &out[i].x;
    _mm_storeu_ps(dst + 0, norm_0);
    _mm_storeu_ps(dst + 4, norm_1);
    _mm_storeu_ps(dst + 8, norm_2);
  }

  // process any leftover vectors indivually
  for (uint32_t i = prepass + midpass; i < count; ++i)
  {
    vec3Normalize(&in[i], &out[i]);
  }
}

When optimizing at this low of a level, you need to be aware of your target CPU. Just because you’re developing on one CPU doesn’t mean that it’ll be faster for others. The CPU I’m optimizing on is an AMD Turion 64 from the AMD K8 series. According to Agner Fog’s instruction tables, these are the timings we need to look at:

Intrinsic      Instruction  Latency  Throughput
_mm_load_ps    MOVAPS       ???      1
_mm_storeu_ps  MOVUPS       ???      0.5
_mm_shuffle_ps SHUFPS       3        0.5
_mm_add_ps     ADDPS        4        0.5
_mm_mul_ps     MULPS        4        0.5
_mm_rsqrt_ps   RSQRTPS      3        0.5

Note that the ones marked with ‘???’ don’t currently have measurements in the tables.

Looking For Change

Looking through the code, we immediately see that the initial three loads are used immediately. This leaves no time for the instruction to finish (latency) but isn’t a problem for how fast we issue them (throughput):

    const float * const src = &in[i].x;
    const __m128 source_0 = _mm_load_ps(src + 0);
    const __m128 source_1 = _mm_load_ps(src + 4);
    const __m128 source_2 = _mm_load_ps(src + 8);

At the end of the loop, however, is where we store the result back out to the destination pointer. These instructions, MOVUPS, have a throughput of 0.5, which means only one MOVAPS can be issued every two cycles.

We can alleviate both problems by loading the data for the next iteration at the end of the loop, interleaved with the loads to cover their throughput stall. This gives the loads a little more time to process and fits the stores tighter together:

  __m128 active_0;
  __m128 active_1;
  __m128 active_2;

  {
    const float * const src = &in[prepass].x;
    active_0 = _mm_load_ps(src + 0);
    active_1 = _mm_load_ps(src + 4);
    active_2 = _mm_load_ps(src + 8);
  }

  for (uint32_t i = prepass; i < prepass + midpass; i += 4)
  {
    // load the vectors from the source data
//    const float * const src = &in[i].x;
//    const __m128 source_0 = _mm_load_ps(src + 0);
//    const __m128 source_1 = _mm_load_ps(src + 4);
//    const __m128 source_2 = _mm_load_ps(src + 8);
    // store the result into the output array (unaligned data)
    const float * const src = &in[i].x;
    float * const dst = &out[i].x;
    _mm_storeu_ps(dst + 0, norm_0);
    active_0 = _mm_load_ps(src + 12);
    _mm_storeu_ps(dst + 4, norm_1);
    active_1 = _mm_load_ps(src + 16);
    _mm_storeu_ps(dst + 8, norm_2);
    active_2 = _mm_load_ps(src + 20);
  }

The end result:

                Name       Total    Shortest     Longest     Average     Percent of Original
           first sse   1,204,622         115     225,868         120      17.65%
         load change   1,108,781         106      64,499         110      16.24%

That’s a pretty good improvement for a fairly simple change.

Shuffle And Move

The next thing we can look at is replacing some of these commands with more appropriate versions. A couple of suspects jump out pretty quickly:

    const __m128 xpose1_0 = _mm_shuffle_ps(square_0, square_1, _MM_SHUFFLE(1, 0, 1, 0));
    const __m128 xpose1_1 = _mm_shuffle_ps(square_1, square_2, _MM_SHUFFLE(3, 2, 3, 2));

Any time I see patterns like this, I’m suspicious. As it turns out, shuffling with 0101 or 2323 is equivalent to _mm_movelh_ps and _mm_movehl_ps. Agner Fog’s instruction tables list a latency of 2 and a throughput of 2 for both instructions. If we insert both and try them out, it’s slightly faster. And when I say slightly, I mean it. The new version runs at 99% of the time of the previous version, or about one clock per iteration.

Rather than implement this change, I looked at the code in the light of using MOVLHPS and MOVHLPS instead. Turns out that there’s a better change to be had by changing this:

    // prepare to add, transposing the data into place
 
    // first transpose
    //
    // x0, y0, z0, x1    x0, y0, y1, z1
    // y1, z1, x2, y2 => x2, y2, y3, z3
    // z2, x3, y3, z3    z0, x1, z2, x3
    const __m128 xpose1_0 = _mm_shuffle_ps(square_0, square_1, _MM_SHUFFLE(1, 0, 1, 0));
    const __m128 xpose1_1 = _mm_shuffle_ps(square_1, square_2, _MM_SHUFFLE(3, 2, 3, 2));
    const __m128 xpose1_2 = _mm_shuffle_ps(square_0, square_2, _MM_SHUFFLE(1, 0, 3, 2));
 
    // second transpose
    //
    // x0, y0, y1, z1    x0, y1, z2, x3
    // x2, y2, y3, z3 => y0, z1, y2, z3
    // z0, x1, z2, x3    z0, x1, x2, y3
    const __m128 xpose2_0 = _mm_shuffle_ps(xpose1_0, xpose1_2, _MM_SHUFFLE(3, 2, 2, 0));
    const __m128 xpose2_1 = _mm_shuffle_ps(xpose1_0, xpose1_1, _MM_SHUFFLE(3, 1, 3, 1));
    const __m128 xpose2_2 = _mm_shuffle_ps(xpose1_2, xpose1_1, _MM_SHUFFLE(2, 0, 1, 0));
 
    // sum the components to get the squared length
    const __m128 sum1 = _mm_add_ps(xpose2_1, xpose2_2);
    const __m128 lensq = _mm_add_ps(xpose2_0, sum1);

To this:

    // prepare to add, transposing the data into place

    // first transpose
    //
    // 0: x0 y0 z0 x1    x0 y0 y1 z1
    // 1: y1 z1 x2 y2 => z0 x1 x2 y2
    // 2: z2 x3 y3 z3    z2 x3 y3 z3
    const __m128 xpose1_0 = _mm_movelh_ps(square_0, square_1);
    const __m128 xpose1_1 = _mm_movehl_ps(square_1, square_0);
    const __m128 xpose1_2 = square_2;

    // second transpose
    //
    // 0: x0 y0 y1 z1    x0 y0 y1 z1
    // 1: z0 x1 x2 y2 => z0 x1 z2 x3
    // 2: z2 x3 y3 z3    x2 y2 y3 z3
    const __m128 xpose2_0 = xpose1_0;
    const __m128 xpose2_1 = _mm_movelh_ps(xpose1_1, xpose1_2);
    const __m128 xpose2_2 = _mm_movehl_ps(xpose1_2, xpose1_1);

    // third transpose
    // 0: x0 y0 y1 z1    x0 y1 x2 y3
    // 1: z0 x1 z2 x3 => z0 x1 z2 x3
    // 2: x2 y2 y3 z3    y0 z1 y2 z3
    const __m128 xpose3_0 = _mm_shuffle_ps(xpose2_0, xpose2_2, _MM_SHUFFLE(2, 0, 2, 0));
    const __m128 xpose3_1 = xpose2_1;
    const __m128 xpose3_2 = _mm_shuffle_ps(xpose2_0, xpose2_2, _MM_SHUFFLE(3, 1, 3, 1));

    // sum the components to get the squared length
    const __m128 sum1 = _mm_add_ps(xpose3_0, xpose3_1);
    const __m128 lensq = _mm_add_ps(xpose3_2, sum1);

I left in the extraneous rows (e.g., xpose 3_1 = xpose2_1) just to improve clarity. Compiling in release removes them since they’re not really necessary.

This change makes a pretty decent difference:

                Name       Total    Shortest     Longest     Average     Percent
         load change   1,108,781         106      64,499         110      16.24%
          move hl/lh   1,008,740          97     120,744         100      14.78%
Packing It In

There are still many repeated instructions whose throughput is 0.5. If we could interleave the different operations, we could save nearly one cycle per instruction. Unluckily, all of the data is fairly dependent on each other. Where it isn’t, interleaving would likely cause more harm than good because of latency.

One option we could consider is to process 8 vectors at a time instead of 4. This would give us the ability to mix the instructions better, decreasing the time we spend stalled on throughput. But, there’s a problem: we only have 16 SIMD registers. By processing 8 vec3’s at a time, we’d need at least 18 registers. If more registers are needed than are available, the data is cached to memory, which is very slow. I’m not entirely ruling this out yet. I’m sure we can rearrange the code so we don’t use 18 registers, but, for now, the lost readability is too much. I’ll put this method in my back pocket for later.

The End?

Let’s recap the progression of each version tried:

                Name       Total    Shortest     Longest     Average     Percent of Original
              scalar   6,826,228         658     234,622         682     100.00%
           first sse   1,204,622         115     225,868         120      17.65%
         load change   1,108,781         106      64,499         110      16.24%
          move hl/lh   1,008,740          97     120,744         100      14.78%

It’s not the fastest it could be, but at some point you have to weigh the gain versus the cost. Honestly, if I weren’t doing this for fun, I probably would have stopped at “first sse” unless this was a critical loop for a well known platform. Reducing the time by that much would usually put another offender on the top of the list, changing the optimization target.

Overall, I’m pretty happy with the result. It’s satisfied the original constraints of not requiring or changing data alignment, the function signature is identical to the original causing no other refactor work, and is invisible to most applications. We did lose some precision by moving to SSE over float, but it’s within what I would consider a tolerable difference.

Here’s the final code:

typedef struct _vec3
{
  float x;
  float y;
  float z;
} vec3;

inline void vec3Normalize(const vec3 * const in, vec3 * const out)
{
  const float len = sqrt(in->x * in->x + in->y * in->y + in->z * in->z);

  out->x = in->x / len;
  out->y = in->y / len;
  out->z = in->z / len;
}

void vec3BatchNormalizeSSE(const vec3 * const in,
                           const uint32_t count,
                           vec3 * const out)
{
  const uint32_t prepass = ((uintptr_t)in & 0xf) / 4;
  const uint32_t midpass = (count - prepass) & ~3;

  // process as many vectors as it takes to get to aligned data
  for (uint32_t i = 0; i < prepass; ++i)
  {
    vec3Normalize(&in[i], &out[i]);
  }

  __m128 active_0;
  __m128 active_1;
  __m128 active_2;

  {
    const float * const src = &in[prepass].x;
    active_0 = _mm_load_ps(src + 0);
    active_1 = _mm_load_ps(src + 4);
    active_2 = _mm_load_ps(src + 8);
  }

  for (uint32_t i = prepass; i < prepass + midpass; i += 4)
  {
    // compute the square of each vector
    const __m128 square_0 = _mm_mul_ps(active_0, active_0);
    const __m128 square_1 = _mm_mul_ps(active_1, active_1);
    const __m128 square_2 = _mm_mul_ps(active_2, active_2);

    // prepare to add, transposing the data into place

    // first transpose
    //
    // 0: x0 y0 z0 x1    x0 y0 y1 z1
    // 1: y1 z1 x2 y2 => z0 x1 x2 y2
    // 2: z2 x3 y3 z3    z2 x3 y3 z3
    const __m128 xpose1_0 = _mm_movelh_ps(square_0, square_1);
    const __m128 xpose1_1 = _mm_movehl_ps(square_1, square_0);

    // second transpose
    //
    // 0: x0 y0 y1 z1    x0 y0 y1 z1
    // 1: z0 x1 x2 y2 => z0 x1 z2 x3
    // 2: z2 x3 y3 z3    x2 y2 y3 z3
    const __m128 xpose2_1 = _mm_movelh_ps(xpose1_1, square_2);
    const __m128 xpose2_2 = _mm_movehl_ps(square_2, xpose1_1);

    // third transpose
    // 0: x0 y0 y1 z1    x0 y1 x2 y3
    // 1: z0 x1 z2 x3 => z0 x1 z2 x3
    // 2: x2 y2 y3 z3    y0 z1 y2 z3
    const __m128 xpose3_0 = _mm_shuffle_ps(xpose1_0, xpose2_2, _MM_SHUFFLE(2, 0, 2, 0));
    const __m128 xpose3_2 = _mm_shuffle_ps(xpose1_0, xpose2_2, _MM_SHUFFLE(3, 1, 3, 1));

    // sum the components to get the squared length
    const __m128 sum1 = _mm_add_ps(xpose3_0, xpose2_1);
    const __m128 lensq = _mm_add_ps(xpose3_2, sum1);

    // calculate the scale as a reciprocal sqrt
    const __m128 rcpsqrt = _mm_rsqrt_ps(lensq);

    // to apply it, we have to mangle it around again
    //               s0, s0, s0, s1
    // x, y, z, w => s1, s1, s2, s2
    //               s2, s3, s3, s3
    const __m128 scale_0 = _mm_shuffle_ps(rcpsqrt, rcpsqrt, _MM_SHUFFLE(1, 0, 0, 0));
    const __m128 scale_1 = _mm_shuffle_ps(rcpsqrt, rcpsqrt, _MM_SHUFFLE(2, 2, 1, 1));
    const __m128 scale_2 = _mm_shuffle_ps(rcpsqrt, rcpsqrt, _MM_SHUFFLE(3, 3, 3, 2));

    // multiply the original vector by the scale, completing the normalize
    const __m128 norm_0 = _mm_mul_ps(active_0, scale_0);
    const __m128 norm_1 = _mm_mul_ps(active_1, scale_1);
    const __m128 norm_2 = _mm_mul_ps(active_2, scale_2);

    // store the result into the output array (unaligned data)
    const float * const src = &in[i].x;
    float * const dst = &out[i].x;
    _mm_storeu_ps(dst + 0, norm_0);
    active_0 = _mm_load_ps(src + 12);
    _mm_storeu_ps(dst + 4, norm_1);
    active_1 = _mm_load_ps(src + 16);
    _mm_storeu_ps(dst + 8, norm_2);
    active_2 = _mm_load_ps(src + 20);
  }

  // process any leftover vectors indivually
  for (uint32_t i = prepass + midpass; i < count; ++i)
  {
    vec3Normalize(&in[i], &out[i]);
  }
}
Part 2: Vector3 Batch Normalization – FPU vs SIMD
Tagged on: