I did a quick test just to see what the cost of different methods of function calling are. Lots of people say “Soandso is more expensive” but I rarely see anyone quantify what “more expensive” means. These tests are just for the call itself; any other overhead such as stack modification for arguments, etc. is being ignored.

The test code is in C++ and is compiled in Microsoft Visual C++ 2010 express with default console project settings, release mode with /O2 (Maximize Speed.)

In my tests, I have the code broken out into a main.cpp, test_impl.h and test_impl.cpp to enforce separation. I found that some function declarations were inlined even with declspec(noinline) attached to it. To prevent the functions from being compiled out, I have a global ‘int test_value’ that each function simply increments.

If you’re interested in the actual timings and cost of the call, see Agner Fog‘s excellent instruction tables document. I reference them for the AMD K8 processor.

Here’s the summary of the code:

The first test was a single function call and generated the expected assembly, simply calling the mangled function name. Assuming I’m understanding the CALL instruction, it is 16-22 macro ops and 23-32 cycles of latency.

Next is the function pointer, which issues the call on a memory address. This is slightly more expensive at 16-22 macro ops and 24-33 cycles of latency:

Next, the standard class member call, which is identical to the normal function call but with more mangling to identify the class name. It will push the hidden ‘this’ parameter onto the stack, so even though the call is the same, the overall cost may not be:

And now the pure abstract virtual class:

As it turns out, calling a virtual class is significantly more expensive, assuming there are no cache misses. The reason for this is that it can’t simply call the function. It has to load the object, load the vtable, load the function code, and finally perform the jump.

Assuming the AMD K8 processor, and no cache misses, calling a function in a virtual table is 3 cycles of latency for each mov. This means it’s an extra ~9 cycles per call, or half to a third more time per call.

The vtable (and it’s calling cost) can be represented in C as below. The object is a pointer to a struct containing an array of function pointers:

Calling it with an integer (I reused test_value) to select the function looked like this:

And resulted in this assembly:

An alternative implementation is the traditional C object which ditches the array in favor of named function pointers.

This implementation resulted in function pointer calling:

I’ll keep playing around with it. 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *