Optimizing C++ Compilation: The Trouble With Templates

tl;dr: Results!

Intro
Baseline Compile Times
Reducing Include Cost
Identifying Compilation Cost
It’s a Problem of Scale
Forward Declaration
Determining What’s Expensive
Deduplicating Duplicitous Duplication
Enter The Dragon
The Final Code?
Results!
Oddities, Errors, and Errata

Intro

Templates are one of the most ubiquitous features of C++, allowing programmers to write one piece of code that can be used for multiple types.

But, wow, are they slow to compile.

With templates, the header may only be included once, but the implementation is compiled for every template argument combination for every translation unit. Expensive templates can really rack up compilation time.

Below is a small template vector class. It’s short on features, complexity, and testing, but it’s decent for optimizing. If anything, it’s a bit too simple. Most of this is summarized in C++ Compilation: Fixing It, specifically the section on templates.

#ifndef TEMPLATE_VECTOR_H
#define TEMPLATE_VECTOR_H

#include 
#include 

namespace unstd {
    template 
    class vector {
        public:
            vector() = default;
            ~vector();

            vector(const vector & other) = delete;
            vector(vector && other) = delete;

            vector& operator=(const vector &) = delete;
            vector& operator=(vector &&) = delete;

            const _TYPE_ & operator[](const size_t index) const;
            _TYPE_ & operator[](const size_t index);

            void resize(const size_t count);

            void push(const _TYPE_ &);
            void push(_TYPE_ &&);

            bool pop(_TYPE_ * const value);

            void erase(const size_t index);

            const _TYPE_ * begin() const { return m_data; }
            const _TYPE_ * end() const { return m_data + m_size; }

            const _TYPE_ * data() const { return m_data; }

            size_t capacity() const { return m_capacity; }
            size_t size() const { return m_size; }

        private:
            void preallocate(const size_t count);

            size_t m_capacity = 0;
            size_t m_size = 0;
            _TYPE_ * m_data = nullptr;
    };

    template
    vector<_TYPE_>::~vector() {
        if (m_data != nullptr) {
            for (size_t i = 0; i 
    const _TYPE_ & vector<_TYPE_>::operator[](const size_t index) const {
        assert(index 
    _TYPE_ & vector<_TYPE_>::operator[](const size_t index) {
        if (index>= m_size) {
            resize(index + 1);
        }
        return m_data[index];
    }

    template
    void vector<_TYPE_>::resize(const size_t count) {
        preallocate(count);
        for (size_t i = m_size; i 
    void vector<_TYPE_>::push(const _TYPE_ & value) {
        preallocate(m_size + 1);
        new (&m_data[m_size++]) _TYPE_(value);
    }

    template
    void vector<_TYPE_>::push(_TYPE_ && value) {
        preallocate(m_size + 1);
        new (&m_data[m_size++]) _TYPE_(std::move(value));
    }

    template
    bool vector<_TYPE_>::pop(_TYPE_ * const value) {
        if (m_size> 0) {
            const size_t index = --m_size;
            if (value != nullptr) {
                value = std::move(m_data[index]);
            }
            m_data[index].~_TYPE_();
            return true;
        }
        return false;
    }

    template
    void vector<_TYPE_>::erase(const size_t index) {
        for ( size_t i = index; i < m_size; i++ ) {
           m_data[i] = std::move(m_data[i+1]);
        }
        m_data[--m_size].~_TYPE_();
    }

    template
    void vector<_TYPE_>::preallocate(const size_t count) {
        if (count> m_capacity) {
            _TYPE_ * const data = static_cast<_TYPE_*>(malloc(sizeof(_TYPE_) * count));
            const size_t end = m_size;
            m_size = std::min(m_size, count);
            for (size_t i = 0; i < count; i++) {
                new (&data[i]) _TYPE_(std::move(m_data[i]));
            }
            for (size_t i = 0; i < end; i++) {
                m_data[i].~_TYPE_();
            }
            free(m_data);
            m_data = data;
            m_capacity = count;
        }
    }
} // unstd

#endif // TEMPLATE_VECTOR_H

We're going to put an equally small thing into the template:

    class thing {
        public:
        thing() { m_state = "ctor"; }
        ~thing() { m_state = "dtor"; }

        explicit thing( const char * const value ) { m_state = "ctor2"; m_value = value; }

        thing(const thing &) { m_state = "cctor"; }
        thing(thing&& other) { m_state = "mctor-target"; other.m_state = "mctor-victim"; }

        thing& operator=(const thing&) { m_state = "assign"; return *this; }
        thing& operator=(thing&& other) { m_state = "massign-target"; other.m_state="m_assign-victim"; return *this; }

        const char * state() const { return m_state; }

        private:
            const char * m_state = "initialized";
            const char * m_value = "";
    };

Baseline Compile Times

Compile time results were generated using the same testing framework from C++ Compilation: Lies, Damned Lies, and Statistics. Most of this post specifically discusses Microsoft Visual Studio 2017, but Visual Studio 2019 (Preview) and clang were used as well. Since results for Visual Studio 2017 and 2019 were similar, I lumped them all under "Visual Studio." Visual Studio 2015 doesn't support the reportTime flag, so no comparison was done.

The test was done using two pieces of code; one for the singular test, used to gather simple statistics, and one that generates a ton of vector/type combinations.

The template type class is designed to be simple and lightweight. It's in a define so multiple unique versions can be generated:

#define thing_class( num )\
    class thing_##num {\
        public:\
        thing_ ##num() { m_state = "ctor " #num; }\
        ~thing_ ##num() { m_state = "dtor " #num; }\
        explicit thing_ ##num( const char * const value ) { m_state = "ctor2 " #num; m_value = value; }\
        thing_ ##num(const thing_ ##num &) { m_state = "cctor "#num; }\
        thing_ ##num(thing_ ##num && other) { m_state = "mctor-target "#num; other.m_state = "mctor-victim "#num; }\
        thing_ ##num & operator=(const thing_ ##num &) { m_state = "assign "#num; return *this; }\
        thing_ ##num & operator=(thing_ ##num &&; other) { m_state = "massign-target "#num; other.m_state="m_assign-victim "#num; return *this; }\
        const char * state() const { return m_state; }\
        private:\
            const char * m_state = "initialized "#num;\
            const char * m_value = "";\
    }

The code isn't designed to test accuracy or speed, but rather coverage and compilation times.

#define thing_test( num )\
{\
    typedef class thing_ ##num thing_type;\
    unstd::vector< thing_type > v;\
    size_t sz;\
    thing_type value;\
    bool ok;\
    v.resize(1);\
    sz = v.size();\
    sz = v.capacity();\
    v[0] = thing_type("a");\
    v[1] = std::move(thing_type("a"));\
    v.push(std::move(thing_type("b")));\
    v.push(thing_type("c"));\
    v.push(value);\
    v.erase(1);\
    ok = v.pop(&value);\
    ok = v.pop(&value);\
    ok = v.pop(nullptr);\
    for ( const thing_type * n = v.begin(); n != v.end(); n++ ) {\
        if ( n == v.data() ) {\
            printf( "Hello world!\n" );\
        }\
    }\
}

The test itself is then built via more defines (truncated for space):

thing_class(00);
// ... snip ...
thing_class(59);

extern "C" void vector_test() {
    thing_test(00);
// ... snip ...
    thing_test(59);
}

Finally, the initial test timing, averaged over 128 iterations:

Visual Studio	0.5717545469 seconds
clang	1.354978094 seconds

Reducing Include Cost

Using Visual Studio 2017 (and Visual Studio 2019 Preview) /d1reportTime outputs the compile time of the unstd::vector header as:

c:\users\random\desktop\compile_timer\00_template_vector.h: 0.233792s

The header include time seems excessive compared to what's actually in it. Checking the include times, the time to include <algorithm> alone is about 0.229242 seconds with <cassert> being an additional 0.000324 seconds, making my code about 0.004226 seconds.

Ideally, we'd want to remove as many includes as we can through forward declares and other means. The <algorithm> include is required for std::move and std::min. While we could replace std::min, std::move is a bit more involved. The <cassert> header can be removed if we remove the assert, but it's an expense I'm willing to pay for.

For now, we'll leave the include time alone.

Identifying Compilation Cost

Continuing to use /d1reportTime, I examined the output of the function definitions portions, but only for the unstd::vector code.

1>		unstd::vector::vector: 0.000091s
1>		unstd::vector::__autoclassinit2: 0.000034s
1>		unstd::vector::size: 0.000044s
1>		unstd::vector::capacity: 0.000036s
1>		unstd::vector::data: 0.000034s
1>		unstd::vector::end: 0.000041s
1>		unstd::vector::begin: 0.000035s
1>		unstd::vector::~vector: 0.000164s
1>		unstd::vector::operator []: 0.000086s
1>		unstd::vector::resize: 0.000197s
1>		unstd::vector::push: 0.000209s
1>		unstd::vector::pop: 0.000179s
1>		unstd::vector::erase: 0.000183s
1>		unstd::vector::preallocate: 0.000503s

Those times are tiny. I mean, like, they don't even really matter, right? Pff. Who cares about preallocate taking half of a millisecond. It's not big enough to worry about...

Except...

It's a Problem of Scale

The cost of a template is measured in two phases: 1) time to #include the header and 2) time to compile the instantiation of the template for the specified type.

When a translation unit (the cpp file) is processed, the preprocessor loads every header that is #included by that file and subsequent includes (sort of...) Headers will almost always have include guards to prevent recursion, so each header will usually load only once. Unlike non-templatized code, every single instantiation, reference, and even pointer to a templated class (yes, even private members of other classes) has it's used/dereferenced members compiled.

In a large include hierarchy (e.g., headers including headers, uber/mega headers, etc.) this can mean hundreds of unique vector types in a single translation unit. The saving grace here is that it only compiles the things that are actually used, so if unstd::vector::push is never called, it's never compiled.

Given a header that includes 100 other headers, with proper include guards the #include cost is seen only once per translation unit. If 50 different vector template types are used, then the compilation cost is paid 50 times. And that's just for a single translation unit; the next translation unit gets to do it all over again.

So let's improve it.

Forward Declaration

If possible, use forward declarations as much as possible. It eliminates the cost of the include as well as the cost of building the template for the specific type. Not doing something is the best optimization there is.

Templates can be forward declared it if and only if none of the implementation is actually needed. The template must be referenced by pointer or reference only and never dereferenced. Any template parameter type(s) must also be forward declared or fully defined as well.

Forward declaring a template works just like anything else. For the code above, it would be:

namespace unstd {
    template
    class vector;
}

Determining What's Expensive

If we have to include and compile it, then we need to figure out what's costing time. There can be no meaningful change without meaningful measurement. Again using cl.exe's /d1reportTime flag, I compile one line at a time (with minimal modifications if required) and capture the time. The totals may not exactly match as there are slight variations from pass to pass. ¯\_(ツ)_/¯

    template
    void vector<_TYPE_>::preallocate(const size_t count) {
        if (count> m_capacity) {                                                        // 0.000023s
            _TYPE_ * const data = static_cast<_TYPE_*>(malloc(sizeof(_TYPE_) * count)); // 0.000076s
            const size_t end = m_size;                                                  // 0.000028s
            m_size = std::min(m_size, count);                                           // 0.000202s
            for (size_t i = 0; i < count; i++) {                                        // 0.000042s
                new (&data[i]) _TYPE_(std::move(m_data[i]));                            // 0.000148s
            }
            for (size_t i = 0; i < end; i++) {                                          // 0.000047s
                m_data[i].~_TYPE_();                                                    // 0.000031s
            }
            free(m_data);                                                               // 0.000028s
            m_data = data;                                                              // 0.000014s
            m_capacity = count;                                                         // 0.000019s
        }
    }

Some things, like std::min are easy. Since we know the data type (size_t) and know it can't throw or do anything else that std::min protects against, let's just do the min ourselves.

// 0.000027s
if ( count < m_size ) {
    m_size = count;
}

After that is the malloc and casts. Everything is required and nothing is duplicated or individually cost heavy, so splitting the code up won't do any good. Reducing duplication helps, such as by replacing multiple sizeof operations with a const local variable, but there's not a whole lot more we can do to make it faster.

Deduplicating Duplicitous Duplication

But actually, there is quite a bit of duplication going on. That's kind of the whole point, though, which is why it's called a template in the first place. Every time a template is instantiated with a new type, all used members are recompiled, but the only difference is the type. The constructor, deconstructor, copy constructor, move constructor, and type size may change, but everything else is identical.

If the various constructors and deconstructor are moved to common functions and called, the template actually becomes more expensive to compile. Why? Because we're adding more templated member functions! Each empty template member function costs about 0.000030 seconds to compile, and that's before it has any code code in it. By moving code out of one function and into another, we actually end up adding time.

So, what to do... what to do...

Enter The Dragon

Like Bruce Lee, we're going to infiltrate the template, then kill it from the inside.

namespace unstd {
    class vector_base {
    };

    template 
    class vector : public vector_base {
        // ... snip ...
    };
}

I can hear the screams already. "Nooooo! Not a base class! Aaaaarrgghhhh!"

By giving the vector a base class, all of the code that the template needlessly duplicates can move to one place and only ever be compiled one time.

    class vector_base {
        public:
            vector_base(const size_t stride);
            ~vector_base();

            size_t capacity() const { return m_capacity; }
            size_t size() const { return m_size; }

            void push_move(void * const value);
            void push_copy(const void * const value);

            bool pop(void * const value);

            void erase(const size_t index);

            void resize(const size_t count);

            void preallocate(const size_t count);

        protected:
            void * element(const size_t index) { return reinterpret_cast(m_data) + index * m_stride; }
            const void * element(const size_t index) const { return reinterpret_cast(m_data) + index * m_stride; }

            size_t m_capacity = 0;
            size_t m_size = 0;
            size_t m_stride = 0;
            void * m_data = nullptr;
    };

as the wails of "but mah type safety!" trail off into the distance...

Yes. Our base class uses void pointers. If it's really important to you, make those functions protected and add wrappers to the template to access them in a type safe manner. But that increases compile times, and that's not what this post is about.

¯\_(ツ)_/¯

The base class also steals all of the member variables from the vector class and adds one: the stride of the type it encapsulates. The stride is the number of bytes between elements in an array. Because of how C++ handle structure and class sizes, it's the same as sizeof(type).

Now, when our derived unstd::vector class needs to call preallocate, the base version is called instead, and that one only gets compiled once!

But I left out one very important detail... how are the elements of the vector constructed? The type is never passed to the base class, only the stride. Well, if the void pointers made you angry, you should stop reading now. Also, maybe avoid the internet for a while. And never, ever read the comments.

Because we're adding a function pointer.

The function pointer is necessary to be able to construct (move, copy, vanilla) and destruct objects within the list. Any time the base class needs to manipulate the list (insert, remove, append, copy) it has to call back to the type specific function to handle the type specific details.

"But! But performance!" they said. And "It makes the class larger!" they said.

Yes. The code now has a function that it calls any time the vector is modified. So what? Growing the vector involves allocating dynamic memory (thousands of cycles,) copying the old elements to the new memory (unknown number of cycles,) destructing the old objects (unknown number of cycles,) freeing the old memory (thousands of cycles,) and constructing the new element (unknown number of cycles.) If you're worried about a few cycles to call a function you must have the most optimal codebase in the history of ever (in which case, what are you doing here?)

The Final Code?

So what does the final code look like? Meh, a bit mangled, but not too terrible.

The header:

#ifndef TEMPLATE_VECTOR_H
#define TEMPLATE_VECTOR_H

#include 
#include 

namespace unstd {
    class vector_base {
        public:
            enum class op_type {
                movector,
                copyctor,
                ctor,
                dtor,
            };

            typedef void (*operation_fn)(const op_type, void * const dst, void * const src, const size_t count);

            vector_base(const size_t stride, operation_fn const);
            ~vector_base();

            size_t capacity() const { return m_capacity; }
            size_t size() const { return m_size; }

            void push_move(void * const value);
            void push_copy(const void * const value);

            bool pop(void * const value);

            void erase(const size_t index);

            void resize(const size_t count);

            void preallocate(const size_t count);

        protected:
            void * element(const size_t index) { return reinterpret_cast(m_data) + index * m_stride; }
            const void * element(const size_t index) const { return reinterpret_cast(m_data) + index * m_stride; }

            size_t m_capacity = 0;
            size_t m_size = 0;
            size_t m_stride = 0;
            void * m_data = nullptr;
            operation_fn m_operation;
    };

    template 
    class vector : public vector_base {
        public:
            vector() : vector_base(sizeof(_TYPE_), (operation_fn)op) {}

            vector(const vector & other) = delete;
            vector(vector && other) = delete;

            vector& operator=(const vector &) = delete;
            vector& operator=(vector &&) = delete;

            const _TYPE_ & operator[](const size_t index) const;
            _TYPE_ & operator[](const size_t index);

            void push(const _TYPE_ &);
            void push(_TYPE_ &&);

            const _TYPE_ * begin() const { return reinterpret_cast(element(0)); }
            const _TYPE_ * end() const { return begin() + size(); }

            const _TYPE_ * data() const { return begin(); }

        private:
            static void op( const op_type, _TYPE_ * const dst, _TYPE_ * const src, const size_t count );
    };

    template
    const _TYPE_ & vector<_TYPE_>::operator[](const size_t index) const {
        assert(index < m_size);
        return *reinterpret_cast<_TYPE_*>(element(index));
    }

    template
    _TYPE_ & vector<_TYPE_>::operator[](const size_t index) {
        if (index >= m_size) {
            resize(index + 1);
        }
        return *reinterpret_cast<_TYPE_*>(element(index));
    }

    template
    void vector<_TYPE_>::push(const _TYPE_ & value) {
        vector_base::push_copy(&value);
    }

    template
    void vector<_TYPE_>::push(_TYPE_ && value) {
        vector_base::push_move(&value);
    }

    template
    void vector<_TYPE_>::op( const op_type type, _TYPE_ * const dst, _TYPE_* const src, const size_t count ) {
        for (size_t i = 0; i < count; i++) {
            switch (type) {
                case op_type::movector:
                    new (&dst[i]) _TYPE_(std::move(src[i]));
                    break;

                case op_type::copyctor:
                    new (&dst[i]) _TYPE_(src[i]);
                    break;

            case op_type::ctor:
                    new (&dst[i]) _TYPE_();
                    break;

            case op_type::dtor:
                    dst[i].~_TYPE_();
                    break;
            }
        }
    }
} // unstd

#endif // TEMPLATE_VECTOR_H

The implementation file:

#include "template_vector.h"

namespace unstd {
    vector_base::vector_base(const size_t stride, operation_fn const op)
        : m_stride(stride)
        , m_operation(op) {
    }

    vector_base::~vector_base() {
        m_operation(op_type::dtor, m_data, nullptr, m_size);
        free(m_data);
    }

    void vector_base::push_move(void * const value) {
        preallocate(m_size + 1);
        m_operation(op_type::movector, element(m_size++), value, 1);
    }

    void vector_base::push_copy(const void * const value) {
        preallocate(m_size + 1);
        m_operation(op_type::copyctor, element(m_size++), const_cast(value), 1);
    }

    bool vector_base::pop(void * const value) {
        if (m_size > 0) {
            const size_t index = --m_size;
            if (value) {
                m_operation(op_type::movector, value, element(index), 1);
            }
            m_operation(op_type::dtor, element(index), nullptr, 1);
            return true;
        }
        return false;
    }

    void vector_base::erase(const size_t index) {
        m_operation(op_type::movector, element(index), element(index + 1), m_size - index );
        m_operation(op_type::dtor, element(--m_size), nullptr, 1);
    }

    void vector_base::preallocate(const size_t count) {
        if (count> m_capacity) {
            void * const data = malloc(m_stride * count);
            const size_t end = m_size;
            if (count < m_size ) {
                m_size = count;
            }
            m_operation(op_type::movector, data, m_data, m_size);
            m_operation(op_type::dtor, m_data, nullptr, end);
            free(m_data);
            m_data = data;
            m_capacity = count;
        }
    }

    void vector_base::resize(const size_t count) {
        preallocate(count);
        if (count > m_size) {
            m_operation(op_type::ctor, element(m_size), nullptr, count - m_size);
        }
        m_size = count;
    }

} // namespace unstd

Results!

	Original Time	Improved Time	Savings
Visual Studio	0.5717545469	0.498964125	12.7%
clang	1.354978094	1.198699836	11.5%

Really? That's it? That's all that was saved?

Remember in the beginning I said that the vector class was already fairly optimal? Let's add some perspective and compare to the unstd::vector results. I compiled LLVM's LLVMCore project with the reportTime flag. Below is a sampling of std::vector compilation times where the contained object was not a pointer. Note that these aren't even "bad" times; some functions took >13ms to compile.

std::vector<class llvm::StringRef,class std::allocator<class llvm::StringRef> >	0.004525s
std::vector<class llvm::StringRef,class std::allocator<class llvm::StringRef> >::push_back	0.001258s
std::vector<class llvm::StringRef,class std::allocator<class llvm::StringRef> >::emplace_back	0.001062s
std::vector<class llvm::StringRef,class std::allocator<class llvm::StringRef> >::_Emplace_back_with_unused_capacity	0.000512s
std::vector<class llvm::StringRef,class std::allocator<class llvm::StringRef> >::~vector	0.000064s
std::vector<class llvm::StringRef,class std::allocator<class llvm::StringRef> >::vector	0.000557s
std::vector<class llvm::StringRef,class std::allocator<class llvm::StringRef> >::_Orphan_range	0.000826s
std::vector<class llvm::StringRef,class std::allocator<class llvm::StringRef> >::_Tidy	0.000245s
std::vector<class llvm::StringRef,class std::allocator<class llvm::StringRef> >::_Has_unused_capacity	0.000061s
std::vector<class llvm::StringRef,class std::allocator<class llvm::StringRef> >::_Move_from	0.000374s
std::vector<class llvm::StringRef,class std::allocator<class llvm::StringRef> >::_Destroy	0.000160s
std::vector<class llvm::StringRef,class std::allocator<class llvm::StringRef> >::capacity	0.000078s
std::vector<class llvm::StringRef,class std::allocator<class llvm::StringRef> >::_Emplace_reallocate	0.000963s
std::vector<class llvm::StringRef,class std::allocator<class llvm::StringRef> >::_Xlength	0.000038s
std::vector<class llvm::StringRef,class std::allocator<class llvm::StringRef> >::_Change_array	0.000255s
std::vector<class llvm::StringRef,class std::allocator<class llvm::StringRef> >::_Calculate_growth	0.000153s
std::vector<class llvm::StringRef,class std::allocator<class llvm::StringRef> >::_Umove_if_noexcept	0.000264s
std::vector<class llvm::StringRef,class std::allocator<class llvm::StringRef> >::_Umove	0.000137s
std::vector<class llvm::StringRef,class std::allocator<class llvm::StringRef> >::max_size	0.000171s
std::vector<class llvm::StringRef,class std::allocator<class llvm::StringRef> >::size	0.000077s
std::vector<class llvm::StringRef,class std::allocator<class llvm::StringRef> >::_Umove_if_noexcept1	0.000102s

Oddities, Errors, and Errata

A few odd or interesting points about the source code:

Q: Is this the end? Is this the only way?
A: Nope, not in the slightest. I spent WAY too much time on this already; it's time to put it out in the public and see what others do to improve (or disprove) the idea.

Q: Why is the for loop in vector<_TYPE_>::op is outside the switch statement. Performance Penalty, -1 Internet Kudo Points!
A: Because the repeated for loops inside the switch statements increased the compilation time of that function alone by by 0.2ms per instantiation.

Q: Did you fully vet all of the changes you made? Incorrect code that compiles fast is useless!
A: Great question! I... umm... did, but there may be bugs. If I (or you!) find them, I'll fix them as I am aware of them.

Q: But it's ugly!
A: That's not a question. Also, have you read the STL code?

Q: Do you read the comments?
A: Only the ones here.

Q: When is the next post?
A: When I get to it. 😀

Q: How do I tell you about bugs, improvements, or something else?
A: Post a comment here or contact me on twitter @virtuallyrandom.