C++ Compilation: Fixing It – virtuallyrandom

For fixing compile times, there is no magic bullet. It takes work and work takes time. Before you spend time reducing compile (and link) times, evaluate if it actually makes sense. If you spend more on making the changes than you (or your team) will ever make back, it’s not worth the effort.

If you’re interested in compilation and linking optimization, you should also read up on Aras Pranckevičius blog as he has lots of fantastic info.

For suggestions, changes, corrections, or questions, reach out to me here or @virtuallyrandom (twitter is faster for me.)

Forward Declarations
Precompiled Headers
Include Guards
Include Complexity
Overexposure
Code Analysis
Templates
Unity or Batch Builds
Compiler Diagnostics
Summary

Forward Declarations

Any class, struct, or enum class (c++11 and higher) that is only ever referred to by reference or pointer and never dereferenced (no destructor, no “thing->soandso”, no “thing.soandso”) may be forward declared. Pointers are the variable. What they point to is irrelevant unless it is dereferenced, in which case it has to know what the thing is.

Forward declares are your weapon of choice in headers. Including <future> in a header puts the include in every single file that directly or indirectly includes that header.

There are tools available that will help with this and writing your own isn’t terribly difficult. For clang, include-what-you-use is pretty decent.

Precompiled Headers

Precompiled headers (PCH’s) are good for headers that don’t change very often. System or standard headers are really good candidates, especially given the cost of some of them. PCH’s are often seen as either “black magic”/”advanced users only” or “all teh thingz!” They’re easily under or over utilized , each of which has consequences.

In broad, hand-wavy terms, a PCH is built by compiling an implementation file (*.cpp) with some compiler specific flags set. When headers in the include hierarchy are included, they’re run through the preprocessor step as normal, then a binary result of it is saved to disk. When that PCH is used by another file, the binary representation of the preprocessed file is loaded, skipping a lot of the work.

Building a PCH is very expensive, often dwarfing gains realized. Any change to any header included in a PCH causes it to rebuild, so use extreme care as to the hierarchy of includes. One bad include can cause build times to skyrocket due to PCH generation, which is required to finish before anything using it can begin.

I don’t have any data on it offhand (maybe for a future post?) but clang’s support of PCH’s has historically left a bit to be desired. The include time of simply loading a sizeable PCH can be prohibitive.

Include Guards

Include guards are compiler keywords (pragma once) or a set of defines and checks that encapsulate a header that prevent it from being preprocessed multiple times in one pass.

I usually do my define guards like this:

#ifndef MY_HEADER_FILE_H
#define MY_HEADER_FILE_H

// your code here

#endif // MY_HEADER_FILE_H

Using pragma once is simpler:

#pragma once

// your code here

For compiler performance, it doesn’t matter if you use defines or pragma once. Every major compiler vendor has optimized for the define method, so the speed difference between the two is negligible. There are, however, some differences…

pragma once:

sees each file exactly once
isn’t standard, although it’s supported by all major compilers
doesn’t work if a header is copied; it’s seen as a unique file

include guards

sees each file exactly once
doesn’t work if the preprocessor symbol is duplicated or typo’d

Include Complexity

Reduce dumping ground headers. The more headers that are included, the longer the compilation takes. Having a header that includes everything is counterproductive as most of those headers won’t actually be used. Sure, it’s easy to add a file there, but it’s a slow death to compile times.

Some codebases forbid header to header includes, instead preferring to put them in global external includes or a PCH. This bloats the header or PCH, causes more frequent large builds for any change to a header and it becomes a nesting game of “header soandso needs this other thing” so now both headers have to be included in a specific order. This type of inclusion is fragile, hard to simplify, nearly impossible to fix.

Overexposure

In C++, anyone can see your private parts; the compiler just makes them slightly more difficult to access. External code needs to know private variables in order to glean the size and alignment of an object.

Often, classes include static private functions and data when they don’t need to. If the data or functions are (or can be) only used in the implementation file, make them static there instead, which has benefits for compilation time (less preprocessor on the header) and link time (less symbols in the global tables.)

PIMPL, or Pointer to Implementation, helps to privatize data and functionality, which lowers compile times as public header data doesn’t change as much, but it comes at various costs. Having an extra data pointer effectively turns it into a virtual class where the cache miss is on data instead of functions, and it also incurs the overhead of an additional layer of dynamic allocation. There’s a fast pimpl method that bypasses the cache-miss overhead, but it comes at a maintenance cost. Using unique_ptr locks you into using the dynamic memory approach and also requires that the object have a defaulted destructor (otherwise, it has to know about the implementation bits, which is what we’re trying to avoid!) In all cases, any inline functions must be moved to the implementation file. Handily, profile guided optimization can make them inline again to get runtime performance back! … at the cost of compile time!

Code Analysis

This is a terrible suggestion, but if all you’re concerned about is compile times, disable code analysis. Analysis makes builds very, very… v e r y… slow.

But don’t do that. Correct code is happy code. Incorrect code that compiles fast is useless.

Templates

Templates destroy compilation performance. That being said, almost no C++ programmer is going to do away with them (and I’m not advocating that) but there are things you can do to make templates compile faster.

Templates are basically fancy macros where the type is substituted at compilation time, but they can also perform compile time logic. This allows for some amazing functionality, such as enable_if or template metaprogramming, but at the cost of the compiler having to do more work. More work = more time.

Most container classes now are templates, but they actually only use a very limited set of functionality of the object they contain: constructor, copy, move, assign, and destroy. The container code, however, is always copy/pasted in full and for the full type expansion. This makes for a lot of duplicated boilerplate code.

Reduce how far and deep the templates reach by making a base class that operates on opaque data blobs and sizes, calling to helper functions for the create/copy/move/destruct operations. The template then becomes a thin layer of syntactic sugar on top that provides the relevant operations. This also has the added benefit that less symbols are generated, which improves link times, and less debug output is created, which improves compile times.

Unity or Batch Builds

Unity (also called Batch) builds improve compile times by merging preprocessor passes together. The general idea is to have one .cpp file that #includes other .cpp files. The count of files is dependent upon the size of the included files or load balancing the compilation workload.

While a lot of programmers see them as the easy-button/magic bullet, there are downsides to them as well. You lose file scope, so any statics that share the same name will become duplicate symbols. The includes in one .cpp file will bleed into all of the other .cpp files. Changing any file causes multiple files to be built. Oh, and you don’t need them when using Visual Studio’s cl.exe.

For Microsoft Visual Studio, building files that have similar settings, includes, etc. on the same compilation command has the same effect as a unity build but without any of the downsides. Yes, that’s right. You can get extreme build speed by “simply” changing how your files are built. But… of course there’s a but… you cannot use the multiprocessor compilation option as that simply launches a bunch of cl.exe’s, which defeats the reason why it works.

It seems that cl.exe doesn’t purge the generated preprocessor output between translation units. This means that if you compile three files that each have the same include, you pay for the include one time, not three. You get to keep file scope. There is no include bleeding. And, best of all, if you only changed the one cpp file, only that one cpp file has to build.

Compiler Diagnostics

Use the compiler to make informed decisions about which pieces of code are slow to compile and why.

For Microsoft Visual Studio, there are three flags that give information of various usefulness:

/Bt+ reports front and back end compilation time per file. C1XX.dll is the front end compiler. It’s responsible for compiling source code to an intermediate language (IL.) Compile times here tend to be affected by preprocessor time (includes, templates, etc.) C2.dll is the back end compiler. It’s responsible for generating the object files (turning the IL into machine code.)

/d1reportTime reports compiler front end times, only in Visual Studio 2017 Community or higher. (thanks @phyronnaz and @aras_p) See the UECompileTimesVisualizer or Aras’ blog.

/d2cgsummary reports functions that have “anomalous” compile times. It’s useful; experiment.

Combining those three in Visual Studio provides a ton of information about where compilation time is going.

For clang, use -ftime-report. If you’re using clang, definitely check out Aras’s blog post, time-trace: timeline / flame chart profile for Clang as he’s started instrumenting clang in order to figure out what’s compiling slowly.

Summary

For optimizing compilation, that all I can think of for right now. I’ll be adding more details, doing some exhaustive testing, and add some notes for reducing linking times, but this post is already long enough.

If you’ve got additions, suggestions, questions, or corrections, reach out to me @virtuallyrandom.

Thanks!

C++ Compilation: Fixing It

2 thoughts on “C++ Compilation: Fixing It”

Andreas Haferburg
January 18, 2019 at 11:52 am

Which compiler version do you use? I just tried the /Bt+ flag, and while it does work in 2010, it doesn’t seem to have any effect anymore in 2017. Did they remove it or change the syntax?
- randomPost author
  January 19, 2019 at 4:04 am
  
  I just tried it on Visual Studio 2015 and Visual Studio 2017 and it worked on both. Here’s my command line for VS2017:
  
  “C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.11.25503\bin\Hostx64\x64\cl.exe” /std:c++latest /I”C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.11.25503\include” /I”C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.11.25503\atlmfc\include” /I”C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Auxiliary\VS\include” /I”C:\Program Files (x86)\Windows Kits\10\Include\10.0.16299.0\ucrt” /I”C:\Program Files (x86)\Windows Kits\10\Include\10.0.16299.0\um” /I”C:\Program Files (x86)\Windows Kits\10\Include\10.0.16299.0\shared” /I”C:\Program Files (x86)\Windows Kits\10\Include\10.0.16299.0\winrt” /I”C:\Program Files (x86)\Windows Kits\10\Include\10.0.10240.0\ucrt” /I”Includeum” /I”C:\Program Files (x86)\Windows Kits\8.1\Include\um” /I”C:\Program Files (x86)\Windows Kits\8.1\Include\shared” /I”C:\Program Files (x86)\Windows Kits\8.1\Include\winrt” /nologo /EHsc /c /Bt+ __test_source_0.cpp
  
  and the output:
  __test_source_0.cpp
  
  time(C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.11.25503\bin\Hostx64\x64\c1xx.dll)=0.03819s < 945478203146 - 945478297524 > BB [C:\Users\random\Desktop\compile_timer\__test_source_0.cpp]