Isabella Muerte

Posted on Mar 23, 2019 • Edited on Mar 29, 2019

Everything You Never Wanted To Know About CMake

#cpp #cmake #buildsystems #opensource

This post was originally posted to my personal site. It has been reposted here for visibility.

Just hearing the word CMake is typically enough to make a shiver run down my spine. I've been deep in the muck of its behavior and tooling the last few weeks as I finish up a CMake library titled IXM. While the various minutiae of how IXM works, why I wrote it, and all the nice little usage details are definitely for another post, the quick summary is that it abstracts away a majority of common needs from CMake users, thus allowing you to write even less CMake (and I think we can all agree that's a good thing). After writing a small ranty post in the YOSPOS subforum on Something Awful about all the gross and disgusting things I've learned about CMake in recent weeks, I decided I'd write up a more in-depth description. Without further ado, let's get into teaching you, the average user, everything you never wanted to know about CMake.

Quick Introduction

Rather than explaining what CMake is, what it does, how it works in extreme
detail, or what have you, I'm going to quickly describe the various steps CMake
takes during the entire build process. Effectively, the workflow of CMake is as
follows:

configure
generate
build

Within these steps we can do the following:

configure
- copy files
- execute processes
- read/write to files
- check host system state
- create, import, and modify build targets
generate
- write to files
- run generator expressions
- ... that's about it
build
- post-build commands
- pre-build commands
- execute processes
- generate files for consumption in later build stages, but only if CMake was able to prove that it could consume the files via the DAG.

One of the many criticisms of CMake is that it is not immediately obvious as to
what commands will run at what stage. Some commands execute, and then create
steps to execute at the generate step, others run during the configure
step, and still others will finally execute during the build itself.

IXM itself is mostly concerned with the configure and generation step. Because we cannot specify to the user the stage that executes at what time, we try to hide the generate operations behind configure step command calls. This means that while we rely on the user performing work in the configure stage, they have less work to do as we simply setup generator expressions to execute in the background later. The nice thing about this is that we can figure out at the generation stage if our DAG is actually safe and correct and not just "hope for the best" during the configure stage.

Cursed and Custom Variables

Variables in CMake are just cursed eldritch terrors, lying in wait to scare the absolute piss out of anyone that isn't expecting it. Luckily, I drink a lot of coffee and I take a dieuretic so this isn't anything new for me.

Beginning with CMake 3.0, there was a change in the way CMake treats variables. Effectively, an "unquoted" argument can be any character except whitespace or one of (, ), #, \, ", or >. Yes, this means CMake variables can contain emoji! How's that for a modern programming language?

# In awe at the byte size of this lad.
# What an absolute code unit.
set(🙃 "Why would you do this?")

But there is a caveat. When dereferencing a variable explicitly (i.e., ${🙃}), one must escape any non-alphanumeric character or the characters _, +, -, ., and /. Except, it's not the characters you're escaping, but the bytes themselves! Thus, we can never actually dereference the value stored in 🙃, unless CMake does it for us. This is done in the if(), elseif(), else(), while(), and foreach(IN LISTS) commands.

Additionally, because function() and macro() can take an unquoted argument, this means we can also name functions and macros with literally anything. The hard part, in this case, is how we can call a command. The only valid characters in this case are alphanumeric and the character _. Why would CMake let us create functions that can't be called? Hell if I know! 🤣

This brings us to the very last bit of information regarding variables in CMake. You can, with a little bit of magic, create your own variable namespaces. So, CMake's current set of variables exist in the following dereference "spaces". There is $, which is the default lookup rules. Additionally, there is also $CACHE and $ENV. Both of these look into the CMakeCache.txt file and system environment variables respectively. This style of variable dereferencing has spread to other parts of CMake. The most explicitly obvious module would be ExternalData, which provides special DATA{} variable references.

In IXM's case, we go a step beyond this, and created a custom syntax to permit the fetching of content via various "provider" variable references. In this syntax, one can specify packages much like other build system/package manager combos. As an example, to get extremely popular Catch2 C++ library, you can specify it via HUB{catchorg/catch2@v2.6.0}. This name can then be passed around, and it will eventually be use to construct parameters to the FetchContent module. Yes, it's painful, terrifying, and I'm not going to show you how to do it because it involves abusing CMake's regex engine, ad-hoc string replacements, and an arrogance not seen since moments before Icarus plummeted to his death.

Basic Dict-ion

CMake's current builtin data type is the string. However lurking behind this string data type is an actual type system. How else, then, would CMake know what is a library and what is a source file? One major thing it is lacking however is for a basic key-value system. As it turns out, we can abuse the current library type system to create our own dictionary type. The way this is done is to simply create an INTERFACE IMPORTED library. Then we simply add functions that automatically add INTERFACE_ to any keys passed in. Because the interface target is imported, anything that might depend on this library masquerading as a dictionary will not require that the target be exported during the install step. Thus, we can get properties via the $<TARGET_PROPERTY> generator expression, however it is up to us to make sure the INTERFACE_ portion is prepended. I would love to see an alternative to this, but oh well.

The only downside to this approach is that, due to scoping rules within CMake, dictionaries are faux-global. In other words, they are available to CMake scripts from the directory they were created in and any child directories. They cannot, sadly, be local to function scopes. Perhaps this might change in the future, and we'll get a real honest to god dictionary type, but don't hold your breath. I'd rather see the CMake language go away entirely than get a dictionary type. 🙂

Improper Properties

Remember moments ago when we were talking about "valid" strings and UTF-8? Well, I lied. As it turns out you just have bytes on your system. This means we can make invalid UTF-8 character sequences. One value that will never exist in a valid UTF-8 sequence is the C0 byte. Well, as it happens, we can just dump that into a property.

string(ASCII 192 C0)
set_property(TARGET ${target} PROPERTY INTERFACE_${C0} value)

Behold! CMake gives us the ability to have invalid properties and it doesn't even come to close giving a shit.

Coincidentally, the above property name is used to keep track of keys that were added to a dict() via the IXM interface. This comes into play when we serialize our dictionary types to disk, as knowing the exact keys to save comes in handy. This is especially true because we want to serialize them minus their INTERFACE_ prepended strings.

Additionally, there is a curious approach to handling cache variables in CMake. Cache variables effectively live in their own scope. This is why we can have $CACHE{VAR} to skip the typical variable lookup. In addition to setting the value of a cache variable via set(), we can also set them via set_property. After all, cache variables have properties, one of which is VALUE. Which we
can get or set as desired. No need to call set(... FORCE) or
-DVAR:TYPE=VALUE the variable.

Serialization and Custom File Formats

CMake's "treat everything as a string" approach to scripting means that we have interesting side effects. Specifically, CMake does not (at the time of this writing) have any way of performing IO on binary data. Instead, you must either use a pre-existing language or tool to make sure that you can extract binary data. This is, to be quite frank, frustrating as hell. However, we can cheat and make our own file formats. If you recall, ASCII (and by extension Unicode), have what are known as the C0 control codes. While many of these, such as the SOH (Start Of Header) or STX (Start of Text) control codes have become superfluous thanks to the existence of TCP/IP, we can still use 4 specific control codes for separating our data into hierarchical structures. Specifically, the File Separator, Group Separator, Record Separator, and Unit Separator control codes are easily within our grasp. This means we can have a fairly extensive amount of data split up.

CMake treats all strings separated by a ; as a list. This means having lists of lists is difficult. But with the magic of the above separators, we simply have to perform a string(REPLACE) call. The downside is we have to do it at least once per level of depth, but that is simple enough. Effectively, encoding looks like so

string(ASCII 31 unit-separator)
string(REPLACE ";" "${unit-separator}" data "${data}")
list(APPEND output "${data}")

Of course the reverse is to simply switch the location of the ; and ${unit-separator} when extracting from output.

The actual dict(SAVE) function in IXM looks like the following

function (ixm_dict_save name)
  parse(${ARGN} @ARGS=1 INTO)
  if (NOT INTO)
    error("dict(SAVE) missing 'INTO' parameter")
  endif()
  ixm_dict_noop(${name})
  dict(KEYS ${name} keys)
  string(ASCII 29 group)
  string(ASCII 30 record)
  string(ASCII 31 unit)
  foreach (key IN LISTS keys)
    dict(GET ${name} ${key} value)
    if (value)
      string(REPLACE ";" "${unit}" value "${value}")
      list(APPEND output "${key}${record}${value}")
    endif()
  endforeach()
  list(JOIN output "${group}" output)
  ixm_dict_filepath(INTO "${INTO}")
  string(ASCII 1 SOH)
  string(ASCII 2 STX)
  string(ASCII 3 ETX)
  string(ASCII 25 EM)
  string(ASCII 30 RS)
  string(ASCII 31 US)
  file(WRITE ${INTO} "${SOH}IXM${STX}${RS}version${US}1${ETX}${output}${EM}")
endfunction()

Yes, we are writing various STX and EM values to the file as well (even
thought technically that's not what they're meant for), however that's to
future proof this file for its versioning, as the actual layout of the file may
change in the future, especially since IXM is currently not even at a stable
alpha version.

This 'streaming' format from the tape drive days¹ works well for CMake, as we lack fine grained byte access into strings. We cannot simply jump around willy nilly. We either must rely on content being stored in a CMake safe format, regexes, or reading one byte at a time in the CMake language (No thank you! 🙅). By treating CMake content as a "stream" of data, we can stitch the entire serialized format back into a state within CMake, as well as write it back out with little to no issue.

Currently the IXM 'database' format looks something to the effect of the
following:

␁IXM␂␞version␟1␞additional␟key␟value␟pairs␃␜<filename>␝<dict-name>␞key␟value
␟list␞another␟key␟pair␞OPTIONS␟BUILD_TESTING␟OFF[␜<filename>␝<dict-name>...]␙

The above text formatting might be a bit hard to follow, but let's walk through it anyhow. First, we use the start-of-header control code. This is so multiple database files can be concatenated together without issue, or possibly even embedded into another text file altogether. This is then followed by the start-of-text control code. This is used to terminate the start-of-header control code. We then treat the record separator and unit separator as a single depth way of setting key-value pairs in the database header. Currently, we just set the date, but in the future additional metadata could possibly be stored.

Next, we store a "file". This representation is technically superfluous. As of right now, we only ever have the one file, unless other files were written to and are being concatenated. Regardless, it's nice to have it be forward compatible. Each group is separated by the name of the target, followed by its key-value pairs. These are separated by record and unit respectively. A unit separator might not appear if the values for a key are a single value. Keys that have no value are never written to disk. dict() instances with no keys are never written to disk either.

So, why do this in the first place? Well, it gives us a bit more flexibility. Instead of polluting the CMake cache for storing previous runs (and then having to sometimes delete the cache just to fix some broken state), we can instead store previous runs for expensive operations. Want to work around the try_compile way of things and increase check() throughput? While not yet implemented, this approach of serializing data in the way people are used to allows us to side step some of CMake's anachronisms.

Events and The Nightmares Held Within

Remember above how I said we can't call commands that don't meet the calling convention? Sorry, but I lied again! There is only one way this works, and that's by way of events. Yes, CMake has events. No, they are not documented, and the order of operations is fickle and can change because some user "did a thing" you weren't expecting. With the exception of one operation. The hidden and not so well known post-configure step.

CMake has a very interesting command, typically meant for debugging. It is called variable_watch and it takes the name of a variable and a command. Because it is a parameter, this command name can be an unquoted argument. The same type of unquoted argument that allows us to have emoji or invalid UTF-8 byte sequences in our function names.

When CMake is finished with its configuration step, the CMAKE_CURRENT_LIST_DIR variable is set to an empty string. This means that, for all intents and purposes, we can execute destructors. Yes. We can force CMake to have RAII. We can even check the current stack to see where we are when executing. This makes the following possible:

function (ixm::exit variable access value current stack)
  #[[Do whatever your heart desires here]]
endfunction()
variable_watch(CMAKE_CURRENT_LIST_DIR ixm::exit)

This is, I should note, extremely useful if you're trying to break CMake to not do its normal thing of crushing your soul everytime you want to start a new project.

file(GENERATE)

Last on our list is the very powerful and very terrifying file(GENERATE) command. This command allows us to feed generation expressions to CMake that will then be used to generate a file at the generate step. These are, in effect, the closest analogy to a post-generate step we can get. What this allows us to do, essentially, is generate any kind of file that can depend on content that was created during the configure step. To save your sanity, I'm not going to be posting all of the massive amounts of code I've written to get the behaviors discussed below. You're more than free to peruse the project itself if you're curious.

For instance, this is how you can generate a response file for your C or C++
compiler based off of a target.

function (ixm_generate_response_file target)
  parse(${ARGN} @ARGS=? LANGUAGE)
  get_property(rsp TARGET ${target} PROPERTY RESPONSE_FILE)
  if (NOT rsp)
    set(output "${CMAKE_CURRENT_BINARY_DIR}/IXM/${target}.rsp")
    set_target_properties(${target}
    PROPERTIES
      RESPONSE_FILE ${output})
  endif()
  # This function generates the actual generator expressions. They're not
  # shown here for brevity.
  ixm_generate_response_file_expressions(${target})
  string(JOIN "\n" content
    ${Default}
    ${Release}
    ${Debug}
    ${INCLUDE_DIRECTORIES}
    ${COMPILE_DEFINITIONS}
    ${COMPILE_OPTIONS}
    ${COMPILE_FLAGS})

  file(GENERATE
    OUTPUT $<TARGET_PROPERTY:${target},RESPONSE_FILE>
    CONTENT ${content})
endfunction()

Essentially, this gives you a way to get all the flags given to a specific target without having to manually track all the possible flags. Sadly, we cannot get the property granularity on a directory or source file level scope. Regardless, generating a response file means we can do things, like, generate a precompiled header in a cross platform way. No need for cotire's approach to PCH generation, nor do we have to add a custom language as seen in the CMakePCHCompiler. Even better, we can use file(GENERATE) to conditionally create unity builds on a per-target basis. If we create our library target with add_library(<name> OBJECT), then we've recreated the ability to do per-directory unity builds as found in game engines like Unreal. Combine this with ninja and your build will see a considerable speed up.

Finally, a few things we can do with file(GENERATE) also include, but are not limited to:

Generating files for services like AppImage, systemd, or launchd without requiring a user to leave CMake, or using configure_file. Want to generate a file to automatically turn your executable into a Windows service as well? You can do that too.
Write a CPackConfig.cmake file that is created at generation time, removing the need to include(CPack) after all your calls to install(), and setting various global variables.
Generate a CTestConfig.cmake file, or ignore that altogether so you can have a decent unit test runner for once.

Welcome To H*ck

Now that I've shared the dark terrifying secrets that lie within CMake, I hope you, the reader, can internalize the nightmares that are sitting quietly waiting to unleash themselves. Perhaps you might lose this information one day, but you'll always know you once knew it, and that is a fate worse than most. Regardless, one thing is true after I have stared into the depths of CMake:

I͠ K̸no͜w ͝Too̢ ͡Muc͘ḩ

Yes, I am aware that tape drives still exist. Chill. ↩

DEV Community