hirrolot

Posted on May 25, 2021 • Edited on May 30, 2022

Compiling Algebraic Data Types in Pure C99

#c #functional #programming #computerscience

Lots of people have been constantly asking me how Datatype99 works:

datatype(
    BinaryTree,
    (Leaf, int),
    (Node, BinaryTree *, int, BinaryTree *)
);

int sum(const BinaryTree *tree) {
    match(*tree) {
        of(Leaf, x) return *x;
        of(Node, lhs, x, rhs) return sum(*lhs) + *x + sum(*rhs);
    }

    return -1;
}

This library implements algebraic data types (ADT) for pure C99. This allows you to design convenient, type-safe interfaces while sticking to plain old C with zero runtime dependencies. Today I would like to explain what the hell it does under the hood!

If you are not acquainted with the concept of algebraic data types, please read Unleashing Sum Types in Pure C99 first.

To study the formal system on a concrete example, it is helpful to read this post along with the generated output of examples/binary_tree.c.

Data layout

First of all, we want to generate a data layout for a sum type. This is achieved by a tagged union: a union of possible variants data paired with a tag (discriminant). The latter is an integer designating which particular variant is active now.

Provided that multiple parameters can be specified in a single variant definition, we must gather them all into a separate C structure:

typedef struct <datatype-name><variant-name> {
    <type>0 _0;
    ...
    <type>N _N;
} <datatype-name><variant-name>;

Metavariables:

<datatype-name> is a sum type name.
<variant-name> is a variant name.
<type>I is a type of Ith variant parameter.

So far so good. Now, in order to collect all variant data structures into a single entity, we must put them into a union:

typedef union <datatype-name>Variants {
    char dummy;

    <datatype-name><variant-name>0 <variant-name>0;
    ...
    <datatype-name><variant-name>N <variant-name>N;
} <datatype-name>Variants;

char dummy; is very important: consider a situation where a datatype definition is analogous to a plain enum in C. In this case, <datatype-name>Variants would contain only char dummy;. If it was not present, the union would be empty, thus violating the C standard.

In order to allow exhaustive pattern matching (more on this later), a tag is represented as enum:

typedef enum <datatype-name>Tag {
    <variant-name>0Tag, ..., <variant-name>NTag
} <datatype-name>Tag;

At least one variant must be specified in datatype, so this enumeration will never be empty.

Finalising a tagged union definition:

struct <datatype-name> {
    <datatype-name>Tag tag;
    <datatype-name>Variants data;
};

Now it can be used as any other manually written tagged union.

Value constructors

A value constructor is an inline static function which constructs a variant instance. Each variant has its own value constructor.

inline static <datatype-name> <variant-name>(...) {
    <datatype-name> result;
    result.tag = <variant-name>Tag;
    result.data.dummy = '\0';
    {
        result.data.<variant-name>._0 = _0;
        ...
        result.data.<variant-name>._N = _N;
    }
    return result;
}

(We set result.data.dummy to '\0' even if no variant payload is present, so that .dummy is always initialised.)

Contrary to manual initialisation (a.k.a. {.tag = A, .data.A = { ... }}), a value constructor is

More convenient to use.
Safer: it automatically injects a corresponding tag, protecting a user from invalid combinations of tag + data.

Pattern matching

This is the most interesting part. Pattern matching is described by the following form:

match(val) {
    of(<variant-name>0, ...) <stmt>
    ...
    of(<variant-name>N, ...) <stmt>
}

How this can be implemented? The key idea is to use statement prefixes. They were first introduced by Simon Tatham in his Metaprogramming Custom Control Structures in C.

So what is actually a statement prefix? It is any C language construction that must precede a C statement. Here are some examples: if (...), for (...), while (...), if (...) <stmt> else, a label, or a combination thereof. Moreover, a statement prefix together with a statement afterwards results in a single C statement. From my experience, for (...) is the most flexible one so we are going to use it to implement match & of.

Let's start with match(val).

`match`

All of clauses must be able to access the provided value, therefore we must store it somewhere beforehand. We cannot just define it as usual: void *datatype99_priv_matched_val = (void *)&(val); because

The whole match construction needs to be a single C statement, according to Datatype99's grammar.
If two match constructions in the same lexical scope output datatype99_priv_matched_val, compilation will fail.

How can we overcome this? The following pattern will help:

for (void *datatype99_priv_matched_val = ((void *)&(val));
         datatype99_priv_matched_val != (void *)0;
         datatype99_priv_matched_val = (void *)0)

(datatype99_priv_matched_val is named in such a way to respect macro hygiene, i.e. not to conflict with user-defined variables.)

This single-iteration loop is a statement prefix. In order to avoid name clashes, it opens a new lexical scope and introduces datatype99_priv_matched_val into it. After it, we put switch:

switch ((val).tag)

After this switch, a user will specify a number of braced of clauses, thus completing the initial statement prefix for (...).

match(val) is done. Let's move on to of.

`of`

Logically, one of branch must boil down to one case branch, but think about break: it cannot be put after a user-provided statement <stmt> following of because our macro just has not got access to this piece of code. Instead, we put it before <stmt>! Thus, instead of

switch(...) {
    case A: <stmt> break;
    case B: <stmt> break;
    case C: <stmt> break;
}

we generate

switch(...) {
    break; case A: <stmt>
    break; case B: <stmt>
    break; case C: <stmt>
}

The first break does nothing: the control flow will never reach it.
The last break is absent: the control flow will nonetheless fall into the next instruction after switch(...).

Now the bindings need to be generated somehow (i.e. variable names provided to of). All the information we have in of is

the variant name <variant-name>,
a list of bindings,
the matched value datatype99_priv_matched_val.

Each binding must

Boil down to a variable definition.
Be initialised to a corresponding variant argument taken from datatype99_priv_matched_val.

For this to happen, we need to

Deduce the type of each binding.
Deduce the previously erased type of datatype99_priv_matched_val.

Type deduction

Hopefully, all the types are already known to us at the stage of datatype generation. The trick is to typedef them appropriately:

To deduce the type of the Ith binding, we can take the provided <variant-name> and concatenate it with _I, thereby obtaining a type alias to the corresponding parameter type.
To deduce the type of datatype99_priv_matched_val, we can take the provided <variant-name> and concatenate it with SumT, thereby obtaining a type alias to the outer sum type.

To make it formal, for each variant the following type definitions are generated:

typedef struct <datatype-name> <variant-name>SumT;

typedef <type>0 <variant-name>_0;
...
typedef <type>N <variant-name>_N;

Finally, inside of, this is how each binding is generated:

for (
tag_##_##i *x = &((tag_##SumT *)datatype99_priv_matched_val)->data.tag_._##i, *ml99_priv_break = (void *)0;
ml99_priv_break != (void *)1;
ml99_priv_break = (void *)1)

x stands for the binding name.
tag_##_##i stands for the type alias <variant-name>_I. This is the type of the binding.
tag_##SumT stands for the type alias <variant-name>SumT. This is the type of datatype99_priv_matched_val.

As you might have already noticed, this is a single-iteration loop too. But it has quite different structure than the previous one: instead of using x to terminate the cycle, we employ the second variable ml99_priv_break (originated from Metalang99, the underlying metaprogramming library). Why? Because otherwise, if x is leaved unused by a user, a compiler will not emit a warning, which is undesirable in our case.

Why we generate the long chains of statement prefix loops instead of plain variable definitions? Because all case branches in switch share a single lexical scope, meaning that two variables with the same name cannot be defined in two different case branches. In contrast to this, for (...) opens a new lexical scope, thus allowing for the same binding names in two distinct of branches.

So what we have in summary? Consider this code snippet:

datatype(
    BinaryTree,
    (Leaf, int),
    (Node, BinaryTree *, int, BinaryTree *)
);

int sum(const BinaryTree *tree) {
    match(*tree) {
        of(Leaf, x) return *x;
        of(Node, lhs, x, rhs) return sum(*lhs) + *x + sum(*rhs);
    }

    return -1;
}

Here, of(Leaf, x) expands to

for (Leaf_0 *x = &((LeafSumT *)datatype99_priv_matched_val)->data.Leaf._0,
            *ml99_priv_break = (void *)0;
     ml99_priv_break != (void *)1; ml99_priv_break = (void *)1)

, while of(Node, lhs, x, rhs) expands to

for (Node_0 *lhs = &((NodeSumT *)datatype99_priv_matched_val)->data.Node._0,
            *ml99_priv_break = (void *)0;
     ml99_priv_break != (void *)1; ml99_priv_break = (void *)1)
    for (Node_1 *x = &((NodeSumT *)datatype99_priv_matched_val)->data.Node._1,
                *ml99_priv_break = (void *)0;
         ml99_priv_break != (void *)1; ml99_priv_break = (void *)1)
        for (Node_2 *rhs = &((NodeSumT *)datatype99_priv_matched_val)->data.Node._2,
                    *ml99_priv_break = (void *)0;
             ml99_priv_break != (void *)1; ml99_priv_break = (void *)1)

After both of them, a return statement follows, thus completing the statement prefixes chain.

Exhaustive pattern matching

Datatype99 has yet another neat feature: a sane compiler will emit a warning if not all variants are handled in match:

match(*tree) {
    of(Leaf, x) return *x;
    // of(Node, lhs, x, rhs) return sum(*lhs) + *x + sum(*rhs);
}

playground.c: In function ‘sum’:
playground.c:6:5: warning: enumeration value ‘NodeTag’ not handled in switch [-Wswitch]
    6 |     match(*tree) {
      |     ^~~~~

This is called exhaustive pattern matching. In fact, Datatype99 does nothing to achieve it -- provided that we desugar match into a proper switch, a C compiler can do case analysis for us! For example, the following code compiles with a warning as well:

[godbolt]

typedef enum {
    Foo, Bar
} MyEnum;

const MyEnum foo = Foo;

switch (foo) {
    case Foo:
        break;
}

<source>:8:5: warning: enumeration value 'Bar' not handled in switch [-Wswitch]
    8 |     switch (foo) {
      |     ^~~~~~

Final words

We have not considered some utility macros like ifLet and matches since nothing special lies in them -- they are very similar to what we have already seen. For more details, feel free to investigate the source code of Datatype99. Thanks to the rich functionality of Metalang99, it fits in just ~400 LoC single header file.

It took me almost a year to figure out how to implement Datatype99 properly. Now it is a well-tested and stable v1.x.y project, so you can use it without fear that something will be broken in the next release.

Happy hacking!

Support me on Patreon >>

Try out Datatype99 >>

Reinvent your career. Join DEV.

It takes one minute and is worth it for your career.

Get started

Top comments (4)

Alexander Chichigin • May 28 '21

A sick (in a good sense! 😃) library and a nice post!

Though I have one clarifying question.

A value constructor is an inline static function

If it's static and defined in a .c file it won't be visible outside this file. But putting the whole (generated) function in a .h file is less than ideal. What way out of this would you suggest?

hirrolot • May 28 '21 • Edited

I would suggest to use precompiled headers to make all these functions compiled only once. This would also increase compilation times of datatype(...).

Alexander Chichigin • May 28 '21

You probably mean decrease compilation times? 😄

hirrolot • May 28 '21

Ah yeah, that's a typo :)

DEV Community

Compiling Algebraic Data Types in Pure C99

Data layout

Value constructors

Pattern matching

`match`

`of`

Type deduction

Exhaustive pattern matching

Final words

Top comments (4)

Read next

Harnessing the Power of Object-Oriented and Functional Programming Paradigms in Software Development

Qwik Resumability Explained

Understanding Large Language Models: From Training to Real-World Use

Daily JavaScript Challenge #JS-72: Count the Frequency of Every Unique Element in an Array