DEV Community

hirrolot
hirrolot

Posted on • Edited on

Compiling Algebraic Data Types in Pure C99

Lots of people have been constantly asking me how Datatype99 works:

datatype(
    BinaryTree,
    (Leaf, int),
    (Node, BinaryTree *, int, BinaryTree *)
);

int sum(const BinaryTree *tree) {
    match(*tree) {
        of(Leaf, x) return *x;
        of(Node, lhs, x, rhs) return sum(*lhs) + *x + sum(*rhs);
    }

    return -1;
}
Enter fullscreen mode Exit fullscreen mode

This library implements algebraic data types (ADT) for pure C99. This allows you to design convenient, type-safe interfaces while sticking to plain old C with zero runtime dependencies. Today I would like to explain what the hell it does under the hood!

If you are not acquainted with the concept of algebraic data types, please read Unleashing Sum Types in Pure C99 first.

To study the formal system on a concrete example, it is helpful to read this post along with the generated output of examples/binary_tree.c.

Data layout

First of all, we want to generate a data layout for a sum type. This is achieved by a tagged union: a union of possible variants data paired with a tag (discriminant). The latter is an integer designating which particular variant is active now.

Provided that multiple parameters can be specified in a single variant definition, we must gather them all into a separate C structure:

typedef struct <datatype-name><variant-name> {
    <type>0 _0;
    ...
    <type>N _N;
} <datatype-name><variant-name>;
Enter fullscreen mode Exit fullscreen mode

Metavariables:

  • <datatype-name> is a sum type name.
  • <variant-name> is a variant name.
  • <type>I is a type of Ith variant parameter.

So far so good. Now, in order to collect all variant data structures into a single entity, we must put them into a union:

typedef union <datatype-name>Variants {
    char dummy;

    <datatype-name><variant-name>0 <variant-name>0;
    ...
    <datatype-name><variant-name>N <variant-name>N;
} <datatype-name>Variants;
Enter fullscreen mode Exit fullscreen mode

char dummy; is very important: consider a situation where a datatype definition is analogous to a plain enum in C. In this case, <datatype-name>Variants would contain only char dummy;. If it was not present, the union would be empty, thus violating the C standard.

In order to allow exhaustive pattern matching (more on this later), a tag is represented as enum:

typedef enum <datatype-name>Tag {
    <variant-name>0Tag, ..., <variant-name>NTag
} <datatype-name>Tag;
Enter fullscreen mode Exit fullscreen mode

At least one variant must be specified in datatype, so this enumeration will never be empty.

Finalising a tagged union definition:

struct <datatype-name> {
    <datatype-name>Tag tag;
    <datatype-name>Variants data;
};
Enter fullscreen mode Exit fullscreen mode

Now it can be used as any other manually written tagged union.

Value constructors

A value constructor is an inline static function which constructs a variant instance. Each variant has its own value constructor.

inline static <datatype-name> <variant-name>(...) {
    <datatype-name> result;
    result.tag = <variant-name>Tag;
    result.data.dummy = '\0';
    {
        result.data.<variant-name>._0 = _0;
        ...
        result.data.<variant-name>._N = _N;
    }
    return result;
}
Enter fullscreen mode Exit fullscreen mode

(We set result.data.dummy to '\0' even if no variant payload is present, so that .dummy is always initialised.)

Contrary to manual initialisation (a.k.a. {.tag = A, .data.A = { ... }}), a value constructor is

  • More convenient to use.
  • Safer: it automatically injects a corresponding tag, protecting a user from invalid combinations of tag + data.

Pattern matching

This is the most interesting part. Pattern matching is described by the following form:

match(val) {
    of(<variant-name>0, ...) <stmt>
    ...
    of(<variant-name>N, ...) <stmt>
}
Enter fullscreen mode Exit fullscreen mode

How this can be implemented? The key idea is to use statement prefixes. They were first introduced by Simon Tatham in his Metaprogramming Custom Control Structures in C.

So what is actually a statement prefix? It is any C language construction that must precede a C statement. Here are some examples: if (...), for (...), while (...), if (...) <stmt> else, a label, or a combination thereof. Moreover, a statement prefix together with a statement afterwards results in a single C statement. From my experience, for (...) is the most flexible one so we are going to use it to implement match & of.

Let's start with match(val).

match

All of clauses must be able to access the provided value, therefore we must store it somewhere beforehand. We cannot just define it as usual: void *datatype99_priv_matched_val = (void *)&(val); because

  • The whole match construction needs to be a single C statement, according to Datatype99's grammar.
  • If two match constructions in the same lexical scope output datatype99_priv_matched_val, compilation will fail.

How can we overcome this? The following pattern will help:

for (void *datatype99_priv_matched_val = ((void *)&(val));
         datatype99_priv_matched_val != (void *)0;
         datatype99_priv_matched_val = (void *)0)
Enter fullscreen mode Exit fullscreen mode

(datatype99_priv_matched_val is named in such a way to respect macro hygiene, i.e. not to conflict with user-defined variables.)

This single-iteration loop is a statement prefix. In order to avoid name clashes, it opens a new lexical scope and introduces datatype99_priv_matched_val into it. After it, we put switch:

switch ((val).tag)
Enter fullscreen mode Exit fullscreen mode

After this switch, a user will specify a number of braced of clauses, thus completing the initial statement prefix for (...).

match(val) is done. Let's move on to of.

of

Logically, one of branch must boil down to one case branch, but think about break: it cannot be put after a user-provided statement <stmt> following of because our macro just has not got access to this piece of code. Instead, we put it before <stmt>! Thus, instead of

switch(...) {
    case A: <stmt> break;
    case B: <stmt> break;
    case C: <stmt> break;
}
Enter fullscreen mode Exit fullscreen mode

we generate

switch(...) {
    break; case A: <stmt>
    break; case B: <stmt>
    break; case C: <stmt>
}
Enter fullscreen mode Exit fullscreen mode
  • The first break does nothing: the control flow will never reach it.
  • The last break is absent: the control flow will nonetheless fall into the next instruction after switch(...).

Now the bindings need to be generated somehow (i.e. variable names provided to of). All the information we have in of is

  • the variant name <variant-name>,
  • a list of bindings,
  • the matched value datatype99_priv_matched_val.

Each binding must

  1. Boil down to a variable definition.
  2. Be initialised to a corresponding variant argument taken from datatype99_priv_matched_val.

For this to happen, we need to

  1. Deduce the type of each binding.
  2. Deduce the previously erased type of datatype99_priv_matched_val.

Type deduction

Hopefully, all the types are already known to us at the stage of datatype generation. The trick is to typedef them appropriately:

  • To deduce the type of the Ith binding, we can take the provided <variant-name> and concatenate it with _I, thereby obtaining a type alias to the corresponding parameter type.

  • To deduce the type of datatype99_priv_matched_val, we can take the provided <variant-name> and concatenate it with SumT, thereby obtaining a type alias to the outer sum type.

To make it formal, for each variant the following type definitions are generated:

typedef struct <datatype-name> <variant-name>SumT;

typedef <type>0 <variant-name>_0;
...
typedef <type>N <variant-name>_N;
Enter fullscreen mode Exit fullscreen mode

Finally, inside of, this is how each binding is generated:

for (
tag_##_##i *x = &((tag_##SumT *)datatype99_priv_matched_val)->data.tag_._##i, *ml99_priv_break = (void *)0;
ml99_priv_break != (void *)1;
ml99_priv_break = (void *)1)
Enter fullscreen mode Exit fullscreen mode
  • x stands for the binding name.
  • tag_##_##i stands for the type alias <variant-name>_I. This is the type of the binding.
  • tag_##SumT stands for the type alias <variant-name>SumT. This is the type of datatype99_priv_matched_val.

As you might have already noticed, this is a single-iteration loop too. But it has quite different structure than the previous one: instead of using x to terminate the cycle, we employ the second variable ml99_priv_break (originated from Metalang99, the underlying metaprogramming library). Why? Because otherwise, if x is leaved unused by a user, a compiler will not emit a warning, which is undesirable in our case.

Why we generate the long chains of statement prefix loops instead of plain variable definitions? Because all case branches in switch share a single lexical scope, meaning that two variables with the same name cannot be defined in two different case branches. In contrast to this, for (...) opens a new lexical scope, thus allowing for the same binding names in two distinct of branches.

So what we have in summary? Consider this code snippet:

datatype(
    BinaryTree,
    (Leaf, int),
    (Node, BinaryTree *, int, BinaryTree *)
);

int sum(const BinaryTree *tree) {
    match(*tree) {
        of(Leaf, x) return *x;
        of(Node, lhs, x, rhs) return sum(*lhs) + *x + sum(*rhs);
    }

    return -1;
}
Enter fullscreen mode Exit fullscreen mode

Here, of(Leaf, x) expands to

for (Leaf_0 *x = &((LeafSumT *)datatype99_priv_matched_val)->data.Leaf._0,
            *ml99_priv_break = (void *)0;
     ml99_priv_break != (void *)1; ml99_priv_break = (void *)1)
Enter fullscreen mode Exit fullscreen mode

, while of(Node, lhs, x, rhs) expands to

for (Node_0 *lhs = &((NodeSumT *)datatype99_priv_matched_val)->data.Node._0,
            *ml99_priv_break = (void *)0;
     ml99_priv_break != (void *)1; ml99_priv_break = (void *)1)
    for (Node_1 *x = &((NodeSumT *)datatype99_priv_matched_val)->data.Node._1,
                *ml99_priv_break = (void *)0;
         ml99_priv_break != (void *)1; ml99_priv_break = (void *)1)
        for (Node_2 *rhs = &((NodeSumT *)datatype99_priv_matched_val)->data.Node._2,
                    *ml99_priv_break = (void *)0;
             ml99_priv_break != (void *)1; ml99_priv_break = (void *)1)
Enter fullscreen mode Exit fullscreen mode

After both of them, a return statement follows, thus completing the statement prefixes chain.

Exhaustive pattern matching

Datatype99 has yet another neat feature: a sane compiler will emit a warning if not all variants are handled in match:

match(*tree) {
    of(Leaf, x) return *x;
    // of(Node, lhs, x, rhs) return sum(*lhs) + *x + sum(*rhs);
}
Enter fullscreen mode Exit fullscreen mode
playground.c: In function ‘sum’:
playground.c:6:5: warning: enumeration value ‘NodeTag’ not handled in switch [-Wswitch]
    6 |     match(*tree) {
      |     ^~~~~
Enter fullscreen mode Exit fullscreen mode

This is called exhaustive pattern matching. In fact, Datatype99 does nothing to achieve it -- provided that we desugar match into a proper switch, a C compiler can do case analysis for us! For example, the following code compiles with a warning as well:

[godbolt]

typedef enum {
    Foo, Bar
} MyEnum;

const MyEnum foo = Foo;

switch (foo) {
    case Foo:
        break;
}
Enter fullscreen mode Exit fullscreen mode
<source>:8:5: warning: enumeration value 'Bar' not handled in switch [-Wswitch]
    8 |     switch (foo) {
      |     ^~~~~~
Enter fullscreen mode Exit fullscreen mode

Final words

We have not considered some utility macros like ifLet and matches since nothing special lies in them -- they are very similar to what we have already seen. For more details, feel free to investigate the source code of Datatype99. Thanks to the rich functionality of Metalang99, it fits in just ~400 LoC single header file.

It took me almost a year to figure out how to implement Datatype99 properly. Now it is a well-tested and stable v1.x.y project, so you can use it without fear that something will be broken in the next release.

Happy hacking!

Support me on Patreon >>

Try out Datatype99 >>

Top comments (4)

Collapse
 
gabrielfallen profile image
Alexander Chichigin

A sick (in a good sense! 😃) library and a nice post!

Though I have one clarifying question.

A value constructor is an inline static function

If it's static and defined in a .c file it won't be visible outside this file. But putting the whole (generated) function in a .h file is less than ideal. What way out of this would you suggest?

Collapse
 
hirrolot profile image
hirrolot • Edited

I would suggest to use precompiled headers to make all these functions compiled only once. This would also increase compilation times of datatype(...).

Collapse
 
gabrielfallen profile image
Alexander Chichigin

You probably mean decrease compilation times? 😄

Thread Thread
 
hirrolot profile image
hirrolot

Ah yeah, that's a typo :)