DEV Community

Paul J. Lucas
Paul J. Lucas

Posted on

include-tidy: A Tool to Enforce Include-What-You-Use

Introduction

Unlike Go where the language definition itself via its compiler strictly enforces the inclusion of modules (i.e., include exactly what you use, no more, no less), neither the C nor C++ language definitions have an equivalent enforcement. This can lead to two problems:

  1. If you unnecessarily include a header that’s not needed (i.e., no symbols from it are referenced), then compilation gets a tiny bit slower. For small codebases, this is negligible; for larger codebases where build times can take the better part of an hour or even exceed an hour, it matters.

  2. If you don’t directly include a header that’s needed (i.e., one or more symbols from it are referenced), then if you remove a seemingly unrelated #include, the build can break. Alternatively, your program may build just fine on one platform, but fail on another due to either system headers being slightly different or #ifdefs.

To avoid both problems, you should follow the include-what-you-use principle, namely that a source file:

  1. Directly includes every header file exactly once from which it references symbols.

  2. Does not include a header file from which it does not reference symbols.

While you can attempt to adhere to that principle manually, it can be difficult as the number of headers either in or used by a project increases.

You’d think there would be a way to automate checking a source file for violations of the principle — and there is. After just a little bit of searching, I found include-what-you-use.

include-what-you-use

Having found include-what-you-use (iwyu), I tried it out. It sort-of worked, but gave incorrect suggestions. For example, when using getopt_long via Gnulib, it said I needed to include getopt-ext.h. While that’s technically correct (because that’s the file in Gnulib that declares getopt_long), that file is a Gnulib implementation detail that user’s aren’t supposed to know or care about; users are still only supposed to include getopt.h. Iwyu has a way to fix this via “mapping files,” but it’s annoying that you have to do it manually.

It seemed to me that if you include a standard(ish) header like getopt.h, then any headers that getopt.h includes should be considered implementation details, i.e., getopt.h should automatically be a proxy for headers it includes so that any symbols declared in those other headers should be treated as-if they were declared in getopt.h.

Since the sets of C, C++, and POSIX standard headers are defined, then any header included by one of the standard headers should automatically be considered an implementation detail and the standard headers should automatically proxy for those other headers.

Aside from that, I found iwyu less than ideal in a few other ways:

  • Some configuration is done via the aforementioned mapping files while other configuration is done via special IWYU comments in source code.

  • Iwyu has been around since 2011. As of this writing in 2026, that’s 15 years (!) and it’s still only at version 0.26. While it is under active development, it’s apparently progressing very slowly.

  • Looking at its source code repository confirms development is progressing slowly.

  • Looking at its open issues, it’s probably going to be a while since those get fixed.

  • Part of the slowness might be due to iwyu using the C++ LLVM API that’s not intended to be a stable API. (It’s intended to support LLVM and clang themselves.) Hence, any time the API changes, they authors have to spend time updating iwyu.

  • When a file has no violations (adheres to the include-what-you-use principle), there’s no way to have iwyu be silent.

  • It uses underscores for its long command-line options that are more annoying to type than dashes.

include-tidy

Introduction

Given the state of iwyu, I decided to write my own tool to enforce the include-what-you-use principle: include-tidy (Tidy). I optimistically (naively?) thought “How hard can it be?” While the core functionality wasn’t hard (I implemented it in a couple of weeks), it’s invariably the corner cases, the last 10%, that takes 90% of the time. Even so, I’ve gotten Tidy to 1.0 in approximately two months.

While it’s non-trivial to write a full C parser since C is (still) a relatively small language, it’ tractable for one person; but it’s simply too much work to write a full C++ parser. Instead of using the unstable Clang C++ API, I used the stable Libclang C API instead to do all the parsing.

Use of AI

What helped a lot was using AI (strictly speaking, an LLM), specifically Google’s Gemini (because I’m too cheap to pay for Claude, especially for a personal project that I have no intention of making any money from). While I may write a follow-up blog post describing my experience, I’ll state briefly that AI saved me from having to read a lot of the documentation, read the tutorials, post questions to a mailing list or Stack Overflow, and wait for answers (if any).

My use of AI does not mean include-tidy was written by AI, certainly not copied verbatim (I don’t like its coding style); but the AI certainly pointed me in the right direction.

Configuration

Unlike iwyu, I wanted Tidy to be fully configurable via files. The choices these days are JSON, TOML, XML, and YAML. IMHO, the least bad of these is TOML. Given that choice, the next task was to be able to parse TOML files.

I looked for a TOML library with a C API and found a couple, but it wasn’t obvious which one was better. I also didn’t like the idea of having another dependency in addition to Libclang. So I decided to implement my own TOML libary. I optimistically (naively?) thought “How hard can it be?” (Sound familiar?)

Actually, compared to implementing Tidy, implementing a TOML library was much easier. Part of that was due to me not having to implement a full TOML parser because Tidy simply doesn’t need things like dates, times, floating-point numbers, arrays of or inline tables, multi-line strings, or Unicode.

Tidy also implement configuration file “layering.” There can be a default, system-wide configuration file in /etc/xdg/include-tidy/config.toml (Tidy implements the XDG Base Directory Specification.), a user-specific configuration file in ~/.config/include-tidy/config.toml, and a project-specific configuration file in the project’s source directory. “Layering” means that settings given in more local configuration files override the same settings in more system-wide configuration files.

Command-Line Parsing

One thing I like about iwyu is that it can be used as a drop-in replacement for either a C or C++ compiler, such as either clang or gcc, in that iwyu accepts all command-line options that the compiler accepts, in particular all -D (macro define) and -I (include path) options. This ensures that the code that iwyu “sees” is exactly the same as the compiler normally sees.

A problem with that is how do you specify iwyu-specific command-line options to iwyu itself? The solution iwyu used is that all such options are preceded by -Xiwyu. I decided to do the same for Tidy except use -Xtidy.

Normally, I’d just use getopt_long() as described here to parse command-line options; but implementing -Xtidy requires that the options be split into Tidy-specific options (with every occurrence of -Xtidy removed) and everything else to be passed as-is to Libclang.

Include Paths

Every C or C++ compiler knows what its default set of include directories are; Libclang needs to be told. The best thing to do is simply to ask clang via:

clang -E -v -xc - </dev/null 2>&1
Enter fullscreen mode Exit fullscreen mode

where -E means run only the preprocessor state, -v requests verbose output, -xc sets the language to C, - reads from stdin, in this case from /dev/null. Among other things, clang fortuitously prints output like (on macOS):

#include <...> search starts here:
 /usr/local/include
 /opt/local/libexec/llvm-22/lib/clang/22/include
 /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include
End of search list.
Enter fullscreen mode Exit fullscreen mode

clang can be called with that exact command via popen() and its output can be parsed looking for the #include <...> search starts here: and End of search list. These directories can then be added as arguments to -isystem command-line options.

More Command-Line Parsing

To complicate matters, the command-line arguments have to be pre-scanned to look for:

  • The last argument to get its filename extension to know what language, and thus which include directories, we need.

  • An -x option that specifies the language overriding the filename extension.

  • A --clang or -C option to allow the user to override the path to clang.

All this before ever calling getopt_long().

Sample include-tidy.toml Configuration File

Not surprisingly, I now eat my own dog food and use include-tidy to tidy itself and my other projects. Here’s include-tidy’s configuration file:

[config.h]
ignore-as-argument = true

[pjl_config.h]
first = true
keep = true
proxy = [
    "attribute.h",
    "config.h",
]
Enter fullscreen mode Exit fullscreen mode

The config.h file is auto-generated by Autotools. Auto-generated files should not be run through Tidy since you’re not responsible for their contents. That’s what the ignore-as-argument says and this still allows you to run Tidy easily for each file in *.h without having to make a special case to exclude it.

For my projects, I create a pjl_config.h file that wraps config.h, adds its own macros, and also include attribute.h. This should always be the first file included (hence the first = true), should always be included because it affects other files included from Gnulib (hence the keep = true), and is a proxy for the other two, i.e., if pjl_config.h is included, then it’s as if attribute.h and config.h were included also.

Data Structures

The main data structure is one for an included file:

enum tidy_sort_rank {
  TIDY_SORT_FIRST       = -2, // The very first `#include`.
  TIDY_SORT_ASSOCIATED  = -1, // After first, but before default.
  TIDY_SORT_DEFAULT     =  0  // Default sort rank.
};

typedef struct  tidy_include    tidy_include;
typedef enum    tidy_sort_rank  tidy_sort_rank;

struct tidy_include {
  CXFile         file;              // File included.
  CXFileUniqueID file_id;           // Unique file ID.
  char const    *abs_path;          // Absolute path.
  char const    *rel_path;          // Relative path.
  tidy_include  *includer;          // Include including this.
  tidy_include  *proxy;             // Proxy include, if any.
  unsigned       depth;             // Include depth.
  array_t        lines;             // Line(s) included from.
  tidy_sort_rank sort_rank;         // Sorting rank.
  bool           elide;             // Elide if necessary?
  bool           keep;              // Keep if unnecessary?
  bool           is_local;          // Local include file?
  bool           is_needed;         // Include needed?
  bool           is_proxy_explicit; // Was proxy explicit?
  rb_tree_t      symbol_set;        // Symbols referenced.
};

rb_tree_t tidy_include_set;
Enter fullscreen mode Exit fullscreen mode

where:

  • Anything with a CX prefix is from Libclang. CXFile is just an opaque handle to a file; CXFileUniqueID is an ID Libclang uses for unique file identification. I use it as the key for tidy_include_set (that uses rb_tree, an implementation of a Red-Black tree).

  • lines is an array of line numbers that the file was included from. If the length > 1, it means the file was erroneously included more than once and Tidy reports this.

  • Normally when printing all include files (when requested), Tidy follows proper header file etiquette and puts all local includes first, non-standard, non-local files (such as from a third-party library) second, and standard includes last; within each group, files are sorted alphabetically. The tidy_sort_rank allows files (like pjl_config.h) to be put first followed by a .c file’s associated .h file.

  • symbol_set is the set of symbols referenced from that include file. Tidy will print as many of those as will fit in a comment following the #include (unless told not to).

The rest of the fields should be relatively self-explanatory.

The other main data structure is one for a referenced symbol:

typedef struct tidy_symbol tidy_symbol;

struct tidy_symbol {
  char const *name;                 // Symbol name.
};

static rb_tree_t symbol_set;        // Set of symbols.
Enter fullscreen mode Exit fullscreen mode

Since the structure has only one member, it could be eliminated; but I kept it for future-proofing in case I ever want or need to add additional members.

Algorithm Part 1: Included Files

The first part of the algorithm is to iterate through the abstract syntax tree (AST) of the translation unit (TU) looking for #include directives. Libclang provides a clang_visitChildren() function that you pass a visitor function to.

The visitor function gets the file being included (the “included”), the file including it (the “includer”), creates a new tidy_include object if the included hasn’t been seen before, an initializes depth that, if zero, means the file was directly included, and initializes other members.

If the included file has been seen before, it’s possible that it was indirectly included before, but is directly included now. If that’s the case, then reset the included’s depth to zero and includer to NULL.

For this part, there’s one other special case. Consider:

</usr/include/sys/wait.h>
  </usr/include/sys/signal.h>   // #define SIGSTOP 17
...
</usr/include/signal.h>
  </usr/include/sys/signal.h>
Enter fullscreen mode Exit fullscreen mode

That is, a standard header (e.g., sys/wait.h) includes a non-standard header (e.g., sys/signal.h) so sys/signal.h’s includer is set to sys/wait.h.

Later, the standard header version of the non-standard header (e.g., signal.h) is included that also includes the non-standard header (sys/signal.h), so sys/signal.h’s includer should be reset to be signal.h because, later still, sys/signal’s proxy will then be set to be signal.h, not sys/wait.h, so signal.h will be considered the header that defines SIGSTOP (which is correct) instead of sys/wait.h (which is incorrect).

Algorithm Part 2: Include File Proxies

A second pass of all #include directives is needed to initialize implicit proxies. An include file p can be a proxy for an included file i only if:

  1. Both i and p are non-local includes, i.e., #include <...>.

  2. i is not a standard header.

  3. i is not directly included (depth > zero).

There’s an exception at least when using Gnulib. Consider:

<../lib/stdlib.h>
  </usr/include/stdlib.h>
Enter fullscreen mode Exit fullscreen mode

That is, a local implementation of a standard header (as is done when using Gnulib) eventually does a (non-standard) #include_next to include the real standard one. The local header (even though has a standard name) should be a proxy for the real one.

Algorithm Part 3: Symbols Referenced

Now that all the include files have been found and proxied, a third pass through the AST is needed, this time looking at every symbol (identifier), figuring out where it’s declared, and checking whether the header file that does so is directly included. This turned out to be harder than I expected. A lot harder. Part of it is due to my naivete (“How hard can it be?”), part is due to quicks of C, and part is due to limitations and quirks of Libclang.

Complete vs. Incomplete Types

Consider the following declarations (borrowed by my cdecl project):

// c_type.h
struct c_type {
  // ...
};

// types.h
typedef struct c_type c_type_t;

// c_ast.h
struct c_ast {
  c_type_t type;
  // ...
};

// dump.h
void c_type_dump( c_type_t const *type, FILE *fout );
Enter fullscreen mode Exit fullscreen mode

Suppose we’re tidying c_ast.h. Which header(s) does it need to include? Clearly, it needs types.h because it declares the typedef for c_type_t. But c_type.h is also needed because the complete type for the c_type_tstruct c_type — is needed for the member declaration.

In contrast, now suppose we’re tidying dump.h. Which header(s) does it need to include? Only types.h because c_type_t is used only as part of a pointer declaration and C allows incomplete types to be used just fine in such cases. (C++ is the same, but also allows incomplete types for reference declarations.)

Macros

Preprocessor macros add a whole other dimension of complication. Consider the declaration:

// util.h
#define POINTER_CAST(T,EXPR)    ((T)(uintptr_t)(EXPR))
Enter fullscreen mode Exit fullscreen mode

Because the macro references uintptr_t, util.h should include stdint.h so the user of the macro can treat it like a black box.

In Libclang, macro definitions don’t form part of the AST. Instead, when a definition is encountered, you have to get all the tokens comprising it and iterate over them. But first you have to parse function-like macros’ parameters, create a set of them, and exclude them from the tokens while iterating. You also have to exclude __VA_ARGS__ and __VA_OPT__.

The Preprocessor’s ## Operator

Another limitation of Libclang is that symbols formed via the preprocessor’s ## (paste) operator aren’t “seen” by Libclang. Consequently, if such a symbol is declared in a header is referenced and no other symbols from that header are referenced, Tidy won't think the header is necessary. For example:

#include <readline/readline.h>

#define RL_PROMPT_IGNORE(SBUF, WHEN) \
   strbuf_putc( (SBUF), RL_PROMPT_ ## WHEN ## _IGNORE )

void prompt_create( strbuf_t *sbuf ) {
  RL_PROMPT_IGNORE( sbuf, START );
  // ...
Enter fullscreen mode Exit fullscreen mode

The expansion of RL_PROMPT_IGNORE will create RL_PROMPT_START_IGNORE that’s declared in readline.h. If no other symbols from it are referenced, Tidy will incorrectly think the header is unnecessary.

As an alternative to not using ## here, you can add dummy code seen only by Tidy (and Libclang):

#ifdef __include_tidy__
void explicitly_reference_symbols() {
  (void)RL_PROMPT_START_IGNORE;
}
#endif
Enter fullscreen mode Exit fullscreen mode

That is, Tidy implicitly defines __include_tidy__ when tidying. By explicitly referencing such a symbol directly via dummy code, Libclang will ”see” it and Tidy will correctly think the header is necessary. Because such code is never compiled, the code can be anything (but it still has to be legal and should not generate warnings).

Conclusion

Tidy was an interesting (and sometimes frustrating) project to work on. I’m sure there are still bugs, likely more-so with C++ code since most of my testing was with C code.

Top comments (0)