Simmypeet

Posted on Mar 8

Optimizing Graph Traversal in Query Based Incremental Compiler

#algorithms #computerscience #performance #rust

Hi, this is my follow up post on the building query-based incremental compiler. This has been my rabbit hole for past few months.

I've been tackling one of the biggest challenges in my incremental compiler. It's about the huge graph traversal problem.

Previously, I've built a incremental compiler that's based on the Salsa model, which is the library that I believe rust-analyzer use.

Here I'll briefly explain how my previous incremental compiler algorithm works and what I believe Salsa model uses too.

1.) The central query database maintain a monotonically increasing version number. This could be just an increasing i32 or whatever. This represents the version that database is in. Every time a modification to the database's input is made (i.e., updating the source file), the global version number will be increased by one.

2.) Every time the query is computed and memoized to the database, It stores the result alongside with the current version number that database currently has. Technically, it stores two separate version numbers: version when it was computed and version when it was verified.

3.) Every time the query is made to the database, the db checks whether if the query has happened to be already memoized or not. If not, the database just run the query. If it's already memoized it follows the next step.

4.) The db checks the version when the query was last verified against the current version of the db. Imagine, if the database has been changed today but the query was last used yesterday, it means that the query potentially has stale result.

Based on this check, if the version mismatches (stale) it performs reparation. Otherwise, if the version matches (up-to-date), it directly returns the query result.

5.) When the query repairs itself, it also have to makes sure that its dependencies are also up-to-date (their dependencies' version number have to match with the current database's version). You'll notice that this is essentially a recursive call in the graph, traversing from the root node down to the leaf nodes.

The pseudocode algorithm looks something like this:

def query(db, node):
    # already up-to-date
    if db[node].last_verified == db.version:
        return db[node].result

    repair(db, node)

    return db[node].result

def repair(db, node):
    # already up-to-date
    if db[node].last_verified == db.version:
        return

    # IMPORTANT: we recursively repair unconditionally 
    for dep in db[node].dependencies:
        repair(db, dep)

    need_recompute = False

    for dep in db[node].dependencies:
        # it means that the dependency has changed since it was last used
        if db[dep].last_computed > db[node].last_verified:
            need_recompute = True

    # this means no further re-computation, mark as verified and return
    if not need_recompute:
        db[node].last_verified = db.version
        return

    old_fingerprint = db[node].fingerprint

    # assume it recomputes and set the result and fingerprint
    recompute(db, node)

    if db[node].fingerprint != old_fingerprint:
        db[node].last_computed = db.version

    db[node].last_verified = db.version

Here's the problem, even though this model can avoid O(n) re-computation to O(changes_made) re-computation (where n refers to the number of node in the graph). It stills have O(n) graph traversal even though you made small changes.

Here's an example:

Let's look at the above DAG (Directed Acyclic Graph), which currently is a 🌳, represents the program that the database maintain. Here, every queries are computed at version 0, hence the : v0 label attached to each node. For the brevity, we'll just show when it was last verified. In the actual implementation, it also contains when the query was computed as well, in this case, all of the queries are computed at version 0.

Let's say we make some change to the query D, making the database version increments to 1. Now, when we query for G again, it will check whether the query G's verification version matches to the database's version or not. Here, it mismatches, then the query G would have to be repaired, and in the process it also have to look at its own dependencies whether or not they have verification version match with the database version. Now you can imagine this recursive process, it would traverse through every node, and ended up traverse 7 nodes.

To Adapton Style

The previous style it's commonly called the pull-based or timestamp-based verification etc. Since I've encountered that said problems, I've been digging on the internet whether anyone has this problems and how they managed it. It seems like rust-analyzer team also acknowledged to this problem and they have came up with the durability system. I've also found "adapton" and its fine-grained dirty propagation system.

In Adapton, when ever the input changes, the modified node performs the dirty propagation on its incoming edge in the graph transitively. Here in Adapton, each edge in the graph also have dirty flag associated to it. When the dirty propagation occurs, it flips that said flags to true, signaling potential invalidation.

Let's say in previous DAG example:

When we modify query D, its incoming edge will be marked dirty and transitively propagates to the incoming edge of F, resulting in graph looking something like this.

Here's the above graph shows how the it would look like after dirty propagation.

The dirty propagation is active, it means that when the query node is modified, the dirty propagation happens immediately. This is what I believe why people often call this push model.

So great! how this dirty propagation help when we repair the query? The answer is unlike the previous model where the query has to recursively invoke reparation on all of its dependencies, this push model can selectively choose to recursively repair its dependencies that have dirty flag set to true.

This totally makes sense, because E doesn't depend on any query that were modified, whereas, F depends on D, which has been modified. Therefore, G skipping reparation on E is totally safe.

With this model, the only nodes that was traversed during reparation were G -> F -> D.

Therefore, making Adapton has two distinct phases:

Dirty Propagation: is active, happens immediately after the node is modified.
Repair: is lazy, happens when the user queries for a node that has dirty outgoing edges.

Now, I want to quickly discuss about the trade-offs between two models.

Laziness: Salsa is more lazy in the sense that when the query is modified, it just increments a version number on the DB. Unlike the Adapton, it has to traverse the graph backward and perform dirty propagation.
Memory Usage: In order for the Adapton dirty propagation to be able to traverse the graph backward efficiently, each node has to also maintain backward edges, resulting in more memory usage.
Graph Traversal: In the Adapton, the reparation phase usually needs less graph traversal than the Salsa model due to finer-grained dirty tracking.

Super Node: Where Dirty Propagation Falls Apart

When I've discovered Adapton, I though that my problem is solved. Before I'm going to implement my own version of incremental computation engine inspired by Adapton, I realized that the graph traversal problem would be a problem again even with the help of dirty propagation if my graph has a super node.

Let's me show you my current setup. I'm currently trying to build incremental compiler and currently my dependency graph looks something similar to this.

Here's at the bottom, we have "src A" and "src B" representing the source files that the compiler uses as inputs.

Then up on one layer, we have a huge query labelled as "symbol table" that aggregates all the symbols defined in each source file into one big HashMap<ID, Symbol>.

Then, we have each query like "A" and "B" that extracts a specific symbol information from the symbol table's HashMap<ID, Symbol>. These queries are essentially just a hash map look up on the "symbol table" to extract the information of the interested symbol.

Furthermore, we have some unnamed queries that depends on the "A", "B", and etc. representing some computation that further use symbol information like type checking.

Finally, the "start" node acts as a root that starts the whole compilation system.

Let's say I modify one single symbol in "src A" (a very little small change), the dirty propagation would kick in. The incoming edge of "src A" node is dirtied and then transitively propagates to all the incoming edges of "symbol table", which essentially all of the edges above the "symbol table'. Finally, we're left with dirty flag being set to true almost every where.

Here, I colored the edge with "red" marking them as dirty.

The problem is that the "symbol table" is a node that has so many incoming edges to it and the "symbol table" query is very coarse grained (very big).

Now, one could say that this is a problem of a poor dependency structure, and should've restructure the dependency to avoid this kind of problem, which is a very fair point.

Continue with this, I would say that the "symbol table" query is a super node and those "A", "B", and more are called projection queries.

The "super node" refers to a node that have lots of outgoing edges. Where as, projection query is a query that reads a small portion of its dependency. For example, the query "A" just extracts one symbol from "symbol table".

I'll call this problem "the over dirty propagation".

In fact, Adapton has a feature that is designed to combat with this issue specifically and it's called firewall.

With that, I take inspiration from Adapton and adapts into my engine. Here, I want introduce some algorithm inspired by Adapton's firewall that will combat with this issues.

Here, I'll introduce you to new two kinds of query that will be added. Normally, each query are "just a query", now we differentiate each query to "Normal" (default), "Firewall" (new), and "Projection" (new).

Firewall Query

We can mark a query as a "Firewall". The "Firewall" query has a main responsibility of stopping dirty propagation from crossing them.

Here's how it works, imagine the same compiler graph setup, if we modify the "src A", the incoming edge of "src A" would be dirtied and then propagate to all of the incoming edge of "symbol table".

However, if we mark the "symbol table" as firewall, the dirty propagation will stop there, meaning no incoming edges of the "symbol table" will be dirtied.

Here, the outgoing edge from "symbol table" to "src A" is dirtied, but none of the incoming edges of "symbol table" are dirtied at all because firewall stops dirty propagation from crossing it.

The Algorithm

Continuing from the previous graph with firewall example, let's say when we query "start" (at the top), it first sees that the "symbol table" has dirtied outgoing edge, therefore, it will attempt to repair the firewall first, which is "symbol table", before repairs query "start" itself.

What could happen is that, if the firewall "symbol table" is repaired, the result could either be the same or updated.

If the firewall "symbol table" is repaired with no changes, the firewall reparation is done. The reparation flow goes back to the "start" query. And since there're no dirty outgoing edges of "start" query, the "start" query itself is considered up-to-date, the result of "start" is reused, done!

In this happy path, where the firewall "symbol table" is recomputed and result stays the same. There's only one edge dirtied in total and only 3 nodes being traversed!

However, in the case of when the firewall "symbol table" is recomputed and result changes, then the dirty propagation will start at the "symbol table", which in this case, the dirty flag will eventually reach to the "start" query's outgoing edges.

The above image shows the steps up to when the firewall "symbol table" is recomputed and result changes.

The above image shows that after it detects result change, the dirty propagation starts at the firewall.

Finally, the dirty propagation is done then the firewall reparation is considered finished. The reparation is back to the "start" query where the "start" query will see that its outgoing edges are dirtied.

To summarize, we modify the reparation algorithm to first reparation the firewall that it depends on (directly on indirectly) before do the actual reparation work on the query node itself.

Note that in order for the "start" node to be able to detect that the firewall "symbol table" needs to be repaired first, it has to maintain additional edge that directly points to the firewall that it depends. This mean that every node in the graph will have additional edge that points to the firewall that it directly or indirectly depends on.

Projection Query

You can see that we haven't quite solved all the issues yet. In the previous example, when the firewall "symbol table" is repaired and the result changed, the dirty propagation still happens and lots of node are still dirty propagated, which means the graph traversal problem isn't quite solved yet.

In some cases, only some entries in the "symbol table" is actually changed.

For example, if we modified symbol "A" that is located in source file "src A". When we repair the firewall "symbol table", the entry table["A"] is only changed while entries like table["B"], table["C"], and more are still the same.

Wouldn't it be nice to only dirty propagate the incoming edge of "A" to "symbol table" like the following image?

Yes, we could achieve this finer-grained dirty propagation!

Here's the twist, what if when the firewall "symbol table" is repaired and the result changed, instead of dirty propagate at "symbol table" right away, we re-run those projection queries (like "A", "B", etc.).

And based on those re-runs on projection queries, whichever projection query's result changes, the dirty propagation start from that particular projection query instead of the firewall "symbol table".

Here the step-by-step explanation, let's use the same example: modifying symbol "A" located in "src A":

The steps are basically the same

All of the projection queries are re-run

Only A changed

Dirty propagate starts at A

Finally, we've achieved very fine-grained dirty propagation and thus can keep graph traversal very minimal during reparation phase.

Trade-offs

Let's talk about the trade-offs, nothing comes for free.

"Everything is a trade off"

Software Design and Architecture Prof.

Memory Usage: With this setup, every node has to maintain additional edges that directly point to the firewall that it depends (directly or indirectly). If not careful, the memory usage can grow to O(node * firewall).
Unnecessary Recomputation: Recall that when the firewall's result changed during reparation, it unconditionally re-executes the incoming projection queries.

However, these are very fair trade-offs considering that:

The number of firewall queries in system is very small, the memory overhead can be acceptable.
The projection queries that got re-executed unconditionally are very fast such as reading variable, array index, or hash map lookup.

Conclusion

We've explored a way to avoid a large graph traversal in query-based incremental compiler by using technique like dirty propagation, stopping dirty propagation at firewall query, and finer-grained dirty propagation through projection queries, which are heavily inspired by Adapton. We trade-off some increased memory usage and potentially some unnecessary recomputation for avoiding massive graph traversal.

These concept also can be extended further into something like: chain of firewalls, projection queries that depends on two firewall, and more.

I've also packaged this into a separate library called https://github.com/Simmypeet/qbice if anyone is interested 😉.

NOTE that I haven't taken some time to properly write documentation for it. Most documentations are AI generated, however, I've tried my best to check for its correctness as well.

Thank you for taking your time and interest reading this far 😘.