Eugene

Posted on May 6

Your PostgreSQL Already Has a Graph Engine — You Just Have to Build It

#postgres #opensource #graphs #database

We Replaced Recursive CTEs with a C Traversal Framework and Got ×207 Faster

TL;DR: We built pg_igraph — a graph engine inside PostgreSQL as a C extension. The first working version used recursive CTEs. It took 47 seconds to traverse a 335K-node tree. The final version uses an in-memory adjacency list with pure-C BFS. The same query takes 227ms. Here's why CTEs are the wrong tool for this, and what the right tool looks like.

The Setup

We had graph-shaped data — users, relationships, hierarchies — and it lived in PostgreSQL. The standard answer is "use Neo4j," but that means a second database to deploy, back up, and keep in sync. For a graph that fits on one server, that felt like unnecessary complexity.

So we built pg_igraph: a PostgreSQL C extension with BFS traversal, bidirectional shortest path, and a small query language. The SQL API looks like this:

SELECT * FROM graph_traverse(42, 'FOLLOWS', true, 5);
SELECT graph_shortest_path(42, 999, 'FOLLOWS');
SELECT igraph_query('MATCH (n:User)-[:FOLLOWS*1..3]->(m) RETURN m');

The data model is two partitioned tables:

nodes (id BIGSERIAL, label SMALLINT)
edges (from_id BIGINT, to_id BIGINT, rel_type SMALLINT, direction BOOL)
      PARTITION BY HASH(from_id)

Both forward and reverse edges are stored explicitly — direction = true for outgoing, direction = false for incoming. Traversal in both directions uses identical query plans.

Getting the schema right was the easy part. Getting the traversal fast took several attempts.

Attempt 1: Recursive CTE

The obvious first approach. PostgreSQL has built-in support for recursive queries:

WITH RECURSIVE bfs(node_id, d) AS (
    -- Base case: start node
    SELECT $1::bigint, 0
    UNION ALL
    -- Recursive step: expand one level
    SELECT e.to_id, b.d + 1
    FROM bfs b
    JOIN edges e ON e.from_id = b.node_id
        AND e.rel_type = $2::smallint
        AND e.direction = $3::bool
    WHERE b.d < $4::int
)
SELECT DISTINCT node_id FROM bfs;

This is clean and readable. It also took 47 seconds on a 335K-node tree.

Why? PostgreSQL's recursive executor works like this:

Evaluate the base case, materialize results into a working table
Join the working table with the recursive term, produce new rows
Materialize those rows, repeat
At the end: apply DISTINCT to collapse duplicates

The key problem is step 4. UNION ALL (required for depth tracking) produces duplicates — the same node can appear at multiple depths. PostgreSQL has no way to maintain a "visited" set across iterations, so every node at every depth flows through the pipeline and gets collapsed at the end. For a 335K-node tree with branching factor 6, the intermediate materialization at depth 7 alone contains 6^7 = 279,936 rows — many of them duplicates.

There is no optimization path out of this. The recursive CTE model fundamentally cannot do what a traversal framework does: maintain state across levels, skip already-visited nodes, and stop early.

Attempt 2: Per-Node SPI (First C Implementation)

We moved the BFS logic into a C extension using PostgreSQL's Server Programming Interface (SPI). The idea: maintain the BFS queue in C, issue one SQL query per node to fetch neighbors, track visited nodes in a C hash table.

// Pseudocode of the first C implementation
while (queue not empty) {
    cur_id = dequeue();

    SPI_execute_plan(plan_get_neighbors, [cur_id, rel_id, direction]);
    // process results, enqueue unvisited neighbors
}

This version correctly handles visited tracking and never revisits a node. The 47-second CTE query became... 19,531 SPI calls for a 19,531-node tree — one per node.

Each SPI round-trip costs ~0.04ms on our hardware (context switching, prepared plan execution, result materialization). That's 19,531 × 0.04ms ≈ 800ms for the small-scale tree. Better than 47 seconds, but still O(N) in SPI overhead.

For a 335K-node tree: 335,923 SPI calls → 47 seconds. Same number as the CTE, different reason.

Attempt 3: Batch Per Level with ANY($1)

Instead of one SPI call per node, fetch neighbors for an entire BFS frontier level in one query:

SELECT from_id, to_id FROM edges
WHERE from_id = ANY($1::bigint[]) AND rel_type = $2 AND direction = $3

Pass the entire frontier as a bigint array. One SPI call per depth level instead of one per node. For a 335K-node tree with 7 depth levels: 7 SPI calls.

Results:

Tree 335K: 47s → 2.7s ✓
Chain 1K: 42ms → 155ms ✗ (×3.7 regression)

The regression on chains revealed a fundamental issue with ANY($1) on HASH-partitioned tables.

The partition pruning trap. The edges table is partitioned by HASH(from_id). For a point lookup WHERE from_id = $1, PostgreSQL can compute HASH($1) at planning time and target exactly one partition. For WHERE from_id = ANY($1::bigint[]), it cannot — the array contents are unknown at plan time, so it generates a plan that scans all 16 partitions and filters.

On a chain traversal with frontier size 1, this means: 1,000 depth levels × 16 partition scans × (full partition read + filter) = significant wasted I/O.

Attempt 4: LATERAL unnest — Restoring Partition Pruning

The fix: use LATERAL with unnest() instead of ANY.

SELECT f.id, e.to_id
FROM unnest($1::bigint[]) AS f(id)
JOIN LATERAL (
    SELECT to_id FROM edges
    WHERE from_id = f.id AND rel_type = $2 AND direction = $3
) AS e ON true

LATERAL forces a Nested Loop plan. For each element from unnest(), PostgreSQL executes the inner query independently — with full HASH partition pruning on from_id = f.id.

Explain output:
Nested Loop
  -> Function Scan on unnest
  -> Index Only Scan on edges_pN
       Index Cond: (from_id = f.id) AND (rel_type = ...)

Results:

Tree 335K: 47s → 46ms ✓ (×1020 improvement)
Chain 1K: 42ms → 435ms ✗ (×10 regression)

The LATERAL unnest has overhead per call — unnest() has setup cost that dwarfs a simple index seek when the frontier is 1 element. 1,000 depth levels × (unnest overhead) = visible regression on chains.

The insight: neither ANY nor LATERAL is universally better. The right tool depends on frontier size.

Attempt 5: Hybrid Dispatch

if (frontier_size == 1) {
    // Simple prepared statement: single index seek, no array overhead
    SPI_execute_plan(plan_get_neighbors, [cur_id, rel_id, direction]);
} else {
    // LATERAL unnest: one round-trip for the whole frontier
    SPI_execute_plan(plan_get_neighbors_batch, [array, rel_id, direction]);
}

This recovered chain performance while keeping the tree improvement. But the fundamental issue remained: for any graph where frontier eventually explodes (trees, random dense graphs), you're still paying SPI overhead per level, plus the cost of building and passing progressively larger arrays.

The Right Approach: Load Once, Traverse in C

The insight from all the failed attempts: as long as BFS is driven by SQL, you're fighting the impedance mismatch between a set-based query engine and an iterative graph algorithm.

pg_routing, the PostgreSQL routing extension, solved this the right way years ago: load the graph into memory, route in C. We needed to do the same.

One SPI call loads all edges for the requested rel_type into a C-level adjacency hash map:

// Build adjacency list: from_id → int64[] neighbors
static AdjList *
build_adj_list(int16 rel_id, bool direction, MemoryContext ctx)
{
    Datum args[] = { Int16GetDatum(rel_id), BoolGetDatum(direction) };
    SPI_execute_plan(plan_load_all_edges, args, NULL, true, 0);
    // Copy (from_id, to_id) pairs out of SPI tuptable
    // Build HTAB: from_id → {neighbors[], n, cap}
    // Return AdjList in ctx
}

Then BFS runs entirely in C over the hash map:

// After build_adj_list, SPI is closed. Zero SQL during traversal.
while (queue not empty) {
    cur_id = dequeue();

    AdjNode *node = hash_search(adj->htab, &cur_id, HASH_FIND, &found);
    if (!found) continue;

    for (int i = 0; i < node->n; i++) {
        int64 nbr = node->neighbors[i];
        hash_search(visited, &nbr, HASH_ENTER, &found);
        if (found) continue;
        enqueue(nbr);
        result[res_size++] = nbr;
    }
}

For the 335K-node tree: 2 SPI calls total (rel_id lookup + edge load), then 335K hash table lookups in C. No SQL during traversal.

Results vs CTE baseline:

	Recursive CTE	Load-all + C BFS
Tree 335K full	47,000ms	227ms
Chain 10K full	~400ms	72ms

But Load-All Has a Fixed Cost

Loading 335K edges takes ~50-80ms on HDD even with a good index. For a query that only needs 8 results (find ancestors of a leaf node, depth=6), this is wasteful.

The signal isn't query depth — it's frontier size. A chain of depth 10,000 has frontier=1 at every level. A 6-branch tree hits frontier=7,776 by level 5. When frontier is small, per-level SPI is cheaper. When frontier explodes, load-all pays off immediately.

The final implementation starts per-level and switches at runtime:

#define ADAPTIVE_FRONTIER_THRESHOLD 200

while (depth < max_depth && frontier_size > 0) {
    expand_frontier_with_spi();  // per-level batch or single
    depth++;

    if (frontier_size > ADAPTIVE_FRONTIER_THRESHOLD) {
        // Frontier is growing fast — load-all will be net cheaper
        adj = build_adj_list(rel_id, direction, work_ctx);
        SPI_finish();  // no more SQL
        switched = true;
        break;
    }
}

if (switched)
    bfs_pure_c(adj, visited, frontier, depth, max_depth);

The visited HTAB is shared between phases. Nodes discovered in Phase 1 are already marked when Phase 2 starts.

Final benchmark — medium scale, HDD server:

Test	Recursive CTE	Final (v10)	Factor
BFS tree 335K, full traversal	47,000 ms	227 ms	×207
Shortest path, 10K chain	618 ms	49 ms	×12
BFS chain, depth=100	14 ms	16 ms	≈ same
BFS tree, depth=3	12 ms	3.6 ms	×3 ↓
Reverse BFS, 8 ancestors	6 ms	6 ms	≈ same
Query language MATCH depth=10	3 ms	3 ms	≈ same

One More Thing: The Covering Index

build_adj_list runs:

SELECT from_id, to_id FROM edges WHERE rel_type = $1 AND direction = $2

The table is partitioned by HASH(from_id). rel_type and direction have no relationship to the partition key — without a dedicated index, this scans all 16 partitions and all rel_types, then filters. With 1.1M total edges across three rel_types, that reads everything to return 200K rows.

The fix: one covering index per partition:

CREATE INDEX ON edges_pN (rel_type, direction) INCLUDE (from_id, to_id);

Index-only scan, reads only the requested rel_type. graph_shortest_path on a 10K chain dropped from 618ms to 49ms from this single index.

Conclusion

The recursive CTE approach is appealing — it's SQL, it's readable, it feels like it should work. But PostgreSQL's recursive executor is fundamentally an iterative set processor, not a traversal framework. It cannot maintain visited state across iterations, cannot skip already-explored nodes, and collapses duplicates only at the end. For graphs of any meaningful size, this becomes unacceptable.

Moving BFS into C and loading the subgraph once — the same approach pg_routing has used for years — resolves the impedance mismatch. The traversal is just pointer arithmetic in a hash table.

The adaptive frontier-based switching gives good performance across all workload shapes: narrow traversals (chains, ancestor lookups) pay no preload cost; wide traversals (trees, dense random graphs) preload once and traverse in C.

pg_igraph is open source under Apache 2.0:
👉 https://github.com/ineron/pg_igraph

The companion pg_ilib binary property serialization extension is in a separate repository.