Prajwal Gaonkar

Posted on Apr 19

How to Store Code Intelligence Graphs Efficiently

#computerscience #database #performance #systemdesign

There are generally two practical approaches people consider:

Store dependencies as adjacency lists (custom structure)
Use a graph database

This post breaks down how each approach behaves across:

Creation
Updates
Deletion
Maintenance
Storage internals
Time complexity

The Problem: Code is a Graph

Example: Function Structure in a Real Repo

Consider a simple codebase:

// auth.js
export function validateUser() {
  checkAuth();
  logAccess();
}

function checkAuth() {
  dbQuery();
}

// db.js
export function dbQuery() {
  connectDB();
}

function connectDB() {}

function logAccess() {}

This creates a dependency graph:

validateUser → checkAuth → dbQuery → connectDB
validateUser → logAccess

How Each Approach Stores This

Adjacency List

{
  "validateUser": ["checkAuth", "logAccess"],
  "checkAuth": ["dbQuery"],
  "dbQuery": ["connectDB"],
  "connectDB": [],
  "logAccess": []
}

Graph Database

(validateUser)-[:CALLS]->(checkAuth)
(validateUser)-[:CALLS]->(logAccess)
(checkAuth)-[:CALLS]->(dbQuery)
(dbQuery)-[:CALLS]->(connectDB)

Approach 1: Adjacency List

Creation

adj["validateUser"] = ["checkAuth", "logAccess"];
adj["checkAuth"] = ["dbQuery"];
adj["dbQuery"] = ["connectDB"];
adj["connectDB"] = [];
adj["logAccess"] = [];

Updating

// Add new dependency
adj["validateUser"].push("newFunc");

// If you need reverse lookup (extra structure)
reverse["newFunc"].push("validateUser");

Deletion

adj["validateUser"] = adj["validateUser"].filter(f => f !== "logAccess");

// Also clean reverse map
reverse["logAccess"] = reverse["logAccess"].filter(f => f !== "validateUser");

Maintenance Problems

Need to maintain forward + reverse maps
Easy to get inconsistent
Hard to track multi-hop dependencies

Storage Internals

Stored as plain objects / JSON
Relationships = strings
No indexing on relationships

Traversal = manual DFS/BFS every time

Approach 2: PostgreSQL (Relational Model)

Representation

functions(id, name)
calls(from_id, to_id)

Creation

INSERT INTO functions (id, name) VALUES (1, 'validateUser');
INSERT INTO functions (id, name) VALUES (2, 'checkAuth');
INSERT INTO calls (from_id, to_id) VALUES (1, 2);

Updating

INSERT INTO calls (from_id, to_id) VALUES (1, 6);

Deletion

DELETE FROM calls WHERE from_id = 1 AND to_id = 5;

How Traversal Works (Joins)

In PostgreSQL, relationships are not directly connected. Instead, they are reconstructed using JOIN operations.

To traverse:

A → B → C → D

PostgreSQL performs multiple JOINs:

SELECT f1.name, f2.name, f3.name
FROM functions f1
JOIN calls c1 ON f1.id = c1.from_id
JOIN functions f2 ON f2.id = c1.to_id
JOIN calls c2 ON f2.id = c2.from_id
JOIN functions f3 ON f3.id = c2.to_id;

What a JOIN Does Here

Combines rows from multiple tables
Matches foreign keys (from_id → to_id)
Builds intermediate result sets step by step

Each JOIN is essentially reconstructing the graph edge manually.

Why Joins Become Expensive (in this use case)

Each hop requires another JOIN
Intermediate results grow at each step
Work increases with graph depth

So cost becomes:

proportional to number of joins × intermediate result size

Storage Internals

Stored as rows in tables
Relationships represented via foreign keys
No direct pointer traversal

Traversal = repeated JOIN operations

Time Complexity

Single hop (with index): O(log N)
Multi-hop traversal: O(k × join_cost)

Where:

k = number of hops
join_cost grows with data size

Approach 3: Graph Database

Representation

(validateUser)-[:CALLS]->(checkAuth)
(validateUser)-[:CALLS]->(logAccess)
(checkAuth)-[:CALLS]->(dbQuery)
(dbQuery)-[:CALLS]->(connectDB)

Creation

CREATE (a:Function {name: "validateUser"})
CREATE (b:Function {name: "checkAuth"})
CREATE (a)-[:CALLS]->(b)

Updating

MATCH (a:Function {name: "validateUser"})
CREATE (a)-[:CALLS]->(:Function {name: "newFunc"})

Deletion

MATCH (a:Function {name: "validateUser"})-[r:CALLS]->(b:Function {name: "logAccess"})
DELETE r

Maintenance

No reverse mapping needed
Relationships are first-class
Consistency handled by DB

Storage Internals

Nodes stored separately
Relationships stored as pointers (node → relationship → node)
Index-free adjacency

Traversal = follow pointers directly (no joins, no scans)

Why Graph Databases Win: How Traversal Works

In adjacency lists, traversal looks like:

A → B → C → D

But internally:

You lookup A
Then search for B
Then search for C
Repeat

Each step involves work.

In graph databases:

A → (pointer) → B → (pointer) → C → (pointer) → D

Traversal becomes:

follow pointer → follow pointer → follow pointer

No searching. No joins.

What This Means

Each hop is constant time
Traversal cost depends on path length, not total data
Even large graphs remain efficient

Time Complexity

Adjacency List

Direct lookup: O(1)
Traversal (DFS/BFS): O(V + E)
Reverse lookup: O(V + E) (unless extra structure maintained)

Graph Database

Starting node lookup: O(log N)
Each hop traversal: O(1)

Total traversal:

O(k)
(where k = number of relationships followed)

Key Insight

Adjacency List → O(V + E)
Graph DB → O(k)

This is why graph databases scale much better for deep, connected data.

Final Takeaway

At small scale, all approaches work.

At large scale:

Either you maintain graph logic yourself
Or you use a system built for graphs

Graph databases like Neo4j are best suited for problems where relationships and multi-hop traversals are core to the system.

They remove the complexity of managing relationships manually and make traversal efficient and natural.

Top comments (1)

mote • Apr 20

Great comparison of storage approaches! The adjacency list vs PostgreSQL vs graph database breakdown is exactly what I needed when evaluating storage for AI agent memory systems.

We ended up with a different trade-off: for robot control systems, the query latency matters more than traversal complexity. Most queries are "give me all related entities for this context" — similar to your "what imports this function" use case — but the latency budget is under 10ms on embedded hardware.

For that constraint, traditional graph databases introduce too much overhead. We ended up building a custom embedded solution that stores edges as sorted vectors with binary search, giving us O(log n) traversal with zero network latency.

If you are dealing with AI agents that need fast local knowledge retrieval without the distributed setup overhead, worth considering lighter-weight approaches. Something like moteDB (a Rust-native embedded multimodal database) might handle the graph structure + vector similarity in a single zero-dependency crate.