WonderLab

Posted on Jun 19

Open Source Project of the Day (#99): codebase-memory-mcp — A Knowledge Graph That Gives AI Agents Structural Memory of Your Codebase

#ai #opensource #mcp #codegraph

Introduction

"AI agents explore codebases by reading every file — consuming 412,000 tokens. A knowledge graph query answers the same question in 3,400 tokens."

This is article #99 in the Open Source Project of the Day series. Today's project is codebase-memory-mcp — a pure-C code intelligence MCP server that builds a persistent knowledge graph from your codebase.

When you use Claude Code on a medium-to-large project, how does the agent understand the code structure? Typically: read the directory tree, read key files, follow references, read more files. Each step consumes tokens. Each new session starts over. Large codebases hit context limits before the agent has a useful picture.

codebase-memory-mcp takes a different approach: extract the codebase's structural information into a persistent knowledge graph stored in SQLite, then let agents query the graph rather than reading files. The 120× token difference follows from that design change — structural queries are concise; file contents are not.

What You'll Learn

The knowledge graph data model: what node types and edge types represent
The two-layer parsing architecture: Tree-sitter syntactic layer + Hybrid LSP semantic layer
All 14 MCP tools and their functions
Performance numbers: how long indexing takes at Linux-kernel scale
Team workflow: sharing a compressed graph artifact via git
Security design: SLSA Level 3, Sigstore signing, VirusTotal scans

Prerequisites

Experience with Claude Code or another MCP-enabled AI coding tool
Familiarity with codebase structure concepts (functions, classes, call graphs)
Basic understanding of the MCP protocol

Project Background

What Is codebase-memory-mcp?

codebase-memory-mcp is a code intelligence MCP server that builds a persistent knowledge graph from codebase structure, enabling AI agents to understand code through structured queries rather than file reads.

"Knowledge graph" is the precise term here. Nodes are code structure elements — files, classes, functions, routes, resources. Edges are structural relationships — calls, inheritance, imports, HTTP calls, data flows. The entire graph lives in a SQLite database, queryable through a Cypher-style graph query language.

The project has academic backing (arXiv:2603.27277) and is among the earliest high-quality MCP servers built after Anthropic released the protocol.

Author / Team

Organization: DeusData
Language: Pure C (zero runtime dependencies)
License: MIT
Latest version: v0.8.1
Test count: 5,604

Project Stats

⭐ GitHub Stars: 5,400+
🍴 Forks: 491+
📄 License: MIT
🔬 Paper: arXiv:2603.27277

Core Features

What It Does

Traditional (file-by-file reading):
AI Agent → read file1.py → read file2.py → read file3.py → ...
              ↓
           ~412,000 tokens, repeats every session, hits context limit

Knowledge graph approach:
AI Agent → query_graph("MATCH (f:Function)-[:CALLS]->(g)...")
              ↓
           ~3,400 tokens, results from persistent graph, sub-millisecond

Use Cases

Large codebase onboarding: Understand a new codebase quickly through graph queries, without reading every file
Refactoring: Find every caller of a function with trace_path, confirm the scope of a change before making it
Dead code detection: Identify isolated functions that no call chain reaches
Architecture analysis: Use Leiden community detection to automatically identify module boundaries
Cross-repository analysis: CROSS_* edge types link nodes across multiple indexed repositories for service dependency analysis

Quick Start

Install:

# One-line install script
curl -fsSL https://raw.githubusercontent.com/DeusData/codebase-memory-mcp/main/install.sh | bash

# npm
npm install -g codebase-memory-mcp

# PyPI
pip install codebase-memory-mcp

# Homebrew (macOS)
brew install deusdata/tap/codebase-memory-mcp

Configure for Claude Code (auto-configures 11 agents):

codebase-memory-mcp setup claude-code

Manual ~/.claude/mcp.json:

{
  "mcpServers": {
    "codebase-memory": {
      "command": "codebase-memory-mcp",
      "args": ["serve"]
    }
  }
}

Using in Claude Code:

# Tell the agent to index the project
"Index this project"
# The agent calls index_repository; the graph builds in seconds to minutes

# All code exploration now goes through the graph, not file reads
"Find all functions that call the authentication handler"
"What does the payment flow look like from API to database?"
"Are there any functions that are never called?"

CLI direct queries:

# Search for functions matching a pattern
codebase-memory-mcp cli search_graph \
  '{"name_pattern": ".*Handler.*", "label": "Function"}'

# Trace a function's call path in both directions
codebase-memory-mcp cli trace_path \
  '{"function_name": "processPayment", "direction": "both"}'

# Cypher graph query
codebase-memory-mcp cli query_graph \
  '{"query": "MATCH (f:Function)-[:CALLS]->(g:Function) WHERE f.name = \"main\" RETURN g.name"}'

All 14 MCP Tools

Tool	Function
`index_repository`	Index a codebase, build or update the knowledge graph
`search_graph`	Search nodes by name pattern and/or label
`search_code`	Four-phase hybrid code search (grep speed + graph intelligence)
`semantic_query`	Vector embedding semantic search (Nomic nomic-embed-code, 768d)
`trace_path`	Trace function call chains (configurable direction and depth)
`query_graph`	Native Cypher graph queries
`find_dead_code`	Detect unreachable isolated code
`analyze_architecture`	Leiden algorithm module boundary detection
`get_node`	Get full details for a single node
`list_routes`	List all HTTP routes (REST API analysis)
`get_dependencies`	Get package/module dependency relationships
`get_graph_stats`	Graph statistics (node count, edge count, coverage)
`watch_repository`	Start background git-aware auto-sync
`get_index_status`	Check index status and progress

Deep Dive

The Knowledge Graph Data Model

The graph captures the full structural semantics of a codebase:

Node types (partial):

Project     ← Repository root
Package     ← Package/module
File        ← Source file
Class       ← Class definition
Function    ← Standalone function
Method      ← Class method
Route       ← HTTP endpoint
Resource    ← Infrastructure resource (K8s, Docker)

Edge types (partial):

CALLS           ← Function/method call relationship
IMPORTS         ← Module import relationship
INHERITS        ← Class inheritance
HTTP_CALLS      ← Cross-service HTTP calls
EMITS           ← Event emission (message queues)
LISTENS_ON      ← Event subscription
DATA_FLOWS      ← Data flow direction
SIMILAR_TO      ← MinHash near-duplicate code detection
CROSS_*         ← Cross-repository dependency edges

This data model goes beyond what most IDE symbol indexes provide. DATA_FLOWS and HTTP_CALLS edges require understanding runtime behavior, not just syntax.

Two-Layer Parsing Architecture

Parsing pipeline
    ↓
Layer 1: Tree-sitter
    ├── Syntactic analysis for 158 languages
    ├── Extracts: function/class/method definitions, call relationships, imports
    └── Fast, but syntax-layer only
         (doesn't know which generic instantiation, can't resolve cross-module types)
    ↓
Layer 2: Hybrid LSP (9 languages)
    ├── Python, TypeScript/JS, PHP, C#
    ├── Go, C/C++, Java, Kotlin, Rust
    └── Type-aware analysis:
        ├── Cross-module call resolution (which foo() does this call?)
        ├── Generic instantiation
        ├── Inheritance chain resolution
        └── Type inference

Key: Hybrid LSP doesn't spawn a language server process — type resolution runs in-process

After Hybrid LSP was introduced in v0.7.0, TypeScript compiler project indexing dropped from ~5,100 seconds to ~50 seconds — a 100× improvement. The tradeoff: only the 9 mainstream languages get semantic resolution; the remaining 149 languages have Tree-sitter syntax-layer coverage only.

Cypher Queries

The graph supports a Neo4j Cypher-compatible query syntax:

-- Find functions called by more than 5 callers (high-coupling nodes)
MATCH (g:Function)<-[:CALLS]-(f:Function)
WITH g, count(f) AS caller_count
WHERE caller_count > 5
RETURN g.name, caller_count
ORDER BY caller_count DESC

-- Trace the complete authentication call chain
MATCH path = (api:Route)-[:CALLS*..5]->(auth:Function)
WHERE auth.name CONTAINS "authenticate"
RETURN path

-- Detect circular dependencies
MATCH (a:Package)-[:IMPORTS]->(b:Package)-[:IMPORTS]->(a)
RETURN a.name, b.name

Query latency under 1ms: SQLite runs in WAL mode, graph traversal and filtering execute at the C layer.

Performance Benchmarks

On Apple M3 Pro:

Operation	Time
Linux kernel full index (28M LOC, 75K files)	~3 minutes
Django full index (~100K lines)	~6 seconds
Average-size repository	Milliseconds
Cypher query	< 1ms
Call path trace (depth 5)	< 10ms
Dead code detection	~150ms

Pure C is the performance foundation: no GC pauses, no JVM warmup, no Python interpreter overhead. The entire indexing pipeline runs at C layer speed.

Team Workflow: Shared Graph Artifact

This is a design detail worth separate attention:

# Commit the compressed graph to git
git add .codebase-memory/graph.db.zst
git commit -m "update codebase knowledge graph"
git push

# Teammates clone and use immediately — no re-indexing
git clone ...
codebase-memory-mcp serve  # graph is already in .codebase-memory/

graph.db.zst is a Zstandard-compressed SQLite database. For large codebases, having every developer re-index independently wastes time. CI generates and commits the graph; everyone else uses it directly.

Security Design

A single binary distribution model carries supply chain risk. This project's security measures are more thorough than most comparable tools:

SLSA Level 3 build provenance: Every release has verifiable build origin documentation
Sigstore cosign keyless signing: No GPG key management; signatures verified through the Sigstore transparency log
VirusTotal scanning: v0.8.1 binary scanned by 72 engines — 0/72 detections
CodeQL SAST: Static security analysis gates every release
Local-only processing: All code processing happens on-device; no data sent to external services
HTTP bound to 127.0.0.1: The built-in visualization interface accepts only localhost connections; v0.8.1 explicitly removed all non-localhost access paths

Version Highlights

Version	Key Changes
v0.5.6	search_code rewrite (4-phase pipeline), Kubernetes/Kustomize indexing
v0.5.7	Critical DB concurrency fix; soak test suite (10-min / 4-hour) as release gate
v0.6.0	Semantic search with Nomic embeddings, BM25 FTS, SIMILAR_TO edges, EMITS/LISTENS_ON
v0.6.1	66 → 158 languages, cross-repo CROSS_* edges, team-shareable graph artifact
v0.7.0	Hybrid LSP for 6 languages; TypeScript indexing 100× faster
v0.8.0	Hybrid LSP adds Java/Kotlin/Rust; Leiden community detection; Helm/HCL support
v0.8.1	Custom in-house HTTP server; localhost-only by construction; 5,604 tests

Links and Resources

Official Resources

🌟 GitHub: DeusData/codebase-memory-mcp
📄 Paper: arXiv:2603.27277

Distribution

npm, PyPI, Homebrew, Scoop, Winget, AUR, Chocolatey, official MCP Registry

Conclusion

codebase-memory-mcp provides an engineering answer to a systemic problem: AI agents explore codebases inefficiently, re-reading files every session, consuming 120× more tokens than a structural query requires.

The knowledge graph approach for codebases is well-established in IDE tooling; the gap was a high-quality MCP server implementation that connects it to AI coding agents. Pure C with zero dependencies produces the most portable, performance-stable deployment option. 158-language coverage and Hybrid LSP semantic resolution make it genuinely useful on multi-language codebases. The 14-tool MCP interface lets agents express precise structural questions.

For developers working long-term on the same codebase, or using Claude Code on projects over 50K lines, this MCP server is worth installing.

Explore PrimeSkills — A marketplace for handpicked AI Agents and skills. Each is validated in real enterprise workflows, stripping away hype and keeping only what truly works.

Welcome to my Homepage for more useful insights and interesting products.

DEV Community