Introduction
"AI agents explore codebases by reading every file — consuming 412,000 tokens. A knowledge graph query answers the same question in 3,400 tokens."
This is article #99 in the Open Source Project of the Day series. Today's project is codebase-memory-mcp — a pure-C code intelligence MCP server that builds a persistent knowledge graph from your codebase.
When you use Claude Code on a medium-to-large project, how does the agent understand the code structure? Typically: read the directory tree, read key files, follow references, read more files. Each step consumes tokens. Each new session starts over. Large codebases hit context limits before the agent has a useful picture.
codebase-memory-mcp takes a different approach: extract the codebase's structural information into a persistent knowledge graph stored in SQLite, then let agents query the graph rather than reading files. The 120× token difference follows from that design change — structural queries are concise; file contents are not.
What You'll Learn
- The knowledge graph data model: what node types and edge types represent
- The two-layer parsing architecture: Tree-sitter syntactic layer + Hybrid LSP semantic layer
- All 14 MCP tools and their functions
- Performance numbers: how long indexing takes at Linux-kernel scale
- Team workflow: sharing a compressed graph artifact via git
- Security design: SLSA Level 3, Sigstore signing, VirusTotal scans
Prerequisites
- Experience with Claude Code or another MCP-enabled AI coding tool
- Familiarity with codebase structure concepts (functions, classes, call graphs)
- Basic understanding of the MCP protocol
Project Background
What Is codebase-memory-mcp?
codebase-memory-mcp is a code intelligence MCP server that builds a persistent knowledge graph from codebase structure, enabling AI agents to understand code through structured queries rather than file reads.
"Knowledge graph" is the precise term here. Nodes are code structure elements — files, classes, functions, routes, resources. Edges are structural relationships — calls, inheritance, imports, HTTP calls, data flows. The entire graph lives in a SQLite database, queryable through a Cypher-style graph query language.
The project has academic backing (arXiv:2603.27277) and is among the earliest high-quality MCP servers built after Anthropic released the protocol.
Author / Team
- Organization: DeusData
- Language: Pure C (zero runtime dependencies)
- License: MIT
- Latest version: v0.8.1
- Test count: 5,604
Project Stats
- ⭐ GitHub Stars: 5,400+
- 🍴 Forks: 491+
- 📄 License: MIT
- 🔬 Paper: arXiv:2603.27277
Core Features
What It Does
Traditional (file-by-file reading):
AI Agent → read file1.py → read file2.py → read file3.py → ...
↓
~412,000 tokens, repeats every session, hits context limit
Knowledge graph approach:
AI Agent → query_graph("MATCH (f:Function)-[:CALLS]->(g)...")
↓
~3,400 tokens, results from persistent graph, sub-millisecond
Use Cases
- Large codebase onboarding: Understand a new codebase quickly through graph queries, without reading every file
-
Refactoring: Find every caller of a function with
trace_path, confirm the scope of a change before making it - Dead code detection: Identify isolated functions that no call chain reaches
- Architecture analysis: Use Leiden community detection to automatically identify module boundaries
-
Cross-repository analysis:
CROSS_*edge types link nodes across multiple indexed repositories for service dependency analysis
Quick Start
Install:
# One-line install script
curl -fsSL https://raw.githubusercontent.com/DeusData/codebase-memory-mcp/main/install.sh | bash
# npm
npm install -g codebase-memory-mcp
# PyPI
pip install codebase-memory-mcp
# Homebrew (macOS)
brew install deusdata/tap/codebase-memory-mcp
Configure for Claude Code (auto-configures 11 agents):
codebase-memory-mcp setup claude-code
Manual ~/.claude/mcp.json:
{
"mcpServers": {
"codebase-memory": {
"command": "codebase-memory-mcp",
"args": ["serve"]
}
}
}
Using in Claude Code:
# Tell the agent to index the project
"Index this project"
# The agent calls index_repository; the graph builds in seconds to minutes
# All code exploration now goes through the graph, not file reads
"Find all functions that call the authentication handler"
"What does the payment flow look like from API to database?"
"Are there any functions that are never called?"
CLI direct queries:
# Search for functions matching a pattern
codebase-memory-mcp cli search_graph \
'{"name_pattern": ".*Handler.*", "label": "Function"}'
# Trace a function's call path in both directions
codebase-memory-mcp cli trace_path \
'{"function_name": "processPayment", "direction": "both"}'
# Cypher graph query
codebase-memory-mcp cli query_graph \
'{"query": "MATCH (f:Function)-[:CALLS]->(g:Function) WHERE f.name = \"main\" RETURN g.name"}'
All 14 MCP Tools
| Tool | Function |
|---|---|
index_repository |
Index a codebase, build or update the knowledge graph |
search_graph |
Search nodes by name pattern and/or label |
search_code |
Four-phase hybrid code search (grep speed + graph intelligence) |
semantic_query |
Vector embedding semantic search (Nomic nomic-embed-code, 768d) |
trace_path |
Trace function call chains (configurable direction and depth) |
query_graph |
Native Cypher graph queries |
find_dead_code |
Detect unreachable isolated code |
analyze_architecture |
Leiden algorithm module boundary detection |
get_node |
Get full details for a single node |
list_routes |
List all HTTP routes (REST API analysis) |
get_dependencies |
Get package/module dependency relationships |
get_graph_stats |
Graph statistics (node count, edge count, coverage) |
watch_repository |
Start background git-aware auto-sync |
get_index_status |
Check index status and progress |
Deep Dive
The Knowledge Graph Data Model
The graph captures the full structural semantics of a codebase:
Node types (partial):
Project ← Repository root
Package ← Package/module
File ← Source file
Class ← Class definition
Function ← Standalone function
Method ← Class method
Route ← HTTP endpoint
Resource ← Infrastructure resource (K8s, Docker)
Edge types (partial):
CALLS ← Function/method call relationship
IMPORTS ← Module import relationship
INHERITS ← Class inheritance
HTTP_CALLS ← Cross-service HTTP calls
EMITS ← Event emission (message queues)
LISTENS_ON ← Event subscription
DATA_FLOWS ← Data flow direction
SIMILAR_TO ← MinHash near-duplicate code detection
CROSS_* ← Cross-repository dependency edges
This data model goes beyond what most IDE symbol indexes provide. DATA_FLOWS and HTTP_CALLS edges require understanding runtime behavior, not just syntax.
Two-Layer Parsing Architecture
Parsing pipeline
↓
Layer 1: Tree-sitter
├── Syntactic analysis for 158 languages
├── Extracts: function/class/method definitions, call relationships, imports
└── Fast, but syntax-layer only
(doesn't know which generic instantiation, can't resolve cross-module types)
↓
Layer 2: Hybrid LSP (9 languages)
├── Python, TypeScript/JS, PHP, C#
├── Go, C/C++, Java, Kotlin, Rust
└── Type-aware analysis:
├── Cross-module call resolution (which foo() does this call?)
├── Generic instantiation
├── Inheritance chain resolution
└── Type inference
Key: Hybrid LSP doesn't spawn a language server process — type resolution runs in-process
After Hybrid LSP was introduced in v0.7.0, TypeScript compiler project indexing dropped from ~5,100 seconds to ~50 seconds — a 100× improvement. The tradeoff: only the 9 mainstream languages get semantic resolution; the remaining 149 languages have Tree-sitter syntax-layer coverage only.
Cypher Queries
The graph supports a Neo4j Cypher-compatible query syntax:
-- Find functions called by more than 5 callers (high-coupling nodes)
MATCH (g:Function)<-[:CALLS]-(f:Function)
WITH g, count(f) AS caller_count
WHERE caller_count > 5
RETURN g.name, caller_count
ORDER BY caller_count DESC
-- Trace the complete authentication call chain
MATCH path = (api:Route)-[:CALLS*..5]->(auth:Function)
WHERE auth.name CONTAINS "authenticate"
RETURN path
-- Detect circular dependencies
MATCH (a:Package)-[:IMPORTS]->(b:Package)-[:IMPORTS]->(a)
RETURN a.name, b.name
Query latency under 1ms: SQLite runs in WAL mode, graph traversal and filtering execute at the C layer.
Performance Benchmarks
On Apple M3 Pro:
| Operation | Time |
|---|---|
| Linux kernel full index (28M LOC, 75K files) | ~3 minutes |
| Django full index (~100K lines) | ~6 seconds |
| Average-size repository | Milliseconds |
| Cypher query | < 1ms |
| Call path trace (depth 5) | < 10ms |
| Dead code detection | ~150ms |
Pure C is the performance foundation: no GC pauses, no JVM warmup, no Python interpreter overhead. The entire indexing pipeline runs at C layer speed.
Team Workflow: Shared Graph Artifact
This is a design detail worth separate attention:
# Commit the compressed graph to git
git add .codebase-memory/graph.db.zst
git commit -m "update codebase knowledge graph"
git push
# Teammates clone and use immediately — no re-indexing
git clone ...
codebase-memory-mcp serve # graph is already in .codebase-memory/
graph.db.zst is a Zstandard-compressed SQLite database. For large codebases, having every developer re-index independently wastes time. CI generates and commits the graph; everyone else uses it directly.
Security Design
A single binary distribution model carries supply chain risk. This project's security measures are more thorough than most comparable tools:
- SLSA Level 3 build provenance: Every release has verifiable build origin documentation
- Sigstore cosign keyless signing: No GPG key management; signatures verified through the Sigstore transparency log
- VirusTotal scanning: v0.8.1 binary scanned by 72 engines — 0/72 detections
- CodeQL SAST: Static security analysis gates every release
- Local-only processing: All code processing happens on-device; no data sent to external services
- HTTP bound to 127.0.0.1: The built-in visualization interface accepts only localhost connections; v0.8.1 explicitly removed all non-localhost access paths
Version Highlights
| Version | Key Changes |
|---|---|
| v0.5.6 | search_code rewrite (4-phase pipeline), Kubernetes/Kustomize indexing |
| v0.5.7 | Critical DB concurrency fix; soak test suite (10-min / 4-hour) as release gate |
| v0.6.0 | Semantic search with Nomic embeddings, BM25 FTS, SIMILAR_TO edges, EMITS/LISTENS_ON |
| v0.6.1 | 66 → 158 languages, cross-repo CROSS_* edges, team-shareable graph artifact |
| v0.7.0 | Hybrid LSP for 6 languages; TypeScript indexing 100× faster |
| v0.8.0 | Hybrid LSP adds Java/Kotlin/Rust; Leiden community detection; Helm/HCL support |
| v0.8.1 | Custom in-house HTTP server; localhost-only by construction; 5,604 tests |
Links and Resources
Official Resources
- 🌟 GitHub: DeusData/codebase-memory-mcp
- 📄 Paper: arXiv:2603.27277
Distribution
npm, PyPI, Homebrew, Scoop, Winget, AUR, Chocolatey, official MCP Registry
Conclusion
codebase-memory-mcp provides an engineering answer to a systemic problem: AI agents explore codebases inefficiently, re-reading files every session, consuming 120× more tokens than a structural query requires.
The knowledge graph approach for codebases is well-established in IDE tooling; the gap was a high-quality MCP server implementation that connects it to AI coding agents. Pure C with zero dependencies produces the most portable, performance-stable deployment option. 158-language coverage and Hybrid LSP semantic resolution make it genuinely useful on multi-language codebases. The 14-tool MCP interface lets agents express precise structural questions.
For developers working long-term on the same codebase, or using Claude Code on projects over 50K lines, this MCP server is worth installing.
Explore PrimeSkills — A marketplace for handpicked AI Agents and skills. Each is validated in real enterprise workflows, stripping away hype and keeping only what truly works.
Welcome to my Homepage for more useful insights and interesting products.
Top comments (0)