WonderLab

Posted on May 21

One Open Source Project a Day (No. 71): CodeGraph — Pre-Index Your Codebase for AI Agents, Save 35% Cost and 70% Tool Calls

#opensource #claude #mcp #sqlite

Introduction

"~35% cheaper · ~70% fewer tool calls · 100% local"

This is the No.71 article in the "One Open Source Project a Day" series. Today we are exploring CodeGraph.

Start with a scenario: you ask Claude Code "How is AuthService being called?" Without any assistance, Claude's approach is: glob-scan directories, run multiple greps, read several files — then finally answer. The whole process might trigger 10–15 tool calls and consume hundreds of thousands of tokens.

CodeGraph's insight is to front-load this work: before you start, it has already parsed your codebase with tree-sitter into a semantic graph stored in a local SQLite database, then exposes 8 query tools to AI agents via MCP. When the agent needs to understand code, a single codegraph_context call returns entry points, related symbols, and code snippets — no file reading required.

9.6k Stars, 588 Forks. Benchmarks across 7 real open-source projects: average 35% cost savings, 70% fewer tool calls, 49% speed improvement. On VS Code's large TypeScript repository, one architecture Q&A dropped from 1.4M tokens to 393k — cost from $0.64 to $0.42.

What You Will Learn

CodeGraph's four-stage pipeline: Extract → Store → Resolve → Auto-Sync
The 8 MCP tools and when to use each
A detailed breakdown of benchmark results across 7 projects: why do larger codebases benefit more?
How 19-language support and 13-framework route recognition work
Complete setup walkthrough from installation to Claude Code integration
codegraph affected: using dependency tracing for smart CI test selection

Prerequisites

Familiarity with Claude Code, Cursor, or similar AI coding tools
Basic understanding of MCP (Model Context Protocol)
Node.js experience

Project Background

Project Introduction

CodeGraph is a local semantic code knowledge graph tool designed specifically to improve AI coding agent efficiency. Its core insight:

AI agents spend a massive amount of tokens and time in the "discovery phase" — scanning directories, searching for symbols, reading files — rather than on the actual reasoning and generation.

CodeGraph's solution is to outsource the discovery phase to a pre-built index: before you start working, the index is already ready, letting AI agents pull structured code knowledge directly instead of exploring the file system from scratch.

The technology choices are pragmatic: tree-sitter for AST parsing (mature, multi-language, high-performance), SQLite FTS5 for full-text search (zero external dependencies, fully local), and native OS file events for live sync (FSEvents/inotify/ReadDirectoryChangesW).

Author/Team

Author: Colby McHenry (GitHub: colbymchenry)
Repository: colbymchenry/codegraph
Distribution: npm package @colbymchenry/codegraph

Project Stats

⭐ GitHub Stars: 9,600+
🍴 Forks: 588
📦 npm package: @colbymchenry/codegraph
🔧 Runtime: Node.js 20–24
💻 Platforms: Windows, macOS, Linux
📄 License: MIT
🌐 Repository: colbymchenry/codegraph

Main Features

Core Utility

CodeGraph inserts a pre-built index layer between AI agents and codebases:

Codebase (TypeScript / Python / Go / ...)
        ↓ tree-sitter parsing
  Semantic graph (symbols + relationships + call chains)
        ↓ stored in SQLite FTS5
  Local knowledge base
        ↓ exposed via MCP
  AI coding agents (Claude Code / Cursor / Codex CLI / OpenCode)

Without CodeGraph:

User: "How is AuthService being called?"
→ Agent: glob("src/**/*.ts")         # Tool call 1
→ Agent: grep("AuthService")         # Tool call 2
→ Agent: read("auth.service.ts")     # Tool call 3
→ Agent: grep("import.*Auth")        # Tool call 4
→ Agent: read("user.controller.ts")  # Tool call 5
→ Agent: read("app.module.ts")       # Tool call 6
... 10–15 total tool calls, massive token consumption

With CodeGraph:

User: "How is AuthService being called?"
→ Agent: codegraph_callers("AuthService")   # Tool call 1
→ Returns: full caller list + call sites + code snippets
→ Agent answers directly, no file reading needed

Quick Start

One-command install (recommended):

# Run the interactive installer — auto-detects installed AI agents and configures them
npx @colbymchenry/codegraph

# Initialize in your project (-i for interactive)
cd your-project
codegraph init -i

Non-interactive install (CI environments):

# Auto-detect all installed agents, global install
codegraph install --yes

# Target specific agents
codegraph install --target=cursor,claude --yes

# Project-local install
codegraph install --target=auto --location=local

Manual Claude Code configuration:

npm install -g @colbymchenry/codegraph

Add to ~/.claude.json (or project-level .claude.json):

{
  "mcpServers": {
    "codegraph": {
      "type": "stdio",
      "command": "codegraph",
      "args": ["serve", "--mcp"]
    }
  }
}

Verify installation:

codegraph status          # Check index status and stats
codegraph query "UserService"  # Test symbol search

The 8 MCP Tools

The complete toolset CodeGraph exposes to AI agents:

Tool	Purpose	Typical Invocation
`codegraph_search`	Find symbols by name	"Find all functions called authenticate"
`codegraph_context`	Build code context for a task	"What code is relevant to the login flow?"
`codegraph_callers`	Find what calls a function	"What calls AuthService?"
`codegraph_callees`	Find what a function calls	"What does processPayment call internally?"
`codegraph_impact`	Analyze change impact radius	"What breaks if I change this function?"
`codegraph_node`	Get details about a specific symbol	"Show me UserController's full signature"
`codegraph_files`	Get indexed file structure	"What is the overall project structure?"
`codegraph_status`	Check index health and stats	"How many symbols are indexed? Last sync?"

codegraph_context is the most important tool — it doesn't just return search results; it intelligently assembles a comprehensive context package for a given task, including entry points, related symbols, and code snippets:

# Command-line equivalent
codegraph context "fix user login bug"
# → Automatically finds login-related functions, call chains, and relevant files
#   packaged into context Claude can consume directly

Project Advantages

Dimension	CodeGraph	Native AI Agent (no assist)	Other code indexers
Tool call count	~70% fewer	High (re-scans each task)	Partial reduction
Token usage	~59% fewer	High	Partial reduction
Data privacy	100% local	Depends on agent	Most require uploads
Real-time sync	Native OS file events	N/A	Usually polling or manual
Language support	19+ languages	Depends on agent	Usually 3–5
Framework route detection	13 frameworks	None	Rare
Installation complexity	One npx command	N/A	Usually requires server

Detailed Analysis

1. The Four-Stage Pipeline

Stage 1: Extraction

tree-sitter parses source files into ASTs, extracting:

Symbols: functions, classes, methods, interfaces, variable definitions
Relationships: function calls, module imports, class inheritance, interface implementations

tree-sitter's key advantage: it is a fault-tolerant parser — it can extract partial structure even when code has syntax errors. This is critical for indexing files that are actively being edited.

Stage 2: Storage

All data lands in a local SQLite database using the FTS5 (Full-Text Search 5) extension:

-- Symbols table (simplified)
CREATE VIRTUAL TABLE symbols USING fts5(
  name,          -- Symbol name
  kind,          -- function/class/method/...
  file_path,     -- Source file
  line_start,    -- Starting line
  signature,     -- Function signature
  docstring,     -- Documentation comment
  code_snippet   -- Code excerpt
);

-- Relationships table
CREATE TABLE edges (
  from_id  INTEGER,  -- Caller symbol ID
  to_id    INTEGER,  -- Callee symbol ID
  kind     TEXT,     -- calls/imports/inherits/implements
  file     TEXT,
  line     INTEGER
);

Stage 3: Resolution

The critical step: resolving abstract "called something named X" into concrete "called the definition in file Y at line Z."

Source code: import { AuthService } from './auth.service'
             ...
             this.authService.login(user)
            ↓ resolution
Graph edges: UserController.login → AuthService.login (calls)
             UserController → AuthService (imports)

Stage 4: Auto-Sync

Uses native OS file events (not polling!) to detect changes:

macOS: FSEvents
Linux: inotify
Windows: ReadDirectoryChangesW

A 2-second debounce prevents triggering mass rebuilds when files change rapidly — it waits for changes to settle before doing incremental updates.

2. Benchmark Deep Dive

Test conditions: Claude Code (headless, Opus 4.7) answering architecture questions. Each result is the median of 4 runs on the same question, across 7 real open-source repositories.

Project        Language       Size            Cost ↓  Token ↓  Speed ↑  Tool Calls ↓
──────────────────────────────────────────────────────────────────────────────────────
VS Code        TypeScript     ~10k files      35%     73%      41%      72%
Excalidraw     TypeScript     ~600 files      47%     73%      60%      86%
Django         Python         ~2.7k files     34%     64%      59%      81%
Tokio          Rust           ~700 files      52%     81%      63%      89%
OkHttp         Java           ~640 files      17%     41%      36%      64%
Gin            Go             ~150 files      22%     23%      34%      19%
Alamofire      Swift          ~100 files      38%     59%      51%      77%
──────────────────────────────────────────────────────────────────────────────────────
Average                                       35%     59%      49%      70%

Patterns worth noting:

Tokio (Rust, 700 files) sees the biggest gains (81% token reduction, 89% fewer tool calls): Rust's type system is complex — agents originally needed extensive file exploration to understand trait implementations and generic relationships. CodeGraph's pre-built relationships make this dramatically cheaper.
Gin (Go, 150 files) sees the smallest gains (23% token reduction, 19% fewer tool calls): Small Go projects have simple file structures. Agents can already navigate them efficiently, so CodeGraph's marginal value is lower.
VS Code's absolute numbers are the most striking: the same question costs $0.64 (1.4M tokens) without CodeGraph, $0.42 (393k tokens) with it. A single task saves $0.22.

Takeaway: The larger the codebase, the more complex the dependencies, and the richer the language's type system, the greater CodeGraph's benefit. For developers using Claude Code heavily on large projects, the ROI is clear.

3. 19 Languages + 13 Framework Route Detection

Language support (via tree-sitter grammars):

TypeScript, JavaScript, Python, Go, Rust, Java, C#, PHP, Ruby, C, C++, Swift, Kotlin, Dart, Svelte, Vue, Liquid, Pascal/Delphi, Scala

Framework route detection is a differentiating feature — CodeGraph doesn't just recognize symbols, it understands the mapping between URL routes and their handler functions:

# Django
urlpatterns = [
    path('users/<int:pk>/', UserDetailView.as_view()),
]
# → CodeGraph knows GET /users/{id}/ maps to UserDetailView

# FastAPI
@app.get("/items/{item_id}")
async def read_item(item_id: int):
    ...
# → CodeGraph knows GET /items/{id} maps to read_item()

The 13 supported frameworks: Django, Flask, FastAPI, Express, NestJS, Laravel, Rails, Spring, Gin/chi/gorilla/mux, Axum/actix/Rocket, ASP.NET, Vapor, React Router/SvelteKit.

This means AI agents can ask "Where is the handler for /api/users/:id?" and get a precise answer, without needing to scan routing config files.

4. `codegraph affected` — Smart CI Test Selection

An underappreciated feature: by tracing import dependencies, it identifies which test files are actually affected by changed source files.

# CI scenario: only run tests affected by this change
git diff --name-only | codegraph affected --stdin

# Manually specify changed files
codegraph affected src/auth.ts

# With filter (only e2e tests)
codegraph affected src/auth.ts --filter "e2e/*"

How it works:

Changed: src/auth.ts
  ↓ CodeGraph queries the dependency graph
  Direct importers: user.service.ts, auth.controller.ts
  Indirect importers: app.module.ts, integration.test.ts
  ↓ Filter to test files only
  Affected tests: auth.spec.ts, user.service.spec.ts, integration.test.ts
  ↓ Output
  [these files] ← run only these, not the full test suite

On large projects, this can compress CI test time from tens of minutes to a few minutes.

5. Configuration and Performance Notes

Project config file (.codegraph/config.json):

{
  "version": 1,
  "languages": ["typescript", "javascript"],
  "exclude": ["node_modules/**", "dist/**", "build/**", "*.min.js"],
  "maxFileSize": 1048576,
  "extractDocstrings": true,
  "trackCallSites": true
}

SQLite backend selection:

CodeGraph ships with two SQLite backends:

Native better-sqlite3 (default, recommended): High performance, supports concurrent reads
WASM fallback: Better compatibility, but 5–10x slower than native, and concurrent operations may produce database is locked errors

If you encounter performance issues or lock errors:

# Rebuild the native module
npm rebuild better-sqlite3

# Check which backend is active
codegraph status

CLI Reference

codegraph                        # Run interactive installer
codegraph init [path]            # Initialize in a project
codegraph uninit [path]          # Remove CodeGraph from a project
codegraph index [path]           # Full index (--force to rebuild)
codegraph sync [path]            # Incremental update
codegraph status [path]          # Show statistics
codegraph query <search>         # Search symbols
codegraph files [path]           # Show file structure
codegraph context <task>         # Build AI context for a task
codegraph affected [files]       # Find affected test files
codegraph serve --mcp            # Start MCP server

Library API (embed CodeGraph in your own tools):

import CodeGraph from '@colbymchenry/codegraph';

const cg = await CodeGraph.init('/path/to/project');

// Full index with progress callbacks
await cg.indexAll({
  onProgress: (p) => console.log(`${p.phase}: ${p.current}/${p.total}`)
});

// Search symbols
const results = cg.searchNodes('UserService');

// Get call chain
const callers = cg.getCallers(results[0].node.id);

// Build AI context
const context = await cg.buildContext('fix login bug', {
  maxNodes: 20,
  includeCode: true
});

// Impact radius analysis (depth 2)
const impact = cg.getImpactRadius(results[0].node.id, 2);

cg.watch();   // Start file watching for auto-sync
cg.close();   // Clean up resources

Project Links & Resources

Official Resources

🌟 GitHub: https://github.com/colbymchenry/codegraph
📦 npm: @colbymchenry/codegraph
⚡ Quick install: npx @colbymchenry/codegraph

Target Audience

Heavy Claude Code / Cursor users: Working on large projects and looking to reduce cost and improve response speed
Large TypeScript/Rust/Python project developers: Codebases large enough that AI agent file-scanning overhead is noticeable
CI/CD engineers: Using codegraph affected for smart test selection to eliminate unnecessary full test runs
Toolchain developers: Embedding code semantic analysis into their own tools via the Library API

Summary

Key Takeaways

Core value: Inserts a pre-built semantic index between AI agents and codebases — average 35% cost savings, 70% fewer tool calls, 49% speed improvement
Technology choices: tree-sitter (AST parsing) + SQLite FTS5 (full-text search) + native OS file events (live sync) — zero external service dependencies
8 MCP tools: codegraph_context is the most critical — one call returns a complete context package for the task at hand
19 languages + 13 framework route detection covering mainstream development stacks
codegraph affected: dependency-traced smart test selection, a CI acceleration tool
Gains scale with codebase size: Tokio (Rust, 700 files) reaches 89% fewer tool calls; small Go projects see ~19%

One-Line Review

CodeGraph does something deceptively simple yet extremely practical: it converts the code discovery work that AI agents redo on every task into a reusable local index — not a feature addition, but a workflow architecture optimization.

Find more useful knowledge and interesting products on my Homepage

DEV Community

One Open Source Project a Day (No. 71): CodeGraph — Pre-Index Your Codebase for AI Agents, Save 35% Cost and 70% Tool Calls

Introduction

What You Will Learn

Prerequisites

Project Background

Project Introduction

Author/Team

Project Stats

Main Features

Core Utility

Quick Start

The 8 MCP Tools

Project Advantages

Detailed Analysis

1. The Four-Stage Pipeline

2. Benchmark Deep Dive

3. 19 Languages + 13 Framework Route Detection

4. `codegraph affected` — Smart CI Test Selection

5. Configuration and Performance Notes

CLI Reference

Project Links & Resources

Official Resources

Target Audience

Summary

Key Takeaways

One-Line Review

Top comments (0)

Introduction

What You Will Learn

Prerequisites

Project Background

Project Introduction

Author/Team

Project Stats

Main Features

Core Utility

Quick Start

The 8 MCP Tools

Project Advantages

Detailed Analysis

1. The Four-Stage Pipeline

2. Benchmark Deep Dive

3. 19 Languages + 13 Framework Route Detection

4. codegraph affected — Smart CI Test Selection

5. Configuration and Performance Notes

CLI Reference

Project Links & Resources

Official Resources

Target Audience

Summary

Key Takeaways

One-Line Review

4. `codegraph affected` — Smart CI Test Selection