DEV Community

Cover image for One Open Source Project a Day (No. 71): CodeGraph — Pre-Index Your Codebase for AI Agents, Save 35% Cost and 70% Tool Calls
WonderLab
WonderLab

Posted on

One Open Source Project a Day (No. 71): CodeGraph — Pre-Index Your Codebase for AI Agents, Save 35% Cost and 70% Tool Calls

Introduction

"~35% cheaper · ~70% fewer tool calls · 100% local"

This is the No.71 article in the "One Open Source Project a Day" series. Today we are exploring CodeGraph.

Start with a scenario: you ask Claude Code "How is AuthService being called?" Without any assistance, Claude's approach is: glob-scan directories, run multiple greps, read several files — then finally answer. The whole process might trigger 10–15 tool calls and consume hundreds of thousands of tokens.

CodeGraph's insight is to front-load this work: before you start, it has already parsed your codebase with tree-sitter into a semantic graph stored in a local SQLite database, then exposes 8 query tools to AI agents via MCP. When the agent needs to understand code, a single codegraph_context call returns entry points, related symbols, and code snippets — no file reading required.

9.6k Stars, 588 Forks. Benchmarks across 7 real open-source projects: average 35% cost savings, 70% fewer tool calls, 49% speed improvement. On VS Code's large TypeScript repository, one architecture Q&A dropped from 1.4M tokens to 393k — cost from $0.64 to $0.42.

What You Will Learn

  • CodeGraph's four-stage pipeline: Extract → Store → Resolve → Auto-Sync
  • The 8 MCP tools and when to use each
  • A detailed breakdown of benchmark results across 7 projects: why do larger codebases benefit more?
  • How 19-language support and 13-framework route recognition work
  • Complete setup walkthrough from installation to Claude Code integration
  • codegraph affected: using dependency tracing for smart CI test selection

Prerequisites

  • Familiarity with Claude Code, Cursor, or similar AI coding tools
  • Basic understanding of MCP (Model Context Protocol)
  • Node.js experience

Project Background

Project Introduction

CodeGraph is a local semantic code knowledge graph tool designed specifically to improve AI coding agent efficiency. Its core insight:

AI agents spend a massive amount of tokens and time in the "discovery phase" — scanning directories, searching for symbols, reading files — rather than on the actual reasoning and generation.

CodeGraph's solution is to outsource the discovery phase to a pre-built index: before you start working, the index is already ready, letting AI agents pull structured code knowledge directly instead of exploring the file system from scratch.

The technology choices are pragmatic: tree-sitter for AST parsing (mature, multi-language, high-performance), SQLite FTS5 for full-text search (zero external dependencies, fully local), and native OS file events for live sync (FSEvents/inotify/ReadDirectoryChangesW).

Author/Team

  • Author: Colby McHenry (GitHub: colbymchenry)
  • Repository: colbymchenry/codegraph
  • Distribution: npm package @colbymchenry/codegraph

Project Stats

  • ⭐ GitHub Stars: 9,600+
  • 🍴 Forks: 588
  • 📦 npm package: @colbymchenry/codegraph
  • 🔧 Runtime: Node.js 20–24
  • 💻 Platforms: Windows, macOS, Linux
  • 📄 License: MIT
  • 🌐 Repository: colbymchenry/codegraph

Main Features

Core Utility

CodeGraph inserts a pre-built index layer between AI agents and codebases:

Codebase (TypeScript / Python / Go / ...)
        ↓ tree-sitter parsing
  Semantic graph (symbols + relationships + call chains)
        ↓ stored in SQLite FTS5
  Local knowledge base
        ↓ exposed via MCP
  AI coding agents (Claude Code / Cursor / Codex CLI / OpenCode)
Enter fullscreen mode Exit fullscreen mode

Without CodeGraph:

User: "How is AuthService being called?"
→ Agent: glob("src/**/*.ts")         # Tool call 1
→ Agent: grep("AuthService")         # Tool call 2
→ Agent: read("auth.service.ts")     # Tool call 3
→ Agent: grep("import.*Auth")        # Tool call 4
→ Agent: read("user.controller.ts")  # Tool call 5
→ Agent: read("app.module.ts")       # Tool call 6
... 10–15 total tool calls, massive token consumption
Enter fullscreen mode Exit fullscreen mode

With CodeGraph:

User: "How is AuthService being called?"
→ Agent: codegraph_callers("AuthService")   # Tool call 1
→ Returns: full caller list + call sites + code snippets
→ Agent answers directly, no file reading needed
Enter fullscreen mode Exit fullscreen mode

Quick Start

One-command install (recommended):

# Run the interactive installer — auto-detects installed AI agents and configures them
npx @colbymchenry/codegraph

# Initialize in your project (-i for interactive)
cd your-project
codegraph init -i
Enter fullscreen mode Exit fullscreen mode

Non-interactive install (CI environments):

# Auto-detect all installed agents, global install
codegraph install --yes

# Target specific agents
codegraph install --target=cursor,claude --yes

# Project-local install
codegraph install --target=auto --location=local
Enter fullscreen mode Exit fullscreen mode

Manual Claude Code configuration:

npm install -g @colbymchenry/codegraph
Enter fullscreen mode Exit fullscreen mode

Add to ~/.claude.json (or project-level .claude.json):

{
  "mcpServers": {
    "codegraph": {
      "type": "stdio",
      "command": "codegraph",
      "args": ["serve", "--mcp"]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Verify installation:

codegraph status          # Check index status and stats
codegraph query "UserService"  # Test symbol search
Enter fullscreen mode Exit fullscreen mode

The 8 MCP Tools

The complete toolset CodeGraph exposes to AI agents:

Tool Purpose Typical Invocation
codegraph_search Find symbols by name "Find all functions called authenticate"
codegraph_context Build code context for a task "What code is relevant to the login flow?"
codegraph_callers Find what calls a function "What calls AuthService?"
codegraph_callees Find what a function calls "What does processPayment call internally?"
codegraph_impact Analyze change impact radius "What breaks if I change this function?"
codegraph_node Get details about a specific symbol "Show me UserController's full signature"
codegraph_files Get indexed file structure "What is the overall project structure?"
codegraph_status Check index health and stats "How many symbols are indexed? Last sync?"

codegraph_context is the most important tool — it doesn't just return search results; it intelligently assembles a comprehensive context package for a given task, including entry points, related symbols, and code snippets:

# Command-line equivalent
codegraph context "fix user login bug"
# → Automatically finds login-related functions, call chains, and relevant files
#   packaged into context Claude can consume directly
Enter fullscreen mode Exit fullscreen mode

Project Advantages

Dimension CodeGraph Native AI Agent (no assist) Other code indexers
Tool call count ~70% fewer High (re-scans each task) Partial reduction
Token usage ~59% fewer High Partial reduction
Data privacy 100% local Depends on agent Most require uploads
Real-time sync Native OS file events N/A Usually polling or manual
Language support 19+ languages Depends on agent Usually 3–5
Framework route detection 13 frameworks None Rare
Installation complexity One npx command N/A Usually requires server

Detailed Analysis

1. The Four-Stage Pipeline

Stage 1: Extraction

tree-sitter parses source files into ASTs, extracting:

  • Symbols: functions, classes, methods, interfaces, variable definitions
  • Relationships: function calls, module imports, class inheritance, interface implementations

tree-sitter's key advantage: it is a fault-tolerant parser — it can extract partial structure even when code has syntax errors. This is critical for indexing files that are actively being edited.

Stage 2: Storage

All data lands in a local SQLite database using the FTS5 (Full-Text Search 5) extension:

-- Symbols table (simplified)
CREATE VIRTUAL TABLE symbols USING fts5(
  name,          -- Symbol name
  kind,          -- function/class/method/...
  file_path,     -- Source file
  line_start,    -- Starting line
  signature,     -- Function signature
  docstring,     -- Documentation comment
  code_snippet   -- Code excerpt
);

-- Relationships table
CREATE TABLE edges (
  from_id  INTEGER,  -- Caller symbol ID
  to_id    INTEGER,  -- Callee symbol ID
  kind     TEXT,     -- calls/imports/inherits/implements
  file     TEXT,
  line     INTEGER
);
Enter fullscreen mode Exit fullscreen mode

Stage 3: Resolution

The critical step: resolving abstract "called something named X" into concrete "called the definition in file Y at line Z."

Source code: import { AuthService } from './auth.service'
             ...
             this.authService.login(user)
             resolution
Graph edges: UserController.login  AuthService.login (calls)
             UserController  AuthService (imports)
Enter fullscreen mode Exit fullscreen mode

Stage 4: Auto-Sync

Uses native OS file events (not polling!) to detect changes:

  • macOS: FSEvents
  • Linux: inotify
  • Windows: ReadDirectoryChangesW

A 2-second debounce prevents triggering mass rebuilds when files change rapidly — it waits for changes to settle before doing incremental updates.


2. Benchmark Deep Dive

Test conditions: Claude Code (headless, Opus 4.7) answering architecture questions. Each result is the median of 4 runs on the same question, across 7 real open-source repositories.

Project        Language       Size            Cost ↓  Token ↓  Speed ↑  Tool Calls ↓
──────────────────────────────────────────────────────────────────────────────────────
VS Code        TypeScript     ~10k files      35%     73%      41%      72%
Excalidraw     TypeScript     ~600 files      47%     73%      60%      86%
Django         Python         ~2.7k files     34%     64%      59%      81%
Tokio          Rust           ~700 files      52%     81%      63%      89%
OkHttp         Java           ~640 files      17%     41%      36%      64%
Gin            Go             ~150 files      22%     23%      34%      19%
Alamofire      Swift          ~100 files      38%     59%      51%      77%
──────────────────────────────────────────────────────────────────────────────────────
Average                                       35%     59%      49%      70%
Enter fullscreen mode Exit fullscreen mode

Patterns worth noting:

  1. Tokio (Rust, 700 files) sees the biggest gains (81% token reduction, 89% fewer tool calls): Rust's type system is complex — agents originally needed extensive file exploration to understand trait implementations and generic relationships. CodeGraph's pre-built relationships make this dramatically cheaper.

  2. Gin (Go, 150 files) sees the smallest gains (23% token reduction, 19% fewer tool calls): Small Go projects have simple file structures. Agents can already navigate them efficiently, so CodeGraph's marginal value is lower.

  3. VS Code's absolute numbers are the most striking: the same question costs $0.64 (1.4M tokens) without CodeGraph, $0.42 (393k tokens) with it. A single task saves $0.22.

Takeaway: The larger the codebase, the more complex the dependencies, and the richer the language's type system, the greater CodeGraph's benefit. For developers using Claude Code heavily on large projects, the ROI is clear.


3. 19 Languages + 13 Framework Route Detection

Language support (via tree-sitter grammars):

TypeScript, JavaScript, Python, Go, Rust, Java, C#, PHP, Ruby, C, C++, Swift, Kotlin, Dart, Svelte, Vue, Liquid, Pascal/Delphi, Scala

Framework route detection is a differentiating feature — CodeGraph doesn't just recognize symbols, it understands the mapping between URL routes and their handler functions:

# Django
urlpatterns = [
    path('users/<int:pk>/', UserDetailView.as_view()),
]
# → CodeGraph knows GET /users/{id}/ maps to UserDetailView

# FastAPI
@app.get("/items/{item_id}")
async def read_item(item_id: int):
    ...
# → CodeGraph knows GET /items/{id} maps to read_item()
Enter fullscreen mode Exit fullscreen mode

The 13 supported frameworks: Django, Flask, FastAPI, Express, NestJS, Laravel, Rails, Spring, Gin/chi/gorilla/mux, Axum/actix/Rocket, ASP.NET, Vapor, React Router/SvelteKit.

This means AI agents can ask "Where is the handler for /api/users/:id?" and get a precise answer, without needing to scan routing config files.


4. codegraph affected — Smart CI Test Selection

An underappreciated feature: by tracing import dependencies, it identifies which test files are actually affected by changed source files.

# CI scenario: only run tests affected by this change
git diff --name-only | codegraph affected --stdin

# Manually specify changed files
codegraph affected src/auth.ts

# With filter (only e2e tests)
codegraph affected src/auth.ts --filter "e2e/*"
Enter fullscreen mode Exit fullscreen mode

How it works:

Changed: src/auth.ts
  ↓ CodeGraph queries the dependency graph
  Direct importers: user.service.ts, auth.controller.ts
  Indirect importers: app.module.ts, integration.test.ts
  ↓ Filter to test files only
  Affected tests: auth.spec.ts, user.service.spec.ts, integration.test.ts
  ↓ Output
  [these files] ← run only these, not the full test suite
Enter fullscreen mode Exit fullscreen mode

On large projects, this can compress CI test time from tens of minutes to a few minutes.


5. Configuration and Performance Notes

Project config file (.codegraph/config.json):

{
  "version": 1,
  "languages": ["typescript", "javascript"],
  "exclude": ["node_modules/**", "dist/**", "build/**", "*.min.js"],
  "maxFileSize": 1048576,
  "extractDocstrings": true,
  "trackCallSites": true
}
Enter fullscreen mode Exit fullscreen mode

SQLite backend selection:

CodeGraph ships with two SQLite backends:

  • Native better-sqlite3 (default, recommended): High performance, supports concurrent reads
  • WASM fallback: Better compatibility, but 5–10x slower than native, and concurrent operations may produce database is locked errors

If you encounter performance issues or lock errors:

# Rebuild the native module
npm rebuild better-sqlite3

# Check which backend is active
codegraph status
Enter fullscreen mode Exit fullscreen mode

CLI Reference

codegraph                        # Run interactive installer
codegraph init [path]            # Initialize in a project
codegraph uninit [path]          # Remove CodeGraph from a project
codegraph index [path]           # Full index (--force to rebuild)
codegraph sync [path]            # Incremental update
codegraph status [path]          # Show statistics
codegraph query <search>         # Search symbols
codegraph files [path]           # Show file structure
codegraph context <task>         # Build AI context for a task
codegraph affected [files]       # Find affected test files
codegraph serve --mcp            # Start MCP server
Enter fullscreen mode Exit fullscreen mode

Library API (embed CodeGraph in your own tools):

import CodeGraph from '@colbymchenry/codegraph';

const cg = await CodeGraph.init('/path/to/project');

// Full index with progress callbacks
await cg.indexAll({
  onProgress: (p) => console.log(`${p.phase}: ${p.current}/${p.total}`)
});

// Search symbols
const results = cg.searchNodes('UserService');

// Get call chain
const callers = cg.getCallers(results[0].node.id);

// Build AI context
const context = await cg.buildContext('fix login bug', {
  maxNodes: 20,
  includeCode: true
});

// Impact radius analysis (depth 2)
const impact = cg.getImpactRadius(results[0].node.id, 2);

cg.watch();   // Start file watching for auto-sync
cg.close();   // Clean up resources
Enter fullscreen mode Exit fullscreen mode

Project Links & Resources

Official Resources

Target Audience

  • Heavy Claude Code / Cursor users: Working on large projects and looking to reduce cost and improve response speed
  • Large TypeScript/Rust/Python project developers: Codebases large enough that AI agent file-scanning overhead is noticeable
  • CI/CD engineers: Using codegraph affected for smart test selection to eliminate unnecessary full test runs
  • Toolchain developers: Embedding code semantic analysis into their own tools via the Library API

Summary

Key Takeaways

  1. Core value: Inserts a pre-built semantic index between AI agents and codebases — average 35% cost savings, 70% fewer tool calls, 49% speed improvement
  2. Technology choices: tree-sitter (AST parsing) + SQLite FTS5 (full-text search) + native OS file events (live sync) — zero external service dependencies
  3. 8 MCP tools: codegraph_context is the most critical — one call returns a complete context package for the task at hand
  4. 19 languages + 13 framework route detection covering mainstream development stacks
  5. codegraph affected: dependency-traced smart test selection, a CI acceleration tool
  6. Gains scale with codebase size: Tokio (Rust, 700 files) reaches 89% fewer tool calls; small Go projects see ~19%

One-Line Review

CodeGraph does something deceptively simple yet extremely practical: it converts the code discovery work that AI agents redo on every task into a reusable local index — not a feature addition, but a workflow architecture optimization.


Find more useful knowledge and interesting products on my Homepage

Top comments (0)