Linghua Jin

Posted on Dec 15, 2025

Build a Real-Time Codebase Index in 5 Minutes with CocoIndex (Rust + Tree-sitter)

#rust #ai #python #opensource

Why Another Codebase Indexing Tool?

Let's be honest: managing code context for AI agents is a nightmare.

Your AI coding assistant needs to understand your entire codebase—not just one file at a time. Whether you're building RAG systems for Claude, context for Cursor, or semantic code search, you need:

✅ Fast, incremental updates (not rebuilding everything)
✅ Proper code parsing (not just text chunking)
✅ Vector embeddings for semantic search
✅ Real-time sync when your code changes

That's exactly what CocoIndex delivers.

What Makes CocoIndex Special?

Built-in Tree-sitter Support: Unlike generic text splitters, CocoIndex uses Tree-sitter to parse your code semantically. It understands functions, classes, and code structure—not just lines of text.

Incremental Processing: Only reprocess what changed. No more waiting 10 minutes every time you update a single file.

Native Vector Search: Built-in support for embedding generation and vector search with PostgreSQL + pgvector.

MCP Compatible: Works seamlessly with AI editors like Cursor, Windsurf, and Claude.

Real-World Use Cases

🤖 AI Coding Agents: Give Claude, Codex, or Gemini the right code context

🔍 Semantic Code Search: Find code by meaning, not keywords

📝 Auto Documentation: Keep design docs synced with actual code

🔧 Code Review Automation: AI-powered PR analysis

🚨 SRE Workflows: Index infrastructure-as-code for incident response

Tutorial: Build Your Codebase Index

Let me show you how ridiculously simple this is.

Step 1: Install

pip install -U cocoindex

You'll also need PostgreSQL with pgvector extension. Installation guide here.

Step 2: Define Your Flow

Create a flow that reads your codebase, chunks it with Tree-sitter, and generates embeddings:

import os
import cocoindex

@cocoindex.flow_def(name="CodeEmbedding")
def code_embedding_flow(
    flow_builder: cocoindex.FlowBuilder,
    data_scope: cocoindex.DataScope
):
    # Load your codebase
    data_scope["files"] = flow_builder.add_source(
        cocoindex.sources.LocalFile(
            path=os.path.join('..', '..'),
            included_patterns=["*.py", "*.rs", "*.toml"],
            excluded_patterns=[".*", "target", "**/node_modules"]
        )
    )

    code_embeddings = data_scope.add_collector()

Step 3: Extract Language & Chunk Code

@cocoindex.op.function()
def extract_extension(filename: str) -> str:
    return os.path.splitext(filename)[1]

with data_scope["files"].row() as file:
    # Extract extension for Tree-sitter
    file["extension"] = file["filename"].transform(extract_extension)

    # Chunk code semantically
    file["chunks"] = file["content"].transform(
        cocoindex.functions.SplitRecursively(),
        language=file["extension"],
        chunk_size=1000,
        chunk_overlap=300
    )

Step 4: Embed & Index

@cocoindex.transform_flow()
def code_to_embedding(
    text: cocoindex.DataSlice[str]
) -> cocoindex.DataSlice[list[float]]:
    return text.transform(
        cocoindex.functions.SentenceTransformerEmbed(
            model="sentence-transformers/all-MiniLM-L6-v2"
        )
    )

with file["chunks"].row() as chunk:
    chunk["embedding"] = chunk["text"].call(code_to_embedding)

    code_embeddings.collect(
        filename=file["filename"],
        location=chunk["location"],
        code=chunk["text"],
        embedding=chunk["embedding"]
    )

# Export to PostgreSQL with vector index
code_embeddings.export(
    "code_embeddings",
    cocoindex.storages.Postgres(),
    primary_key_fields=["filename", "location"],
    vector_indexes=[
        cocoindex.VectorIndex(
            "embedding",
            cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY
        )
    ]
)

Step 5: Run It

cocoindex update main

Boom. Your codebase is now indexed with semantic embeddings.

Step 6: Query Your Index

def search(pool: ConnectionPool, query: str, top_k: int = 5):
    table_name = cocoindex.utils.get_target_storage_default_name(
        code_embedding_flow, "code_embeddings"
    )

    query_vector = code_to_embedding.eval(query)

    with pool.connection() as conn:
        with conn.cursor() as cur:
            cur.execute(f"""
                SELECT filename, code, embedding <=> %s::vector AS distance
                FROM {table_name}
                ORDER BY distance LIMIT %s
            """, (query_vector, top_k))

            return [{
                "filename": row[0],
                "code": row[1],
                "score": 1.0 - row[2]
            } for row in cur.fetchall()]

Now you can search your codebase semantically:

python main.py
# Enter: "authentication middleware"
# Returns relevant auth code across your entire codebase

Language Support

CocoIndex supports all major languages via Tree-sitter:

Python, JavaScript, TypeScript
Rust, Go, C, C++, Java
Ruby, PHP, Swift, Kotlin
And 30+ more

Full language list here.

Visualize with CocoInsight

Want to debug your indexing flow visually?

cocoindex server -ci main

This spins up CocoInsight at https://cocoindex.io/cocoinsight where you can inspect your data flow step-by-step.

Why You Should Care

AI coding tools are only as good as the context you give them. If you're building:

AI agents that need code awareness
Semantic code search engines
Automated documentation generators
Code review automation

...you need a proper codebase index.

CocoIndex makes it stupidly simple.

Try It Yourself

⭐ Star the repo: github.com/cocoindex-io/cocoindex

📖 Read the docs: cocoindex.io/docs

🎥 Watch the tutorial: YouTube guide

💬 Join Discord: discord.com/invite/zpA9S2DR7s

What are you building with AI code tools? Drop a comment below—I'd love to hear your use case!

If you found this useful, give CocoIndex a star on GitHub. It's open source and built by developers who actually understand the pain of managing code context for AI. 🚀

DEV Community