Linghua Jin

Posted on Nov 30, 2025

Stop Grepping Your Monorepo: Real-Time Codebase Indexing with CocoIndex

#ai #devtools #rag #postgres

Real-time codebase indexing with CocoIndex lets you turn a messy, evolving repo into a live semantic API that your AI tools, editors, and SRE workflows can query in milliseconds.

Why codebase indexing matters

Most AI coding agents and RAG stacks fall apart on real-world code because they rely on brittle regex search, static embeddings, or manual sync scripts that constantly drift out of date. A proper index solves three hard problems at once: semantic chunking (what to embed), incremental updates (what to reprocess), and fast similarity search (how to query). CocoIndex packages these into a declarative flow: define sources, transforms, and storage once, then keep your index fresh with a single CLI command.

What you can build with it

Once your repo is indexed, you get a universal "code context service" that many tools can plug into. Some examples:

AI coding agents (Claude, Gemini CLI, etc.) that can pull precise, up-to-date snippets across the whole monorepo instead of just the open file.
MCP-style backends for editors like Cursor, Windsurf, and VS Code that answer "where is this configured?" or "who calls this function?" with semantic search, not grep.
Code review and refactoring assistants that reason across multiple services, configs, and docs for large migrations or safety checks.
SRE workflows that index infra-as-code, deployment scripts, and configs so you can ask questions like "what changes touched this service's timeout in the last month?"
Auto-generated design docs that stay in sync with the actual implementation by querying indexed code instead of stale wiki pages.

Architecture: a flow, not a script

CocoIndex is not "yet another Python script around an embedding model." It gives you a flow definition that describes how data moves from raw files to vector storage, and it tracks enough metadata to support incremental recomputation. For a codebase index, the high-level flow looks like this:

Read files from the local filesystem via the LocalFile source.
Derive language info from filenames so Tree-sitter can parse correctly.
Split code into semantic chunks using SplitRecursively instead of naive fixed-size windows.
Compute embeddings for each chunk with a SentenceTransformer model.
Store everything into Postgres as a vector table with an index on the embedding column.

This flow is declared once in Python with @cocoindex.flow_def, and CocoIndex turns it into a reproducible pipeline that can be updated with cocoindex update main whenever your repo changes.

Adding your repo as a source

The first step is teaching the flow where your code lives and which files to care about. Using the LocalFile source, you can:

Include extensions that matter for your stack (for example: .py, .rs, .toml, .md, .mdx).
Exclude noise like dotfiles, build artifacts (target), or dependency trees such as **/node_modules.

flow_builder.add_source materializes this as a table with at least filename and content columns, which becomes the foundation for all downstream steps.

Tree-sitter powered chunking

Most RAG examples still split code by character count or line count, which cuts functions and classes in half and destroys structure. CocoIndex leans on Tree-sitter plus its SplitRecursively function to chunk code along syntactic boundaries, so each chunk is a coherent unit like a function, method, or small logical block.

Getting the language right

Tree-sitter needs to know what language it is parsing. The flow defines a tiny extract_extension function that takes a filename and returns its extension, and then stores this as a new extension field for each file. That extension is then passed into SplitRecursively as the language parameter, which lets CocoIndex pick the right parser for each file type.

Semantic chunks with overlap

Within each file row, the flow calls SplitRecursively on the content column to produce a chunks collection that contains both the text and a location field telling you where in the file the chunk came from. You can configure chunk_size (for example, around 1000 tokens/characters) and chunk_overlap (for example, 300) so neighboring chunks have context continuity, which improves retrieval quality when your query touches code that lies on the boundary.

Embeddings and vector storage

Once you have clean, language-aware chunks, the next step is to embed them into a vector space. CocoIndex uses a @cocoindex.transform_flow called code_to_embedding which applies SentenceTransformerEmbed with a model like sentence-transformers/all-MiniLM-L6-v2 to each chunk's text. Because the same transform flow can be evaluated at query time, you get consistent embeddings for both indexing and querying, which is critical for similarity search.

Collecting and exporting

Within the flow, each chunk row:

Calls code_to_embedding on chunk["text"].
Collects a record into code_embeddings containing filename, location, code, and the computed embedding.

At the end, code_embeddings.export writes this to Postgres via cocoindex.storages.Postgres, defines filename and location as a composite primary key, and configures a vector index on the embedding field using cosine similarity. That gives you a queryable code_embeddings table that plays nicely with SQL and existing infra.

Querying your code like an API

With the index built, you can turn natural language into search results using a tiny search helper. The idea:

Use cocoindex.utils.get_target_storage_default_name to get the actual table name linked to your export.
Call code_to_embedding.eval(query) to turn the user's text into a query vector using the exact same embedding pipeline as indexing.
Run a SQL query with the Postgres <=> operator against the vector column, ordering by distance and returning the top-k matches plus their scores.

Each result carries filename, the raw code snippet, and a score derived from cosine similarity, which you can surface directly in a CLI, editor extension, or API response.

A tiny REPL to demo it

To make the example tangible, the docs define a main() function that:

Creates a database ConnectionPool using COCOINDEX_DATABASE_URL.
Loops over user input queries from the terminal.
Prints out a ranked list of matching files and code snippets for each query.

Run python main.py, type something like "retry policy for HTTP client" or "feature flag for checkout A/B test," and you'll see the most relevant chunks from across your repo, without any manual curation.

Keeping the index fresh

The entire point of using CocoIndex over a one-off ingest script is incremental, near real-time updates. After installing dependencies with pip install -e ., a single cocoindex update main command walks the repo, detects changes, and only reprocesses what is necessary according to the flow definition.

Observability with CocoInsight

If you want to debug or iterate on your flow, you can start CocoInsight with:

cocoindex server -ci main

Then open the URL from the terminal to inspect how your data moves through each step, understand chunking behavior, and refine filters or parameters until the index matches how your team thinks about the repo. SplitRecursively supports all major programming languages, so you can grow from a single service to a polyglot monorepo without rethinking the design.

Next steps

If you're running AI agents over your codebase, managing a monorepo, or building tools that need semantic code search—try CocoIndex. The docs are at cocoindex.io, and the flow-based model means you can adapt the index to your exact stack: polyglot codebases, multiple repos, custom chunking strategies, or entirely different embedding models.

Stop grepping. Start indexing.

DEV Community