Why Another Codebase Indexing Tool?
Let's be honest: managing code context for AI agents is a nightmare.
Your AI coding assistant needs to understand your entire codebase—not just one file at a time. Whether you're building RAG systems for Claude, context for Cursor, or semantic code search, you need:
✅ Fast, incremental updates (not rebuilding everything)
✅ Proper code parsing (not just text chunking)
✅ Vector embeddings for semantic search
✅ Real-time sync when your code changes
That's exactly what CocoIndex delivers.
What Makes CocoIndex Special?
Built-in Tree-sitter Support: Unlike generic text splitters, CocoIndex uses Tree-sitter to parse your code semantically. It understands functions, classes, and code structure—not just lines of text.
Incremental Processing: Only reprocess what changed. No more waiting 10 minutes every time you update a single file.
Native Vector Search: Built-in support for embedding generation and vector search with PostgreSQL + pgvector.
MCP Compatible: Works seamlessly with AI editors like Cursor, Windsurf, and Claude.
Real-World Use Cases
🤖 AI Coding Agents: Give Claude, Codex, or Gemini the right code context
🔍 Semantic Code Search: Find code by meaning, not keywords
📝 Auto Documentation: Keep design docs synced with actual code
🔧 Code Review Automation: AI-powered PR analysis
🚨 SRE Workflows: Index infrastructure-as-code for incident response
Tutorial: Build Your Codebase Index
Let me show you how ridiculously simple this is.
Step 1: Install
pip install -U cocoindex
You'll also need PostgreSQL with pgvector extension. Installation guide here.
Step 2: Define Your Flow
Create a flow that reads your codebase, chunks it with Tree-sitter, and generates embeddings:
import os
import cocoindex
@cocoindex.flow_def(name="CodeEmbedding")
def code_embedding_flow(
flow_builder: cocoindex.FlowBuilder,
data_scope: cocoindex.DataScope
):
# Load your codebase
data_scope["files"] = flow_builder.add_source(
cocoindex.sources.LocalFile(
path=os.path.join('..', '..'),
included_patterns=["*.py", "*.rs", "*.toml"],
excluded_patterns=[".*", "target", "**/node_modules"]
)
)
code_embeddings = data_scope.add_collector()
Step 3: Extract Language & Chunk Code
@cocoindex.op.function()
def extract_extension(filename: str) -> str:
return os.path.splitext(filename)[1]
with data_scope["files"].row() as file:
# Extract extension for Tree-sitter
file["extension"] = file["filename"].transform(extract_extension)
# Chunk code semantically
file["chunks"] = file["content"].transform(
cocoindex.functions.SplitRecursively(),
language=file["extension"],
chunk_size=1000,
chunk_overlap=300
)
Step 4: Embed & Index
@cocoindex.transform_flow()
def code_to_embedding(
text: cocoindex.DataSlice[str]
) -> cocoindex.DataSlice[list[float]]:
return text.transform(
cocoindex.functions.SentenceTransformerEmbed(
model="sentence-transformers/all-MiniLM-L6-v2"
)
)
with file["chunks"].row() as chunk:
chunk["embedding"] = chunk["text"].call(code_to_embedding)
code_embeddings.collect(
filename=file["filename"],
location=chunk["location"],
code=chunk["text"],
embedding=chunk["embedding"]
)
# Export to PostgreSQL with vector index
code_embeddings.export(
"code_embeddings",
cocoindex.storages.Postgres(),
primary_key_fields=["filename", "location"],
vector_indexes=[
cocoindex.VectorIndex(
"embedding",
cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY
)
]
)
Step 5: Run It
cocoindex update main
Boom. Your codebase is now indexed with semantic embeddings.
Step 6: Query Your Index
def search(pool: ConnectionPool, query: str, top_k: int = 5):
table_name = cocoindex.utils.get_target_storage_default_name(
code_embedding_flow, "code_embeddings"
)
query_vector = code_to_embedding.eval(query)
with pool.connection() as conn:
with conn.cursor() as cur:
cur.execute(f"""
SELECT filename, code, embedding <=> %s::vector AS distance
FROM {table_name}
ORDER BY distance LIMIT %s
""", (query_vector, top_k))
return [{
"filename": row[0],
"code": row[1],
"score": 1.0 - row[2]
} for row in cur.fetchall()]
Now you can search your codebase semantically:
python main.py
# Enter: "authentication middleware"
# Returns relevant auth code across your entire codebase
Language Support
CocoIndex supports all major languages via Tree-sitter:
- Python, JavaScript, TypeScript
- Rust, Go, C, C++, Java
- Ruby, PHP, Swift, Kotlin
- And 30+ more
Visualize with CocoInsight
Want to debug your indexing flow visually?
cocoindex server -ci main
This spins up CocoInsight at https://cocoindex.io/cocoinsight where you can inspect your data flow step-by-step.
Why You Should Care
AI coding tools are only as good as the context you give them. If you're building:
- AI agents that need code awareness
- Semantic code search engines
- Automated documentation generators
- Code review automation
...you need a proper codebase index.
CocoIndex makes it stupidly simple.
Try It Yourself
⭐ Star the repo: github.com/cocoindex-io/cocoindex
📖 Read the docs: cocoindex.io/docs
🎥 Watch the tutorial: YouTube guide
💬 Join Discord: discord.com/invite/zpA9S2DR7s
What are you building with AI code tools? Drop a comment below—I'd love to hear your use case!
If you found this useful, give CocoIndex a star on GitHub. It's open source and built by developers who actually understand the pain of managing code context for AI. 🚀
Top comments (0)