Every few months I'd open an old repo and ask the same question:
"Who wrote this, and why were they so angry?"
Function names that lie, classes with no explanation, a main.py that secretly runs the whole company. The code worked, but the documentation never kept up. It was always "I'll document this later," and later never came.
So I did the only reasonable thing: I stopped trying.
Instead, I wired up a pipeline that reads my projects, calls an LLM, and spits out a one-pager wiki for each codebase—automatically, every time the code changes.
In this post, I'll walk through how to build that: a multi-codebase summarization pipeline using CocoIndex that:
- Scans multiple projects in one go.
- Uses structured LLM outputs (Pydantic + Instructor) to extract functions, classes, and relationships.
- Aggregates everything into a project-level summary.
- Generates Markdown docs with Mermaid diagrams so you actually understand the architecture.
- Only re-runs the minimum work when files change, thanks to incremental processing and memoization.
If you've ever wished your monorepo came with a live-updating architecture wiki, this is for you.
The Idea: A Self-Updating Wiki For Every Project
Let's say you have a projects/ folder like this:
projects/
├── my_project_1/
│ ├── main.py
│ └── utils.py
├── my_project_2/
│ └── app.py
└── ...
What we want:
- For each subdirectory, generate a Markdown file like
output/my_project_1.md. - That Markdown includes:
- An overview of the project's purpose.
- Key public classes and functions, with human-readable summaries.
- Mermaid diagrams showing how components connect.
- Optional per-file details if the project spans multiple files.
And whenever you:
- Add a new project,
- Modify a file, or
- Change the LLM extraction logic,
you just run:
cocoindex update main.py
CocoIndex figures out what changed and recomputes only what's necessary.
Step 1: Setup – Point the Pipeline at Your Code
First, install CocoIndex and the supporting libraries:
pip install --pre "cocoindex>=1.0.0a6" instructor litellm pydantic
Create a project folder and enter it:
mkdir multi-codebase-summarization
cd multi-codebase-summarization
Set up your LLM configuration:
export GEMINI_API_KEY="your-api-key"
export LLM_MODEL="gemini/gemini-2.5-flash"
Tell CocoIndex where to keep its state:
echo "COCOINDEX_DB=./cocoindex.db" > .env
Then create your projects/ directory and drop in any Python projects you want summarized:
mkdir projects
Step 2: The App – Treat Each Directory as a Project
CocoIndex has the notion of an App: a top-level unit that defines how data flows from sources to outputs.
from __future__ import annotations
import os
import asyncio
import pathlib
from typing import Collection
import instructor
from litellm import acompletion
from pydantic import BaseModel, Field
import cocoindex as coco
from cocoindex.connectors import localfs
from cocoindex.resources.file import FileLike, PatternFilePathMatcher
LLM_MODEL = os.environ.get("LLM_MODEL", "gemini/gemini-2.5-flash")
app = coco.App(
"MultiCodebaseSummarization",
app_main,
root_dir=pathlib.Path("./projects"),
output_dir=pathlib.Path("./output"),
)
The app_main function scans subdirectories and mounts a processing component per project:
@coco.function
def app_main(root_dir: pathlib.Path, output_dir: pathlib.Path) -> None:
for entry in root_dir.resolve().iterdir():
if not entry.is_dir() or entry.name.startswith("."):
continue
project_name = entry.name
files = list(
localfs.walk_dir(
entry,
recursive=True,
path_matcher=PatternFilePathMatcher(
included_patterns=["*.py"],
excluded_patterns=[".*", "__pycache__"],
),
)
)
if files:
coco.mount(
coco.component_subpath("project", project_name),
process_project,
project_name,
files,
output_dir,
)
Step 3: Extract Structured File Info with an LLM
Define Pydantic models that describe exactly what we want:
class FunctionInfo(BaseModel):
name: str = Field(description="Function name")
signature: str = Field(description="Function signature")
is_coco_function: bool = Field(description="Whether decorated with @coco.function")
summary: str = Field(description="Brief summary")
class ClassInfo(BaseModel):
name: str = Field(description="Class name")
summary: str = Field(description="Brief summary")
class CodebaseInfo(BaseModel):
name: str = Field(description="File path or project name")
summary: str = Field(description="Brief summary")
public_classes: list[ClassInfo] = Field(default_factory=list)
public_functions: list[FunctionInfo] = Field(default_factory=list)
mermaid_graphs: list[str] = Field(default_factory=list)
With Instructor wrapping LiteLLM, we tell the model to fill in this schema:
_instructor_client = instructor.from_litellm(acompletion, mode=instructor.Mode.JSON)
@coco.function(memo=True)
async def extract_file_info(file: FileLike) -> CodebaseInfo:
content = file.read_text()
prompt = f"Analyze the following Python file...\n{content}"
result = await _instructor_client.chat.completions.create(
model=LLM_MODEL,
response_model=CodebaseInfo,
messages=[{"role": "user", "content": prompt}],
)
return CodebaseInfo.model_validate(result.model_dump())
Key point: memo=True caches results by file content—unchanged files skip the LLM entirely.
Step 4: Aggregate Files into a Project-Level Summary
@coco.function
async def aggregate_project_info(
project_name: str,
file_infos: list[CodebaseInfo],
) -> CodebaseInfo:
if len(file_infos) == 1:
return file_infos[0] # Single file, reuse
# Multi-file: LLM synthesizes
files_text = "\n".join(f"{i.name}: {i.summary}" for i in file_infos)
prompt = f"Aggregate these files into one summary:\n{files_text}"
result = await _instructor_client.chat.completions.create(
model=LLM_MODEL,
response_model=CodebaseInfo,
messages=[{"role": "user", "content": prompt}],
)
return result
Step 5: Generate Markdown (With Mermaid Diagrams)
@coco.function
def generate_markdown(project_name: str, info: CodebaseInfo) -> str:
lines = [f"# {project_name}", "", "## Overview", info.summary, ""]
if info.public_classes:
lines.append("**Classes:**")
for cls in info.public_classes:
lines.append(f"- `{cls.name}`: {cls.summary}")
if info.public_functions:
lines.append("**Functions:**")
for fn in info.public_functions:
lines.append(f"- `{fn.signature}`: {fn.summary}")
if info.mermaid_graphs:
lines.extend(["## Pipeline", "```
mermaid"])
lines.extend(info.mermaid_graphs)
lines.append("
```")
return "\n".join(lines)
Step 6: Wire It All Together
@coco.function(memo=True)
async def process_project(
project_name: str,
files: Collection[localfs.File],
output_dir: pathlib.Path,
) -> None:
# Extract info from each file concurrently
file_infos = await asyncio.gather(
*[extract_file_info(f) for f in files]
)
# Aggregate into project-level summary
project_info = await aggregate_project_info(project_name, file_infos)
# Generate and output markdown
markdown = generate_markdown(project_name, project_info)
localfs.declare_file(
output_dir / f"{project_name}.md",
markdown,
create_parent_dirs=True,
)
Run it:
cocoindex update main.py
ls output/
# my_project_1.md my_project_2.md ...
Why This Pattern Is Powerful
This pipeline demonstrates a reusable pattern:
- Structured LLM outputs: LLMs become typed, predictable components via Pydantic + Instructor
- Memoized LLM calls: You stop paying for the same prompt multiple times
- Async concurrency: LLM becomes a parallel compute resource
- Hierarchical aggregation: File → project, page → document, message → conversation
- Incremental processing: "Live" documentation without nightly rebuilds
Anywhere you have "items that need LLM enrichment, plus a rolled-up view," this pattern applies.
Try It Yourself
- Full example: Multi-Codebase Summarization
- CocoIndex repo: github.com/cocoindex/cocoindex
- Step by Step Tutorial
If you end up generating auto-wikis for your own monorepo, drop a link in the comments—I'd love to see what your "self-documenting" codebase looks like.
Found this useful? Give CocoIndex a ⭐ on GitHub!

Top comments (0)