Linghua Jin

Posted on Feb 5

I Stopped Manually Documenting My Repos: Now They Generate Their Own Wikis

#ai #opensource #productivity #python

Every few months I'd open an old repo and ask the same question:

"Who wrote this, and why were they so angry?"

Function names that lie, classes with no explanation, a main.py that secretly runs the whole company. The code worked, but the documentation never kept up. It was always "I'll document this later," and later never came.

So I did the only reasonable thing: I stopped trying.

Instead, I wired up a pipeline that reads my projects, calls an LLM, and spits out a one-pager wiki for each codebase—automatically, every time the code changes.

In this post, I'll walk through how to build that: a multi-codebase summarization pipeline using CocoIndex that:

Scans multiple projects in one go.
Uses structured LLM outputs (Pydantic + Instructor) to extract functions, classes, and relationships.
Aggregates everything into a project-level summary.
Generates Markdown docs with Mermaid diagrams so you actually understand the architecture.
Only re-runs the minimum work when files change, thanks to incremental processing and memoization.

If you've ever wished your monorepo came with a live-updating architecture wiki, this is for you.

The Idea: A Self-Updating Wiki For Every Project

Let's say you have a projects/ folder like this:

projects/
  ├── my_project_1/
  │   ├── main.py
  │   └── utils.py
  ├── my_project_2/
  │   └── app.py
  └── ...

What we want:

For each subdirectory, generate a Markdown file like output/my_project_1.md.
That Markdown includes:
- An overview of the project's purpose.
- Key public classes and functions, with human-readable summaries.
- Mermaid diagrams showing how components connect.
- Optional per-file details if the project spans multiple files.

And whenever you:

Add a new project,
Modify a file, or
Change the LLM extraction logic,

you just run:

cocoindex update main.py

CocoIndex figures out what changed and recomputes only what's necessary.

Step 1: Setup – Point the Pipeline at Your Code

First, install CocoIndex and the supporting libraries:

pip install --pre "cocoindex>=1.0.0a6" instructor litellm pydantic

Create a project folder and enter it:

mkdir multi-codebase-summarization
cd multi-codebase-summarization

Set up your LLM configuration:

export GEMINI_API_KEY="your-api-key"
export LLM_MODEL="gemini/gemini-2.5-flash"

Tell CocoIndex where to keep its state:

echo "COCOINDEX_DB=./cocoindex.db" > .env

Then create your projects/ directory and drop in any Python projects you want summarized:

mkdir projects

Step 2: The App – Treat Each Directory as a Project

CocoIndex has the notion of an App: a top-level unit that defines how data flows from sources to outputs.

from __future__ import annotations
import os
import asyncio
import pathlib
from typing import Collection

import instructor
from litellm import acompletion
from pydantic import BaseModel, Field

import cocoindex as coco
from cocoindex.connectors import localfs
from cocoindex.resources.file import FileLike, PatternFilePathMatcher

LLM_MODEL = os.environ.get("LLM_MODEL", "gemini/gemini-2.5-flash")

app = coco.App(
    "MultiCodebaseSummarization",
    app_main,
    root_dir=pathlib.Path("./projects"),
    output_dir=pathlib.Path("./output"),
)

The app_main function scans subdirectories and mounts a processing component per project:

@coco.function
def app_main(root_dir: pathlib.Path, output_dir: pathlib.Path) -> None:
    for entry in root_dir.resolve().iterdir():
        if not entry.is_dir() or entry.name.startswith("."):
            continue
        project_name = entry.name
        files = list(
            localfs.walk_dir(
                entry,
                recursive=True,
                path_matcher=PatternFilePathMatcher(
                    included_patterns=["*.py"],
                    excluded_patterns=[".*", "__pycache__"],
                ),
            )
        )
        if files:
            coco.mount(
                coco.component_subpath("project", project_name),
                process_project,
                project_name,
                files,
                output_dir,
            )

Step 3: Extract Structured File Info with an LLM

Define Pydantic models that describe exactly what we want:

class FunctionInfo(BaseModel):
    name: str = Field(description="Function name")
    signature: str = Field(description="Function signature")
    is_coco_function: bool = Field(description="Whether decorated with @coco.function")
    summary: str = Field(description="Brief summary")

class ClassInfo(BaseModel):
    name: str = Field(description="Class name")
    summary: str = Field(description="Brief summary")

class CodebaseInfo(BaseModel):
    name: str = Field(description="File path or project name")
    summary: str = Field(description="Brief summary")
    public_classes: list[ClassInfo] = Field(default_factory=list)
    public_functions: list[FunctionInfo] = Field(default_factory=list)
    mermaid_graphs: list[str] = Field(default_factory=list)

With Instructor wrapping LiteLLM, we tell the model to fill in this schema:

_instructor_client = instructor.from_litellm(acompletion, mode=instructor.Mode.JSON)

@coco.function(memo=True)
async def extract_file_info(file: FileLike) -> CodebaseInfo:
    content = file.read_text()
    prompt = f"Analyze the following Python file...\n{content}"
    result = await _instructor_client.chat.completions.create(
        model=LLM_MODEL,
        response_model=CodebaseInfo,
        messages=[{"role": "user", "content": prompt}],
    )
    return CodebaseInfo.model_validate(result.model_dump())

Key point: memo=True caches results by file content—unchanged files skip the LLM entirely.

Step 4: Aggregate Files into a Project-Level Summary

@coco.function
async def aggregate_project_info(
    project_name: str,
    file_infos: list[CodebaseInfo],
) -> CodebaseInfo:
    if len(file_infos) == 1:
        return file_infos[0]  # Single file, reuse

    # Multi-file: LLM synthesizes
    files_text = "\n".join(f"{i.name}: {i.summary}" for i in file_infos)
    prompt = f"Aggregate these files into one summary:\n{files_text}"
    result = await _instructor_client.chat.completions.create(
        model=LLM_MODEL,
        response_model=CodebaseInfo,
        messages=[{"role": "user", "content": prompt}],
    )
    return result

Step 5: Generate Markdown (With Mermaid Diagrams)

@coco.function
def generate_markdown(project_name: str, info: CodebaseInfo) -> str:
    lines = [f"# {project_name}", "", "## Overview", info.summary, ""]

    if info.public_classes:
        lines.append("**Classes:**")
        for cls in info.public_classes:
            lines.append(f"- `{cls.name}`: {cls.summary}")

    if info.public_functions:
        lines.append("**Functions:**")
        for fn in info.public_functions:
            lines.append(f"- `{fn.signature}`: {fn.summary}")

    if info.mermaid_graphs:
        lines.extend(["## Pipeline", "```

mermaid"])
        lines.extend(info.mermaid_graphs)
        lines.append("

```")

    return "\n".join(lines)

Step 6: Wire It All Together

@coco.function(memo=True)
async def process_project(
    project_name: str,
    files: Collection[localfs.File],
    output_dir: pathlib.Path,
) -> None:
    # Extract info from each file concurrently
    file_infos = await asyncio.gather(
        *[extract_file_info(f) for f in files]
    )

    # Aggregate into project-level summary
    project_info = await aggregate_project_info(project_name, file_infos)

    # Generate and output markdown
    markdown = generate_markdown(project_name, project_info)

    localfs.declare_file(
        output_dir / f"{project_name}.md",
        markdown,
        create_parent_dirs=True,
    )

Run it:

cocoindex update main.py
ls output/
# my_project_1.md  my_project_2.md  ...

Why This Pattern Is Powerful

This pipeline demonstrates a reusable pattern:

Structured LLM outputs: LLMs become typed, predictable components via Pydantic + Instructor
Memoized LLM calls: You stop paying for the same prompt multiple times
Async concurrency: LLM becomes a parallel compute resource
Hierarchical aggregation: File → project, page → document, message → conversation
Incremental processing: "Live" documentation without nightly rebuilds

Anywhere you have "items that need LLM enrichment, plus a rolled-up view," this pattern applies.

Try It Yourself

Full example: Multi-Codebase Summarization
CocoIndex repo: github.com/cocoindex/cocoindex
Step by Step Tutorial

If you end up generating auto-wikis for your own monorepo, drop a link in the comments—I'd love to see what your "self-documenting" codebase looks like.

Found this useful? Give CocoIndex a ⭐ on GitHub!

DEV Community