DEV Community

Cover image for I Stopped Manually Documenting My Repos: Now They Generate Their Own Wikis
Linghua Jin
Linghua Jin

Posted on

I Stopped Manually Documenting My Repos: Now They Generate Their Own Wikis

Every few months I'd open an old repo and ask the same question:

"Who wrote this, and why were they so angry?"

Function names that lie, classes with no explanation, a main.py that secretly runs the whole company. The code worked, but the documentation never kept up. It was always "I'll document this later," and later never came.

So I did the only reasonable thing: I stopped trying.

Instead, I wired up a pipeline that reads my projects, calls an LLM, and spits out a one-pager wiki for each codebase—automatically, every time the code changes.

In this post, I'll walk through how to build that: a multi-codebase summarization pipeline using CocoIndex that:

  • Scans multiple projects in one go.
  • Uses structured LLM outputs (Pydantic + Instructor) to extract functions, classes, and relationships.
  • Aggregates everything into a project-level summary.
  • Generates Markdown docs with Mermaid diagrams so you actually understand the architecture.
  • Only re-runs the minimum work when files change, thanks to incremental processing and memoization.

If you've ever wished your monorepo came with a live-updating architecture wiki, this is for you.

markdown


The Idea: A Self-Updating Wiki For Every Project

Let's say you have a projects/ folder like this:

projects/
  ├── my_project_1/
  │   ├── main.py
  │   └── utils.py
  ├── my_project_2/
  │   └── app.py
  └── ...
Enter fullscreen mode Exit fullscreen mode

What we want:

  • For each subdirectory, generate a Markdown file like output/my_project_1.md.
  • That Markdown includes:
    • An overview of the project's purpose.
    • Key public classes and functions, with human-readable summaries.
    • Mermaid diagrams showing how components connect.
    • Optional per-file details if the project spans multiple files.

And whenever you:

  • Add a new project,
  • Modify a file, or
  • Change the LLM extraction logic,

you just run:

cocoindex update main.py
Enter fullscreen mode Exit fullscreen mode

CocoIndex figures out what changed and recomputes only what's necessary.


Step 1: Setup – Point the Pipeline at Your Code

First, install CocoIndex and the supporting libraries:

pip install --pre "cocoindex>=1.0.0a6" instructor litellm pydantic
Enter fullscreen mode Exit fullscreen mode

Create a project folder and enter it:

mkdir multi-codebase-summarization
cd multi-codebase-summarization
Enter fullscreen mode Exit fullscreen mode

Set up your LLM configuration:

export GEMINI_API_KEY="your-api-key"
export LLM_MODEL="gemini/gemini-2.5-flash"
Enter fullscreen mode Exit fullscreen mode

Tell CocoIndex where to keep its state:

echo "COCOINDEX_DB=./cocoindex.db" > .env
Enter fullscreen mode Exit fullscreen mode

Then create your projects/ directory and drop in any Python projects you want summarized:

mkdir projects
Enter fullscreen mode Exit fullscreen mode

Step 2: The App – Treat Each Directory as a Project

CocoIndex has the notion of an App: a top-level unit that defines how data flows from sources to outputs.

from __future__ import annotations
import os
import asyncio
import pathlib
from typing import Collection

import instructor
from litellm import acompletion
from pydantic import BaseModel, Field

import cocoindex as coco
from cocoindex.connectors import localfs
from cocoindex.resources.file import FileLike, PatternFilePathMatcher

LLM_MODEL = os.environ.get("LLM_MODEL", "gemini/gemini-2.5-flash")

app = coco.App(
    "MultiCodebaseSummarization",
    app_main,
    root_dir=pathlib.Path("./projects"),
    output_dir=pathlib.Path("./output"),
)
Enter fullscreen mode Exit fullscreen mode

The app_main function scans subdirectories and mounts a processing component per project:

@coco.function
def app_main(root_dir: pathlib.Path, output_dir: pathlib.Path) -> None:
    for entry in root_dir.resolve().iterdir():
        if not entry.is_dir() or entry.name.startswith("."):
            continue
        project_name = entry.name
        files = list(
            localfs.walk_dir(
                entry,
                recursive=True,
                path_matcher=PatternFilePathMatcher(
                    included_patterns=["*.py"],
                    excluded_patterns=[".*", "__pycache__"],
                ),
            )
        )
        if files:
            coco.mount(
                coco.component_subpath("project", project_name),
                process_project,
                project_name,
                files,
                output_dir,
            )
Enter fullscreen mode Exit fullscreen mode

Step 3: Extract Structured File Info with an LLM

Define Pydantic models that describe exactly what we want:

class FunctionInfo(BaseModel):
    name: str = Field(description="Function name")
    signature: str = Field(description="Function signature")
    is_coco_function: bool = Field(description="Whether decorated with @coco.function")
    summary: str = Field(description="Brief summary")

class ClassInfo(BaseModel):
    name: str = Field(description="Class name")
    summary: str = Field(description="Brief summary")

class CodebaseInfo(BaseModel):
    name: str = Field(description="File path or project name")
    summary: str = Field(description="Brief summary")
    public_classes: list[ClassInfo] = Field(default_factory=list)
    public_functions: list[FunctionInfo] = Field(default_factory=list)
    mermaid_graphs: list[str] = Field(default_factory=list)
Enter fullscreen mode Exit fullscreen mode

With Instructor wrapping LiteLLM, we tell the model to fill in this schema:

_instructor_client = instructor.from_litellm(acompletion, mode=instructor.Mode.JSON)

@coco.function(memo=True)
async def extract_file_info(file: FileLike) -> CodebaseInfo:
    content = file.read_text()
    prompt = f"Analyze the following Python file...\n{content}"
    result = await _instructor_client.chat.completions.create(
        model=LLM_MODEL,
        response_model=CodebaseInfo,
        messages=[{"role": "user", "content": prompt}],
    )
    return CodebaseInfo.model_validate(result.model_dump())
Enter fullscreen mode Exit fullscreen mode

Key point: memo=True caches results by file content—unchanged files skip the LLM entirely.


Step 4: Aggregate Files into a Project-Level Summary

@coco.function
async def aggregate_project_info(
    project_name: str,
    file_infos: list[CodebaseInfo],
) -> CodebaseInfo:
    if len(file_infos) == 1:
        return file_infos[0]  # Single file, reuse

    # Multi-file: LLM synthesizes
    files_text = "\n".join(f"{i.name}: {i.summary}" for i in file_infos)
    prompt = f"Aggregate these files into one summary:\n{files_text}"
    result = await _instructor_client.chat.completions.create(
        model=LLM_MODEL,
        response_model=CodebaseInfo,
        messages=[{"role": "user", "content": prompt}],
    )
    return result
Enter fullscreen mode Exit fullscreen mode

Step 5: Generate Markdown (With Mermaid Diagrams)

@coco.function
def generate_markdown(project_name: str, info: CodebaseInfo) -> str:
    lines = [f"# {project_name}", "", "## Overview", info.summary, ""]

    if info.public_classes:
        lines.append("**Classes:**")
        for cls in info.public_classes:
            lines.append(f"- `{cls.name}`: {cls.summary}")

    if info.public_functions:
        lines.append("**Functions:**")
        for fn in info.public_functions:
            lines.append(f"- `{fn.signature}`: {fn.summary}")

    if info.mermaid_graphs:
        lines.extend(["## Pipeline", "```

mermaid"])
        lines.extend(info.mermaid_graphs)
        lines.append("

```")

    return "\n".join(lines)
Enter fullscreen mode Exit fullscreen mode

Step 6: Wire It All Together

@coco.function(memo=True)
async def process_project(
    project_name: str,
    files: Collection[localfs.File],
    output_dir: pathlib.Path,
) -> None:
    # Extract info from each file concurrently
    file_infos = await asyncio.gather(
        *[extract_file_info(f) for f in files]
    )

    # Aggregate into project-level summary
    project_info = await aggregate_project_info(project_name, file_infos)

    # Generate and output markdown
    markdown = generate_markdown(project_name, project_info)

    localfs.declare_file(
        output_dir / f"{project_name}.md",
        markdown,
        create_parent_dirs=True,
    )
Enter fullscreen mode Exit fullscreen mode

Run it:

cocoindex update main.py
ls output/
# my_project_1.md  my_project_2.md  ...
Enter fullscreen mode Exit fullscreen mode

Why This Pattern Is Powerful

This pipeline demonstrates a reusable pattern:

  • Structured LLM outputs: LLMs become typed, predictable components via Pydantic + Instructor
  • Memoized LLM calls: You stop paying for the same prompt multiple times
  • Async concurrency: LLM becomes a parallel compute resource
  • Hierarchical aggregation: File → project, page → document, message → conversation
  • Incremental processing: "Live" documentation without nightly rebuilds

Anywhere you have "items that need LLM enrichment, plus a rolled-up view," this pattern applies.


Try It Yourself

If you end up generating auto-wikis for your own monorepo, drop a link in the comments—I'd love to see what your "self-documenting" codebase looks like.


Found this useful? Give CocoIndex a ⭐ on GitHub!

Top comments (0)