HarmanPreet-Singh-XYT

Posted on Mar 19

How CodiLay Reads a Codebase the Way a Detective Reads a Crime Scene

#agents #ai #rag #documentation

Most documentation tools ask you to write the docs yourself, or they generate something so shallow it barely survives contact with the actual codebase. CodiLay takes a different approach. It reads the code the way an investigator reads evidence — tracing connections, holding open questions, resolving them when the right file comes along, and building a picture that gets more accurate as it goes.

Here's how it actually works under the hood.

The Wire Model

The central abstraction in CodiLay is the wire. A wire represents an unresolved reference — a file imports something, calls something, or depends on something that hasn't been documented yet. The agent opens a wire when it sees the reference. That wire stays alive in the agent's active state, carried forward through subsequent files, until a later file explains the other end. At that point, the wire closes, the connection gets recorded, and it retires from active context permanently.

This is deliberate. Closed wires are never re-injected into future LLM calls. As the codebase grows, the active context stays lean — only what's genuinely unresolved travels forward.

Wires carry type information too. import, call, model, config, event. A routes file importing a service opens an import wire. That service calling into a payment processor opens a call wire. Wires that reach external packages or reference deleted files stay permanently open and surface in the final output as Unresolved References — which is useful information, not a failure.

The Agent Loop, Phase by Phase

Bootstrap strips the project down to what matters: parse .gitignore, merge any additional ignore patterns from config, walk the directory tree, preload existing markdown files.

Triage is a single LLM call that sees only filenames and paths, not file content. It categorizes every file as core (document fully), skim (extract key metadata), or skip (ignore entirely). For a Flutter project this means ios/, android/, build/, and all generated .g.dart and .freezed.dart files disappear before the planner ever sees them. For a Next.js project, .next/ and out/ vanish. The triage phase is autonomous — no user confirmation step.

Planning makes one LLM call against the curated (post-triage) file list. The planner outputs an ordered queue, a list of parked files (too ambiguous to process yet), and a suggested document skeleton including section names and structure.

Processing runs file by file through that queue. For each file, the agent counts tokens with tiktoken, decides whether it goes through single-call or large-file handling, loads the relevant doc chunks, builds a prompt with current wire state and section index, calls the LLM, applies the JSON diff to the docstore, and updates the wire state — closing resolved wires, opening new ones, checking whether any parked files can now be unparked given the new context.

Finalization runs a sequential sweep with full wire context. Parked files get documented with whatever context exists. The Unresolved References and Dependency Graph sections assemble. CODEBASE.md writes out. links.json writes out. The current HEAD commit hash saves to state for the next re-run.

Large File Handling

Files are measured in tokens, not lines. A 500-line TypeScript generics file costs more context than a 1,000-line config file. The default threshold is 6,000 tokens, configurable per project.

Files over the threshold go through a skeleton pass first. The agent reads imports, function signatures, class definitions, and docstrings — no function bodies. This builds the section in the docstore, opens all detectable wires early, and marks the section as detail_pending.

Then the file splits along natural boundaries — class definitions, top-level functions, component edges. If no clean boundaries exist, it splits by token budget with 10–15% overlap between chunks. The overlap is the key detail. Without it, a function that starts at the bottom of chunk N and ends at the top of chunk N+1 gets half-documented or missed entirely. Overlap means the model always has trailing context from the previous chunk.

The skeleton-first order matters for a reason beyond chunking. By the time detail passes run, other files in the queue may have already processed. The LLM reading a function body already knows what calls it and what it returns to — it reads with full context rather than in isolation.

Docstore Architecture

The docstore manages the document as independently addressable sections rather than a flat string. Each section carries invisible metadata:

<!-- section:id=auth-middleware tags=auth,jwt,middleware deps=routes/users.js,routes/orders.js -->
## Auth Middleware
...content...
<!-- /section -->

These markers strip on final output. The section index — a lightweight JSON object always in context — holds only metadata: title, file, tags, deps, which wires it closed, whether it's detail_pending.

When processing a new file, relevant sections load by priority: sections whose deps list includes the current file path, sections whose tags overlap with imports found in the current file, sections flagged by open wires pointing to this file, and the top-level Overview (always loaded). Each LLM call stays bounded regardless of total document size.

Structured Parallelism

Sequential processing of a 50-file codebase means 50 serial LLM calls. Naive parallelism breaks the wire model — two workers processing related files simultaneously produce inconsistent context and miss connections.

CodiLay solves this with a dependency tier model. Before the loop starts, the planner builds a lightweight DAG from structural inference (folder hierarchy, import patterns visible in skim files). Files assign to tiers by their depth in the DAG. Tier 0 is entry points. Tier 1 is files directly imported by tier 0. And so on.

Processing happens tier by tier. Within a tier, all files run in parallel. Between tiers, a hard sync point waits for all workers to finish and reconciles the central wire bus before the next tier begins.

The wire bus is the shared state that makes safe parallelism possible. All workers read from and write to it through atomic operations: open(wire), close(wire_id, summary), peek(file_path), mark_pending(wire_id). Individual doc sections write independently per worker — only wire state is shared and locked.

Every worker takes a frozen snapshot of the wire bus at job start. It processes its file against that snapshot, not a live-updating view. A wire that closes mid-call by another worker doesn't affect the current worker's LLM call. The finalize pass reconciles everything afterward.

Sections generated during parallel processing carry a confidence tag: partial if there were pending wires at generation time. The finalize pass always re-reviews partial sections. Speedup expectations range from 1.5x on deep call-chain monoliths to 5–8x on flat utility repos.

Git Integration

The current HEAD commit hash saves to state after every run. On the next run, git diff <last_commit> HEAD --name-status returns a typed change list — M for modified, A for added, D for deleted, R for renamed.

Modified files re-enter the queue. Any wires that originated from or pointed to them re-open. Their doc sections invalidate. Added files run through a single-file triage call before queuing. Deleted files turn their wires permanently open with a note referencing the commit hash. Renamed files get all their wire from/to fields updated along with section index deps entries, then re-process at the new path.

When git isn't available — no git history, no git binary — the fallback compares file mtimes against timestamps in the state file. Files newer than last_run treat as modified. Files in processed but missing from disk treat as deleted.

Parallelism Safety: Three Failure Modes Eliminated

Missed wire — Worker B documents a file without knowing Worker A opened a wire pointing to it. The wire never closes. The connection disappears from the final doc. The frozen-context snapshot eliminates this: every worker reads the wire bus before starting, and the finalize pass catches anything that slipped.

Partial read — Worker B reads shared wire state mid-update by Worker A, gets half-formed context, and produces inconsistent output. The locked atomic operations on the wire bus eliminate this.

Confident wrong — Worker B has no open wire pointing to its file, assumes self-containment, and hallucinates relationships. The confidence tagging catches this: if pending wires existed at generation time, the section marks as partial and finalize re-reviews it.

Resumption and Cost Protection

The state file rotates through four copies: current, .bak.1, .bak.2, .bak.3. On corruption, automatic fallback cascades through the backups. State files run 50–200KB — negligible disk footprint.

The LLM response cache is the deepest money protection. Before any LLM call, the system checks for a cached response keyed on hash(file_content) + hash(prompt_template) + hash(wire_context_snapshot). A crash between API call and docstore write doesn't lose the money — the response is in cache and replays on resume at zero cost.

The two-phase docstore write ensures this: write LLM response to cache (atomic), write section to temp file, atomically move to final location via os.replace(), update state, rotate backups. A crash at any step after the cache write means the response survives.

Failure routing distinguishes retryable from user-actionable errors. Rate limits get exponential backoff with jitter, up to 5 retries. Timeouts retry once with a longer timeout, then park. Auth errors pause the run and prompt the user to fix the API key — then resume. Disk full pauses and waits. The distinction matters: retrying an auth error silently accomplishes nothing.

The Web UI Architecture

The server is a FastAPI application built with a factory pattern — create_app(target_path, output_dir) scopes the entire app to a specific project via closures. All state is project-local.

Four items cache with mtime invalidation: agent state, wire links, CODEBASE.md, and the TF-IDF retriever. When codilay run or codilay watch updates files on disk, the server picks up changes without restart.

All synchronous operations — LLM calls, file I/O, search indexing — wrap in asyncio.to_thread to keep the async event loop responsive under concurrent requests. Feature modules lazy-import inside their endpoint handlers rather than at module level, which keeps startup fast and isolates import errors.

The chat system runs three layers. Layer 1 renders the static output with an interactive dependency graph from links.json. Layer 2 is a chatbot that answers questions from doc context only. Layer 3, the deep agent, activates when confidence drops below threshold — it reads actual source files, answers with precision, then patches the doc with what it found. The next time the same question arrives, Layer 2 handles it without escalation.

Code Annotation

The annotation system writes wire knowledge back into source files as comments and docstrings. It's the one feature that modifies actual code, which drives the guard design.

requireGitClean blocks annotation if the working tree has uncommitted changes. git checkout . always works as a clean rollback. requireDryRunFirst forces a dry run on the first annotation of any project — nothing writes until the user has seen the unified diff preview.

Annotations insert sorted by line number descending — bottom of file first. This prevents line offset drift. Inserting a block comment at line 10 shifts every subsequent target line number by N. Inserting from the bottom up means each insertion doesn't affect anything above it.

Before writing any file, syntax validation runs. Python goes through ast.parse(). Other languages get structural checks. If validation fails, the original file stays untouched and the annotation moves to the review queue.

The wire connection block in an annotation is the output that no other doc generator produces:

def process_payment(order_id, retry_count=0):
    """
    Charges the customer for a pending order via Stripe.

    Wire connections:
      <- Called by: routes/orders.py (checkout), scheduler/retry_jobs.py
      -> Calls:     stripe.charge.create, notify_fulfillment (Celery task)
      -> Reads:     Order model, Customer.stripe_id

    Retry logic: up to 3 attempts with exponential backoff.
    """

The cross-file relationship shows up directly in the code, not in a separate document that drifts out of sync.

Conversation Search

The search module is a custom TF-IDF implementation with no external dependencies. It tokenizes with a regex that extracts words and code identifiers (get_user, handleClick) while filtering stop words. Augmented TF — 0.5 + 0.5 * (term_count / max_term_in_doc) — prevents long documents from dominating by raw occurrence count. IDF uses log((1 + N) / (1 + df)) + 1 smoothing so terms appearing in all documents don't score to zero.

Snippets extract by scanning for the 120-character window with highest query term density, which ensures the most relevant part of a message shows rather than just the opening.

The index saves to codilay/chat/search_index.json. It's a non-critical cache — if missing or corrupt, it auto-rebuilds from conversation files. First search after fresh install is slower; everything after uses the cached index.

The Audit System

Audits run against the wire graph and doc context with a specific analytical lens. The wire model already knows where auth happens, what touches it, where data enters and exits, what depends on what, where secrets get referenced. An audit agent reads this existing knowledge and does targeted deep dives into the files relevant to the audit type.

Every finding includes the wire path that shows how the vulnerable code gets reached:

FINDING: Unsanitized user input in search handler
Severity:   HIGH
File:       src/routes/search.js  line 47
Wire path:  routes/search.js -> utils/query_builder.js (call)
Evidence:   req.query.term passed directly to buildQuery() with no sanitization
Impact:     buildQuery() constructs raw SQL — potential injection
Fix:        Sanitize req.query.term before passing to buildQuery()

The checklist system matters here. Each audit type loads a specific checklist into the system prompt. The LLM works through it item by item rather than free-associating. Nothing gets skipped, and every finding ties to a specific checklist item.

Audit types split into three tiers. Tier 1 is pure code analysis — security, architecture, performance, dependency supply chain, license audit. CodiLay reads the code and produces findings. Tier 2 adds config file analysis — Dockerfiles, CI pipelines, IAM definitions, cloud resource config. Tier 3 generates specifications for external tooling — OWASP ZAP target lists from route wires, load test targets from high-traffic route analysis, chaos engineering blast radius maps from the dependency graph.

Multiple audit types share file reads in one pass. A security and architecture audit running together surfaces things neither finds alone — a service boundary violation that also creates a security exposure.

CodiLay sits at ~30k lines across ~25+ source files, 500+ passing tests, and 30+ CLI commands. The wire model is the idea everything else builds on. Everything from parallelism to audit reports to annotation safety to cost protection traces back to the same core abstraction: track what you know, track what you don't, resolve unknowns as you go, and never carry dead weight into the next call.

Website - codilay.harmanita.com