DEV Community: Pramoda Sahu

How I Built a Compounding Markdown Wiki for Research: When Standard RAG Stops Being Enough

Pramoda Sahu — Thu, 25 Jun 2026 10:42:17 +0000

Most AI knowledge systems have a hidden failure mode: they keep rediscovering what they already know.

That is fine for lightweight lookup. It becomes a problem in long-horizon, multi-source research. Once you move beyond a handful of documents, the usual pattern of “retrieve some chunks, ask the model, hope it synthesizes correctly” starts to break down. Answers get fuzzier. Citations blur together. Contradictions slip through. And every new document increases cost without necessarily increasing understanding.

I ran into this while working with large research corpora: papers, transcripts, articles, and project-specific source material that needed to stay usable over time. RAG solved one problem — fitting large corpora into a model workflow — but it did not solve the deeper one. It gave me search, not cumulative knowledge.

So I built something different: a compounding markdown wiki inspired by Recursive Language Model (RLM) principles, where ingestion happens once, structure accumulates over time, and the 100th document makes the first 99 more useful. This is not a literal implementation of the RLM paper; it is an adaptation of the core idea that useful state should live outside the model context window and be updated incrementally.

In this post, I’ll break down the failure mode I was seeing, why standard chunk-retrieval RAG was not enough for this kind of work, and how this architecture produces more stable, cited, human-readable research outputs at scale.

The Real Problem: Context Rot

If you’ve used AI agents on research material, you’ve probably seen some version of this:

You feed one paper into ChatGPT. It answers your question well.
You feed a second paper. It still works.
You ask about both papers together. The synthesis gets weaker.
You add a tenth paper. Hallucinations start creeping in.
You add a fiftieth. Earlier sources start getting dropped without warning.

This is context rot.

The issue is not that the model “forgets” in a clean, binary way. It is that as the context window fills up, attention gets diluted across more and more tokens. Earlier material does not vanish; it just becomes harder for the model to use precisely. The degradation is gradual and messy:

Answers become more generic
Citations get mixed up
Contradictions go undetected
Earlier evidence stops influencing later responses reliably

That matters in research workflows, where the hard part usually is not finding a sentence that mentions a term. It is maintaining a consistent understanding across many sources.

What context rot looks like in practice

A typical agent loop looks something like this:

1. Build system prompt (persona, instructions, tool schemas)
2. Add conversation history (growing every turn)
3. Add retrieved context (RAG chunks, file contents)
4. Send to model
5. Get response
6. Append response to history
7. Repeat

By turn 10, you may already be around 30K tokens. By turn 30, you can be well past 100K. Even if your model technically supports a large context window, it does not attend equally well to everything inside it.

The symptoms are familiar:

The answer shifts from specific claims to vague summaries
The model contradicts something it said a few turns earlier
It attributes one source’s claim to the wrong person
It refuses to answer because it detects inconsistency but cannot resolve it

This is the hidden tax on long-running AI research sessions. And it is what led me to rethink the default architecture.

Why Standard RAG Solves Scale but Not Cumulative Research

The standard answer to context rot is Retrieval-Augmented Generation.

Instead of stuffing everything into the prompt, you embed documents, retrieve the most relevant chunks for the current question, and pass only those chunks to the model. That absolutely helps with scale. It keeps prompts smaller and avoids dumping entire corpora into context.

But basic top-K chunk-retrieval RAG pipelines have a different limitation: they often treat each query as a mostly fresh synthesis problem.

For a query like “What did Smith say about the merger?”, a standard RAG pipeline typically does this:

Embed the question
Search for similar chunks across all documents
Return the top-K matches
Ask the model to synthesize an answer

That works well for retrieval. It works much less well for cumulative understanding.

Production systems can add memory layers, rerankers, metadata filters, cached summaries, or knowledge graphs. But once you add those pieces, you are already moving beyond plain chunk retrieval and toward persistent state — which is the shift this post is about.

Where standard RAG falls short for research

No persistent cross-source memory by default

If Document A mentions Smith’s stance on regulation, and Document B records Jones pushing the opposite position, basic RAG does not persist that relationship. It may retrieve both chunks if your query is good enough, but it does not maintain an explicit “Smith vs. Jones” structure over time.

That means comparisons are synthesized from scratch on demand, not built into the knowledge system.

No compounding

Adding Document C does not automatically make the system smarter about Documents A and B. The vector index just gets larger.

In practice, that often means more possible matches, more noise, and more opportunity for irrelevant chunks to crowd out the useful ones. As the corpus grows, retrieval quality can degrade unless you keep tuning chunking, ranking, metadata filters, and prompts.

Weak provenance

RAG can usually tell you which file a chunk came from. It is much worse at maintaining fine-grained provenance over time:

Which paragraph supported this claim?
Is the claim directly stated or inferred?
Do two sources disagree on the same point?
Was this information updated later by a more reliable source?

Those questions matter in research, and standard chunk retrieval does not answer them by itself.

Query-time cost keeps accumulating

As more documents enter the system, you are not only searching more material. You are also often retrieving more candidates, performing more ranking, and building prompts that grow in complexity. Latency and token costs rise with corpus size, and the burden shifts to query time.

That is acceptable for search. It is inefficient for ongoing research programs with hundreds of documents per project.

The Shift: Knowledge Should Compound

What I wanted was not a better search engine. I wanted a knowledge base.

More specifically, I wanted a system where:

The 50th document is processed in the context of what the first 49 already taught the system
Query cost depends more on the selected wiki pages than on the full raw corpus
Every claim traces back to a specific source paragraph
Contradictions are flagged for review instead of being silently overwritten
The resulting knowledge is human-readable markdown, not an opaque index

That is the real distinction here:

A search engine finds relevant documents. A knowledge base preserves relationships between them.

The RLM Pattern Behind the Design

The deeper design pattern I built on is Recursive Language Models (arXiv:2512.24601). Again, I am using the paper as inspiration rather than claiming a direct implementation.

The core idea I borrowed is simple:

keep persistent external state outside the model’s context window
use the model for bounded operations on that state
avoid making the model reconstruct everything from raw text on every query

In my case, that persistent state is a markdown wiki.

Instead of asking the LLM to rediscover structure from raw documents over and over, I let it:

extract entities from a chunk
update a specific page
merge structured facts
synthesize from pre-resolved references

That shifts the expensive reasoning from “every query forever” to “once at ingestion, then maintained incrementally.”

The Architecture: A Compounding Markdown Wiki

The system is organized as a markdown wiki with three layers:

wiki/
├── SCHEMA.md           # Domain conventions, tag taxonomy
├── index.md            # Content catalog
├── log.md              # Audit trail
├── raw/                # Layer 1: Immutable source files
│   ├── articles/
│   ├── papers/
│   └── transcripts/
├── entities/           # Layer 2: People, orgs, products
├── concepts/           # Layer 2: Topics, ideas
├── comparisons/        # Layer 2: Side-by-side analyses
└── queries/            # Layer 2: Preserved Q&A

This structure is deliberately plain. No proprietary store. No hidden schema behind an API. The wiki itself is the database.

Layer 1: Raw sources

The raw/ directory contains immutable source material. The system can read these files, but it does not rewrite them.

Each source gets a SHA-256 hash, which makes drift detection straightforward. If I ingest the same source again and the content hash has not changed, the pipeline can safely skip expensive work.

Layer 2: The wiki

This is the agent-owned knowledge layer.

Every major entity gets its own page:

people
organizations
products

Every major concept gets its own page:

topics
definitions
theories
techniques

Pages are linked using [[wikilinks]], which makes relationships explicit and navigable. This is where the compounding happens: as new sources arrive, they enrich existing pages rather than living as isolated chunks.

Layer 3: The schema

SCHEMA.md defines domain conventions, taxonomy, and page structure. This matters because consistency is what allows incremental updates to stay clean over time. Without a schema, you do not get a knowledge base — you get a pile of markdown.

Why markdown matters

One of the most practical design decisions here is also one of the least flashy: everything is stored as files.

That gives you several advantages immediately:

You can inspect the knowledge base without custom tooling
You can open it directly in Obsidian
You can version it with Git
You can edit pages by hand when needed
You are not locked into a vector database or vendor-specific format

That last point matters for long-lived research systems. If your knowledge base outlives your current model stack, it should still be readable.

How Ingestion Works

The ingestion pipeline is where the system earns its keep. Instead of postponing structure until query time, it resolves as much as possible up front.

When I ingest a source — whether it is a URL, PDF, or transcript — the pipeline does the following:

Fetch and normalize the content into clean markdown
Write to raw/ with metadata and a SHA-256 hash
Check drift and skip unchanged sources
Chunk semantically instead of by arbitrary token count alone
Extract per chunk using a smaller, cheaper model call
Aggregate and deduplicate the extracted structures
Cross-reference against existing wiki pages
Generate or update pages via LLM-assisted editing
Update the index and append to the audit log

The key difference from RAG is this:

Extraction happens once at ingest time, not repeatedly at query time.

What “semantic chunking” means here

In this system, chunking is heuristic rather than magical. I usually split by the strongest available content boundary first:

document headings and subheadings for articles or papers
paragraph groups for prose-heavy text
speaker turns for transcripts
sentence windows only as a fallback when no stronger boundary exists

The goal is to keep each chunk as a coherent unit of thought. That improves:

entity resolution
fact extraction
citation accuracy
contradiction checks

What gets extracted per chunk

Each chunk is processed into structured JSON containing:

Entities: people, organizations, products
Concepts: topics and definitions
Facts: claims with certainty markers and provenance

Here is the kind of structure this stage is designed to produce:

{
  "entities": [
    {
      "name": "John Smith",
      "type": "person",
      "aliases": ["J. Smith"]
    }
  ],
  "concepts": [
    {
      "name": "merger regulation",
      "definition": "Regulatory constraints affecting merger approval."
    }
  ],
  "facts": [
    {
      "claim": "Smith argued that the merger would face regulatory friction.",
      "certainty": "stated",
      "source_paragraph": "p12",
      "source_id": "q3-call-2026-04"
    }
  ]
}

The exact schema can vary by domain, but the point is the same: ingestion produces structured evidence, not just searchable text.

A Compact End-to-End Example

This is the shortest example I have found that shows the full idea.

1) Source excerpt

[Source: raw/transcripts/q3-call-2026-04.md]

Paragraph p12:
"John Smith said the merger would likely face regulatory friction in the EU,
but still described the deal as strategically necessary."

2) Extraction output

{
  "entities": [
    { "name": "John Smith", "type": "person" },
    { "name": "EU", "type": "organization" }
  ],
  "concepts": [
    {
      "name": "merger regulation",
      "definition": "Regulatory constraints that may affect merger approval."
    }
  ],
  "facts": [
    {
      "claim": "John Smith said the merger would likely face regulatory friction in the EU.",
      "certainty": "stated",
      "source_id": "q3-call-2026-04",
      "source_paragraph": "p12"
    },
    {
      "claim": "John Smith described the deal as strategically necessary.",
      "certainty": "stated",
      "source_id": "q3-call-2026-04",
      "source_paragraph": "p12"
    }
  ]
}

3) Wiki page update

# John Smith

tags: [person, executive]

## Positions

- Smith said the merger would likely face regulatory friction in the EU.  
  Source: [[q3-call-2026-04#p12]]

- Smith described the deal as strategically necessary.  
  Source: [[q3-call-2026-04#p12]]

## Related

- [[merger regulation]]
- [[EU]]

4) Query response

If I later ask:

What did Smith say about the merger in the Q3 call?

The system can answer from the wiki first:

John Smith said the merger would likely face regulatory friction in the EU,
while still describing the deal as strategically necessary.

Citations:
- [[John Smith]]
- [[q3-call-2026-04#p12]]

That is the compounding behavior in miniature: source text becomes structured evidence, evidence updates a persistent page, and later questions hit the page instead of forcing the model to rediscover the same relationship from raw text.

How Contradiction Detection Works

This is the part I had to be careful not to overstate.

I do not treat contradiction detection as perfect. In my implementation, it is a review-oriented mechanism, not a guarantee of truth.

At ingestion time, new facts are compared against existing page facts with a mix of:

entity matching
claim-type matching from the schema
provenance checks
an LLM judgment step that labels a new fact as supports, updates, conflicts with, or unclear

When the system sees a likely conflict, it does not auto-resolve it into one canonical statement. It adds both claims to the relevant page and marks the conflict for review.

A simplified page fragment might look like this:

## Open conflicts

- Smith said the merger was strategically necessary.  
  Source: [[q3-call-2026-04#p12]]

- Smith said the merger should be delayed until regulatory conditions improve.  
  Source: [[investor-interview-2026-05#p4]]

Status: needs review

So when I say contradictions are “flagged,” I mean candidate contradictions are surfaced and preserved instead of being silently smoothed over.

How Querying Works

Once the wiki exists, queries become much lighter.

Instead of searching raw chunks across the entire corpus, the system searches the structured wiki first. In many cases, that means keyword, title, and tag-based search, with no embeddings required for the first pass.

The query flow looks like this:

Search the wiki
Select a strategy using a fast model call
Execute the strategy with the appropriate model
Return an answer with [[Page Name]] or source-anchor citations

The strategy selection matters because not every question needs the same level of work.

Query strategies

Lookup

For a single fact from one page.

Synthesis

For compare/contrast tasks across multiple pages or sources.

Deep-dive

For more extensive analysis across a small set of selected pages plus raw-source verification if needed.

In practice, query cost is much closer to wiki size and selected pages than to total raw document count. It is not literally constant, but it scales more gently than repeatedly searching and prompting over the full corpus.

Where raw retrieval still fits

This is not an anti-RAG purity test. I still use raw-document retrieval in edge cases:

cold-start questions before enough wiki structure exists
verification against the underlying source text
cases where a page summary looks incomplete or stale
exploratory search across new material before ingestion rules are tuned

For me, the best framing is not “replace RAG everywhere.” It is “stop making raw retrieval do all the memory work.”

Why This Beats Plain RAG for Research

Here is the comparison that mattered in practice:

Dimension	Plain chunk-retrieval RAG	Compounding Wiki
Knowledge growth	Re-synthesizes per query	Accumulates across sources
Cross-referencing	Implicit (similar chunks)	Explicit (entity pages + wikilinks)
Query cost trend	Usually grows with retrieval/ranking complexity	Tied more to selected pages than total corpus
Provenance	Often file/chunk-level	Page and paragraph-anchor level
Contradictions	Usually handled inside answer generation	Surfaced as reviewable conflicts
Human-readable	Often opaque to humans	Plain markdown
Edit maintenance	Often requires re-indexing	Often localized to page updates

A few of these deserve emphasis.

Provenance becomes operational, not cosmetic

In many RAG systems, “citation” means attaching a filename or chunk excerpt to an answer. That is useful, but it is not enough for research-quality traceability.

In the wiki model, claims can point back to specific paragraphs, and disagreements can remain visible instead of being collapsed into one model-generated summary.

Contradictions stop disappearing

One of the worst behaviors in AI research pipelines is silent reconciliation. If two sources disagree, the model often smooths them into a single answer.

By maintaining persistent pages and structured fact updates, likely conflicts can be surfaced and flagged instead of overwritten. That changes the system from “helpful summarizer” into something closer to a research assistant.

Edits become cheaper

In a vector-centric workflow, changing a source often means re-embedding and re-indexing. In this setup, a targeted edit can mean updating one source, recomputing only the affected structures, and revising the relevant pages.

That is not free, but it is usually more localized.

Practical Performance Notes

The original draft of this post was too absolute here, so let me be precise.

The exact latency and cost depend on:

the model provider
whether extraction uses a small local model or an API model
corpus size and document shape
how aggressively you verify against raw sources
the size of the wiki pages being synthesized at query time

In my own testing, the main gain was not “zero scaling cost.” It was moving a large share of the work to ingestion and keeping many queries small because they hit pre-structured pages instead of raw corpora.

So when I say the system is more stable at scale, I mean:

qualitatively more consistent on repeated research questions
operationally easier to trace and debug
economically better when the same corpus gets queried many times

If you publish hard benchmark numbers, include your corpus, models, prompt strategy, and what counts as an “active project.” Without that context, qualitative claims are more honest.

What This Enables

The compounding wiki model changes what kinds of workflows are practical.

Multi-source research

You can ask questions across large document sets and get answers tied back to the exact paragraph that supports them.

If you ask:

What did Smith say about the merger in the Q3 call?

the goal is not to retrieve a vaguely relevant transcript chunk. The goal is to resolve “Smith,” locate the relevant page or source anchor, and return the supporting paragraph.

Project-specific memory for teams and writers

Each new interview, paper, or article enriches what the system already knows.

If the third transcript creates a [[John Smith]] page, the fiftieth transcript should update that page, not create one more disconnected chunk.

Better structure as the corpus matures

As the graph gets denser, the wiki tends to become more useful:

more entity resolution
better cross-referencing
richer concept pages
stronger comparison pages

That is the compounding effect plain RAG does not naturally provide on its own.

Practical Packaging and Tooling

The system is open source here:

https://github.com/pksw4u/rlm-wiki

At the time of writing, I would describe it as a practical, evolving project rather than a finished universal framework.

It works in several modes:

As a Hermes Agent skill by dropping SKILL.md into your skills directory
As a standalone Python library via pip install rlm-wiki
As a native Obsidian vault because the wiki is plain markdown
Alongside llm-wiki, using the same format so they can coexist

If you are not familiar with those names:

Hermes Agent is the agent runtime this project was designed to plug into.
llm-wiki is a related markdown-wiki workflow that shares a compatible file format.

Because the wiki is just files, the data is not trapped inside a database that only one runtime understands.

Minimal usage example

pip install rlm-wiki
rlm-wiki ingest ./docs/q3-call.pdf --project ./wiki
rlm-wiki ask "What did Smith say about the merger?" --project ./wiki

Example entity page format

# John Smith

tags: [person]
aliases: [J. Smith]

## Summary
Executive involved in merger discussions.

## Claims
- Smith argued the merger would face regulatory friction in the EU.  
  Source: [[q3-call-2026-04#p12]]

## Related
- [[merger regulation]]
- [[EU]]

Here is the operational point I care about most:

The wiki is just markdown files. No required database, no vendor lock-in, and no mandatory embedding index to keep the core state readable.

That means your knowledge base remains usable even if your model stack changes later.

The Deeper Point: Search Is Not Understanding

A lot of AI tooling still treats every question as a fresh event:

search
retrieve
answer
forget

That is a useful pattern for document access. It is a poor pattern for research.

Research is cumulative. If you are trying to understand Smith’s position, the answer usually does not live in one chunk. It emerges from multiple interviews, repeated claims, revisions over time, and contradictions with other people’s statements.

A system that compounds can do things a query-time retrieval stack struggles with:

resolve some relationships once instead of rediscovering them repeatedly
preserve explicit links across sources
keep provenance attached to claims
improve existing pages as new evidence appears

That is not just a better implementation detail. It is a different idea of what an AI knowledge system should be.

Key Takeaways

RAG is good for retrieval, but plain chunk-retrieval RAG is weak for cumulative research because it treats many questions as fresh synthesis tasks.
Context rot is a real scaling problem in long-running AI research workflows, not just a context-window marketing issue.
A markdown wiki can serve as persistent external state, replacing vector-only memory with readable, editable files.
Ingestion-time extraction changes the economics by doing the hard structuring work once instead of on every query.
RLM-style bounded operations keep the working set small even as document collections grow.
Hybrid systems are often the practical answer: persistent wiki first, raw retrieval when needed.

Conclusion

What I ended up building was not “RAG, but better.” It was a different architecture with a different goal.

RAG helps models find relevant text. A compounding wiki helps a system build and preserve relationships, provenance, and working understanding over time. That distinction starts to matter once your corpus gets large, your questions become comparative, and your tolerance for vague synthesis drops.

If you work with research-heavy AI workflows, this is the assumption I would revisit: maybe the model should not be responsible for reconstructing knowledge from scratch on every question. Maybe the system should already know how your sources relate before the question is even asked.

That is the promise of a compounding knowledge base. The more you feed it, the more useful its prior structure becomes — not because the model got bigger, but because the system got better at remembering.

Designing a Self-Prompting Agent Harness with Per-Task Prompt, Tool, and Strategy Synthesis

Pramoda Sahu — Fri, 19 Jun 2026 08:25:59 +0000

Most agent stacks have matured in roughly the same direction: we version the code, test the tools, constrain the runtime, and instrument the loop. But one part of the system still often lives as an unversioned artifact copied between docs, chats, and notebooks: the prompt.

That mismatch gets harder to ignore once you start treating the agent harness as the real product. If the harness is what determines reliability, cost, safety, and task success, it is strange that the prompt is often the least engineered part of the stack.

That question led me to build SynthAgent: a small framework that generates a task-specific prompt, tool plan, and runtime strategy at task time instead of relying on one fixed prompt and one fixed loop for every task.

This post is best read as an architecture exploration, not a benchmark report. I have not yet run the A/B test that would justify a strong performance claim over a fixed-prompt baseline. What I do have is a working harness, a clear design thesis, and a set of implementation lessons that were useful enough to write down.

I’ll walk through the architecture, show what the synthesized artifacts actually look like, explain the tradeoffs behind the design, and point out where the current version is still weak.

The Core Idea: Generate the Harness at Task Time

Here’s the one-line version of the project:

SynthAgent takes (task, success_criteria) and, instead of running a fixed prompt through a fixed loop, generates a custom Prompt P, Tool Plan T, and Strategy S for that specific task. It then runs them through a Plan-Execute-Verify loop, scores the result, reflects on the failure, and tries again with revised components.

That is the core thesis: prompts should not be static artifacts when the rest of the harness is dynamic.

This setup is inspired by recent work on agent harnesses and automated agent design. In particular, it resembles the plan-execute-verify style control loop discussed in Code as Agent Harness (arXiv:2605.18747) and the meta-level search perspective in ADAS / Meta Agent Search (Hu et al., ICLR 2025). The difference is scope: instead of trying to invent entirely new agents, SynthAgent tries to invent a per-task harness.

SynthAgent Architecture

At a high level, the system looks like this:

                            ┌──────────────────────────┐
   TASK (free-form)  ───►   │  META-PROMPT GENERATOR   │   ◄── the only
   + success criteria       │  (MPG)                   │       hand-written
                            └────────────┬─────────────┘       instruction
                                         │ synthesizes         (about how
                                         ▼                     to write
                            ┌──────────────────────────┐       prompts,
                            │  TASK PROMPT P           │       not the
                            │  + TOOL PLAN T           │       task itself)
                            │  + STRATEGY S            │
                            └────────────┬─────────────┘
                                         │ runs inside
                                         ▼
        ┌──────────────────────────────────────────────────────────────┐
        │                       THE PEV LOOP                           │
        │                                                              │
        │    ┌─────────┐    plan     ┌─────────┐   tool calls   ┌─────┐ │
        │    │ PLANNER │ ──────────► │EXECUTOR │ ─────────────► │ENV  │ │
        │    └────┬────┘             └────┬────┘                └──┬──┘ │
        │         ▲                       │ observations           │    │
        │         │ replan                ▼                        │    │
        │         │                 ┌─────────┐                    │    │
        │         └──────────────── │MEMORY / │ ◄──────────────────┘    │
        │                           │STATE    │                          │
        │                           └────┬────┘                          │
        │                                │ trajectory                    │
        │                                ▼                               │
        │                         ┌─────────────┐                        │
        │                         │  VERIFIER V │                        │
        │                         └──────┬──────┘                        │
        │                                │ score + critique              │
        │                                ▼                               │
        │                         ┌─────────────┐                        │
        │                         │ REFLECTOR R │ ──► revise P/T/S       │
        │                         └─────────────┘                        │
        └──────────────────────────────────────────────────────────────┘
                                         │
                            loop until V=Pass or budget exhausted

There are six components, each in its own file and each intentionally small. The point is not abstraction for its own sake. It is to make the harness legible to a human debugger so that when something fails, you can inspect the trajectory, the prompt, the plan, and the verifier output directly.

A Concrete Example of Synthesized `P/T/S`

Before going component by component, it helps to see the core artifact.

Here is a trimmed, representative example of the JSON object the Meta-Prompt Generator emits for a simple task like:

Task: Summarize the repository's Python modules and write the summary to SUMMARY.md.

Success criteria: Cover each top-level .py file once, do not invent files, and produce a concise markdown summary.

{
  "prompt_p": "You are the planner for a repository-analysis task. First inspect the directory, then read each relevant Python file, extract its purpose conservatively, and only summarize files you actually observed. Before finishing, verify that every top-level .py file has been covered exactly once.",
  "tool_plan_t": [
    "list_directory",
    "read_file",
    "write_file"
  ],
  "strategy_s": {
    "mode": "plan_execute",
    "notes": [
      "Enumerate files before summarizing",
      "Do not infer missing modules",
      "Validate file coverage before final write"
    ]
  }
}

That example is not meant as benchmark evidence. It is there to make the mechanism concrete: the harness is generating a task-specific instruction, an intended tool sequence, and a strategy hint for the runtime.

A representative failure-and-retry sequence for that kind of task looks like this:

Attempt 1 lists the directory and reads several files, but misses one module.
The Verifier fails the run because the output does not satisfy the “cover each top-level .py file once” criterion.
The Reflector revises prompt_p to explicitly require a checklist of discovered files before writing.
Attempt 2 re-runs with the stricter instruction, covers the missing file, and passes.

That does not prove per-task synthesis is better than a fixed prompt. It does show the shape of the feedback loop and the kind of failure it is designed to repair.

The Meta-Prompt Generator Is the Only Hand-Written Prompt

`mpg.py`

The Meta-Prompt Generator (MPG) is the only place in the system that contains a human-authored prompt.

That prompt is not a task instruction. It is a meta-prompt: an instruction for how to write task-specific prompts. Given a task description, success criteria, an available tool catalog, and any prior lessons retrieved from memory, the MPG emits a JSON object with three fields:

prompt_p — a detailed task-specific instruction for the Planner
tool_plan_t — which tools are relevant and the intended order of use
strategy_s — which loop pattern to prefer, such as ReAct, Plan-and-Execute, or Decompose-and-Solve

This recursive structure is deliberate: the system is generating instructions about how to solve the task, not solving the task directly in the meta-prompt itself. That is part of what attracted me to the approach described in Meta Prompting for AI Systems, which also discusses recursive meta-prompting (Zhang et al.).

I chose this route over prompt optimization frameworks like DSPy for one reason: I wanted the harness behavior to stay inspectable. If a run goes wrong, I want to read the synthesized prompt and the execution trace. I do not want the optimization process hidden behind a compiled graph or framework abstraction that makes the final behavior harder to audit.

That transparency shaped most of the rest of the system too.

The PEV Loop Turns Prompt Synthesis Into Runtime Behavior

`harness.py`

Once the MPG emits P, T, and S, those artifacts are fed into the inner Plan-Execute-Verify loop.

The loop works like this:

Planner decides the next action given the current prompt, tool plan, strategy, and trajectory.
Executor calls the selected tool.
Memory persists the step immediately.
The cycle repeats until the Planner finishes, the Verifier passes the result, or the attempt budget runs out.

This is the operational center of the system. A synthesized prompt is only interesting if it changes downstream behavior in a controlled way. The PEV loop is what gives that prompt something to steer.

Two implementation details mattered more than I expected.

Structured planner output matters more than planner creativity

Where provider support allows it, the Planner output is requested as a JSON object using a response format like:

{ "type": "json_object" }

That is less glamorous than tuning reasoning quality, but it matters more. In an agent harness, the Planner is not writing prose for a human reader. It is producing a decision interface that the Executor must parse reliably.

In practice, this is not uniformly standardized across all OpenAI-compatible providers. Tool calling, response_format, streaming, and provider-specific parameters still vary. For the subset of plain chat-completion behavior used in this project, though, structured JSON output was stable enough to be worth enforcing where supported and recovering heuristically where it was not.

The system therefore prioritizes structure over style. Reliability beats expressiveness here.

Tool argument normalization prevents embarrassingly common failures

The Executor is intentionally thin, but it does one very practical thing: it resolves argument-name aliases like:

filepath
file_path
filename

Models get these wrong all the time, even when the tool schema is explicit. Rather than pretending this will not happen, the harness normalizes common variants before dispatch. That small tolerance layer ended up being more useful than a more elaborate execution abstraction.

Hard step limits are non-negotiable

The default is:

max_steps_per_attempt = 25

This is enforced unconditionally in the current design.

The Planner can return action: "finish" when it believes the task is done, but if it gets stuck in a loop, the harness terminates the attempt. Self-prompting agents are not immune to looping. If anything, giving them the ability to rewrite their own instructions can make unmanaged loops more likely.

A hard cap is therefore part of the product, not just a debugging safeguard.

Persisting every step to disk changes failure recovery

After every step, the trajectory is written to disk at:

agent_memory/runs/<uuid>.json

That persistence layer matters for more than observability. If the process crashes mid-task, the run history still exists. That history becomes the source of truth for verification, reflection, debugging, and future lessons.

In other words, the trace is not just logging. It is part of the system state.

How Much Does `strategy_s` Change Runtime Behavior Today?

One gap in many “dynamic strategy” writeups is that the strategy field sounds more powerful than it really is.

That is worth being explicit about here.

Today, strategy_s is partly operative and partly advisory.

It is operative in the sense that it changes the planner context and nudges the loop toward a different decomposition style.
It is not yet a fully separate runtime policy engine with deeply different execution branches for each strategy family.

So if the MPG emits something like “ReAct” versus “Plan-and-Execute,” the current implementation mostly changes how the Planner is instructed to proceed, not an entirely different harness implementation under the hood.

That still matters, but it is narrower than “the runtime swaps in a wholly different agent architecture.” If I extend the project, strategy branching is one of the first places I would make more explicit.

The Verifier Is the Weakest Part of the Current Design

`verifier.py`

Right now, the Verifier is LLM-as-judge.

It evaluates the final output against the provided success criteria and emits:

a score from 0.0 to 1.0
a critique
suggested fixes

The current pass threshold is:

0.85

This works well enough to support iterative refinement, but it is also the most obvious weakness in the architecture.

When the agent and the verifier come from the same model family, the system can end up hill-climbing on the verifier's blind spots. A prompt can get “better” according to the judge while the actual task result stays wrong in ways the judge fails to detect.

One mitigation in SynthAgent is that the Verifier gets the full trajectory, not just the final answer. That gives it a better chance of spotting failure modes like:

sloppy execution after a good initial plan
bad tool usage followed by confident synthesis
premature commitment to an answer without validation

But that is still a mitigation, not a solution.

The real fix is to support deterministic, pluggable verifiers wherever the task class allows it. For code, that might be pytest. For SQL, it might be execution against a test database. For structured output, it might be jsonschema.validate. In some environments, the environment itself can serve as the oracle.

That lesson also shows up in recent harness work. The AutoHarness authors report that a smaller model with a stronger synthesized harness can outperform a larger model in constrained game-like environments because the environment itself supplies a deterministic feedback signal. That is the part I find most important—not a specific leaderboard result, but the fact that the verifier is external, cheap, and hard to game.

Design your verifier first. Everything else is downstream.

That is the biggest architectural lesson in the repo.

Reflection Works Best at the Prompt Level, Not the Response Level

`reflector.py`

When verification fails, the Reflector (R) is invoked.

It receives:

the original task
the current P/T/S
the full trajectory
the verifier score
the verifier critique

Its job is to identify the likely failure mode and revise the harness accordingly.

That raises an important design question: what exactly should reflection operate on?

There are three obvious levels:

Response-level reflection — edit the final answer
Prompt-level reflection — revise the task prompt P
Harness-level reflection — re-run synthesis and regenerate P/T/S from scratch

These are not equivalent.

Response-level reflection is usually too shallow

Editing the answer after the fact is the weakest form of reflection. For example, if the agent wrote a flawed final summary, response-level reflection would just rewrite that summary.

It can improve phrasing or patch a local omission, but it does not fix the process that produced the failure. This is close to the Self-Refine pattern, and for a system like SynthAgent it is not where the real leverage is.

Prompt-level reflection is the current sweet spot

Right now, SynthAgent revises the prompt-level artifacts. In the repository-summary example above, that might mean changing the Planner instruction from “summarize the repo” to “enumerate files first, maintain a checklist, and do not finish until every discovered .py file has been covered.”

That is a useful middle ground because it lets the system improve behavior without paying the full cost of fresh synthesis every time.

Full re-synthesis is stronger, but more expensive and riskier

The strongest option would be to re-call the MPG and regenerate all of P/T/S from scratch. In the same example, that might swap not just the prompt wording but the tool sequence and decomposition strategy too.

That is the next experiment I would try, but not the current default.

The reason is practical: full re-synthesis is more expensive, and in early testing it often overfit to the most recent failure rather than learning a stable improvement. Prompt-level revisions turned out to be the better default tradeoff.

Memory Stores Both Raw Runs and Reusable Lessons

`memory.py`

The memory layer has two stores:

Runs — runs/<uuid>.json
Lessons — lessons.json

The runs store captures the full trajectory, prompt/tool/strategy history, verifier scores, and reflections for each attempt. It is a forensic record of what happened.

The lessons store is different. It contains distilled takeaways from completed runs and indexes them for retrieval using embeddings. When a new task arrives, the MPG gets the top-5 most similar lessons as additional context.

By default, the embedding model is NVIDIA's:

nv-embed-v1

But the implementation is flexible enough to work with other OpenAI-style embedding endpoints too.

This is one area where I am still not convinced the architecture is earning its complexity. For a small lesson corpus, cosine similarity over lesson text is doing a lot of work. It may be that a simpler baseline like “last 5 lessons” performs just as well. I would not claim embedding-based retrieval here is a decisive win without a proper A/B test.

That uncertainty matters. Not every component that sounds architecturally elegant turns out to matter in practice.

The Tool Surface Defines the Agent's Legal Action Space

`tools.py`

SynthAgent ships with five tools:

tavily_search — web search via Tavily
read_file
write_file
list_directory
execute_python — sandboxed Python via subprocess with a 30-second timeout

These tools are deliberately boring.

That is by design. The aim was not to build a flashy tool ecosystem. It was to make the action space constrained, legible, and safe enough to reason about.

A few implementation details matter:

file tools enforce a project-root sandbox
execute_python runs in a subprocess
the subprocess has a hard 30s timeout
stdout and stderr are returned
sandboxed Python has no network access by default

This is where the harness perspective becomes concrete: the tool surface is the legal action space. If a tool is not in the registry, the agent cannot call it. That is not just a convenience; it is a safety boundary.

It also prevents a whole class of failures that appear in less constrained frameworks, where the model “discovers” capabilities the runtime should never have exposed in the first place.

Model Selection Uses Any OpenAI-Compatible Endpoint

`llm.py`

The LLM client began with NVIDIA NIM hardcoded, but that quickly felt too limiting. I wanted to test:

Groq for latency
Ollama for privacy-sensitive local runs
OpenRouter for model flexibility
standard OpenAI-compatible providers without rewriting client logic

So llm.py became a thin wrapper around any OpenAI-compatible /v1/chat/completions endpoint.

Here is the basic configuration shape:

# OpenAI
LLM_BASE_URL=https://api.openai.com/v1
LLM_MODEL=gpt-4o-mini

# Groq
LLM_BASE_URL=https://api.groq.com/openai/v1
LLM_MODEL=llama-3.3-70b-versatile

# OpenRouter
LLM_BASE_URL=https://openrouter.ai/api/v1
LLM_MODEL=anthropic/claude-3.5-sonnet

# Local Ollama
LLM_BASE_URL=http://localhost:11434/v1
LLM_MODEL=llama3.1

In practice, there were only two provider quirks that deserved special handling.

NVIDIA embedding requests need `input_type`

NVIDIA's embedding endpoint expects an input_type field. Other providers reject that same field. The client detects the NVIDIA case and only sends it there.

Chat completions are similar enough for this project's core flow

The more encouraging result is that, for the subset of non-streaming chat-completion calls used in this project, the interface was similar enough across providers that one small client wrapper worked with only minor conditionals.

That is a narrower claim than “all OpenAI-compatible APIs are standardized.” They are not. Tool calling, structured output modes, reasoning-token controls, and provider-specific parameters still vary. But for the core request/response flow used here, treating model choice as configuration rather than architecture worked well.

In my testing, the entire llm.py implementation stayed small and worked across the providers above with only lightweight provider-specific handling.

What's in the Repository

The repo is intentionally small and direct:

main.py — CLI entry point with onboarding wizard
config.py — environment loader with backward-compatibility aliases
llm.py — OpenAI-compatible client with retry and robust JSON extraction
tools.py — tool implementations and tool registry schema
mpg.py — Meta-Prompt Generator
verifier.py — LLM-as-judge verifier
reflector.py — failure diagnosis and P/T/S revision
harness.py — PEV loop and outer attempt loop
memory.py — runs and lessons persistence

Dependencies are minimal:

requests
Python 3.8+

There is no LangChain, no LlamaIndex, no DSPy, and no heavyweight framework hidden underneath. That constraint was part of the point. I wanted every line of behavior to be readable in an afternoon.

What I Learned Building SynthAgent

The architecture is the headline, but the implementation taught me a few things that mattered more than the conceptual framing.

JSON parsing is harder than model selection

The single biggest source of early failures was malformed JSON from the Planner step.

The failure patterns were painfully familiar:

markdown code fences around JSON
prose before the object
raw newlines inside strings
trailing commas
partial object emission

To make the system resilient, llm.generate_json() now uses a three-stage recovery pipeline:

direct parse
strip markdown
brace-matching with control-character sanitization

A simplified sketch of the idea looks like this:

def recover_json(text: str):
    # 1. Try direct parse
    try:
        return json.loads(text)
    except Exception:
        pass

    # 2. Remove markdown fences
    cleaned = strip_markdown_fences(text)
    try:
        return json.loads(cleaned)
    except Exception:
        pass

    # 3. Extract brace-matched object and sanitize control chars
    candidate = extract_first_brace_matched_json(cleaned)
    candidate = sanitize_control_chars(candidate)
    return json.loads(candidate)

Those recovery paths are not elegant, but they are extremely practical. In a harness like this, robust JSON extraction matters more than almost any model-level tuning.

Reflection needs the trajectory, not just the score

If you tell the Reflector only that the verifier score was 0.4, you are not giving it enough to improve anything meaningful.

If you show it the actual trace and make the failure concrete—for example, that it called tavily_search with the wrong query argument on step 3 and then committed to an unsupported answer on step 5—it has something operational to fix.

That is why the full trace gets passed through the system. Reflection without trajectory is mostly guesswork.

The smallest prompts are often the highest-leverage components

The most valuable hand-written parts of the system are also the smallest:

the MPG prompt in mpg.py is about 20 lines
the Verifier prompt is about 15 lines
the Reflector prompt is about 20 lines

Those prompts earn their place because they define behavior at the right level of abstraction. They are not trying to solve the task directly. They are specifying how the system should generate, evaluate, and revise task-solving behavior.

Everything else is composed around those few instructions.

Self-prompting without a real verifier is hallucination with extra steps

This was the most sobering lesson.

I watched the loop improve prompts across multiple attempts, tighten wording, raise the verifier score from 0.7 to 0.85, and still produce a confidently wrong answer. The system looked like it was learning, but it was really optimizing against an imperfect judge.

That does not make self-prompting useless. It just means it cannot rescue a broken oracle.

A better harness cannot compensate for a verifier that does not track reality.

If I had to summarize the whole project in one cautionary line, that would be it.

Where I'd Take the Architecture Next

The current version works, but a few next steps feel much more important than adding more tools or more prompt tricks.

Pluggable verifiers

This is the most obvious gap.

Right now, verification is LLM-as-judge only. The next version should expose a verifier interface where task-specific checks can be plugged in, such as:

pytest
sql_execute
jsonschema.validate
custom rubric evaluators

The current structure already supports this direction. verifier.py mainly needs to be refactored into a strategy-style interface.

Better lesson-store retrieval

Lessons are currently appended and retrieved by naive cosine similarity. That may be enough for now, but I suspect even a small reranker—or a baseline like BM25—could outperform raw embedding similarity on a small corpus.

This is another place where I would want to measure before claiming the design is sound.

Cost ceilings

Termination currently depends on either pass/fail outcome or attempt limit. That is useful, but not sufficient for unattended runs.

A token-cost ceiling per task would make the system much safer to leave running in the background, especially when testing across providers with different latency and pricing profiles.

A/B testing whether MPG actually helps

This is the experiment the project still owes itself.

I have not yet run a clean A/B test comparing:

a fixed, hand-written prompt with the same harness
the synthesized P/T/S approach

That comparison is overdue. It would either validate the central thesis or force me to narrow it. Either outcome would be useful.

How to Try SynthAgent

If you want to run it locally:

git clone https://github.com/pksw4u/synthagent
cd synthagent
pip install -r requirements.txt
python main.py

The onboarding flow prompts for an LLM endpoint. Most OpenAI-compatible endpoints should work for the core chat flow used here, with occasional provider-specific adjustments. The default points to NVIDIA NIM, but you can switch to OpenAI, Groq, OpenRouter, Ollama, or a local llama.cpp-style server by setting two environment variables:

LLM_BASE_URL
LLM_MODEL

That portability was one of the more satisfying parts of the project. It lets the harness stay stable while the model backend remains easy to swap.

Key Takeaways

SynthAgent treats prompt synthesis as an architectural primitive, not a static authoring step.
The system generates a task-specific Prompt P, Tool Plan T, and Strategy S, then runs them through a Plan-Execute-Verify loop with reflection.
The Meta-Prompt Generator is the only human-authored prompt in the system, which keeps harness logic transparent and inspectable.
Structured planner output and robust JSON recovery mattered more in practice than most model-selection debates.
The current LLM-as-judge verifier is the weakest link; deterministic, task-specific verifiers are the most important next step.
The tool registry defines the legal action space, which is both a capability boundary and a safety boundary.
The architecture is deliberately lightweight: Python 3.8+, requests, and a small set of readable modules.

Conclusion

SynthAgent started from a simple discomfort: if we already accept that the harness matters more than the model in many real-world agent systems, it makes little sense to keep the prompt frozen as a manually maintained artifact while everything else evolves around it.

Building the project made that intuition sharper, but it also exposed the limits of the idea. Prompt synthesis is useful. Reflection is useful. Per-task harness generation is promising. But none of those can substitute for a verifier that actually tracks correctness. If the oracle is weak, the loop just gets better at fooling itself.

That is why I think the most important question for agent design is shifting. It is no longer just “what model should I use?” or even “what tools should I expose?” Increasingly, it is: what parts of the harness should be generated, what parts should be fixed, and how do we verify the result without getting gamed?

If that question interests you, take a look at the repo, try it on a task class you care about, and compare it against your own fixed-prompt baseline. The central claim only matters if it pays for itself in practice.

Repo: github.com/pksw4u/synthagent

Understanding the Agent Loop: How Tool-Using LLM Systems Actually Work

Pramoda Sahu — Thu, 18 Jun 2026 10:21:25 +0000

If you are building with tool-calling models, the most important design decision is often not the prompt. It is the loop around the model.

An LLM can decide it wants to use a tool, but it cannot execute that tool by itself. The surrounding application or SDK has to assemble context, inspect the model response, run tools, append results, and continue until a final answer is produced. That runtime cycle is the agent loop.

This article explains what the agent loop actually is, where the model stops and the harness begins, how tool calling works step by step, and which engineering tradeoffs show up once you move beyond demos.

TL;DR

An agent loop is the execution cycle that lets a model inspect context, request tools, observe results, and continue until it reaches a final answer.
The model is only one part of the system. The harness or SDK owns orchestration: prompt assembly, tool execution, retries, approvals, and termination.
State management matters as much as prompting. If you lose prior tool outputs or conversation continuity, the agent will behave like it forgot what just happened.
Performance depends heavily on prompt growth control, stable prompt prefixes, caching, and bounded tool output.
Safe agent design requires validation, approval gates for side effects, and clear rules for concurrency and history propagation.

The Agent Loop Is the System, Not Just the Model

The core problem is simple: a one-shot model call cannot inspect the world, act on it, and adapt to the result unless something outside the model manages that cycle.

That is the harness's job.

OpenAI's Codex architecture describes a user interaction as a turn, but a single turn may contain multiple internal iterations of model inference and tool execution. The OpenAI Agents SDK describes the same idea directly: invoke the agent, check whether there is final output, handle handoffs if needed, otherwise execute tool calls and re-run.

A practical mental model looks like this:

Build the input state.
Call the model.
Inspect the response.
If the model requested tools, validate and execute them.
Append tool results back into context.
Call the model again.
Stop only when the model returns a final answer.

That means the harness, not the model alone, is responsible for:

Prompt assembly
Message history management
Tool schema registration
Tool execution
Validation and error handling
Retry logic
Approval workflows
State persistence
Loop termination

This is why two systems using the same model can behave very differently. Their harnesses may make different decisions about context, tool ordering, truncation, approvals, and continuation.

What Goes Into a Single Turn

Before the loop can run, the system needs to define what the model sees.

The input state

A typical turn includes:

System or developer instructions
Tool definitions or schemas
Previous messages
Previous tool-call results
The current user request
Sometimes environment state, session metadata, or hidden runtime instructions

This matters because follow-up reasoning depends on prior observations being present. If the model requested a tool in one iteration and the result is not added back correctly, the next iteration cannot build on that work.

Inner loop vs outer loop

There are really two loops to think about:

Inner loop: model inference and tool execution inside a single user turn
Outer loop: the broader multi-turn conversation across user follow-ups

This distinction shows up clearly in Codex-style architectures. A user asks for something once, but the agent may internally perform several tool steps before replying. Then the next user message arrives, and the entire conversation thread continues from that accumulated state.

That is why state continuity is not optional. Without it, the outer loop breaks and the inner loop starts reasoning from an incomplete view of reality.

How the Model Decides Between Text and a Tool Call

Once the harness provides the current turn state, the model has a decision boundary: answer directly, or request one or more tools.

Tool calling works because the model is given structured tool definitions. Instead of producing only natural language, it can emit a structured request indicating which tool it wants and which arguments it wants to pass.

At that point, the model is effectively yielding control back to the application.

With custom tools, the client harness must take over, run the tool, and return the result. With hosted tools, more of that orchestration can happen inside the API itself.

This is an important architectural choice:

Tool type	Who orchestrates execution?	Main tradeoff
Hosted tool	API/runtime handles more of the loop	Simpler orchestration, less direct control
Custom function tool	Client harness executes it	More flexibility, more operational responsibility
MCP tool	Depends on integration and discovery flow	Adds discovery and caching concerns

The advantage of client-side orchestration is control. The cost is that you now own the failure modes.

Tool Execution Mechanics in Practice

Once the model emits a tool request, the harness needs to do more than just run it.

Validate before execution

A safe harness should validate:

Tool name
Argument structure
Argument types
Permission rules
Whether the tool is read-only or mutating

This is not just a security concern. It is also a quality concern. If the model asks for a tool with invalid arguments, returning an explicit tool error often gives it enough signal to self-correct on the next loop iteration.

Return the observation in the right format

The model needs a structured observation that closes the action-observation cycle.

A minimal pattern looks like this:

response = client.responses.create(
    input=initial_question,
    **MODEL_DEFAULTS,
)

while True:
    function_responses = invoke_functions_from_response(response)

    if len(function_responses) == 0:
        print(response.output_text)
        break

    print("More reasoning required, continuing...")
    response = client.responses.create(
        input=function_responses,
        previous_response_id=response.id,
        **MODEL_DEFAULTS,
    )

The key detail is not just the loop itself. It is that the next request continues from the previous response and includes the tool outputs produced by the harness.

A more explicit observation payload looks like this:

context.append({
    "type": "function_call_output",
    "call_id": tool_call.call_id,
    "output": str(result),
})

response_2 = client.responses.create(
    model="o3",
    input=context,
    tools=tools,
    store=False,
    include=["reasoning.encrypted_content"],
)

print(response_2.output_text)

That function_call_output item is the observation that lets the model continue reasoning with the tool result now available in context.

State Management Patterns: Where Many Agents Fail

One of the easiest ways to break an agent is to lose state continuity.

Common state strategies

There are several patterns in current OpenAI tooling:

Full history replay managed by the client
previous_response_id for server-managed continuation
conversation_id for conversation continuity
SDK-managed session persistence

Each approach has tradeoffs.

Full replay vs server-managed continuation

With full replay, the client sends all prior messages and tool results every time. This is simple to reason about, but payload size grows quickly.

With server-managed continuation, the client can send the new input along with a continuation identifier such as previous_response_id. That reduces payload size and offloads some history management.

This example from the Agents SDK shows response chaining:

from agents import Agent, Runner

async def main():
    agent = Agent(name="Assistant", instructions="Reply very concisely.")
    previous_response_id = None

    while True:
        user_input = input("You: ")

        # Setting auto_previous_response_id=True enables response chaining
        # automatically for the first turn, even when there is no actual
        # previous response ID yet.
        result = await Runner.run(
            agent,
            user_input,
            previous_response_id=previous_response_id,
            auto_previous_response_id=True,
        )

        previous_response_id = result.last_response_id
        print(f"Assistant: {result.final_output}")

This is convenient, but you still need to choose a consistent state strategy.

Do not mix incompatible modes

The Agents SDK documentation explicitly warns against combining session persistence with conversation_id, previous_response_id, or auto_previous_response_id in the same run path.

That is a practical design rule: pick one continuity model per call flow.

If you mix them, debugging becomes much harder because it is no longer obvious which state the model is actually seeing.

Prompt Growth, Caching, and Why Stable Prefixes Matter

As the loop continues, context grows.

Every new model call may include prior instructions, tool schemas, user messages, and tool outputs. If you simply keep appending everything forever, the number of bytes sent over the lifetime of a conversation can grow quickly.

Why Codex emphasizes prompt prefixes

The Codex architecture discussion highlights a useful principle: keep old prompt content as an exact prefix of the new prompt whenever possible. That improves prompt-cache reuse.

In practical terms, stable ordering matters for:

System instructions
Tool definitions
Environment metadata
Prior messages

If these move around between calls, cacheability drops. The same issue affects reproducibility. Even tool-definition ordering bugs can introduce cache misses and inconsistent behavior.

Compaction strategies

A production harness usually needs some combination of:

Truncating verbose tool output
Summarizing old history
Keeping static instructions stable and early
Bounding shell or retrieval output
Preserving only the most relevant observations verbatim

This matters even more for shell, retrieval, or computer-use tasks, where output can become noisy very quickly.

The goal is not just lower cost. It is maintaining a usable reasoning substrate for the model.

Safety and Control in the Loop

The more powerful the tools, the more important the harness becomes.

Approval gates and side effects

Read-only tool calls are different from side-effectful operations.

For example:

Fetching documentation is relatively low risk
Sending an email, editing a file, or executing a deployment is high risk

Mutating actions should often be:

Serialized instead of run concurrently
Approval-gated
Sandboxed when possible
Logged with enough metadata for auditability

This is one reason agent frameworks expose concurrency settings and approval workflows.

Validate arguments, not intentions

You cannot safely assume that a tool request is correct just because it came from the model. Validate the arguments before execution, and return structured error feedback when something is wrong.

That gives the loop a chance to recover without silently doing the wrong thing.

Do not over-prompt reasoning models

OpenAI's function-calling guidance for reasoning models notes that you should not force extra "think more before every function call" prompting. Reasoning models already perform internal reasoning, and excessive prompting can degrade performance.

That is a useful reminder that harness quality is often more important than prompt verbosity.

Multi-Agent Extensions and Their Tradeoffs

Once a single-agent loop works, teams often add handoffs or agent-as-tool patterns.

Conceptually, the loop stays the same:

Invoke one agent.
Detect whether it produced final output, a tool request, or a handoff.
Route execution accordingly.
Continue until termination.

The Agents SDK summarizes the semantics clearly:

The agent will run in a loop until a final output is generated. The loop runs like so:

1. The agent is invoked with the given input.
2. If there is a final output (i.e. the agent produces something of type `agent.output_type`), the loop terminates.
3. If there's a handoff, we run the loop again, with the new agent.
4. Else, we run tool calls (if any), and re-run the loop.

The tricky part is not the idea of handoffs. It is history propagation.

Recent community discussions show that when one agent is exposed as a tool to another, developers are often unsure how much history is forwarded automatically. In practice, this means you should not assume that all relevant context follows the handoff unless your framework explicitly guarantees it.

For multi-agent systems, explicit context composition is often safer than implicit inheritance.

Common Failure Modes and Debugging Strategies

Most agent bugs look obvious in hindsight.

Failure mode 1: losing continuity

Symptoms:

The agent repeats itself
It forgets prior tool results
MCP tool discovery keeps happening again

Check whether you are correctly passing previous_response_id, conversation_id, or full message history.

Failure mode 2: context flooding

Symptoms:

Long, low-quality responses
Poor tool selection
The model misses relevant facts

Check whether tool output is too verbose. Cap output size, summarize logs, and keep only useful observations.

Failure mode 3: unstable prompt construction

Symptoms:

Cache misses
Inconsistent behavior across similar runs
Higher token usage than expected

Check the ordering of instructions, tool schemas, and environment metadata.

Failure mode 4: unsafe tool execution

Symptoms:

Invalid API calls
Accidental side effects
Hard-to-reproduce failures

Validate tool names and arguments before execution. Treat tool requests as proposals, not commands.

Failure mode 5: incorrect concurrency

Symptoms:

Race conditions
Conflicting writes
Non-deterministic outcomes

Run read-only operations concurrently only when safe. Serialize or approval-gate mutating operations.

Practical Architecture Takeaways

The recent OpenAI ecosystem changes make one thing clear: the important boundary is no longer just model prompting. It is orchestration design.

The Responses API, Agents SDK, MCP integrations, and Codex harness examples all point to the same execution model:

The model chooses actions
The harness controls reality
State continuity determines coherence
Prompt discipline determines scalability
Safety controls determine whether the system is usable in practice

If you are building an agent today, the fastest path to a better system is often not a new prompt. It is a better loop.

Key Takeaways

The agent loop is the action-observation cycle that makes tool-using LLM systems possible.
The harness owns orchestration: context assembly, tool execution, validation, retries, approvals, and termination.
State continuity is critical. Losing prior responses or tool outputs breaks reasoning quality quickly.
Server-managed continuation can simplify history handling, but you should choose one state strategy consistently.
Prompt growth is an engineering problem. Stable prefixes, truncation, compaction, and bounded tool output all matter.
Hosted tools and custom tools shift the orchestration boundary in different ways.
Multi-agent patterns introduce history propagation and control-flow complexity that should be designed explicitly.
Safe execution requires argument validation, side-effect controls, and careful concurrency handling.

Conclusion

An agent loop is not a small implementation detail. It is the core runtime pattern that turns a model into a working system.

Once you see the loop clearly, many design decisions make more sense: why history management matters, why tool output must be bounded, why prompt ordering affects cacheability, and why side effects need approval and validation.

If you are building with tool-calling models, make the loop explicit first. Define how state is carried forward, how tools are validated, how observations are appended, and how the run terminates. In practice, that foundation will usually improve reliability more than any prompt tweak.

Understanding Pi Coding Agent: A Minimal, Extensible Architecture for Terminal-First AI Coding Workflow

Pramoda Sahu — Thu, 18 Jun 2026 09:17:18 +0000

TL;DR

Pi Coding Agent is built as a layered TypeScript toolkit, not a sealed coding assistant product.
Its architecture separates provider access, agent runtime, coding workflow, and terminal UI into distinct packages.
Context engineering is a first-class feature through AGENTS.md, SYSTEM.md, APPEND_SYSTEM.md, skills, and extension hooks.
Pi can run interactively, headlessly over JSONL RPC, or be embedded through its SDK using the same underlying runtime.
The flexibility comes with tradeoffs: no built-in sandbox, strict RPC framing rules, and extension authors need to understand trust and compaction behavior.

Introduction

Most coding agents present themselves as finished products: you install them, learn their commands, and work within the boundaries the authors chose. That can be fine if the built-in workflow matches your needs. It becomes limiting when you want to change how prompts are assembled, how tools are registered, how sessions are summarized, or how the agent is embedded inside your own application.

Pi Coding Agent takes a different path.

Based on the official Pi homepage, documentation, and repository, Pi from Earendil Works is better understood as a minimal agent harness with a coding-oriented runtime than as a fixed end-user product. It ships with useful defaults, but its architecture assumes users may want to replace or extend large parts of the workflow. The project explicitly positions advanced behavior such as plan-like workflows, extra commands, and other higher-level capabilities as things that can live in extensions or packages instead of being hardcoded into the core.

That design choice matters for engineers building AI tooling. It affects maintainability, portability, and how easily the system can adapt to terminals, IDE wrappers, automation pipelines, or internal developer platforms.

In this article, we will look at how Pi is structured, why its layering matters, how its context pipeline works, and what tradeoffs appear once you start using extensions, RPC mode, or SDK embedding.

The problem Pi is trying to solve

A coding agent has to do several jobs at once:

Talk to one or more model providers.
Maintain an agent loop with tool calls and state.
Manage coding-specific concerns such as filesystem access, shell execution, session history, and context limits.
Provide a user interface or integration surface.

Many tools solve all four inside one tightly coupled application. That can make the initial experience simple, but it often makes customization expensive. If you want to change prompt composition or session summarization, you may end up forking the project or working against internal assumptions.

Pi’s architecture addresses this by splitting responsibilities into layers.

The Pi stack: four layers instead of one monolith

According to the repository README, Pi is organized as a monorepo with distinct packages:

@earendil-works/pi-ai
@earendil-works/pi-agent-core
@earendil-works/pi-coding-agent
@earendil-works/pi-tui

This package split is the clearest way to understand the system.

Layer 1: `pi-ai`

This is the provider abstraction layer. Its role is to present a unified interface across multiple model providers.

Why this layer exists:

The agent loop should not depend directly on one provider SDK.
Provider switching should not require rewriting the coding runtime.
Frontends and extension systems should remain provider-agnostic where possible.

This is a standard but important decision. If provider-specific details leak into higher layers, the whole system becomes harder to test and evolve.

Layer 2: `pi-agent-core`

This is the runtime layer for core agent behavior, including tool calling and state management.

Why this matters:

Tool execution is a runtime concern, not a terminal UI concern.
State transitions in the loop should be reusable in both CLI and embedded modes.
A headless integration should get the same agent behavior as the interactive one.

Architecturally, this is the part that keeps Pi from being “just a CLI.”

Layer 3: `pi-coding-agent`

This is where Pi becomes a coding agent rather than a generic agent harness.

This layer includes:

coding workflow behavior
sessions and persistence
built-in file and shell tools
compaction and summarization
extensions
skills
mode-specific runtime assembly

This package is the operational center of the project. It contains the logic that most users think of as “Pi,” while still remaining separable from the lower-level runtime and the higher-level UI.

Layer 4: `pi-tui`

This is the terminal UI layer.

Its presence as a distinct package is important because it suggests the user interface is not the agent itself. The same runtime can support different frontends.

That leads directly to one of Pi’s strongest architectural decisions: frontend/runtime separation.

One runtime, multiple modes

The official docs describe four major usage modes:

interactive
print/JSON
RPC
SDK embedding

That means Pi is not tied to its terminal interface, even if the terminal is the primary experience.

Interactive mode

This is the user-facing CLI workflow most people will start with. It combines the runtime with the terminal UI and built-in commands.

Print and JSON modes

These modes are useful for automation or simple scripting where you want structured output without a long-lived interactive session.

RPC mode

RPC mode exposes Pi through a JSONL protocol over stdin/stdout. This is the mode that makes IDE integrations, editor plugins, and service wrappers plausible without reimplementing the core runtime.

For example:

pi --mode rpc [options]

{"id": "req-1", "type": "prompt", "message": "Hello, world!"}

This is a strong design choice because subprocess embedding is often the easiest integration path for tools written in another language or running in another environment.

SDK mode

For Node.js and TypeScript applications, Pi can be embedded in-process through its SDK.

import {
  type CreateAgentSessionRuntimeFactory,
  createAgentSessionFromServices,
  createAgentSessionRuntime,
  createAgentSessionServices,
  getAgentDir,
  runRpcMode,
  SessionManager,
} from "@earendil-works/pi-coding-agent";

const createRuntime: CreateAgentSessionRuntimeFactory = async ({ cwd, sessionManager, sessionStartEvent }) => {
  const services = await createAgentSessionServices({ cwd });
  return {
    ...(await createAgentSessionFromServices({
      services,
      sessionManager,
      sessionStartEvent,
    })),
    services,
    diagnostics: services.diagnostics,
  };
};

const runtime = await createAgentSessionRuntime(createRuntime, {
  cwd: process.cwd(),
  agentDir: getAgentDir(),
  sessionManager: SessionManager.create(process.cwd()),
});

await runRpcMode(runtime);

This snippet shows the decomposition clearly: services, session manager, runtime creation, then a mode runner on top.

Core runtime flow: prompt, tools, persistence, compaction

For AI agents, architecture is really about workflow under constraints. Pi’s runtime appears to follow a loop like this:

Load startup context and trust-sensitive configuration.
Assemble the system prompt and working context.
Run extension hooks before the model call.
Send the provider request.
Receive model output, including possible tool calls.
Execute tool calls and attach results.
Repeat until the assistant completes.
Persist session entries.
Compact older context when token pressure increases.

The interesting part is that this pipeline is not fully hardcoded. The extension system lets you intercept multiple stages.

Extension hooks make the loop observable and adjustable

The extension docs describe lifecycle events around startup, provider requests, tool calls, compaction, tree navigation, and shutdown. Examples mentioned in the source material include:

session_start
before_agent_start
tool_call
before_provider_request
after_provider_response
session_before_compact
session_compact
session_before_tree
session_tree
session_shutdown

That event model suggests a publish/subscribe architecture around the core loop instead of a single monolithic pipeline. This is one of the biggest reasons Pi feels more like a toolkit than a product.

Context engineering is built into the architecture

A lot of agent systems treat prompt engineering as text pasted into a config file. Pi treats it as infrastructure.

According to the docs and homepage, Pi can load:

AGENTS.md and CLAUDE.md from user/global and project directories
SYSTEM.md to replace the default system prompt
APPEND_SYSTEM.md to append to it
skills loaded on demand
prompt templates
extension-provided prompt modifications
project trust state

This is not a minor convenience feature. It changes how the system is operated.

Why on-demand skills matter

Skills are loaded only when needed instead of always being included in the prompt. That helps avoid bloating context windows and prompt caches.

This is a practical tradeoff:

Always-loaded instructions are simpler.
On-demand loading is more efficient and gives finer control.

Pi chooses the second option, which fits its broader design: minimal default core, dynamic behavior at runtime.

Prompt customization through extensions

Pi also allows extensions to modify the assembled system prompt before model execution.

export default function promptCustomizer(pi: ExtensionAPI) {
  pi.on("before_agent_start", async (event) => {
    const { systemPrompt, systemPromptOptions } = event;
    const customPrompt = addToolGuidance(systemPromptOptions, systemPrompt);
    const appendSection = mergeWithUserAppend(systemPromptOptions);

    return {
      systemPrompt: `${customPrompt}${appendSection}`,
    };
  });
}

This is a strong example of Pi’s philosophy. Prompt composition is not just a file-loading step; it is part of the runtime and open to modification.

Sessions, JSONL persistence, and branching

Pi stores sessions in JSONL and supports commands such as /resume, /new, /tree, /fork, and /clone.

That combination implies that the session model is not a flat transcript. It supports branching workflows where a user can explore alternate paths.

Why JSONL is a sensible choice

JSONL is a practical format for agent session storage because it is:

append-friendly
easy to inspect
easy to process line by line
convenient for event-like histories

For terminal-first tools, that is often a better fit than requiring a heavier database.

Branching changes the context story

The source material notes that branch summarization is used when switching branches so that context from the abandoned branch can be injected into the new branch’s working context.

That matters because branching is not just a UI feature. It affects memory and continuity.

Pi also distinguishes between full history and in-memory working context. Compaction affects the latter, not the underlying stored session history. That is an important operational detail if you are debugging behavior or writing extensions that depend on prior entries.

Compaction is not just token trimming

Most agent systems eventually need summarization because context windows are finite. Pi exposes compaction as a visible architectural feature rather than hiding it as internal bookkeeping.

The docs describe two summarization mechanisms:

auto/manual compaction
branch summarization

They also define cut-point rules. For example, tool results must remain attached to their tool calls, so valid compaction boundaries are restricted.

That is exactly the kind of implementation detail extension authors need to know. If your extension assumes history can be split anywhere, you may break tool-call coherence.

Pi even allows custom compaction logic through hooks.

pi.on("session_before_compact", async (event, ctx) => {
  const { preparation, branchEntries, customInstructions, signal } = event;

  // Cancel:
  return { cancel: true };

  // Custom summary:
  return {
    compaction: {
      summary: "...",
      firstKeptEntryId: preparation.firstKeptEntryId,
      tokensBefore: preparation.tokensBefore,
    },
  };
});

This makes compaction a policy surface, not just an implementation detail.

Tradeoffs of customizable compaction

The flexibility is useful, but it increases the burden on extension authors.

You need to understand:

firstKeptEntryId
tokensBefore
serialized and truncated tool outputs
valid cut points
how repeated compactions relate to earlier kept boundaries

If you ignore those details, summaries may be technically valid but operationally misleading.

Extensions are the real center of Pi’s design

Pi’s homepage explicitly says it skips some built-in features and expects users to add them through extensions or packages. That is one of the most unusual and important aspects of the project.

Dynamic tool registration

Tools are not fixed at compile time. An extension can register them during session startup.

import type { ExtensionAPI } from "@earendil-works/pi-coding-agent";
import { Type } from "typebox";

const ECHO_PARAMS = Type.Object({
  message: Type.String({ description: "Message to echo" }),
});

export default function dynamicToolsExtension(pi: ExtensionAPI) {
  const registeredToolNames = new Set<string>();

  const registerEchoTool = (
    name: string,
    label: string,
    prefix: string,
  ): boolean => {
    if (registeredToolNames.has(name)) {
      return false;
    }

    registeredToolNames.add(name);

    pi.registerTool({
      name,
      label,
      description: `Echo a message with prefix: ${prefix}`,
      promptSnippet: `Echo back user-provided text with ${prefix.trim()} prefix`,
      promptGuidelines: [
        "Use echo_session when the user asks for exact echo output.",
      ],
      parameters: ECHO_PARAMS,
      async execute(_toolCallId, params) {
        return {
          content: [{ type: "text", text: `${prefix}${params.message}` }],
          details: { tool: name, prefix },
        };
      },
    });

    return true;
  };

  pi.on("session_start", (_event, ctx) => {
    registerEchoTool("echo_session", "Echo Session", "[session] ");
    ctx.ui.notify("Registered dynamic tool: echo_session", "info");
  });
}

This is a clear signal that Pi’s workflow surface is intended to be extended, not merely configured.

What extensions can change

Based on the provided material, extensions can influence:

commands
tools
provider request/response handling
prompt assembly
compaction behavior
tree navigation behavior
UI interactions
workflow logic around session lifecycle

That is unusually broad. It also explains why Pi can remain small at the core while still supporting highly specialized workflows.

Headless integrations: RPC mode and its sharp edges

RPC mode is one of Pi’s most practical features for teams building wrappers or custom frontends. But the protocol details matter.

The docs specify strict JSONL semantics with LF as the record delimiter.

The source material calls out a concrete gotcha: Node’s readline is not protocol-compliant for this use case because it can split on Unicode line separators such as U+2028 and U+2029, which are valid inside JSON strings.

That means a robust client should:

split records on \n only
accept optional \r\n by stripping the trailing \r
avoid generic line readers that reinterpret other Unicode characters as line boundaries

This is a good example of a small but important systems detail. If you are embedding Pi inside an editor extension or orchestrator, protocol correctness matters more than convenience.

Security and operational concerns

Pi’s flexibility does not remove operational risk.

No built-in sandbox

The repository README states that Pi does not provide a built-in permission system for filesystem, process, network, or credential access. It runs with the launching user’s permissions.

That has an obvious implication: if you need stronger isolation, you should containerize or otherwise sandbox it externally.

Trust model affects what loads

Before trust is granted, Pi loads only a subset of context and extension sources. According to the docs, project-local extensions, package-managed project extensions, and project settings are loaded only after trust resolution.

In non-interactive modes, trust prompts are not shown, so automation behavior depends on defaults or explicit CLI overrides.

If you are building tooling around Pi, document this clearly. Otherwise, a project may behave differently in interactive use versus CI-like or subprocess-driven environments.

Extension lifecycle resets on fork and clone

After /fork or /clone, Pi emits session_shutdown for the old extension instance, reloads and rebinds extensions, and then emits session_start for the new session.

That means in-memory extension state is not automatically preserved. If state matters, persist it into session entries or rebuild it during startup.

Why this architecture matters in practice

Pi’s design is especially useful when you need one of the following:

a terminal-first agent that is still scriptable
a reusable runtime for editor or service integration
custom prompt assembly without forking the core project
organization-specific commands, tools, or policies through extensions
session storage that is inspectable and easy to process

In other words, Pi is less about delivering one ideal workflow and more about providing a stable substrate for many workflows.

That is the real architectural difference.

Key takeaways

Pi is best understood as a layered toolkit for coding agents, not a fixed assistant product.
The package split separates providers, agent runtime, coding workflow, and terminal UI in a clean way.
Context engineering is deeply integrated through files, skills, prompt templates, and hooks.
Sessions are durable and branch-aware through JSONL persistence and summarization mechanisms.
Extensions are central to the design and can reshape tools, prompts, compaction, and workflow behavior.
RPC and SDK modes make the same runtime usable in terminals, subprocess integrations, and custom applications.
Operational safety is your responsibility: sandboxing, trust configuration, and extension-state handling all need deliberate design.

Conclusion

Pi Coding Agent stands out because it treats extensibility as the default architecture rather than an afterthought. The minimal core is not a limitation by accident; it is the mechanism that keeps the system adaptable.

That makes Pi especially interesting for engineers who want more than a terminal chatbot. If you need a coding agent that can be embedded, wrapped, or reshaped without forking the entire application, Pi’s layered design is worth studying.

The practical next step is to evaluate it in the mode closest to your real use case:

If you want a terminal workflow, start with interactive mode.
If you want editor or service integration, inspect RPC framing carefully.
If you want deep control over behavior, study the extension lifecycle and compaction hooks before writing custom logic.

In Pi, the architecture is the product.

DEV Community: Pramoda Sahu

How I Built a Compounding Markdown Wiki for Research: When Standard RAG Stops Being Enough

The Real Problem: Context Rot

What context rot looks like in practice

Why Standard RAG Solves Scale but Not Cumulative Research

Where standard RAG falls short for research

No persistent cross-source memory by default

No compounding

Weak provenance

Query-time cost keeps accumulating

The Shift: Knowledge Should Compound

The RLM Pattern Behind the Design

The Architecture: A Compounding Markdown Wiki

Layer 1: Raw sources

Layer 2: The wiki

Layer 3: The schema

Why markdown matters

How Ingestion Works

What “semantic chunking” means here

What gets extracted per chunk

A Compact End-to-End Example

1) Source excerpt

2) Extraction output

3) Wiki page update

4) Query response

How Contradiction Detection Works

How Querying Works

Query strategies

Lookup

Synthesis

Deep-dive

Where raw retrieval still fits

Why This Beats Plain RAG for Research

Provenance becomes operational, not cosmetic

Contradictions stop disappearing

Edits become cheaper

Practical Performance Notes

What This Enables

Multi-source research

Project-specific memory for teams and writers

Better structure as the corpus matures

Practical Packaging and Tooling

Minimal usage example

Example entity page format

The Deeper Point: Search Is Not Understanding

Key Takeaways

Conclusion

Designing a Self-Prompting Agent Harness with Per-Task Prompt, Tool, and Strategy Synthesis

The Core Idea: Generate the Harness at Task Time

SynthAgent Architecture

A Concrete Example of Synthesized P/T/S

The Meta-Prompt Generator Is the Only Hand-Written Prompt

mpg.py

The PEV Loop Turns Prompt Synthesis Into Runtime Behavior

harness.py

Structured planner output matters more than planner creativity

Tool argument normalization prevents embarrassingly common failures

Hard step limits are non-negotiable

Persisting every step to disk changes failure recovery

How Much Does strategy_s Change Runtime Behavior Today?

The Verifier Is the Weakest Part of the Current Design

verifier.py

Reflection Works Best at the Prompt Level, Not the Response Level

reflector.py

Response-level reflection is usually too shallow

Prompt-level reflection is the current sweet spot

Full re-synthesis is stronger, but more expensive and riskier

Memory Stores Both Raw Runs and Reusable Lessons

memory.py

The Tool Surface Defines the Agent's Legal Action Space

tools.py

Model Selection Uses Any OpenAI-Compatible Endpoint

llm.py

NVIDIA embedding requests need input_type

Chat completions are similar enough for this project's core flow

What's in the Repository

What I Learned Building SynthAgent

JSON parsing is harder than model selection

Reflection needs the trajectory, not just the score

The smallest prompts are often the highest-leverage components

A Concrete Example of Synthesized `P/T/S`

`mpg.py`

`harness.py`

How Much Does `strategy_s` Change Runtime Behavior Today?

`verifier.py`

`reflector.py`

`memory.py`

`tools.py`

`llm.py`

NVIDIA embedding requests need `input_type`

Layer 1: `pi-ai`

Layer 2: `pi-agent-core`

Layer 3: `pi-coding-agent`

Layer 4: `pi-tui`