A synthesis of hard-won lessons from Claude Code, OpenHands, SWE-agent, GoClaw, Nanobot, PicoClaw, and the emerging discipline of harness engineering. This is the guide we wish existed when we started building agents.
The goal: an AI agent that is fast, scalable, capable, reliable, efficient, and secure โ not by accident, but by design.
๐ How to read this guide
- Read top-to-bottom for the full mental model. Each section builds on the previous one.
- Skim the boxes if you only want the takeaways โ every section ends with an Actionable rules box.
- Jump to Part 14 โ The Build-Your-Own Roadmap if you already know the theory and want a sequenced plan.
- Bookmark Part 15 โ Anti-Patterns for design reviews.
๐ Table of Contents
- โก Part 0 โ The Core Equation
- ๐ง Part 1 โ Mental Model: What an AI Agent Actually Is
- ๐ Part 2 โ The Agent Loop (the Kernel)
- ๐ ๏ธ Part 3 โ Tools: The Agent's Hands
- ๐ญ Part 4 โ Context Engineering
- ๐พ Part 5 โ Memory (Long-Term Knowledge)
- โก Part 6 โ Concurrency & Multi-Agent Patterns
- ๐ง Part 7 โ Reliability: Error Recovery, Stuck Detection, Autosubmit
- ๐ Part 8 โ Security: Defense-in-Depth
- ๐ข Part 9 โ Multi-Tenancy from Day One
- ๐ Part 10 โ Performance & Efficiency
- ๐ Part 11 โ Provider Abstraction & Resilience
- ๐ก Part 12 โ Channels & Integration Surface
- ๐ Part 13 โ Observability & Evaluation
- ๐บ๏ธ Part 14 โ The Build-Your-Own Roadmap
- โ ๏ธ Part 15 โ Anti-Patterns to Avoid
- ๐ฏ Part 16 โ Closing: The Harness Mindset
โก Part 0 โ The Core Equation: Agent = Model + Harness
The single most important insight in agent engineering, framed by Vivek Trivedy of LangChain: "If you're not the model, you're the harness."
Reliability โ Model capability ร Harness quality
(fixed) (your job)
The model is fixed for the duration of your project. The harness โ system prompts, tools, sandboxes, memory, hooks, orchestration, observability โ is where 80% of agent quality comes from. After hundreds of production agent sessions across many teams, the pattern is consistent:
It's almost never a model problem. It's a configuration problem.
Teams who blame weak results on the model are usually wrong. Teams who treat harness engineering as a discipline โ equal in rigor to model selection โ ship reliably.
This guide is about the harness.
๐งฐ What the harness contains
| Layer | Concrete artifacts |
|---|---|
| System prompts | identity, methodology, tool docs, safety rules, persona |
| Tools / skills / MCP | the agent's allowed actions on the world |
| Infrastructure | sandboxes, filesystems, browsers, persistent shells |
| Orchestration | loops, sub-agents, handoffs, routing, retries |
| Hooks / middleware | linters, validators, approval gates, scrubbers |
| Memory & state | files, git, knowledge graphs, sessions, traces |
| Observability | spans, costs, evals, replays |
Every Part below covers one of these layers in depth.
๐ง Part 1 โ Mental Model: What an AI Agent Actually Is
Before any code: hold the right picture in your head.
๐ The one-paragraph definition
An AI agent is a streaming, cancellable, recursive state machine that emits typed actions against an environment, observes the consequences, and runs in a loop until a typed terminal state. The model is a function from history to next action. Everything else โ memory, sub-agents, security, hooks โ is a small module hanging off that single loop.
๐ The canonical loop
loop while not done:
state = compress(state) # context fits the budget
response = await model(state) # streaming
yield response # to UI / caller
if no tool_calls: return completed
batches = partition(response.tool_calls)# concurrent vs serial
for batch in batches:
results = run(batch) # parallel where safe
yield results
state += results
That's it. Five lines. Every agent in this guide โ Claude Code, OpenHands, SWE-agent, GoClaw, Nanobot, PicoClaw โ implements that exact shape, with variations in how each piece is realized.
๐๏ธ The five universal abstractions
Every production agent reduces to these. Implement them as first-class modules, not helpers attached to a god object.
| # | Abstraction | Responsibility |
|---|---|---|
| 1 | Loop | Drives iterations, classifies LLM responses, tracks terminal state |
| 2 | Tools | Self-describing actions with schema, permissions, concurrency safety |
| 3 | State | Append-only event log + a reactive view layer |
| 4 | Memory | File-tier persistence read at session start; LLM picks what to load |
| 5 | Hooks | Lifecycle interceptors at well-defined points |
๐ The four design principles to lift wholesale (from OpenHands V1)
-
Optional isolation, not mandatory sandboxing. The agent runs in-process by default; swap
LocalWorkspaceโDockerWorkspacefor isolation without changing agent code. Don't make sandboxing a build-time decision. -
Stateless components, single source of truth. Agent, Tool, LLM, and Condenser are immutable models. The only mutable thing is
ConversationState. State changes happen by appending events โ never by mutating objects. - Strict separation of concerns. The SDK never imports applications. The CLI, GUI, and your custom integration all consume the SDK as a library.
- Two-layer composability. Compose at package level (swap workspaces, swap servers) and at component level (swap tools, prompts, condensers, LLMs).
๐ Five rules carry 80% of the design (from Claude Code's source)
- ๐ The loop is an async generator. Backpressure, cancellation, and typed terminal states fall out for free.
- ๐ Every tool is self-describing (schema, permissions, concurrency safety). The loop never special-cases tools.
- ๐ก๏ธ Safety is per-invocation, not per-tool.
Bash("ls")โBash("rm -rf"). - ๐พ Prompt cache is architecture, not optimization. Every design either preserves cache hits or busts them.
- ๐ Memory is files. A small LLM picks which to load. No vector DB, no embeddings (yet). Trust through transparency.
โ Actionable rule: Build those five things well and you have ~80% of a production agent. Everything else is layering and polish.
๐ Part 2 โ The Agent Loop (the Kernel)
The loop is the only place control flow lives. Everything else โ tools, hooks, sub-agents โ yields through it.
2.1 ๐ Why an async generator (not a callback web)
Three concrete reasons:
-
Backpressure for free. A generator yields only when the consumer calls
.next(). The UI naturally pauses if it can't render fast enough. -
Typed terminal states. The generator's
returnis a discriminated union of why execution stopped:completed,max_turns,error,aborted_streaming,aborted_tools,prompt_too_long,image_error,model_error,stop_hook_prevented,hook_stopped. The compiler enforces exhaustive handling. -
Composability. Inner generators delegate via
yield*. No callback nesting.
In Python, use async def + async for. In Go, use channels or a method that returns a (StepOutput, error) tuple per call.
2.2 ๐๏ธ Phases worth memorizing (the 5-phase shape)
Borrowed from OpenHands' Agent.step() and SWE-agent's forward_with_handling:
def step(self):
# 1. Drain confirmed actions waiting to execute (confirmation flow).
if pending := drain_pending(state):
return execute(pending)
# 2. Honor user-input or stop hooks that could block this turn.
if blocked := check_hooks(state):
return finished(reason=blocked)
# 3. Build the LLM prompt โ may return a Condensation event instead.
msgs = prepare_llm_messages(state.events, condenser, llm)
if isinstance(msgs, Condensation):
return msgs
# 4. Call the LLM, with explicit context-window handling.
try:
response = call_llm(msgs, tools, on_token=stream)
except ContextWindowExceeded:
return CondensationRequest() # next step compresses
# 5. Classify and dispatch.
match classify(response):
case TOOL_CALLS: return execute_tools(response)
case CONTENT: return emit_message(response)
case EMPTY: return retry_with_nudge()
The hardest bug in agent loops is "the LLM responded but my code didn't know what to do with it." Explicit response classification kills that bug. Branch on a tagged union, not on a string match.
2.3 โป๏ธ Continue-states (don't return, just continue)
Some failures should re-enter the loop, not terminate it:
| Trigger | First step | Second step | Third step |
|---|---|---|---|
prompt_too_long (413) |
drain staged collapse summaries | reactive compact | surface to user |
max_output_tokens |
escalate cap 8K โ 64K | multi-turn recovery (โค3 attempts) | surface |
format_error (parse failed) |
re-prompt with format hint | (max_requeries=3) | surface |
bash syntax error |
re-prompt with bash -n output |
(counted) | surface |
Name each transition (collapse_drain_retry, reactive_compact_retry, max_output_tokens_escalate). Every test asserts which transition fired.
Hard rule: Guards prevent infinite loops. hasAttemptedReactiveCompact one-shot flags. Hard caps on recovery attempts. Never run stop-hooks on an error response โ that creates "error โ hook blocks โ retry โ error" spirals.
2.4 ๐ The tool_use / tool_result invariant
This is the single most common bug source.
Every
tool_useyour agent emits MUST have a pairedtool_resultin message history before the next API call.
Anthropic and OpenAI APIs both reject assistant messages that contain a tool_use block without a matching tool_result in the next user message.
On any cancellation path โ Esc, timeout, error โ drain queued tools by emitting synthetic tool_result blocks:
"Cancelled: parallel tool call Bash(mkdir build) errored"
signal.reason distinguishes hard aborts from "submit interrupts" (a new user message), so you skip redundant interruption stubs in the latter case.
2.5 ๐งน Loop hygiene every iteration
Before each LLM call, OpenHands' Runner.run enforces these invariants:
-
Drop orphan tool_results โ if the LLM forgot to emit
tool_usefor atool_result, strip it. -
Backfill missing tool_results โ if the LLM emitted
tool_usewith no matching result, synthesize an error placeholder so the trace is well-formed. - Microcompact โ fast pre-call truncation of large blobs.
2.6 ๐ฐ Concrete budgets
These defaults are battle-tested across the open-source agent ecosystem:
| Budget | Default | Why |
|---|---|---|
| Max iterations per turn | 20โ25 | Prevents runaway loops |
| Max requeries on parse fail | 3 | Three strikes is enough |
| Per-instance cost cap | $2โ$3 | Cost beats step count as a stop signal โ step counts vary 5ร across models |
| Per-command timeout | 60s | Most legitimate ops finish; longer needs explicit declaration |
| Total execution timeout | per task | Full task wallclock budget |
| Max consecutive timeouts | 5 | Anomalous environment, give up |
| Max observation length | 100K chars | Above this, persist to disk and return a preview |
| Idle watchdog | 90s (warn at 45s) | Stalled stream โ abort + non-streaming retry |
Cost is the primary stop signal, not steps. Step counts vary wildly across models โ Claude uses few long steps, GPT-5 uses many short ones. A $3-per-task cap normalizes this.
โ Actionable rules
- The loop is one async generator with five named phases.
- Branch on a typed union of LLM response classifications.
- Pair every
tool_usewith atool_resultbefore the next API call.- Cap iterations and cost; cost is the primary signal.
- Errors that should retry use continue states; errors that shouldn't return terminal states.
- Never run stop-hooks on error responses.
๐ ๏ธ Part 3 โ Tools: The Agent's Hands
If the loop is the kernel, tools are the kernel's syscalls.
3.1 ๐ The tool interface
A tool is parameterized by three types: Input, Output, Progress. Five members are critical:
interface Tool<I, O, P> {
name: string
description: string
inputSchema: Schema // Zod / Pydantic / JSON Schema
isConcurrencySafe(input): boolean // PER INVOCATION, not per type
checkPermissions(input, ctx): allow | deny | ask | passthrough
validateInput(input, ctx): ok | error
call(input, ctx): Result<O>
}
3.2 ๐ญ The buildTool() factory pattern (fail-closed)
Never construct a tool literal directly. Wrap it in a factory that fills in safe defaults:
const SAFE_DEFAULTS = {
isEnabled: () => true,
isParallelSafe: () => false, // serial unless proven otherwise
isReadOnly: () => false, // assume writes
isDestructive: () => false,
checkPermissions: (input) => ({ behavior: 'allow', updatedInput: input }),
}
If a tool author forgets isConcurrencySafe, they get serial execution โ slow, but never corrupting. The opposite default would silently produce race conditions.
3.3 ๐ก๏ธ Safety is per-invocation, not per-type
Bash("ls -la")is concurrency-safe.Bash("rm -rf build/")is not. Same tool, different inputs, different verdicts.
Pass the parsed input to the safety check. If parsing fails or the check throws, default to serial execution. Always fail closed.
3.4 ๐ช The 14-step tool execution pipeline
Every tool call in Claude Code goes through this single pipeline:
| # | Step | Why |
|---|---|---|
| 1 | Tool lookup with alias map | Old transcripts may reference renamed tools |
| 2 | Abort check | Don't waste compute on cancelled queued calls |
| 3 | Schema validation | Catch type errors early |
| 4 | Semantic validation | Reject no-op edits, etc. |
| 5 | Speculative classifier (parallel) | Auto-mode permission classifier for Bash |
| 6 | Input backfill | Expand ~/foo โ absolute paths for hooks/permissions, but keep originals for transcript stability |
| 7 |
PreToolUse hooks |
Hooks decide / modify / block |
| 8 | Permission resolution chain | Rule match โ tool method โ mode default โ prompt โ classifier |
| 9 | Permission denied path | Build error, fire PermissionDenied hook |
| 10 | Execute call()
|
The actual work |
| 11 | Result budgeting | Persist oversized output; return preview |
| 12 |
PostToolUse hooks |
Modify output, possibly block continuation |
| 13 | Append newMessages
|
Sub-agent transcripts, system reminders |
| 14 | Error classification | Telemetry, OTel events |
Implement this as one function (checkPermissionsAndCallTool). Skipping any step will hurt later.
3.5 โก Concurrency partitioning
def partition(calls):
batches = []
current = ConcurrentBatch()
for call in calls:
tool = lookup(call.name)
parsed = tool.schema.safe_parse(call.input)
safe = parsed.success and tool.is_concurrency_safe(parsed.data)
if safe and current.kind == 'concurrent':
current.push(call)
elif safe:
batches.push(current); current = ConcurrentBatch([call])
else:
if current: batches.push(current)
batches.push(SerialBatch([call]))
current = ConcurrentBatch()
if current: batches.push(current)
return batches
Example: [Read, Read, Grep, Edit, Read] โ [concurrent[Read, Read, Grep], serial[Edit], concurrent[Read]].
Yield results in submission order, not completion order โ even if c.ts finishes before a.ts, the conversation history must remain a, b, c.
3.6 ๐ Speculative streaming execution
Watch the model stream. The moment a tool_use block is fully parsed (often seconds before the response finishes), start that tool โ provided admission rules allow.
Admission rule: a tool can start executing iff no tool is currently running, or both the new tool and all currently-running tools are concurrency-safe.
Sequential timeline: stream 2.5s + 3 serial tools = 3.1s.
Speculative: stream 2.5s overlapped with tools 1โ2 = 2.6s.
3.7 ๐ฆ Three concurrency flags every tool needs
| Flag | Meaning |
|---|---|
read_only |
Safe to ignore for state checkpointing |
concurrency_safe |
Can be batched in gather()
|
exclusive |
Must be the only tool in its batch |
The runner partitions a turn's tool calls honoring these. That's how you get fast parallel reads without races on writes โ without a real scheduler.
3.8 ๐ Result budgeting: per-tool size caps
| Tool | maxResultSizeChars | Rationale |
|---|---|---|
| Bash | 30,000 | Most useful output fits |
| Edit | 100,000 | Diffs need room |
| Grep | 100,000 | Search results accumulate |
| Read | โ | Self-bounded by token limit |
Above the cap, write the full content to a <persisted-output> file and return a preview pointing to it. An aggregate ContentReplacementState tracks per-conversation budgets.
3.9 ๐ฅ๏ธ The Agent-Computer Interface (ACI) โ the SWE-agent thesis
The single most-imitated idea in coding agents:
"Language models are a new kind of end user, and they need software interfaces designed for them, not for humans."
The empirical claim: same model, same problem, roughly 2ร SWE-Bench score with ACI tools versus raw bash. The agent isn't smarter; it's better-housed.
๐๏ธ Four ACI design rules
-
Concise, bounded output. No
cat-the-whole-file. No unboundedgrep -rn. Every command's output has a maximum size and a structured shape. -
Persistent state across turns. The runtime owns the cursor (
CURRENT_FILE,FIRST_LINE). The agent never has to reconstruct "where am I" from history. -
Guard rails on destructive actions. Edits run through a linter; failures auto-revert. Bash is
-n-checked before execution. -
Predictable, tiny argument grammar. Each command has 1โ2 positional args max. Multi-line bodies bracketed by sentinels (
end_of_edit).
๐ The four flagship ACI tools
-
Windowed file viewer (100 lines, 2-line overlap on scroll,
goto Nputs N about 1/6 down the window so the agent gets ~17 lines of preceding context). - Bounded search (filenames + counts only, hard cap at 100 files, explicit start/end markers).
-
Line-targeted
editwith flake8 + auto-rollback. This is the most carefully engineered piece. On lint failure, show the bad state and the kept state side-by-side, with line numbers, both bracketed; revert automatically; tell the agent "DO NOT re-run the same failed edit command." -
Submit sentinel. A unique string (
<<SWE_AGENT_SUBMISSION>>) means any tool can declare completion without a separate channel.
๐ก CodeAct: code as the universal action (OpenHands)
A different direction reaches the same conclusion: instead of giving the LLM 20 bespoke tools each with its own JSON schema, give it bash + Python + a file editor + a browser and let it write code. Empirically this generalizes far better and dramatically reduces parsing errors.
Trade-off: relies on the model being a strong code generator. With Claude / GPT-5, "give it a shell" is the strongest baseline. With weaker models you may need narrower, more guided tools.
Pick your stance: narrow ACI tools (SWE-agent) or broad code-as-action (OpenHands). Both work. Don't do both at once โ the prompt cost compounds.
3.10 ๐๏ธ Tool registry assembly: cache-aware ordering
final = sort(builtins, alpha) ++ sort(mcpTools, alpha)
Sort within each partition, then concatenate. A flat sort across all tools would interleave MCP tools into built-in positions, busting the prompt cache whenever MCP servers change.
3.11 โณ Deferred tool loading
Tools marked shouldDefer: true send only { name, description, defer_loading: true } to the API. The model has to call ToolSearch to load full schemas. Three benefits:
- Smaller initial prompt
- Adding/removing a deferred tool changes the prompt by a few tokens, not hundreds โ prompt cache stays warm
- Less tool-soup confusion for the model
โ Actionable rules
- Wrap every tool in a
buildTool()factory with safe-by-default flags.- Concurrency safety is per-invocation, with
read_only/concurrency_safe/exclusivemetadata.- Funnel every tool call through one 14-step pipeline.
- Pick narrow ACI tools or broad code-as-action โ don't blend both.
- Cap tool output sizes; persist overflow to disk; return previews.
- Sort tools within partitions before concatenating; defer tools that aren't always needed.
๐ญ Part 4 โ Context Engineering (the Brain's Working Memory)
The model's context window is your most expensive, most fragile resource. Treat it that way.
4.1 ๐งฑ The dumb zone
LLMs degrade as context fills up. The relationship is non-linear: the last 20% of the window is dramatically lower-quality than the first 60%. Every token you waste in the first 60% pushes signal into the dumb zone.
Three forces fight for context:
- System prompt (identity, tools, conventions) โ static-ish
- Memory (CLAUDE.md, AGENTS.md, project context) โ semi-static
- Working history (turns, tool results, observations) โ pure churn
Keep #1 and #2 lean and stable. Compress #3 aggressively.
4.2 ๐๏ธ The 4-layer compression pipeline
Run before every API call, in this strict order:
| Layer | What it does | Cost |
|---|---|---|
| 0. Tool result budget | Enforce per-message size caps | Trivial |
| 1. Snip compact | Physically remove old messages; emit UI boundary marker | Cheap |
| 2. Microcompact | Drop tool results by tool_use_id once unneeded |
Cheap |
| 3. Context collapse | Replace conversation spans with summaries | Medium |
| 4. Auto-compact | Fork an entire conversation to summarize history | Heavy |
Why ordering matters: if collapse alone gets tokens below the auto-compact threshold, auto-compact never runs โ so you keep fine-grained recent history. Cheap layers first.
4.3 ๐ Budget thresholds
- Auto-compact triggers at
effectiveContextWindow โ 13,000tokens. - Hard blocking limit at
effectiveContextWindow โ 3,000. - Mid-loop compaction triggers at 75% of context window during iteration. Summarize the first ~70% of in-memory messages, keep the last ~30%.
- Post-run compaction at 50 messages OR 75% context window.
Token counting blends authoritative API usage numbers with rough estimates for messages added since the last response โ biased conservative so compaction fires slightly early.
Instrument both estimated and authoritative token counts; log the delta. When the delta drifts, your estimator is broken and your safety margins are wrong.
4.4 โ๏ธ Last-N-observations: the brutal default
A history processor that drops all but the most recent N observations from the messages array, keeping actions and thoughts in place but blanking out stdout of older steps. For a 50-turn trajectory, this is the difference between context overflow and a clean run.
SWE-agent's default is n=5. Aggressive but works; the agent's own thoughts plus recent state-command output give it enough to keep going.
4.5 ๐ช The condenser pattern (OpenHands)
def maybe_condense(events, summarizer, max_size=80, keep_first=4):
if len(events) <= max_size:
return events
head = events[:keep_first] # system prompt + task
tail = events[-(max_size // 2):] # recent work
middle = events[keep_first:-(max_size // 2)]
summary = summarizer.complete(
"Summarize concisely, preserving decisions, findings, current state:\n"
+ dump(middle))
return head + [SummaryEvent(text=summary)] + tail
Two trigger paths:
- Proactive โ check size on each step.
-
Reactive โ when the LLM raises
LLMContextWindowExceedError, emit aCondensationRequestevent and try again next step.
Don't summarize on every step โ only when over a threshold. Cache aggressively. The cheapest thing you can do is just truncate with a small head + recent tail; LLM summarization is the upgrade.
4.6 ๐ Progressive disclosure: skills and tools
A separate compression strategy: don't load it until you need it.
Two-phase skill loading (the killer pattern):
-
Phase 1 (startup): parse YAML frontmatter only โ
name,description,when_to_use. Inject as a directory. -
Phase 2 (invocation): load full markdown body, substitute
$ARGUMENTS, execute inline shell commands, prepend as a user message.
You pay the token cost only when the skill actually runs. 50+ skills cost 50 lines, not 50 docs.
4.7 ๐ฅ Sub-agents as context firewalls
The single biggest qualitative win in context engineering. Instead of letting research/grep output pollute the main conversation, delegate to a sub-agent with isolated context. The sub-agent returns a condensed answer with citations (e.g. auth.py:142-158).
Use sub-agents for:
- Codebase exploration
- Research and grep operations
- Long Q&A with simple final answers
- Anything that produces 200+ lines of intermediate output you don't need to keep
The parent never sees the 200 lines of grep output โ only the sub-agent's 3-line summary.
4.8 ๐งฉ The system prompt is 19+ composable sections
Not one giant blob โ a list of small sections assembled at request time. Order matters. Build it like this (Claude Code / GoClaw):
- Identity (channel-aware)
- First-run bootstrap notice
- Persona (SOUL.md, IDENTITY.md) โ early "primacy zone"
- Tooling (filtered + sandbox-aware)
- Credentialed CLI context
- Safety preamble + identity anchoring
- Self-evolution rules (if applicable)
- Skills inline (โค 15) OR via
skill_search - MCP tools inline OR via
mcp_tool_search - Workspace info
- Team workspace (if team agent)
- Sandbox container info
- User identity
- Time (UTC)
- Channel formatting hints
- Extra context (
<extra_context>tags) - Project/bootstrap context files
- Sub-agent spawning rules
- Runtime info (agent ID, model, pricing)
- Persona reminder โ late "recency zone" โ fights "lost in the middle"
- Memory reminders ("run
memory_searchfirst")
Two reinforcement zones (primacy + recency) are the cheapest reliability win in agent prompting. Put critical rules at both ends.
4.9 โ The "one big AGENTS.md" anti-pattern
A monolithic AGENTS.md crowds out task and code context. Too much guidance becomes non-guidance. It rots without maintenance.
Better: Treat AGENTS.md / CLAUDE.md as a table of contents (~60โ100 lines max) pointing to deeper docs/ sources. Include build commands, test commands, key conventions, structure pointers. Exclude directory listings, conditional rules, auto-generated content.
โ Actionable rules
- Run a 4-layer compression pipeline before every LLM call.
- Cap last-N observations (default N=5) โ keep actions/thoughts, blank stdout.
- Sub-agents are context firewalls, not just parallelism.
- Skills load in two phases โ frontmatter at startup, body on invocation.
- The system prompt is a list of small sections assembled at request time.
- AGENTS.md is a table of contents under 100 lines, not an encyclopedia.
- Instrument estimated vs authoritative token counts and alert on drift.
๐พ Part 5 โ Memory (Long-Term Knowledge)
Memory is where most agents either over-engineer (vector DB on day one) or under-engineer (everything fits in context, surely?).
5.1 ๐ Why files, not a database (Claude Code / Nanobot)
-
Transparency โ users open
.mdfiles and see exactly what the agent remembers. Trust through observability, not capability. - Modification time is a built-in epistemological signal: "when was this observation recorded?"
- Git is free โ diffable, recoverable, human-editable, auditable.
- Zero infrastructure โ no schema migrations, no indexes, no backups.
Add embeddings + a vector store later, when you have measured the failure modes that demand them. Most agents don't need them.
5.2 ๐๏ธ The file layout
~/.claude/projects/<sanitized-git-root>/memory/
MEMORY.md # always loaded; index only; โค200 lines
user_role.md # one memory per file
feedback_testing.md
project_migration_q2.md
team/ # shared via symlink
logs/YYYY/MM/YYYY-MM-DD.md # append-only daily logs
5.3 ๐ท๏ธ The four-type taxonomy
| Type | Purpose |
|---|---|
| user | Role, expertise, preferences |
| feedback | Corrections + validated approaches (rule + Why: + How to apply: lines) |
| project | Active work context with absolute dates (always convert "Thursday" โ 2026-03-05) |
| reference | Pointers to external systems (Linear, Slack channels) |
Derivability test: if git log / git blame / the code itself can answer it, don't memorize it. No code patterns, no architecture, no debug fix recipes.
5.4 ๐ Frontmatter contract
---
name: <title>
description: <one-line summary used by recall LLM>
type: user | feedback | project | reference
---
<body โ for feedback/project, structure as: rule โ **Why:** โ **How to apply:**>
The description field carries the most weight โ it's the LLM-recall index.
5.5 ๐ Two-tier retrieval
-
Tier 1 (always loaded):
MEMORY.mdindex (~3,000 tokens for ~150 entries). Lines after 200 are truncated. - Tier 2 (on-demand): an async side-query LLM gets the manifest (type, name, date, description), the user's current query, and recent tool history. Returns up to 5 filenames. Validated against the file list to catch hallucination.
This trades a few hundred ms of latency for semantic precision keyword-matching cannot achieve โ especially for negation (do NOT use mocks).
5.6 ๐๏ธ The 3-tier memory model (GoClaw, when you grow up)
Once a file system isn't enough, GoClaw's tiering is the right next step:
L0 โ Working Memory L1 โ Episodic L2 โ Semantic
โโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ
โ Current session โ โ Session summariesโ โ Knowledge Graph โ
โ messages โ โโโโโโโโบ โ + L0 abstracts โ โโโโโโโบ โ entities + โ
โ Threshold-based โ โ (~50 tokens ea.) โ โ relations โ
โ compaction โ โ + embeddings โ โ + temporal โ
โ โ โ 90-day retention โ โ validity โ
โโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ
โฒ โฒ โฒ
auto-inject memory_search memory_expand
(free, no tool call) (top-K) (full doc)
Hybrid search formula:
combined_score = vector_score * 0.7 + fts_score * 0.3
With BM25 / tsvector for FTS and pgvector for embeddings. Per-user 1.2ร boost. Dedup: per-user wins over global.
5.7 ๐ Event-driven consolidation
run.completed
โ
โผ
EpisodicWorker โ extract summary + abstract via LLM
โ
โผ episodic.created
SemanticWorker โ extract entities/relations into KG
โ
โผ entity.upserted
DedupWorker โ embedding-similarity merge
(separately, debounced 10m)
DreamingWorker โ batch unpromoted summaries scored by:
0.30 * frequency + 0.35 * relevance +
0.20 * recency + 0.15 * freshness
โ LLM synthesis โ write to long-term memory
The dream phase (Nanobot / GoClaw) is the agent literally writing notes to itself, then git commit-ing them. Auditable, diffable, recoverable.
5.8 ๐ฐ๏ธ Staleness policy: annotate, don't expire
Don't expire memories. Annotate with age. Today/yesterday โ no caveat. Older โ human-readable warning ("This memory is 47 days old โ code claims may be outdated"). Models reason better about "47 days ago" than ISO timestamps.
Per-line age suffixes (โ 30d) computed from git blame let the LLM naturally deprioritize stale entries.
โ Actionable rules
- Start with files + git. Add a vector store only when measured failures demand it.
- Always-loaded
MEMORY.mdindex โ on-demand LLM recall side-query for tier 2.- Four memory types: user, feedback, project, reference. Derivability test rules out the rest.
- Hybrid search: 0.7รvector + 0.3รFTS. Per-user boost. Dedup.
- Consolidation is event-driven (workers subscribed to
run.completed).- Annotate stale entries; don't expire them.
โก Part 6 โ Concurrency & Multi-Agent Patterns
6.1 ๐ Per-session serial, cross-session concurrent (Nanobot)
The simplest correct concurrency model for multi-tenant chat agents:
- Each session gets a lock.
- Within a session: strictly serial (no race on history).
- Across sessions: concurrent (one user's slow tool call doesn't block another's).
No threads, no actors, no Redis.
6.2 ๐ฌ The pending queue: mid-turn message injection
Users send follow-ups while the agent is still working. Don't ignore them; queue them.
# Top-level dispatcher
async def run(self):
while running:
msg = await bus.consume_inbound(timeout=1.0)
key = effective_session_key(msg)
if key in pending_queues:
# session is mid-turn โ enqueue, don't start a 2nd task
pending_queues[key].put_nowait(msg)
continue
task = asyncio.create_task(self._dispatch(msg))
When the current turn ends, drain the queue back into inbound for the next turn. Per-session lock + pending queue is the entire multi-turn concurrency model. You will be tempted to use a state machine. Don't.
6.3 ๐น๏ธ Steering: mid-loop user correction (PicoClaw)
"The user can correct the agent at any moment. Make that a first-class concern."
Per-session FIFO queue polled at four checkpoints:
- Loop initialization
- After each tool completes
- After each non-tool LLM response
- Before turn finalization
If a queued message exists at any of those points:
-
Remaining tool calls in the current LLM response are skipped, each receiving the synthetic result
"Skipped due to queued user message."so the model still understands what did/didn't run. - The queued message is appended as a new user turn.
- The loop re-enters the LLM stage.
Why this matters:
- Side-effect safety: a user yelling "don't send that email" actually stops the email.
- Compute savings: a planned batch of three 4s tool calls = 12s avoided.
- Model awareness: skipping is announced via tool-result so the model adapts.
6.4 ๐ค Sub-agents: the recursive primitive
Single-agent capability has a hard ceiling. Spawn child agents that are the same loop with isolated state.
| Aspect | Sync sub-agent | Async sub-agent |
|---|---|---|
| Parent waits | yes | no |
| Permission mode default | bubble |
dontAsk |
| Abort controller | shares parent's | independent |
| App state | shared | isolated |
| File state | own cache | own cache |
Hard limits to enforce:
| Limit | Default |
|---|---|
| Max concurrent (system-wide) | 8 |
| Max spawn depth | 1 |
| Max children per parent | 5 |
| Auto-archive after | 60 min |
| Max iterations per subagent | 20 |
Recursive fork prevention (defense in depth):
- Primary: tag child's context with a
querySourceflag. AgentTool checks this before allowing fork. - Fallback: scan message history for the boilerplate XML tag if the flag was lost in transit.
6.5 ๐ด Fork agents: cache-driven subprocess design
The point of a fork is byte-identical request prefix to the parent, so children pay ~10% input-token cost (90%+ savings on prompt cache).
Three mechanisms make this work:
-
System prompt threading โ pass parent's already-rendered bytes via
override.systemPrompt. Don't regenerate; feature flags or session date may have changed. -
Exact tool passthrough โ
useExactTools: true. No filtering, no reordering, no re-serialization. -
Placeholder tool results โ clone the parent's last assistant message. For each
tool_use, insert a constant placeholder string ("Fork started -- processing in background"). Same string for every child โ same bytes.
Result: [...shared_history, assistant(all_tool_uses), user(placeholders..., directive)]. Only the final directive differs. With a 48,500-token shared prefix and 5 children, savings exceed 90% on input tokens for children 2โ5.
6.6 ๐บ๏ธ Three coordination patterns
Pick the right shape for the work:
| Scenario | Pattern |
|---|---|
| Single bg task | Delegation โ fire-and-forget |
| Multi-file refactor with research phase | Coordinator โ manager-worker |
| Long-running collaborative dev | Swarm โ peer-to-peer |
๐ Coordinator mode rules
-
Three tools only:
Agent(spawn),SendMessage(talk),TaskStop(kill). By design. - "The coordinator's job is to think, plan, decompose, and synthesize. Workers do the work."
- Critical principle: never delegate understanding. Coordinators must give workers exact file paths, exact line numbers, exact change descriptions โ not "based on the research, fix the bug."
Workflow phases: research โ synthesis (coordinator!) โ implementation โ verification.
๐ Swarm rules (when peers collaborate)
- File-based mailboxes for inter-peer messages.
- Messages delivered between tool rounds, never mid-execution. No race conditions.
- Three interruption levels: abort current work / shutdown request / kill.
- Hard cap on teammate state. A real production incident reached 36.8 GB across 292 agents. Cap message history; budget unbounded fan-out from day one.
6.7 ๐ Teams: SQL-claimed task boards (GoClaw)
When agents collaborate on a structured workflow, give them a board:
-- Atomic, race-safe claim โ no distributed lock needed
UPDATE team_tasks
SET status = 'in_progress', owner_agent_id = $1
WHERE id = $2 AND status = 'pending' AND owner_agent_id IS NULL
-- 1 row updated = claimed; 0 rows = someone else got it
Task states: pending / in_progress / in_review / completed / failed / cancelled / blocked / stale. Dependencies via blocked_by UUID[] array; completing a task auto-unblocks dependents.
6.8 ๐๏ธ SubTurn: hierarchical sub-agents (PicoClaw)
| Property | Value |
|---|---|
| Max nesting depth | 3 |
| Max concurrent per parent | 5 (semaphore-guarded) |
| Default timeout | 5 min (parent + child have independent timeouts) |
| Message buffer | 50 per sub-turn (does not contaminate parent history) |
| Result delivery | async via pendingResults channel (16-message buffer) |
Critical: true |
survives parent completion โ runs in background |
Why context derives from context.Background(), not the parent's ctx: an independent timeout on a child shouldn't surprise it when the parent finishes early. Cascading cancellation is opt-in.
โ Actionable rules
- Per-session lock + pending queue is the entire multi-turn concurrency model.
- Steering: poll a per-session FIFO at 4 checkpoints; skip remaining tools with explicit "Skipped" results.
- Sub-agent limits: depth 1, โค5 children, โค8 concurrent. Hard-enforce.
- Forks share bytes with the parent for 90%+ prompt cache savings.
- Coordinator: 3 tools only. Workers get exact paths, never "based on research."
- Cap teammate state. Bound every fan-out queue.
- Atomic SQL claim beats distributed locks.
๐ง Part 7 โ Reliability: Error Recovery, Stuck Detection, Autosubmit
7.1 ๐ช The error recovery ladder (not a fallback)
Order matters. From least to most aggressive:
| Trigger | Step 1 | Step 2 | Step 3 |
|---|---|---|---|
prompt_too_long (413) |
drain staged collapse summaries | reactive compact | surface to user |
max_output_tokens |
escalate cap 8K โ 64K | multi-turn recovery (โค3 attempts) | surface |
media_size_error |
reactive compact | โ | surface |
format_error |
re-prompt with format hint | (ร3) | surface |
bash_syntax_error |
re-prompt with bash -n output |
(counted) | surface |
command_timeout |
re-prompt with timeout note; increment counter | (after 5 in a row) raise | autosubmit |
Each layer catches a different failure mode. Together they make the agent feel "self-correcting" even though the LM is just being asked to try again with more context.
7.2 ๐ The autosubmit pattern (SWE-agent's killer move)
Every error path ends in autosubmit, not crash. A 30-step trajectory that hits a cost limit at step 31 has probably made some real progress. A git diff of the work-in-progress is a partial patch that may pass some tests. Throwing it away for an exception is a bug; autosubmitting is the feature.
def handle_error_with_autosubmission(self, exit_status, message):
try:
# Run submit one more time, capture whatever git diff exists
observation = self._env.communicate("submit", timeout=10)
submission = self.extract_submission(observation)
except Exception:
submission = "" # even submit failed, ship empty
return StepOutput(
done=True,
exit_status=exit_status, # "exit_cost", "exit_context", etc.
submission=submission,
observation=message,
)
Failure modes turn into degraded successes. Cost overrun, context overflow, total timeout, consecutive errors โ all autosubmit, none crash.
7.3 ๐ก๏ธ Three guardrails that compound (SWE-agent)
Each layer catches a different failure mode. Together they make the agent feel self-correcting:
-
bash -nsyntax check before execution. Malformed commands never run; the agent sees the error and retries. - flake8 + auto-revert on edits. Broken code is never persisted. Side-by-side before/after diff in the error message.
- Format-error requery. Up to 3 re-prompts when parse / blocklist / bash-syntax fails.
7.4 ๐จ Stuck detection (OpenHands)
Without this, agents burn money in loops. Run a StuckDetector on the event log every step. It flags five patterns:
| Pattern | Threshold |
|---|---|
| Repeating action โ observation pairs | 4+ identical |
| Repeating action โ error pairs | 3+ identical |
| Agent monologue (no tool calls, no progress) | 3+ consecutive |
| Alternating actionโobservation ping-pong | 6+ cycles |
| Repeated context-window errors | (any) |
Comparison is semantic, not object identity: actions matched by tool name + content (timestamps and metrics ignored). When stuck, the agent transitions to ERROR or emits a LoopRecoveryAction.
This 100-LOC detector saves more money than any other optimization.
7.5 โ๏ธ The SWE-agent caveat: don't add semantic guardrails unless 100% precision
SWE-agent has no semantic stuck detection. The team tried and abandoned them: false positives were too high.
Their stance: don't add a guardrail unless its false-positive rate is low. SWE-agent's existing guardrails (cost, syntax check, lint+revert) are all 100%-precision: they only fire when something is definitely wrong.
Reconciling with OpenHands: stuck detection works when the false-positive cost is low (you abort/notify, not crash). Tune the thresholds for your tolerance.
7.6 โ Verification is the difference between hallucinated "done" and real done
The single most important prompting lesson from OpenHands' CodeActAgent:
Make your agent re-run the test suite as the last action before
finish.
The four-phase methodology baked into the system prompt:
- Exploration โ read the repo, find relevant files, understand the surface area before doing anything.
-
Analysis โ form a hypothesis about what to change and why. The
ThinkToolexists for this slot. - Implementation โ make the smallest change that addresses the analysis.
-
Verification โ re-run the tests, lints, build. Only call
finishwhen verification passes.
Without verification, the agent declares victory on broken code. With it, "ran for 30 minutes and didn't break anything" becomes a real story.
7.7 ๐ Confirmation policy + risk analyzer (OpenHands)
Two layers:
-
Risk analyzer. Every Action gets a
SecurityRisk โ {LOW, MEDIUM, HIGH, UNKNOWN}score. The defaultLLMSecurityAnalyzeradds asecurity_riskfield to every tool's JSON schema, so the LLM scores its own action inline โ no extra call. -
Confirmation policy.
AlwaysConfirm,NeverConfirm, orConfirmRisky(threshold=HIGH). WithConfirmRisky, low-risk actions auto-execute; risky ones pause until approved.
Headless mode hard-disables confirmation (it's NeverConfirm always). That means headless mode's blast radius is whatever the workspace allows โ which is exactly why headless mode wants Docker.
7.8 ๐ฐ Cost ceilings: the day-one minimum
| Ceiling | Default |
|---|---|
MAX_ITERATIONS |
~100 |
LLM_NUM_RETRIES |
8 |
| Hard accumulated-cost cutoff | per-task budget |
| Per-tool timeout | 60s |
| Total wallclock | per-task budget |
Don't ship a headless agent without all of these.
7.9 ๐พ Session checkpoint on every iteration
Cheap to write, makes /stop and crashes feel free instead of catastrophic. Before each turn the loop saves a runtime checkpoint of intermediate tool messages. On /stop or crash the next turn restores them so a half-finished tool sequence isn't lost.
โ Actionable rules
- The error recovery is a ladder, not a fallback. Each layer is testable.
- Autosubmit, never crash. Every error path produces a degraded success.
- Three compounding guardrails: bash -n, lint+revert, format requery.
- Add stuck detection โ but tune for low-false-positive cost.
- Verification is the last action before "done." No exceptions.
- Day-one ceilings: max iterations, max retries, max cost, max wallclock.
- Checkpoint every iteration. Restore on resume.
๐ Part 8 โ Security: Defense-in-Depth
Each layer is independent โ even if one is bypassed, the others still protect.
Layer 1 โ ๐ Transport
- CORS allow-list validation
- WebSocket message size limit: 512 KB
- HTTP body limit:
MaxBytesReader1 MB -
Timing-safe token comparison (
crypto/subtle.ConstantTimeCompare) - Rate limiting (token bucket per user / per IP)
- Ping/pong every 30s; read deadline 60s; write deadline 10s
Layer 2 โ ๐ Input Validation (InputGuard)
Six regex patterns scan every user message. Catches:
| Pattern | Catches |
|---|---|
ignore_instructions |
"Ignore all previous instructions" |
role_override |
"You are now a different assistant" |
system_tags |
`<\ |
{% raw %}instruction_injection
|
"New instructions:", "override:" |
null_bytes |
\x00 |
delimiter_escape |
</instructions>, "end of system" |
Four action modes: off / log / warn (default) / block.
Layer 3 โ โ๏ธ Tool Execution
-
Shell deny groups โ 15 classes denied by default:
destructive_ops,data_exfiltration,reverse_shell,code_injection,privilege_escalation,dangerous_paths,env_injection,container_escape,crypto_mining,filter_bypass,network_recon,package_install,persistence,process_control,env_dump. Live-reloadable via pub/sub. -
Path traversal prevention โ
resolvePath()runsfilepath.Clean()and verifies the result starts with the workspace prefix on every filesystem op. -
SSRF guards โ block
127.0.0.1/localhost/ RFC1918 ranges for provider base URLs andweb_fetch. -
Credentialed CLI gate โ when calling registered binaries (
gh,gcloud,aws,kubectl,terraform), inject encrypted env vars directly into the child process (no shell), unwrapsh -cwrappers up to depth 3 to prevent bypass, and fail-closed on DB error. -
Domain allow/block โ
web_fetchhonors per-tenant allow/block lists.
For Bash, parse the command via a real bash AST parser, split on && || ; |, classify each subcommand. If the parser fails, return fail-safe behavior โ assume any command it can't parse is unsafe.
Layer 4 โ ๐งน Output Sanitization
-
Credential scrubber โ static regex patterns (OpenAI sk-, Anthropic sk-ant-, GitHub ghp_, AWS AKIA, generic 64-char hex) + dynamic registry of runtime values. Replaces with
[REDACTED]. Always-on. -
Output sanitizer โ 7 steps applied to LLM output before delivery:
- Strip garbled tool XML (
<tool_call>, etc. from broken models) - Strip downgraded text-format tool calls (
[Tool Call: ...]) - Strip thinking tags (
<think>,<thinking>,<antThinking>) - Strip final wrapper tags (preserve inner content)
- Strip echoed
[System Message]blocks - Collapse consecutive duplicate paragraphs (model stuttering)
- Strip leading blank lines
- Strip garbled tool XML (
Layer 5 โ ๐ Isolation
-
Per-user workspace โ
base + "/" + sanitize(userID), injected viaWithToolWorkspace(ctx). - Docker / bwrap sandbox โ read-only root, dropped capabilities, scoped per-session.
- Subagent depth limit โ max depth 1, max children 5/parent, max concurrent 8 system-wide.
- MCP servers spawned in isolated processes. Treat MCP servers as untrusted child processes. One buggy server must not crash the agent.
๐ง Hard security boundaries to set early
| Boundary | Why |
|---|---|
| MCP skills NEVER execute inline shell commands. External MCP servers are content-only. Every other extension surface (user skills, project skills) can run shell; MCP cannot. | The single most important MCP rule and the one you will be tempted to break |
allowFrom empty = deny all. |
Many "personal" agents accidentally ship open |
API keys split into .security.yml. |
Different file permissions; easier to scrub from bug reports |
AES-256-GCM for at-rest secrets with aes-gcm: prefix + nonce + ciphertext + tag, base64'd. |
Database dumps leak; insider access widens blast radius |
| API keys: 16 random bytes, SHA-256 hash, constant-time compare. | Plaintext or non-constant-time = trivially broken |
| Headless = always-approve = blast radius is whatever the workspace allows. | Always use Docker in headless. Don't mount more than the working directory |
๐ The secret registry (OpenHands)
A separate vault that:
- Stores secrets per-session, late-bound (resolved only at exec time).
-
Masks them in stdout/stderr (
<secret-hidden>). - Encrypts at rest, supports rotation, supports callable resolvers.
- The shell tool scans commands for known secret keys, exports them as env vars, and replaces matches in output.
โ Actionable rules
- Five layers, all independent. Don't pick one and stop.
- 15 shell deny groups by default. Live-reloadable.
- Every filesystem op runs through
resolvePath(). Path traversal dies.- Output sanitizer is a 7-step pipeline. Always-on.
- Credential scrubber: static + dynamic. Mask in real-time.
- MCP servers run in isolated processes. MCP skills never execute shell.
.security.ymlseparate file, separate perms. AES-256-GCM at rest. SHA-256 + constant-time for auth.
๐ข Part 9 โ Multi-Tenancy from Day One
This is the single most consequential design decision โ and the one most projects skip until it's painful.
๐ Three rules, never broken
-
Every isolatable table has
tenant_id NOT NULL. 40+ tables in GoClaw enforce this. -
Every query includes
WHERE tenant_id = $N. No exceptions. Fail-closed. -
Tenant flows through
context.Context. Resolved at the gateway, propagated everywhere, never taken from client headers (which can be spoofed).
๐ช Tenant resolution at the gateway
| Credential | How tenant is resolved |
|---|---|
| Tenant-bound API key | Auto from api_keys.tenant_id (the recommended path) |
System-level API key + X-Tenant-Id header |
From header (UUID or slug); only system keys can do this |
| Channel webhook (Telegram, Discord, โฆ) | Baked into channel_instances.tenant_id at registration |
| No credentials | Master tenant only (dev mode) |
โ๏ธ Per-tenant overrides
Each tenant gets its own:
- LLM provider configs and API keys
- Tool settings (web_search providers, TTS voice, etc.)
- Skills enabled/disabled
- MCP servers + per-user credentials
- Channel instances
๐ Storage hardening
- API keys: SHA-256 at rest, constant-time compare for validation.
-
Provider/MCP/custom-tool secrets: AES-256-GCM with
aes-gcm:prefix + 12-byte nonce + ciphertext + tag, base64'd. -
Master scope guard: writes to global tables (
builtin_tools,config.*) requireIsMasterScope(ctx)โ otherwise tenant admin only.
๐ชช Identity propagation
You don't authenticate end-users. The upstream service (your SaaS backend, your auth proxy) provides user_id, opaque, max 255 chars. Convention for multi-tenant: tenant.{tenantId}.user.{userId}.
โ Actionable rule: Retrofitting multi-tenancy is one of the most painful migrations in software. Make
tenant_ida column on day one, even if you only have one tenant. The migration cost from "single-tenant + tenant_id column" to "multi-tenant" is hours. The migration cost from "single-tenant" to "multi-tenant" is months.
๐ Part 10 โ Performance & Efficiency
10.1 ๐๏ธ Prompt caching is architecture, not optimization
Every design decision either preserves cache hits or busts them. Anthropic's prompt cache gives ~90% input-token discount on cached prefixes. With long conversations, this dominates economics.
Cache scopes:
| Scope | TTL | What |
|---|---|---|
| Global | Long | Static prompt prefix shared across users |
| 1-hour | 60 min | Eligible users' extended cache |
| Ephemeral (default) | ~5 min | Per-session |
The dynamic boundary โ a literal marker in your system prompt:
- Above (cacheScope: global): identity, system rules, task guidance, tool usage, tone.
- Below (per-session): session guidance, project memory, env info, language, MCP instructions, output style.
Rule: Every runtime
ifabove the boundary doubles the cache key space. 3 conditionals = 8 prefixes. 5 = 32. Compile-time feature flags are fine; runtime checks must live below the boundary.
Global scope is disabled when MCP tools are present โ user-specific tool definitions would fragment global cache into millions of unique prefixes.
10.2 ๐ Sticky latches
Five session-scoped boolean flags (null | true) that, once set, can't unset for the rest of the session. They control beta/feature headers. Reason: mid-session toggles change the server-side cache key โ flipping a flag would bust 50โ70K tokens of cached context.
type Latch = boolean | null // null = "not yet evaluated"
function shouldSendBetaHeader(active: boolean): boolean {
const latched = getAfkLatch()
if (latched === true) return true
if (active) { setAfkLatch(true); return true }
return false
}
Use Once(value) semantics for any cache-influencing config.
10.3 ๐ฆ Output token slot reservation
Production p99 output โ 4,911 tokens. Default SDK reservation = 32Kโ64K. Over-reservation = 8โ16ร.
Strategy: cap default max_tokens at 8K. On the rare truncation (<1% of requests), retry with 64K. Recovers 12โ28% of context window for free.
10.4 ๐ Streaming: skip the SDK helper
The SDK's BetaMessageStream calls partialParse() on every input_json_delta โ repeatedly re-parsing growing JSON from scratch (O(nยฒ)). Use raw stream events and accumulate tool-input strings yourself.
10.5 ๐ Watchdog and fallback
-
Idle watchdog:
setTimeout(90s)reset on every chunk. At 45s, warn. At 90s, abort and retry non-streaming. - Non-streaming fallback activates when streaming dies mid-response (network, stall, truncation, proxies returning 200 with non-SSE bodies).
- Disable fallback when streaming tool execution is active โ duplicate tool runs would corrupt state.
10.6 ๐ API preconnection at boot
Fire a HEAD request to your LLM API during init. TCP+TLS handshake (100โ200 ms) overlaps with setup. Connection is warm by the time the user submits.
10.7 ๐ธ Cheap-first model routing (PicoClaw)
A rule-based classifier scores each turn 0..1 on five language-agnostic features:
| Feature | Weight |
|---|---|
| Has attachments | 1.00 |
| Code block present | 0.40 |
| Tokens > 200 | 0.35 |
| Recent tool calls > 3 | 0.25 |
| Tokens > 50 | 0.15 |
| Recent tool calls 1โ3 | 0.10 |
| Conversation depth > 10 | 0.10 |
With threshold 0.35, trivial chat stays on a cheap "light" model (Gemini Flash, Haiku); code/attachments/tool-active goes to the heavy model. This alone cuts API spend dramatically for chatty workloads.
Each agent has both Candidates and LightCandidates โ primary and cheap fallback chains. Routing only picks the chain; fallback logic inside the chain is generic.
10.8 ๐ชถ Lean runtime tricks (PicoClaw, Go)
- Static linking: no shared-library footprint.
-
-ldflags="-s -w"strips symbol table and DWARF info (~30% size reduction). -
-trimpathremoves file system paths from binaries. - Bounded queues everywhere โ turns "memory bug" into "rejected request" you can monitor and tune.
- Lazy initialization โ channels, hooks, skill registries init only when enabled.
-
membenchregression gate in CI โ peak RSS measured per PR.
10.9 ๐ Concrete cost economics
Two anchor points:
- CodeActAgent v1.8 on Claude 3.5 Sonnet: 26% on SWE-Bench Lite at $1.10 per instance.
- OpenHands V1 default condenser: cuts API spend by ~2ร on long sessions.
Order-of-magnitude expectations:
- Trivial task (few file edits, no tests): $0.05โ$0.30 per run on a frontier model.
- SWE-Bench-style real fix: $0.50โ$3 per task.
- Multi-hour autonomous run: $5โ$30, easily more without a condenser.
10.10 ๐ฏ Bootstrap targets
For interactive agents:
- Boot time < 300 ms.
- First token streamed < 1 second.
- Fast-path dispatch:
--version/--helpโ dynamic-import only that handler, exit. Don't load React, telemetry, MCP. - 50+ profiling checkpoints sampled at 100% of internal users / 0.5% of external. Without instrumentation you can't tell what to optimize.
โ Actionable rules
- Treat the prompt cache as architecture. Static-then-dynamic boundary, sticky latches, byte-identical fork prefixes.
- Cap
max_tokensat 8K, escalate on truncation. Recovers 12โ28% of context.- Cheap-first routing: 5-feature classifier, threshold 0.35, light/heavy model chains.
- Bounded queues, lazy init, regression gates.
- Boot < 300ms, first token < 1s. Profile from day one.
๐ Part 11 โ Provider Abstraction & Resilience
11.1 ๐ The Provider interface
Tiny. Everything that's hard about LLMs lives inside this seam.
type Provider interface {
Name() string
DefaultModel() string
Chat(ctx, req) (Response, error)
ChatStream(ctx, req, onChunk) (Response, error)
}
Every backend โ Anthropic native HTTP+SSE, OpenAI-compatible (Groq, DeepSeek, Gemini, Mistral, OpenRouter, vLLM, LM Studio, Ollama all speak the same wire), Claude CLI subprocess, Bedrock, Azure, Vertex AI, ACP JSON-RPC, DashScope โ implements this interface. The agent loop never knows which one it's talking to.
Use LiteLLM (Python) or equivalent โ 100+ providers for free.
11.2 ๐งฑ The reliability stack (the part most projects miss)
When a provider call fails, the wrapper consults:
- Error classifier โ Auth? Rate-limit? Network blip? 5xx? 9 canonical reasons.
- Cooldown โ if Auth/Quota, mark this provider unavailable for N minutes.
- Rate limiter โ token bucket to keep us under contractual TPM/RPM.
- Fallback โ try next candidate in chain (heavy โ light, or primary โ secondary key).
-
Retry โ exponential backoff with jitter; honors
Retry-After; retries 5xx + network only (not 4xx). - Cache middleware โ caches identical prompts within a TTL.
-
Service tier โ picks
priority/flex/autoper request.
The agent never sees this โ it sees one logical "send" that either returns a response or gives up after the chain is exhausted.
11.3 ๐ง Wire-format quirks live in adapters, not the loop
Examples:
- Anthropic uses
x-api-key; OpenAI-compat usesBearer; Codex uses OAuth + token refresh. - Claude CLI is a subprocess speaking stdio; ACP is JSON-RPC 2.0 over stdio.
- DashScope wraps Qwen with custom thinking-budget mapping.
- Some providers return "200 with error body" (Slack-style). Normalize at the adapter.
Force every quirk through one interface and you keep the agent loop boringly simple.
11.4 ๐ง Reasoning/thinking blocks are first-class
Anthropic extended thinking โ ThinkingBlocks; OpenAI o-series reasoning โ ReasoningItemModel. Persist these on ActionEvent / MessageEvent so they're replayable and can be fed back to the model on the next turn โ required to maintain reasoning continuity for o-series and Sonnet thinking-mode.
11.5 ๐ Non-native function calling
For models without native tool support, serialize tools into a structured prompt and parse responses with regex. Lets even small open-source models drive the agent loop. Pattern: detect, then either call native function-calling or fall back to prompt-and-parse โ same agent code path.
11.6 ๐บ๏ธ Routers all the way down
RouterLLM โ abstract base; subclass with select_llm(messages) โ str. Real example: route image-containing messages to a vision model and text-only to a cheap model. Composes recursively (a router can route to a router), so you can build cost-optimization trees.
11.7 ๐ Rich LLMResponse error fields (Nanobot's hard-won detail)
class LLMResponse:
content: str | None
tool_calls: list[ToolCallRequest]
finish_reason: str
usage: dict[str, int]
reasoning_content: str | None
thinking_blocks: list[dict] | None
error_status_code: int | None
error_should_retry: bool | None
retry_after: float | None
Capture rich error metadata in the response object itself. The retry layer becomes a one-page, provider-agnostic policy instead of a forest of except clauses.
โ Actionable rules
- Tiny
Providerinterface. Every quirk lives behind it.- 7-layer reliability stack: classify โ cooldown โ rate-limit โ fallback โ retry โ cache โ tier.
- Persist reasoning/thinking blocks for continuity.
- Pattern-detect native vs non-native function calling โ same agent code path.
- Rich error fields on the response object. One retry policy, not N.
๐ก Part 12 โ Channels & Integration Surface
12.1 ๐ Channels as pluggable adapters
Each external messaging platform is an adapter that:
- Translates platform events to a unified
InboundMessage. - Translates unified
OutboundMessageto platform replies.
Two functions: Listen() and Send(). Keep the agent loop ignorant of platform specifics.
12.2 ๐ท๏ธ First-class fields beat metadata bags
Don't bury chatId, senderId, messageId inside generic metadata maps. Hoist them to typed fields:
type InboundMessage struct {
Peer Peer // platform + chat + topic
MessageID string
Sender SenderInfo // canonical identity ("telegram:42")
Body string
Media []MediaRef
ReceivedAt time.Time
}
This is the contract that session allocation, routing, and hooks rely on. Put it in your design from day one โ retrofitting is painful.
12.3 ๐ญ Capability-based polymorphism (PicoClaw)
Every platform sub-package embeds BaseChannel and implements the minimum interface. Optional capabilities are separate interfaces:
type MediaSender interface { SendMedia(...) error }
type TypingCapable interface { ShowTyping(...) error }
type ReactionCapable interface { React(...) error }
type PlaceholderCapable interface { SendPlaceholder(...) (id string, err error) }
type MessageEditor interface { Edit(...) error }
type WebhookHandler interface { HandleWebhook(...) }
type HealthChecker interface { Check(ctx) error }
The manager probes channels with if c, ok := ch.(MediaSender); ok { ... }. Adding VoiceCapable to one platform doesn't change anyone else.
12.4 ๐ The manager owns retries, not the channel
Centralize in the manager:
- Worker queue with rate limit per channel.
- Outbound message splitting โ long replies broken at sentence/word boundaries below the platform's per-message limit.
- Retries with backoff on transient errors classified by sentinel error types.
- Typing/reaction indicators as transparent decorations of long turns.
Platforms only know how to send a single chunk. Everything fancy happens above them. 30 channels ร N retry strategies = 0 duplication.
12.5 ๐ Self-registration via blank imports
// channels/telegram/telegram.go
func init() {
channels.Register("telegram", New)
}
// main.go
import _ "yourapp/channels/telegram" // side-effect registration
The main binary just imports for side effects. The channel becomes available in the registry. No registry plumbing.
12.6 ๐ The bus is the universal IPC (Nanobot)
class MessageBus:
inbound: asyncio.Queue[InboundMessage]
outbound: asyncio.Queue[OutboundMessage]
That's the entire seam. Every channel just needs to (a) translate platform events into InboundMessage and publish_inbound, and (b) listen to consume_outbound for messages addressed to its channel name and translate back.
Side effect: cron jobs, sub-agent results, heartbeat triggers, and inter-agent messages all use the same bus โ they're just synthetic InboundMessage events with channel="system". Uniformity = small code.
12.7 ๐ Session keys: structured identity
agent:{agentId}:{channel}:direct:{peerId} โ DM
agent:{agentId}:{channel}:group:{groupId} โ Group
agent:{agentId}:subagent:{label} โ Subagent
agent:{agentId}:cron:{jobId}:run:{runId} โ Cron run
Or content-addressed: sk_v1_<sha256> (PicoClaw) โ stable, opaque, source of truth, with legacy aliases resolved transparently.
12.8 ๐ค Pairing flow (DM access policies)
Three policies:
| Policy | Behavior |
|---|---|
pairing |
8-character code, 60-min validity |
allowlist |
explicit user IDs only |
open |
anyone (use sparingly) |
allowFrom empty = deny all. The right default for personal agents.
โ Actionable rules
- Channels: two methods (
Listen,Send). Everything else lives in the manager.- First-class typed fields on
InboundMessage. Never metadata bags.- Capability interfaces beat optional methods on a god-interface.
- Manager owns retries, splitting, rate-limit. Channels send one chunk.
- The bus carries inbound, outbound, system events. Uniformity = small code.
- DM access:
pairing/allowlist/open. Empty = deny all.
๐ Part 13 โ Observability & Evaluation
13.1 ๐ Trace everything
Three span types: agent, llm_call, tool_call. Wrap every LLM call in a span. Wrap every tool call in a span. Trace tree mirrors the run shape.
| Detail | Value |
|---|---|
| Batch size | 100 spans |
| On batch failure | retry individually |
| Verbose mode | full input/output truncated at 50 KB |
| Span exporters | OpenTelemetry compatible |
13.2 ๐ฐ Cost tracking from step 1
Every API response runs through a cost accumulator:
- Per-model usage in bootstrap state.
- Reports to OpenTelemetry.
- Recursively processes nested model calls (sub-agents, recall queries).
- Persists to project config on process exit.
- Restores on next session if persisted session ID matches.
Histograms use reservoir sampling (Algorithm R) with 1,024 entries to compute p50/p95/p99. Averages hide tail latency, and tail latency is what users feel.
Even in v0, instrument cost and latency. You cannot decide what to optimize from feel.
13.3 โฎ๏ธ Replayable trajectories
Every step() writes a .traj JSON file containing history, model output, observations, costs. SWE-agent's run-replay re-executes any old run. The append-only event log is the source of truth.
Worth it just for debugging. When the agent does something weird at minute 47, you can rewind to any event and try a different model or prompt.
13.4 ๐งช Eval taxonomies
Three layers of evaluation:
| Eval | What it measures |
|---|---|
| Single-step | Does one tool call work correctly? |
| Full-run | Does the complete task get solved? |
| Multi-turn | Does the agent handle evolving goals? |
13.5 ๐ Trace grading
Grade agent traces directly โ especially helpful for multi-step tasks where final output alone doesn't reveal process quality. Use a separate LLM as a judge with a clear rubric.
13.6 ๐ฏ Skill-level evals
Measure whether a specific skill actually helps using:
- Bounded tasks โ reproducible inputs.
- Deterministic verifiers โ automated pass/fail.
- No-skill baseline โ does the skill move the needle?
- Trace review โ human-spotcheck the failures.
13.7 ๐ก Infrastructure noise
Runtime configuration can move coding benchmark scores by more than many leaderboard gaps.
Infrastructure choices may matter more than model intelligence. The same model with a better harness, better tools, better verification, lands a higher score.
13.8 ๐ Activity log for every admin action
Every admin write to global tables (settings, permissions, tool config) appends to an audit log: { tenant_id, actor_id, action, target, timestamp, ip }. Cheap to write, invaluable when "who changed X?" comes up.
โ Actionable rules
- Spans on every LLM and tool call. Trace tree mirrors the run.
- Cost + reservoir-sampled latency from day one.
- Append-only event log = replayable trajectories.
- Eval at three layers: single-step, full-run, multi-turn. Trace-grade.
- Skill-level evals with no-skill baselines. If it doesn't move the needle, drop it.
- Audit log for every admin action.
๐บ๏ธ Part 14 โ The Build-Your-Own Roadmap
A pragmatic order to implement everything above. Each step compiles and runs on its own.
๐ฑ Milestone 0 โ Foundation (1โ2 days)
- Pick the language. Go for small/portable; Python for ML/research/speed.
- Pick the DB: PostgreSQL + pgvector if you ever want vector search.
- Skeleton:
cmd/,internal/,pkg/,migrations/,docs/,Makefile,docker-compose.yml. - Define the
Providerinterface (4 methods). - Implement one provider โ start with OpenAI-compatible (covers Groq, DeepSeek, Together for free).
-
cmd/serveloads config, makes one HTTP request, prints the response.
๐ Milestone 1 โ Minimum Viable Agent Loop (1 week)
- Define
Toolinterface:name,description,schema,execute(ctx, args). - Implement 3 tools:
read_file,write_file,list_filesโ workspace-scoped, withresolvePath()traversal guard. - Build the loop:
for i := 0; i < 20; i++ { think; if no tools break; act; observe }. - Persist sessions:
SessionStoreinterface + in-memory implementation. - Emit events via callback. Three only:
run.started,tool.call,run.completed. - HTTP endpoint
/v1/chat/completions(OpenAI-compatible). One agent. No streaming yet.
You now have an LLM that can read/write files in a workspace.
๐งฉ Milestone 2 โ System Prompt Architecture (3โ4 days)
- Bootstrap files:
agent_context_files(agent-level) +user_context_files(per-user). 6 known files: SOUL, IDENTITY, AGENTS, TOOLS, BOOTSTRAP, USER. -
ContextFileInterceptorโ when a tool reads/writes a known name, route to DB instead of disk. - System prompt builder โ assemble from sections. Persona early, persona reminder late.
- Two modes:
PromptFullandPromptMinimal. - Per-user file seeding on first chat.
๐ข Milestone 3 โ Multi-Tenancy from the Start (3โ4 days)
-
tenantsandapi_keystables. UUID v7 PKs. -
tenant_id NOT NULLon every table that holds tenant data. -
WithTenantID(ctx)/TenantIDFromContext(ctx)helpers. - Resolve API key โ SHA-256 lookup โ set tenant on ctx at the gateway.
- Update every store query to add
WHERE tenant_id = $N. Audit the diff. - Master tenant for legacy/single-user data; master scope guard for global writes.
๐ง Milestone 4 โ Pipeline Refactor (1 week)
Once your loop has > 3 conditional branches, split it:
- Define
Stageinterface,StageResultenum,RunStatestruct. - Implement
ContextStage,ThinkStage,ToolStage,ObserveStage,CheckpointStage,FinalizeStage. AddPruneStagelater. -
Pipeline.Runorchestrates: setup โ iteration loop โ finalize. - Feature flag (
pipeline_enabled) so V2 (monolithic) and V3 (pipeline) coexist during migration.
๐พ Milestone 5 โ Memory & Search (1โ2 weeks)
-
memory_documents+memory_chunkstables.tsvector(FTS) +vector(1536)columns. -
MemoryInterceptorโ auto-chunks + embeds on.mdwrites insidememory/*. - Hybrid search:
0.7 * vector + 0.3 * fts. Per-user 1.2ร boost. Dedup. -
memory_searchandmemory_gettools. - Later:
episodic_summaries+EpisodicWorkersubscribed torun.completed. - Later:
kg_entities+kg_relationswith temporal validity for L2.
๐ก๏ธ Milestone 6 โ Tool Registry Hardening (1 week)
- Funnel every tool call through
Registry.ExecuteWithContext. - Token-bucket rate limiting per session key (defaults: 60/min, burst 5).
- Credential scrubber โ start with 5โ10 high-value patterns.
- Policy engine: profiles (
full/coding/messaging/minimal), groups, allow/deny lists. - Shell deny groups (start with:
destructive_ops,reverse_shell,dangerous_paths,package_install). - Capability metadata on every tool.
๐ก Milestone 7 โ Channels (per channel, ~2 days each)
- Define
Channelinterface:Listen(ctx, onMessage),Send(ctx, OutboundMessage) error. - Telegram first (simplest, long-polling).
-
channel_instancestable withtenant_idbaked in. - Outbound dispatcher routes by
channel_instance_id. - Pairing flow: 8-char code, 60-min TTL.
- Then: Discord, Slack, WhatsApp, Feishu, Zalo.
๐ Milestone 8 โ Observability (3โ4 days)
-
tracesandspanstables. Three span types. - Wrap every LLM call and tool call in a span.
-
BatchCreateSpansin batches of 100; on failure, retry individually. - Verbose mode (
TRACE_VERBOSE=1) for full input/output, truncated at 50 KB. - Optional: OpenTelemetry exporter.
๐ Milestone 9 โ Resilience (3โ4 days)
- Wrap providers with retry middleware.
- Per-model cooldown.
- Failover chain.
- Mid-loop compaction at 75%; post-run at 50 messages or 75%.
- Per-session
TryLockfor compaction goroutine. - Stuck detector (5 patterns, semantic comparison).
- Autosubmit on every fatal error path.
๐ค Milestone 10 โ Multi-Agent (1โ2 weeks)
-
subagenttable. Limits: depth 1, max 5 children, max 8 concurrent. -
spawntool (async return),delegatetool (sync with timeout). -
agent_linkstable for delegation eligibility. - When ready:
teams,agent_team_members,team_tasks,team_messages. - Atomic task claim:
UPDATE โฆ WHERE status = 'pending' AND owner_agent_id IS NULL.
๐ Milestone 11 โ Production Hardening (ongoing)
- Add the remaining 4 security layers (input guard, output sanitizer, isolation).
- AES-256-GCM encryption for all at-rest secrets.
aes-gcm:prefix convention. - API keys: 16 random bytes, SHA-256 hash, constant-time compare.
- Activity log for every admin action.
- Hourly snapshot aggregations.
- Per-tenant config UI.
๐งฉ Milestone 12 โ Optional Surface Area
- Knowledge Vault with wikilinks (
[[target]]). - MCP bridge (stdio + SSE + streamable-http transports, per-agent + per-user grants).
- Custom shell tools (DB-stored, hot-reloaded).
- Cron jobs.
- Browser automation (headless Chrome).
๐ฐ๏ธ Save for last (don't build until milestone 12)
- Fork agents (cache-driven sub-agents)
- Swarm teams
- Remote tasks across machines
- KAIROS continuous-mode logs
- Auto-mode permission classifier
- Renderer optimization (cell-diffing, BSU/ESU)
- Bitmap search index for huge filesystems
โ ๏ธ Part 15 โ Anti-Patterns to Avoid
Each row is a trap that's burned multiple production teams.
๐ Loop / control flow
| Anti-pattern | Why it's a trap | What to do instead |
|---|---|---|
| Callbacks or event emitters for the agent loop | You'll re-invent backpressure poorly |
async function* (or channels) |
A single error terminal state |
Lose information about why | Encode 10+ specific reasons in a discriminated union |
| Stop-hooks on error responses | Creates error โ hook blocks โ retry โ error infinite loops |
Skip them on errors |
Forgetting to pair tool_use with tool_result on abort |
API rejects the next message | Drain queued tools with synthetic results on every cancel path |
| Trusting the model's tool-call format | Models hallucinate <tool_call> XML, [Tool Call: ...] text |
7-step output sanitizer strips them all |
One giant runLoop() function |
2k-line functions become untestable | 8-stage pipeline; each stage isolated |
๐ ๏ธ Tools
| Anti-pattern | Why it's a trap | What to do instead |
|---|---|---|
| Constructor literal instead of factory | Defaults will be unsafe | Always go through buildTool()
|
| Per-tool-type concurrency safety |
Bash is sometimes safe, sometimes not |
Pass parsed input |
| Concatenating built-ins and MCP tools then sorting flat | Cache breakpoint dies | Sort within partition, then concat |
| Returning huge raw output | Context blows up | Cap with maxResultSizeChars; persist to disk + return preview |
Using SDK's BetaMessageStream
|
O(nยฒ) JSON re-parsing | Read raw stream events |
| Bypassing the tool registry "just for this one call" | Loses scrubbing, rate-limit, RBAC | Every tool call through the registry, no exceptions |
Reusing the human shell (cat, grep -rn) |
Bad agent tools โ too much output, no error story | Build agent-shaped commands with bounded output |
Free-form sed -i edits |
Frequent syntactic collapses | Line-range edit with lint + auto-rollback |
๐ Permissions
| Anti-pattern | Why it's a trap | What to do instead |
|---|---|---|
Scattering if mode === ... checks throughout tool code |
Untestable, drift | Centralize in modes + resolution chain |
| Trusting a partial bash parse | Bypassable | If parseForSecurity() fails, treat as unsafe |
Sub-agent default = default mode |
Needs a UI to prompt; bg agents have none | Default to bubble (sync) or dontAsk (async) |
โก Caching / API
| Anti-pattern | Why it's a trap | What to do instead |
|---|---|---|
| Runtime conditionals in the static prompt prefix | Each one doubles cache key space | Move below the dynamic boundary |
| Mid-session feature toggles that change request headers | Bust cache | Use sticky latches |
| Reserving 64K output tokens by default | Over-reserve 8โ16ร | Cap at 8K, escalate on demand |
| Regenerating the system prompt for fork children | Feature flags or session date may have moved | Pass parent's bytes |
| Filtering tools per child agent in fork mode | Different array โ different cache key |
useExactTools: true and runtime guards |
๐พ Memory
| Anti-pattern | Why it's a trap | What to do instead |
|---|---|---|
Storing what git log can answer |
Useless duplication that goes stale | Derivability test: if git/code can answer it, don't memorize |
| Embedding-only retrieval | Misses negation ("do NOT mock") | LLM recall over a manifest, hybrid with FTS |
| Hard expiration | Stale memories are still data | Annotate with age; let model decide |
Letting MEMORY.md grow past 200 lines |
Truncated silently | Treat the index as a budget |
๐ค Multi-agent
| Anti-pattern | Why it's a trap | What to do instead |
|---|---|---|
| Coordinators with the full tool set | They'll do the work themselves | Restrict to Agent, SendMessage, TaskStop
|
| Workers asked to "based on the research, implement X" | Re-derive context, miss specifics, hallucinate paths | Synthesis is the coordinator's job; give exact paths/lines |
| Mid-tool-execution message delivery | Race conditions | Queue at tool-round boundaries |
| Unbounded teammate state | 36.8 GB / 292 agents was a real incident | Cap message history |
General-purpose agents that can spawn Agent
|
Exponential fan-out | Block recursive spawning at the schema level |
๐ข Multi-tenancy
| Anti-pattern | Why it's a trap | What to do instead |
|---|---|---|
| Single-tenant first, "we'll add it later" | Migration is brutal โ every query, test, cache key |
tenant_id NOT NULL on day one |
Trusting client-supplied tenant_id header |
Spoofable; cross-tenant leakage | Tenant resolved from API key at gateway |
๐ช Bootstrap / hooks
| Anti-pattern | Why it's a trap | What to do instead |
|---|---|---|
Loading the world for --version
|
Slow startup | Fast-path dispatch first |
| Hook config that updates live mid-session | Lets a malicious repo redefine permissions after trust dialog | Snapshot at startup; update only via explicit user channel |
| Treating MCP skills like local skills | They are content-only | Never execute their inline shell commands |
๐ Provider / API
| Anti-pattern | Why it's a trap | What to do instead |
|---|---|---|
| Hard-coding one LLM provider | You'll need 5 within a year |
Provider interface + adapters |
| Storing secrets unencrypted because "it's the same DB" | Database dumps leak; insider widens blast radius | AES-256-GCM with aes-gcm: prefix |
time.Sleep between LLM retries |
Wastes time + cost; thundering herd | Exponential backoff with jitter, honor Retry-After
|
| Distributed lock for "claim this task" | Adds Redis/Zookeeper; race conditions still possible | Atomic SQL UPDATE with WHERE status = 'pending'
|
| Loading the full agent config on every request | Slow; chatty | Router cache with TTL + pub/sub invalidation |
| Synchronous summarization on the request path | User waits 10+ seconds | Synchronous flush, asynchronous summarize |
| Letting the agent self-modify its prompts unguarded | One bad cycle, quality craters | Suggestion engine + admin approval + rollback_on_drop_pct
|
๐ฏ Part 16 โ Closing: The Harness Mindset
Three closing observations distilled from every source.
1. ๐ฏ Push complexity to the boundaries
Permission resolution, protocol translation, state reconciliation, tool I/O โ these are the messy edges. Concentrate the mess there. Keep the loop, the tool composition, the memory recall, and the streaming logic clean and exhaustively typed.
2. ๐ The agent is a function from event history to next event, run in a loop
Everything else is a hook into that one loop:
- "Function" โ stateless Agent.
- "Event history" โ append-only EventLog.
- "Next event" โ Action, executed by Workspace, producing Observation.
-
"Run in a loop" โ Conversation, until
Finishor stuck.
There is no big design. There is one tight kernel and a lot of small components hanging off it.
3. ๐ง Iterate on failures
The single most important cultural practice from the harness-engineering discipline:
Anytime an agent makes a mistake, engineer a solution so it never makes that mistake again.
Ship first. Add configuration reactively. Throw away what doesn't help. Distribute battle-tested configurations. Treat technical debt as a high-interest loan.
After many production incidents the pattern is the same:
- "GPT-6 will fix it" โ almost always wrong.
- "It's a configuration problem" โ almost always right.
The fix is in your harness โ context management, tool selection, verification loops, handoff artifacts, prompt reinforcement zones, hook ordering, error ladders.
๐ณ The shortest possible recipe
If you only build six things well, you have a great agent:
- An async-generator loop with typed terminal states and a continue-state ladder for recovery.
- A self-describing tool registry with per-invocation safety, the 14-step pipeline, and bounded output.
- A 4-layer context compression pipeline preserving the prompt cache architecture.
- File-based memory with always-loaded index + LLM recall side-query.
- Defense-in-depth security with five independent layers.
-
Multi-tenancy on day one โ
tenant_id NOT NULLeverywhere.
Build those, and you've shipped a real agent. The rest of this guide is layering and polish.
๐ Appendix โ Source Map
| Source | The lessons learned |
|---|---|
| Claude Code (from-source guide) | Async-generator loop, prompt cache as architecture, fork agents, file-based memory, hooks, 4-layer compression |
| OpenHands | CodeAct (code as universal action), append-only event log, Workspace abstraction, Skills/microagents, stuck detection, risk-aware confirmation |
| SWE-agent | The Agent-Computer Interface thesis, line-bounded edit + lint + rollback, autosubmit on error, cost-budget |
| GoClaw | Multi-tenancy from day one, 8-stage pipeline, 3-tier memory, 5-layer security, channel adapters, provider resilience stack |
| Nanobot | Bus-based decoupling, per-session lock + pending queue, files + git for memory, progressive skill loading |
| PicoClaw | Lean Go runtime, capability-based polymorphism, JSONL persistence with sidecar metadata, 64-shard mutex, cheap-first routing, JSON-RPC stdio hooks |
| Harness Engineering | Agent = Model + Harness; feedforward + feedback control; sensors/guides; sub-agents as context firewalls; iterate on failures |
"It's not a model problem. It's a configuration problem." โ every team, after enough incidents.
If you found this helpful, let me know by leaving a ๐ or a comment!, or if you think this post could help someone, feel free to share it! Thank you very much! ๐
Top comments (0)