DEV Community

Cover image for πŸ™Œ OpenHands β€” Deep Dive & Build-Your-Own Guide πŸ“š
Truong Phung
Truong Phung

Posted on

πŸ™Œ OpenHands β€” Deep Dive & Build-Your-Own Guide πŸ“š

A practical, technical walkthrough of how OpenHands (formerly OpenDevin) actually works, what makes it highly autonomous, and how you can build a similar agent from first principles.

Written April 2026. Based on the V1 SDK paper (arXiv 2511.03690), the original OpenDevin paper (arXiv 2407.16741), the OpenHands docs, and the source of All-Hands-AI/OpenHands and OpenHands/software-agent-sdk.


Table of Contents


πŸ’‘ TL;DR β€” what OpenHands is, in one paragraph

OpenHands is an open-source autonomous software-engineering agent. It scores ~77% on SWE-Bench Verified with Claude Sonnet 4.5, opens GitHub PRs without supervision, and ships under MIT. Architecturally it is a tiny core: a stateless Agent that emits Actions, a Conversation that runs the loop and stores an append-only EventLog, a Workspace (local process or Docker container) that executes Actions and returns Observations, and an LLM wrapped by LiteLLM for provider portability. Everything else β€” memory compression, microagent knowledge, sub-agent delegation, security review, stuck detection β€” is a small auxiliary service hanging off the event stream. The result is a system where a few thousand lines of Python turn an LLM into something that can run for hours, recover from its own errors, and finish real engineering tasks.

If you only remember three slogans:

  1. πŸ’» Code is the universal action. Don't design 20 bespoke tools. Give the agent bash + Python + a file editor + a browser, then let it write code.
  2. πŸ“¦ State lives in one place. All components are immutable Pydantic models. The only mutable thing is ConversationState. This makes the system replayable, debuggable, and safe to parallelize.
  3. πŸ”„ Observations close the loop. Every error, stderr, exit code, and HTTP response goes back into the next prompt. Self-correction is not a feature β€” it's a side effect of letting the LLM see its own consequences.

1. 🧠 The mental model β€” Agent / Conversation / Workspace / Event Stream

+--------------+      +-----------------+      +---------------+
|    Agent     |<-----|  Conversation   |----->|  Workspace    |
| (stateless,  |      |  (loop runner,  |      | (Local/Docker |
|  Pydantic)   |      |  state, EventLog)|     |  /Remote)     |
+------^-------+      +--------^--------+      +-------^-------+
       |                       |                       |
       | uses                  | persists / streams    | executes
       v                       v                       v
   +-------+              +----------+         +---------------+
   |  LLM  |              | EventLog |         | bash, python  |
   |+Cond. |              +----------+         | jupyter,      |
   +-------+                                   | browser, FS   |
                                               +---------------+
Enter fullscreen mode Exit fullscreen mode

Four roles, sharp boundaries:

  • Agent β€” pure function from history β†’ next Action. No state of its own. Configured by LLM, a list of Tools, a Condenser, optional MCP config, and system_prompt_kwargs.
  • Conversation β€” owns ConversationState, drives the loop, persists the EventLog, and is the only mutable thing in the system.
  • Workspace β€” knows how to execute commands and shuttle files. Three implementations: in-process (LocalWorkspace), container (DockerWorkspace), or HTTP (RemoteAPIWorkspace). Same agent code; just swap the workspace.
  • Event β€” every interaction is an event: MessageEvent, ActionEvent, ObservationEvent, AgentErrorEvent, Condensation, etc. The event log is append-only and the single source of truth β€” replaying it reconstructs the entire conversation.

This is the V1 architecture (Nov 2025 onward). The original V0 had an explicit AgentController class plus a Runtime abstraction; V1 collapsed that into Conversation + Workspace because the controller didn't earn its keep.

Why this shape works

Property How OpenHands gets it
Replayable Append-only event log + immutable config = deterministic replay.
Swappable execution Same Agent works against Local / Docker / Remote workspaces.
Debuggable Every prompt/response/tool call is a typed event you can inspect.
Composable Tools, condensers, security analyzers, and LLM routers are independent components.
Autonomous The loop runs until FinishAction, stuck detection, or budget β€” no human required.

The four V1 design principles (steal these)

The V1 paper opens with four explicit principles. They explain why the architecture looks the way it does, and they're worth lifting wholesale into your own design doc:

  1. Optional isolation, not mandatory sandboxing. The agent runs in-process by default; you swap LocalWorkspace β†’ DockerWorkspace for isolation without changing any agent code. Don't make sandboxing a build-time decision.
  2. Stateless components, single source of truth. Agent, Tool, LLM, and Condenser are immutable Pydantic models. The only mutable thing in the entire system is ConversationState. State changes happen by appending events β€” never by mutating objects.
  3. Strict separation of concerns. The SDK never imports applications. The CLI, GUI, GitHub resolver, and your custom integration all consume the SDK as a library. This sounds obvious; it is not what V0 did.
  4. Two-layer composability. Compose at package level (swap workspaces, swap servers) and at component level (swap tools, prompts, condensers, LLMs). Both layers exist intentionally.

The cautionary tale: V0 reportedly had 140+ config fields, 15 config classes, and 2.8K LOC just for configuration before the V1 rewrite. If your config grows faster than your features, that's a smell.


2. βš™οΈ The agent loop β€” the canonical 30 lines

Here is the actual Agent.step() from the V1 SDK (openhands-sdk/openhands/sdk/agent/agent.py), distilled. This is the single most important function in the project. Read it twice.

def step(self, conversation, on_event, on_token=None) -> None:
    state = conversation.state

    # 1. Drain confirmed actions waiting to execute.
    pending = ConversationState.get_unmatched_actions(state.events)
    if pending:
        self._execute_actions(conversation, pending, on_event)
        return

    # 2. Honor any UserPromptSubmit hook that wants to block the message.
    if state.last_user_message_id is not None:
        reason = state.pop_blocked_message(state.last_user_message_id)
        if reason is not None:
            state.execution_status = ConversationExecutionStatus.FINISHED
            return

    # 3. Build the LLM prompt β€” may return a Condensation event instead.
    msgs_or_cond = prepare_llm_messages(
        state.events, condenser=self.condenser, llm=self.llm)
    if isinstance(msgs_or_cond, Condensation):
        on_event(msgs_or_cond); return

    # 4. Call the LLM with retry.
    try:
        response = make_llm_completion(
            self.llm, msgs_or_cond,
            tools=list(self.tools_map.values()), on_token=on_token)
    except LLMContextWindowExceedError:
        if self.condenser and self.condenser.handles_condensation_requests():
            on_event(CondensationRequest()); return
        raise

    # 5. Classify and dispatch.
    match classify_response(response.message):
        case LLMResponseType.TOOL_CALLS:
            self._handle_tool_calls(...)
        case LLMResponseType.CONTENT:
            self._handle_content_response(...)
        case LLMResponseType.REASONING_ONLY | LLMResponseType.EMPTY:
            self._handle_no_content_response(...)
Enter fullscreen mode Exit fullscreen mode

That's it. The Conversation calls step() in a while not finished: loop. Everything else β€” memory compression, microagent injection, security review β€” happens inside one of those 5 phases as a hook or as another event being emitted.

Phases worth memorizing:

  1. Drain pending actions (confirmation flow).
  2. Block if a hook rejected the user message.
  3. Prepare prompt β€” condenser may decide to summarize first.
  4. Call the LLM, with explicit handling for context-window overflow.
  5. Dispatch the response: tool call β†’ execute, plain text β†’ emit message, empty β†’ ask the LLM to try again.

Build-your-own: when you write your own version, copy this 5-phase shape exactly. The hardest bug in agent loops is "the LLM responded but my code didn't know what to do with it" β€” explicit response classification kills that bug.

Worked example: one task end-to-end through the loop

A concrete trace makes the abstraction click. Imagine the user says: "Find the failing test in this repo and fix it." What happens, event by event:

Step Event appended Source Notes
0 SystemPromptEvent(prompt=..., tools=[...]) agent Built once at start; includes always-on skills (e.g. AGENTS.md).
1 MessageEvent(content="Find the failing test...") user The task.
2 ActionEvent(CmdRunAction(command="pytest -x")) agent LLM picks bash.
3 ObservationEvent(CmdOutputObservation(stdout="...FAILED tests/test_auth.py::test_login...", exit_code=1)) environment The agent now sees the failure.
4 ActionEvent(CmdRunAction(command="cat tests/test_auth.py")) agent Reading the failing test.
5 ObservationEvent(CmdOutputObservation(stdout="...assert user.token == 'abc'...")) environment
6 ActionEvent(FileEditAction(path="src/auth.py", str_replace=...)) agent Patching.
7 ObservationEvent(FileEditObservation(diff="...")) environment
8 ActionEvent(CmdRunAction(command="pytest -x")) agent Verifying.
9 ObservationEvent(CmdOutputObservation(stdout="...1 passed", exit_code=0)) environment Green.
10 ActionEvent(AgentFinishAction(final_thought="Fixed token comparison in auth.py", outputs={...})) agent Loop terminates.

Three things to notice:

  • The LLM never "remembers" what it did β€” it sees the entire event log every step. That's why the EventLog has to be cheap to materialize.
  • The error in step 3 is what triggered the next action. The agent didn't have to be told "if a test fails, read it" β€” that came from the LLM reasoning over the observation.
  • Steps 8–9 are verification, not optimism. A well-prompted agent re-runs tests after editing. That's what stops it from declaring victory on broken code (more on this in Β§3).

3. πŸ”„ Actions and Observations β€” the universal protocol

Every interaction with the world is either an Action (something the agent decided to do) or an Observation (what happened as a result). Both are typed Pydantic models. The list is short and stable:

Action Observation Purpose
CmdRunAction(command, cwd, blocking) CmdOutputObservation(stdout, exit_code) Run shell
IPythonRunCellAction(code) IPythonRunCellObservation Persistent Python kernel
FileReadAction / FileWriteAction / FileEditAction(path, str_replace=...) FileReadObservation / FileEditObservation File ops, with str_replace_editor semantics
BrowseURLAction(url) / BrowseInteractiveAction(code) BrowserOutputObservation Headless Chromium
MessageAction(content, wait_for_response) (none β€” it's a chat turn) Talk to the user
AgentThinkAction(thought) (none) Reasoning slot
AgentFinishAction(final_thought, outputs) (terminates loop) "I'm done"
AgentDelegateAction(agent, inputs) AgentDelegateObservation Spawn a sub-agent
RecallAction(query) RecallObservation(microagent_knowledge=...) Pull knowledge in
CondensationAction(forgotten_event_ids, summary) (rewrites history) Memory compression
MCPAction MCPObservation External MCP tool

CodeAct β€” why "code" is the action language

The flagship CodeActAgent (openhands/agenthub/codeact_agent/) is built around one observation from the original paper: instead of giving the LLM 20 bespoke tools each with their own JSON schema, give it bash, Python, and a browser DSL, and let it express anything as code. Empirically this generalizes far better and dramatically reduces parsing errors.

The trade-off: a giant unified action space relies on the LLM being a strong code generator. With weaker models you may need narrower, more guided tools. With Claude Sonnet 4.5 / GPT-5, "give it a shell" is the strongest baseline.

The four-phase methodology baked into the system prompt

The CodeActAgent prompt is doing more than just listing tools β€” it imposes a methodology that drives autonomous behavior. Roughly:

  1. Exploration β€” read the repo, find relevant files, understand the surface area before doing anything. (grep, find, cat, ls.)
  2. Analysis β€” form a hypothesis about what to change and why. The ThinkTool exists specifically for this β€” it produces no observation, it just gives the model a slot to reason without committing to an action.
  3. Implementation β€” make the smallest change that addresses the analysis. Prefer editing existing files over creating duplicates. Don't write README.md unless asked. Don't commit secrets.
  4. Verification β€” re-run the tests, lints, build. Loop back to analysis if it fails. Only call finish when verification passes.

The system prompt also encodes etiquette: configure git user.name=openhands and user.email=openhands@all-hands.dev if missing, prefer str_replace_editor over rewriting whole files, ask the user (MessageAction(wait_for_response=True)) if truly blocked instead of guessing.

Why this matters for autonomy: the verification loop is the difference between an agent that hallucinates "done" and one that actually finishes. If you take only one prompting lesson from OpenHands, take this: make your agent re-run the test suite as the last action before finish. The whole "ran for 30 minutes and didn't break anything" story falls apart without it.

The actual prompt template lives in openhands/agenthub/codeact_agent/prompts/system_prompt.j2. Read it directly when designing your own β€” it's a cheat sheet for what works.

Build-your-own action set

If you're building your own agent, you can ship a useful prototype with three actions:

  1. RunCommand(command: str) β€” bash via subprocess or docker exec.
  2. EditFile(path: str, old: str, new: str) β€” string-replace editor (much more reliable for LLMs than full-file rewrites).
  3. Finish(summary: str) β€” terminate.

Add Browse and RunPython later. Keep observations boringly literal: stdout + stderr + exit code, file diff, page text. Don't pre-summarize β€” let the LLM see the raw world.


4. πŸ“‘ The Event Stream β€” single source of truth

Every Action and Observation is wrapped as an Event with id, source ∈ {agent, user, environment}, and timestamp, then appended to the EventLog. The log is:

  • Append-only β€” events are never edited, only superseded by a Condensation event that marks ranges as "forgotten."
  • Persisted incrementally β€” each event is one JSON file; full state is rebuildable from disk.
  • Pub/sub for V0 (EventStream.subscribe(...)) or read by auxiliary services for V1.

The auxiliary services in V1 β€” Persistence, Stuck Detection, Visualization, Secret Registry β€” all read from the event log and never mutate state directly. State mutation only happens by appending a new event. That single rule is what makes the system replayable and gives you free time-travel debugging.

Build-your-own: write your event log as events.jsonl plus a state.json for cached materialized state. Don't get fancy β€” it's a list of dicts. The discipline of "all state changes are events" pays for itself the first time you have to debug why an agent did something weird at minute 47.


5. 🐳 Sandboxing β€” Workspace + Action Execution Server

The Workspace is where Actions get executed. Three implementations:

Workspace Isolation Use case
LocalWorkspace Process, host filesystem Dev, unit tests
DockerWorkspace Container Production, untrusted code
RemoteAPIWorkspace Network β†’ Agent Server Cloud, multi-tenant

The clever part: the Docker container runs a small FastAPI server inside it, the Action Execution Server. The agent process on the host sends actions to it as REST POST /execute_action, and the server runs them against:

  • a persistent tmux bash session (so cd and shell history survive across actions),
  • a persistent IPython kernel (%pip install once, use forever),
  • a Playwright Chromium browser,
  • a str-replace file editor with undo.

Plus it ships VSCode Server on a sibling port so a human can attach. The agent talks to the box exactly the way a remote developer would.

# Drop in DockerWorkspace, no code change anywhere else.
from openhands.workspace import DockerWorkspace

with DockerWorkspace(host_port=8010, extra_ports=True) as ws:
    conversation = Conversation(agent=agent, workspace=ws)
    conversation.send_message("Refactor the auth module.")
    conversation.run()
Enter fullscreen mode Exit fullscreen mode

Build-your-own sandbox

Minimum viable plan:

  1. Build a Docker image with bash, python, your project deps, and a tiny FastAPI server.
  2. The server has one endpoint: POST /exec taking {kind: "bash"|"python", body: "..."}.
  3. Use tmux for bash persistence (or just keep a subprocess.Popen open and write into its stdin).
  4. Mount the workspace dir as a volume.
  5. Stream output back chunked so the agent can show progress.
  6. Add a watchdog: kill anything running over N seconds.

That's ~300 LOC and gives you 80% of what DockerRuntime does. Don't try to be clever about networking/cgroups until you need to.


6. πŸ—œοΈ Memory β€” the Condenser

LLM context windows are finite. Long agent runs blow through them. OpenHands handles this with condensers, plug-in objects that decide whether to compress history before each LLM call.

Default policy (get_default_condenser):

LLMSummarizingCondenser(llm=summarizer_llm, max_size=80, keep_first=4)
Enter fullscreen mode Exit fullscreen mode

Translation: when the visible event count exceeds 80, ask a (cheap) LLM to summarize all events except the first 4 (which usually contain the system prompt and original task) and the last few (recent work). Replace the middle with that summary.

Two trigger paths:

  1. Proactive β€” View.from_events() checks size on each step.
  2. Reactive β€” when the LLM raises LLMContextWindowExceedError, the agent emits a CondensationRequest event and tries again next step.

The V1 paper claims this reduces API spend by ~2Γ— with no quality loss on benchmarks; in practice it depends heavily on your task length, but it's the difference between "agent stops at hour 1 with a context error" and "agent runs for 8 hours."

Build-your-own condenser

def maybe_condense(events, summarizer, max_size=80, keep_first=4):
    if len(events) <= max_size:
        return events
    head = events[:keep_first]
    tail = events[-(max_size // 2):]
    middle = events[keep_first:-(max_size // 2)]
    summary = summarizer.complete(
        "Summarize the following agent history concisely, preserving "
        "decisions, findings, and current state:\n" + dump(middle))
    return head + [SummaryEvent(text=summary)] + tail
Enter fullscreen mode Exit fullscreen mode

Don't summarize on every step β€” only when over a threshold. Cache aggressively. The cheapest thing you can do is just truncate with a small head + recent tail; LLM summarization is the upgrade.


7. πŸ”Œ Microagents / Skills β€” knowledge that auto-loads

This is one of the biggest autonomy multipliers and the easiest to underrate.

The problem: the system prompt is finite. You can't cram every framework's conventions, every project's quirks, every secret-handling rule, into one giant blob β€” it would burn tokens and confuse the model.

The solution: Skills (formerly "microagents"). Markdown files with YAML frontmatter, organized by trigger:

---
name: kubernetes
trigger:
  type: keyword
  keywords: ["kubernetes", "k8s", "kubectl"]
---

# Kubernetes guidance
- Always use `kubectl --context=<ctx>` explicitly.
- Current cluster: !`kubectl config current-context`
- Common namespaces: !`kubectl get ns -o name | head -10`
Enter fullscreen mode Exit fullscreen mode

Three trigger types:

Trigger Activates when Example
None (always-on) Every step AGENTS.md, CLAUDE.md, .openhands/microagents/repo.md
KeywordTrigger User mentions keyword "deploy" β†’ CD pipeline rules
TaskTrigger User wants a specific workflow "fix bug X" β†’ bug-fix skill with structured inputs

Plus magic features:

  • !`shell command` β€” run a command at activation time and inline the output (e.g. !git branch --show-current`` to inject current branch into the prompt).
  • mcp_tools: block in the YAML β€” spin up an MCP server when this skill activates, register its tools dynamically.
  • Repository skills auto-discover AGENTS.md / CLAUDE.md / GEMINI.md in the repo root.

This is how an OpenHands agent dropped into your repo "knows" your conventions without you doing anything: the always-on repo skills get glued onto the system prompt at conversation start.

Skill + MCP β€” the dynamic-tool pattern

This is one of the more under-discussed power moves. A skill can ship its own MCP server and tools, only when activated:

`markdown

name: postgres-readonly
trigger:
type: keyword
keywords: ["database", "query", "sql", "postgres"]
mcp_tools:
mcpServers:
pg:
command: "uvx"
args: ["mcp-server-postgres", "--readonly"]
env:

DATABASE_URL: "$DATABASE_URL"

Postgres read-only access

You have read-only DB access via the pg MCP server. Schema:
!psql -c "\dt" $DATABASE_URL
`

When the user mentions "database", the skill activates: the MCP server is spawned, its tools (pg.query, pg.describe) are registered into agent.tools_map, and the rendered schema is injected into the system prompt. No tools at all when the skill isn't active β€” token-cheap, attack-surface-light, and self-documenting. Build this pattern and you stop bloating your global tool list.

Build-your-own skills

`python
def load_skills(repo_path, latest_user_message):
skills = []
# always-on
for f in (repo_path / ".agents" / "skills").glob("*.md"):
meta, body = parse_frontmatter(f)
if not meta.get("trigger"):
skills.append(render(body))
# keyword-triggered
for f in (repo_path / ".agents" / "skills").glob("*.md"):
meta, body = parse_frontmatter(f)
kw = (meta.get("trigger") or {}).get("keywords", [])
if any(k.lower() in latest_user_message.lower() for k in kw):
skills.append(render(body))
return "\n\n".join(skills)
`

render() does the !`...` substitution. Cap output at 50KB to prevent prompt-injection-via-huge-files.


8. πŸ€– Sub-agent delegation β€” parallel agents on a shared workspace

OpenHands V1 treats delegation as just another tool, not a special core mechanism. The delegation tool offers:

  • Spawn β€” register sub-agents by name, optionally with custom prompts and tool subsets. Each sub-agent inherits the parent's LLM and shares the workspace, but has its own independent EventLog.
  • Delegate β€” dispatch one or more named sub-agents in parallel threads. Block until all complete. Return a consolidated result.

`python
agent.register_subagent("bash", custom_prompt="...")
agent.register_subagent("explore", tools=[GlobTool, GrepTool, FileReadTool])

Then the LLM can call the delegate tool with multiple targets:

delegate(targets=[

{"agent": "explore", "task": "Find all usages of authMiddleware"},

{"agent": "bash", "task": "Run the failing test and capture stderr"}])

`

Why this matters for autonomy: parallel exploration kills the latency tax on long tasks. While the parent agent is reasoning, two sub-agents are simultaneously grepping and running tests. The parent gets back a summary, not an essay of grep output.

Independent context is the second insight: sub-agents don't pollute the parent's window. The parent never sees the 200 lines of grep output β€” only the sub-agent's distilled answer.

Build-your-own delegation

This is just concurrent.futures.ThreadPoolExecutor with a tool that takes a list of {agent_name, task} dicts. Each thread instantiates a child Conversation against the same workspace, runs to completion, returns its AgentFinishAction.outputs. Aggregate, return as one observation.

The main rule: sub-agents share the workspace but not the conversation. Critical for keeping context clean.


9. 🚨 Stuck detection β€” the "agent has lost the plot" alarm

Without this, agents burn money in loops. OpenHands runs a StuckDetector (docs) on the event log every step. It flags five patterns:

Pattern Threshold
Repeating action ↔ observation pairs 4+ identical
Repeating action ↔ error pairs 3+ identical
Agent monologue (no tool calls, no progress) 3+ consecutive
Alternating action–observation ping-pong 6+ cycles
Repeated context-window errors (any)

Comparison is semantic, not object identity: actions are matched by tool name + content (timestamps and metrics ignored). When stuck, the agent transitions to ERROR or emits a LoopRecoveryAction for the user to handle.

Build-your-own: trivial. Maintain a sliding window of the last N events. Hash (action.tool, action.body, observation.body) tuples and count repeats. When count exceeds threshold, abort or notify.

This single 100-LOC detector saves more money than any other optimization.


10. πŸ”’ Security β€” confirmation policy + risk analyzer

OpenHands has two layers:

  1. Risk analyzer (openhands.sdk.security) β€” every Action gets a SecurityRisk ∈ {LOW, MEDIUM, HIGH, UNKNOWN} score. The default LLMSecurityAnalyzer adds a security_risk field to every tool's JSON schema, so the LLM scores its own action inline with no extra call. The MCP tool annotations (readOnlyHint, destructiveHint, etc.) feed in.
  2. Confirmation policy β€” AlwaysConfirm, NeverConfirm, or ConfirmRisky(threshold=HIGH). With ConfirmRisky, low-risk actions auto-execute; risky ones pause the conversation in WAITING_FOR_CONFIRMATION until the user approves.

Plus a Secret Registry that:

  • Stores secrets per-session, late-bound (resolved only at exec time).
  • Masks them in stdout/stderr (<secret-hidden>).
  • Encrypts at rest, supports rotation, supports callable resolvers (refresh tokens, etc.).
  • The TerminalTool scans commands for known secret keys, exports them as env vars, and replaces matches in output.

Headless mode hard-disables confirmation (it's NeverConfirm always). That means headless mode's blast radius is whatever the workspace allows β€” which is exactly why headless mode wants Docker.


11. 🌐 The LLM layer β€” LiteLLM, Router, and prompt caching

OpenHands wraps everything through LiteLLM so users get 100+ providers (OpenAI, Anthropic, Bedrock, Azure, Google, local Ollama) for free.

Notable layer features:

  • Two completion modes: classic Chat Completions (function calling) and OpenAI's Responses API (for GPT-5 reasoning models). Auto-detected per-model from a model_features.py registry.
  • Reasoning/thinking blocks are first-class. Anthropic extended thinking is captured as ThinkingBlocks; OpenAI reasoning items as ReasoningItemModel. The agent persists these on ActionEvent/MessageEvent (reasoning_content, thinking_blocks) so they're replayable and can be fed back to the model on the next turn β€” required to maintain reasoning continuity for o-series and Sonnet thinking-mode.
  • NonNativeToolCallingMixin β€” for models without native function calling, it serializes tools into a structured prompt and parses responses with regex. Lets even small open-source models drive the agent loop. The pattern: detect, then either call native function-calling or fall back to prompt-and-parse β€” same agent code path.
  • RouterLLM β€” abstract base; subclass with select_llm(messages) -> str. Real example: route image-containing messages to a vision model and text-only to a cheap model. Composes recursively (a router can route to a router), so you can build cost-optimization trees.
  • Prompt caching β€” Anthropic cache_control breakpoints inserted at stable prefix points (system prompt, tool defs, condensed history). Big savings on long conversations. (V0 had a known caching bug; V1 fixed it β€” verify in your own implementation that your cache hit rate is what you expect.)
  • Telemetry β€” every call records tokens in/out, computed cost, latency, error counts. Cost shows up at conversation.state.stats.accumulated_cost.
  • Retries with exponential backoff baked in.

Concrete cost economics

Two data points worth knowing:

  • Original OpenDevin paper: CodeActAgent v1.8 on Claude 3.5 Sonnet hit 26% on SWE-Bench Lite at $1.10 per instance. That's the cost-per-task baseline for a previous-generation model on a hard benchmark.
  • V1 paper: the default condenser claims to cut API spend by ~2Γ— on long sessions with no measurable quality loss.

For your own builds, expect order-of-magnitude:

  • Trivial task (few file edits, no tests): $0.05–$0.30 per run on a frontier model.
  • SWE-Bench-style real fix (explore + analyze + edit + verify): $0.50–$3 per task.
  • Multi-hour autonomous run (resolver mode on a complex issue): $5–$30, easily more without a condenser.

Cost ceilings to set on day one: MAX_ITERATIONS (default ~100 in OpenHands), LLM_NUM_RETRIES (default 8), and a hard accumulated-cost cutoff that aborts the conversation. Don't ship a headless agent without all three.

Build-your-own

Don't write the LLM client yourself β€” depend on LiteLLM. Add three things on top:

  1. A retry wrapper.
  2. A cost tracker (pull prompt_tokens/completion_tokens off the response, multiply by your rate card; if the model returns reasoning tokens, account for them separately β€” they're often billed differently).
  3. A response classifier: did the model call a tool, return text, return reasoning only, or return nothing? Branch explicitly.

12. πŸš€ What makes OpenHands highly autonomous β€” synthesis

Twelve concrete mechanisms. If you want your agent to be autonomous, you need most of these:

  1. A loop with no human in the middle by default. Conversation.run() doesn't ask permission β€” it runs until FinishAction, stuck detection, budget exhaustion, or explicit pause. Headless mode hard-codes this.
  2. Self-correction via observations. Every error becomes an Observation in the next prompt. The LLM literally sees its own stderr and adjusts.
  3. Long-horizon memory. The condenser lets sessions exceed the context window indefinitely. A persistent EventLog means full replay even after compression.
  4. Tool diversity = "anything a developer can do." Bash + Python + browser + file edit + MCP. The agent isn't shoehorned into 5 narrow operations.
  5. Microagents/Skills. Conventions and project knowledge load automatically when triggered. The agent "knows" your project the moment it lands in the repo.
  6. Sub-agent delegation. Parallel exploration with isolated contexts. Big tasks decompose without the parent context blowing up.
  7. Stuck detection. Five semantic patterns, every step. Pathological loops die early.
  8. Budget controls. Max iterations, max retries, accumulated-cost tracking. Hard ceilings on runaway spend.
  9. Risk-aware confirmation policy. ConfirmRisky lets the agent fly through routine work and pause only on destructive ops.
  10. Replayable event log. When the agent screws up, you can rewind to any event and try a different model or prompt. Debug loops are short.
  11. Same code, multiple isolation levels. LocalWorkspace for dev, DockerWorkspace for prod. No code changes β€” meaning you actually use isolation in prod, instead of half-disabling it for "dev convenience."
  12. End-to-end resolver mode. The GitHub Action wires GitHub issue β†’ sandbox β†’ CodeActAgent β†’ PR. No human in the loop. This is the maximum-autonomy configuration in production.

The pattern: autonomy is not a single feature. It's the union of "can keep going" (memory, budget), "can recover" (observations, stuck detection), "knows what to do" (skills), and "won't blow up the world" (sandbox, confirmation policy). Skip any of these and the agent is fragile.


13. πŸ—οΈ Building your own β€” minimum viable autonomous agent

Here's a concrete, achievable plan to build a clone with the same shape. Roughly 2,000 LOC of Python.

Skeleton

`plaintext
your_agent/
agent.py # Agent class with .step()
conversation.py # Conversation runner + EventLog
events.py # Event/Action/Observation Pydantic models
tools/
bash.py
edit.py
finish.py
workspace/
local.py
docker.py
llm.py # LiteLLM wrapper + RouterLLM
condenser.py # Threshold-based summarizer
skills.py # Markdown skill loader
stuck.py # Sliding-window detector
`

Build order (each step takes ~half a day)

  1. Events + EventLog. Define Event, ActionEvent, ObservationEvent, MessageEvent. Append-only list, JSON-serializable, persistable.
  2. Tools: bash + finish. Two Pydantic action models, two executor classes. Use subprocess.Popen for bash; keep stdin open for persistence.
  3. LLM wrapper. LiteLLM call + retry + cost tracking. Function-calling tool format.
  4. Agent.step(). Build messages from events, call LLM, classify response, dispatch. Copy the 5-phase shape from Β§2.
  5. Conversation.run(). while not finished: agent.step().
  6. LocalWorkspace. Dirt simple: cwd plus a tool registry.
  7. First end-to-end test. Give it "create a hello world Python script and run it". It should succeed.
  8. File edit tool. str_replace_editor semantics β€” read, then replace exact string. Refuse if the string doesn't appear or appears more than once.
  9. DockerWorkspace. Build a small image. Run a FastAPI server inside that exposes POST /exec. Forward bash and file-edit actions over HTTP.
  10. Condenser. Threshold check + LLM-summarize-the-middle. Cache summaries.
  11. Skills loader. Parse .agents/skills/*.md, evaluate triggers, inject into system prompt.
  12. Stuck detector. Sliding window + hash compare. Halt on repeats.
  13. Security gate. Add a risk field to tool schemas; pause when risk β‰₯ HIGH unless the user pre-approved.
  14. Sub-agent delegation tool. ThreadPoolExecutor over child Conversations sharing the workspace.

Stop when you have steps 1–7 working end-to-end on a real task. That's already a usable agent. Steps 8–14 are the "make it autonomous for hours" upgrades.

Build-order rules of thumb

  • Don't build a UI first. A CLI that prints events as they happen is enough to develop against. Headless mode is the actual product anyway.
  • Don't write a controller class. It's just while not done: agent.step(). Adding ceremony hurts.
  • Don't build a mock workspace. Use the real one (subprocess + cwd) from day 1. Mocks lie.
  • Do log every prompt and response to disk. When the agent does something weird, you'll need to see exactly what it saw.
  • Do use Pydantic for every event. Schemas catch 80% of bugs at the boundary.
  • Do measure tokens and cost from step 3. Otherwise you'll get a $400 bill the first time you run an overnight loop.

What to skip in v1

  • Browser automation (real browsing is hard; cover 90% of cases with curl/Python requests from bash).
  • MCP integration (YAGNI until your tools outgrow the built-ins).
  • Multi-agent delegation (single agent + good condenser handles surprisingly long tasks).
  • Streaming. (Token streaming is a UI feature, not an autonomy feature.)

14. 🏭 Production features OpenHands ships that you'll want eventually

Feature What it gives you When to add
Headless mode (--headless -t "task") Run from CI, no UI First prod use
Resolver / GitHub Action Issue β†’ PR autonomously When you trust the agent
GUI server (FastAPI + WebSocket + React) Multi-user web UI When non-devs use it
Cloud / Kubernetes deploy Multi-tenant, RBAC, integrations When you have users
Evaluation harness (openhands/core/main.py) SWE-Bench, GAIA runs Before claiming numbers
VSCode-in-the-sandbox Human can take over mid-task When agent stalls on hard tasks
Microagent registry Share skills across teams When you have >5 projects
Prompt caching 30–80% cost reduction When bills hurt

A reasonable path: ship CLI + headless first, add GUI when the team complains, add the resolver-style "one-shot from a ticket" mode last because it requires high trust.


15. ⚠️ Honest pitfalls and gotchas

A guide that only lists strengths is a brochure. Things to know:

  • The default condenser is aggressive (max_size=80, keep_first=4). Long sessions trigger it often. Real cost savings vary by workload.
  • Stuck detection thresholds are conservative (4+ identical pairs). An agent can burn meaningful tokens in a near-loop before being killed. Tune thresholds for your tolerance.
  • Headless mode = always-approve = blast radius is whatever the workspace allows. Always use Docker in headless. Don't mount more than the working directory.
  • 77% on SWE-Bench Verified is Claude Sonnet 4.5 dependent. Cheaper models drop hard (Qwen3 Coder 480B was 65%; smaller models do worse). The architecture isn't magic.
  • The V0 β†’ V1 split (Nov 2025) means a lot of public material describes a different codebase. The original arXiv paper describes V0; the new V1 paper and the SDK repo describe the architecture this guide focuses on. When you read OpenHands content, check the date.
  • MCP integration is powerful but adds attack surface. External MCP servers run with the agent's privileges. Treat them like dependencies β€” pin and audit.
  • Browsing is the flakiest tool. Site changes, JS-heavy pages, and bot detection make it unreliable. Reach for curl or library-level integrations whenever you can.

16. πŸ“š Reading list (curated, in order)

  1. OpenHands V1 SDK paper (arXiv 2511.03690) β€” the canonical architecture writeup. Read this first.
  2. Original OpenDevin paper (arXiv 2407.16741) β€” context on CodeAct and the original design.
  3. Agent.step() source in openhands-sdk/openhands/sdk/agent/agent.py β€” the loop, in 100 lines.
  4. Architecture overview docs β€” how the pieces fit.
  5. Skill format β€” the autonomy multiplier.
  6. Stuck detector guide β€” the loop-prevention patterns.
  7. Sub-agent delegation guide β€” parallel agents.
  8. CodeActAgent system prompt β€” actual prompt text.
  9. Headless mode docs β€” the autonomous configuration.
  10. GitHub Action / Resolver β€” issue β†’ PR pipeline.

17. 🎯 Closing β€” the mental model that makes everything click

The whole project rests on one idea: an autonomous agent is a function from event history to next event, run in a loop. Every architectural choice in OpenHands is downstream of that:

  • "Function" β†’ stateless Agent.
  • "Event history" β†’ append-only EventLog.
  • "Next event" β†’ Action, executed by Workspace, producing Observation.
  • "Run in a loop" β†’ Conversation, until Finish or stuck.

Everything else β€” condensers, skills, sub-agents, security analyzers β€” is a hook into that one loop. There is no big design. There is one tight kernel and a lot of small components hanging off it.

Build the kernel first. Make sure it actually closes the loop on observations. Then earn each of the 12 autonomy features by removing a class of failure you observed in practice. That's the path.


If you found this helpful, let me know by leaving a πŸ‘ or a comment!, or if you think this post could help someone, feel free to share it! Thank you very much! πŸ˜ƒ

Top comments (0)