DEV Community: Evan Green

Why AI Agents Fall Apart on Real Work

Evan Green — Wed, 18 Mar 2026 12:36:35 +0000

I've been learning the hard way that building real autonomous AI systems has very little to do with writing better prompts.

I am building Tandem, an open-source autonomous execution engine designed for long-running work. The goal is simple: take on a mission, move through structured tasks, and only advance when the work is actually complete and verified.

That sounds simple. In practice, it immediately exposes the gap between what AI demos suggest and what real autonomous execution actually looks like.

The promise

At a high level, the vision is simple. Give an LLM a problem, let it break the work into tasks, execute those tasks in order, retry when something fails, and keep going until the result is done and verified.

For a research workflow inside Tandem, that means discovering relevant files, reading concrete source material, gathering external evidence from the web, writing an artifact grounded in what was actually found, validating that the output meets coverage requirements, and retrying with targeted guidance if it does not. This is the kind of behavior people imagine when they talk about autonomous agents.

The problem is that LLMs do not naturally behave like dependable operators.

What actually happens

Once you start running long task chains, the cracks show fast.

A research node in Tandem was offered four tools: glob, read, websearch, and write. It executed two of them. Then it produced a blocked handoff artifact saying, in effect, that it did not have access to the discovery and reading tools.

The telemetry for that same run showed the tools were offered and that glob had successfully executed. The model used glob, found nothing worth following up on from its own perspective, and wrote a blocked brief rather than doing the required reading and web research.

That is not a loud failure.

The artifact exists. The file is written. It looks like work was done. The system moved on.

That is the most dangerous failure mode: output that looks like completion but is not actually usable. The model skipped required tools, claimed they were unavailable when they were not, and produced something plausible enough to pass casual inspection. It took the cheapest compliant-looking exit rather than doing the real work.

The first bad assumption

The first instinct is to prompt harder. Add more instructions, be more explicit about required tools, repeat the rules, tighten the format.

That helps a little. It does not solve the real problem.

The model is a probabilistic system predicting the next useful action. You can improve compliance with better wording, but you cannot build reliability on prompt wording alone. The model will still find cheaper paths through the task that satisfy the letter of the prompt without doing the actual work. That was one of the biggest lessons for me building Tandem.

Where the real work moved

Once I stopped treating the prompt as the main control surface, the design got clearer. Tandem's runtime had to own what actually matters: what task is active, what tools are required, what evidence is needed before output is accepted. And on failure: what counts as a valid result versus a premature exit, what a retry should look like when required behavior was skipped, and when the system is allowed to move on.

That means building discipline into engine state rather than leaving it inside the model's temporary reasoning. Tandem treats autonomous execution as a distributed systems problem, not a chat problem. Once you do that, the runtime becomes less like a wrapper around a chatbot and more like an execution system that happens to use an LLM inside it.

Why state matters so much

This is where a lot of agent systems start to get fragile.

If the conversation is the main source of truth, long-running work becomes unstable quickly. Context grows, summaries get lossy, retries become fuzzy. It becomes difficult to know what is still pending, what was already attempted, what failed, and what is safe to retry.

Tandem's engine holds the durable truth: run status, per-node validation outcomes, what tools were offered versus actually executed, what evidence was gathered, repair attempt counters, and replayable event history. Nodes move through explicit states (passed, needs_repair, blocked) rather than just "done" or "failed." The needs_repair state means the node can still succeed. blocked means repair budget is exhausted or the failure class is terminal. That three-way distinction changes what the runtime can do when something goes wrong.

Once that state exists outside the model, the system becomes much easier to reason about and audit.

You cannot fix what you cannot see

I would never have identified the specific failure described above without it. And I want to be precise about what "it" means here, because this is not about adding a debug panel to a frontend.

Tandem has a structured observability layer built into the engine itself. Every significant event emits a typed JSONL record to a dedicated tandem.obs tracing target, carrying a correlation ID, session ID, run ID, component name, event type, and status. The engine does not just log free text. It emits structured, queryable, component-tagged events as a first-class architectural concern, with a redaction policy to ensure sensitive content never leaks into traces.

That foundation is what makes everything else possible. The per-node state tracking, the tools-offered versus tools-executed comparison, the validator reason, the blocking classification, the repair attempt counter — none of that could be surfaced anywhere if it had not been deliberately captured inside the engine first as durable, typed state. The frontend is just the last step in that chain. The hard work is in making the engine know and record these things at all.

Without it, I would have seen "research node failed" and started guessing. Maybe the prompt was wrong. Maybe the model needed more context. Maybe it was a configuration issue. There would have been no way to know.

With it, I could say with precision that the model was offered glob, read, websearch, and write, used only two of them, and then produced an artifact claiming the others were unavailable. The telemetry and the self-report were directly contradicting each other, and I could see both in the same view.

Here is what that mismatch actually looks like in the Tandem runtime:

Node blocked: research-brief
research completed without concrete file reads or required source coverage

offered tools: glob, read, websearch, write
executed tools: glob, write

unmet requirements:
  no_concrete_reads
  citations_missing
  files_reviewed_not_backed_by_read
  web_sources_reviewed_missing
  missing_successful_web_research

web research was not used

blocking classification: tool_available_but_not_used
failure kind: research_missing_reads
repair attempts left: 5

The model's own output began with "Blocked: I do not have access in this run to the required discovery and reading tools." The telemetry shows all four tools were offered and two were successfully executed. The model chose not to use read and websearch, then reported them as unavailable. Without structured per-node state capturing both sides, there would be no way to distinguish a genuine tool failure from a model that simply chose the cheapest exit.

The honest lesson is that observability is not a debugging convenience. It is what makes diagnosis possible at all. Every failure looks the same from the outside. The detailed per-node state in Tandem is what turned "the agent gave up" into "the model ignored available tools and the runtime accepted it." Those are very different problems with very different solutions.

What guardrails really are

Guardrails are often described like they are just safety prompts or refusal rules. That is not how I think about them in Tandem anymore.

In a serious autonomous system, guardrails are operational controls. They determine whether a task may proceed, whether the model must use a specific tool before writing output, whether an output is incomplete relative to what was required, and how many repair attempts are allowed before a node is terminal.

The most important check in Tandem's research validator is whether the output claims tool unavailability that contradicts the telemetry. When a model writes "I did not have access to the required tools" but the run shows the tools were offered and partially used, that is not an acceptable terminal state. The runtime has to treat it as a repair case, not a valid blocked output.

Verification changes everything

One of the most important shifts is moving from "did the model respond?" to "did the system verify the result?" Those are not the same thing.

Tandem has to care whether required tool use actually occurred, not just whether tool calls were made. It has to check whether the output is grounded in gathered evidence, whether source coverage requirements were met, and whether the model's self-report matches the actual run telemetry.

This is the honest assessment of where I am right now. The observability is much better than it was. I can say with precision what failed and why. But the engine still allows the model to reach a bad terminal state too early. Verification happens after the artifact is written, rather than preventing the artifact from being written prematurely. That is the remaining gap, and it is significant.

Why retries are not enough by themselves

Retries help, but only if the runtime understands what failed and forces a meaningfully different attempt.

Tandem's current retry mechanism injects a runtime-owned repair brief into the next attempt. That brief summarizes the previous validator reason, the specific unmet requirements, the blocking classification, required next tool actions, a comparison of tools offered versus executed, files that were discovered but not read, and repair budget remaining. That is substantially better than blindly rerunning the same prompt.

But I have seen the model still take the same cheap exit path even with that guidance injected. That is the key lesson: retry quality depends on how much the runtime can constrain the second attempt, not just how much information it provides. A well-described repair brief tells the model what to do. It does not prevent the model from choosing not to.

The next step in Tandem is a stronger pre-finalization gate. If required tools were offered, were not used, and no actual tool failure occurred, the node cannot produce a terminal result yet. It must be rerun on a forced repair path with those tools required, not just suggested.

The generalization gap

As I added more enforcement to Tandem's research workflow, a second problem emerged: the repair runtime is becoming genuinely generic, but the enforcement logic is not.

Things like needs_repair state, retry metadata, repair guidance format, context-run task projection, and API repair summaries are all reusable across workflow types. But the actual behavioral rules (must use read before writing, must include citations, must use websearch) are still embedded directly in the engine as research-specific knowledge. New workflow types in Tandem do not automatically get the same strong runtime behavior unless they happen to align with the engine's built-in validator patterns.

The next architectural step is moving workflow-specific success and repair rules out of ad hoc engine code and into declarative node contracts, where each node declares its required tool classes, evidence classes, retryable failure classes, and pre-finalization gates, and Tandem enforces those generically. I built repair visibility faster than I built workflow semantics. That is the gap that needs to close.

The bigger lesson

The deeper I get into building Tandem, the less I think the future of autonomous systems is about smarter prompting. It is about building runtimes that can make model behavior usable: explicit per-node state, controlled execution with required evidence gates, validations with classified outcomes rather than just pass/fail, and retries with structured repair context rather than reruns.

Better models reduce friction. But better models do not remove the need for structure. If anything, stronger models make it more tempting to trust output that still needs to be verified. A confident model producing a well-formatted blocked artifact still failed the mission. Tandem has to know that.

Where this leads

I still want the same end state I started with: an engine that can take on long-running work, manage its own task list, recover from failure, and finish what it starts. That is what Tandem is being built toward.

But I no longer think that comes from giving the model enough instructions and hoping it behaves. It comes from building the surrounding runtime carefully enough that the model can only succeed inside a system that knows what success actually means, and that refuses to accept convincing-looking failure as a terminal result.

That is a very different mindset from most of the agent hype. And I think it is the only one that will hold up when these systems move from demos into real work.

Closing

The hardest part of autonomous AI is not getting the model to sound intelligent. The hardest part is building a runtime that can keep a non-deterministic model inside reliable execution boundaries and tell the difference between a model that genuinely could not complete the work and a model that simply chose not to try.

That distinction is the whole game. And the more time I spend on it, the more convinced I am that the future of agent systems belongs to teams that treat autonomous execution as a systems problem, not a prompting problem.

I fell into most of the pitfalls described here before I understood what was actually happening. If this saves someone else from the same detours, that matters as much to me as shipping the engine itself.

If you want to follow along as I build Tandem into a genuinely autonomous execution engine, the project is open source.

github.com/frumu-ai/tandem

I Was Bored and Built an AI Social Network. The Results Were Hilarious.

Evan Green — Thu, 12 Mar 2026 12:43:36 +0000

Most AI demos are one good prompt wearing a fake mustache.

They look convincing right up until you ask them to do anything annoying, repetitive, stateful, or public.

So instead of making another neat little demo, I built bots.frumu.ai: an AI social network where bot personalities post to a live feed, run recurring shows, argue with each other, and generate replayable content on a schedule.

If this project makes no sense yet, this is the fastest explanation:

The original goal was simple:

Stop talking about orchestration quality and put it somewhere people can actually watch it succeed or fail.

That made the project useful almost immediately.

It also made it funny almost immediately.

The idea

I wanted a proof-of-work project for Tandem, my orchestration engine.

Not a benchmark. Not a diagram. Not a "look, it can call tools" demo.

A real system with:

recurring jobs
overlapping workflows
retries
recoverable failures
structured outputs
public artifacts
enough chaos that bad orchestration becomes obvious fast

An AI social network is weirdly good at this.

If the system gets repetitive, you see it.

If timing breaks, you see it.

If two bots accidentally converge on the same personality, you definitely see it.

And if a "serious" debate between AI characters turns into nonsense, that is both a product bug and unexpectedly good content.

Why I built this instead of a normal agent demo

A lot of agent demos quietly stop right before the hard part.

They show one successful interaction, maybe one tool call, maybe a nice streamed response, and then everybody goes home pretending the system is production-ready.

The hard part starts when the thing has to keep working.

Can it run every day?

Can multiple workflows overlap without stepping on each other?

Can it recover when one step fails halfway through?

Can it produce artifacts that still make sense in public, not just text that looked fine in a terminal once?

That is the kind of pressure I wanted.

Because if Tandem is actually useful, it should survive more than one pretty demo prompt.

What Tandem is actually doing here

Tandem is not the social product. It is the runtime underneath it.

The social layer owns things like:

personas
prompts
formats
channels
content rules
media generation
playback and presentation

Tandem owns the orchestration layer:

scheduling
execution state
retries
recovery
replay
coordination across workflows

That separation matters a lot.

I did not want every new content format to turn into another pile of app-specific cron jobs, queues, retry logic, and glue code held together by hope.

The funniest part: the bots are public

This is where the project stopped being a dry infrastructure exercise and started becoming entertaining.

Private systems can hide a lot.

A public feed cannot.

If a bot starts repeating itself, everyone can see it.

If a debate goes off the rails, everyone can see it.

If one character suddenly sounds suspiciously like another character, everyone can see that too.

It turns out "public AI weirdness" is a pretty effective testing strategy.

It is also a decent content strategy.

What building it taught me

The biggest lesson was this:

A prompt working once tells you almost nothing. A system working repeatedly tells you everything.

The failures that mattered were usually not "the model is bad."

They were things like:

two workflows reaching almost the same intent from different paths
outputs that were individually fine but late relative to the live context
retries that were technically correct but operationally annoying
long-running media flows turning small faults into larger messes

That is exactly why I like this project as a proof surface.

It makes orchestration quality observable.

Not theoretical. Not hidden in architecture slides. Observable.

This is proof of work, not just a joke

The site is funny because the bots are weird.

But it is useful because the system is real.

It runs on schedules.

It produces public artifacts.

It forces me to care about failure recovery, state, timing, consistency, and operator visibility.

And because it is a social product, the output is legible even to people who do not care about orchestration internals.

They do not need to understand the runtime design to understand whether it is working.

They can just look at the feed.

That is a much harsher and more honest demo surface than a polished single-turn interaction.

The real point

I built bots.frumu.ai because I wanted a public test for whether Tandem actually holds up under recurring, messy, visible work.

It turns out that making bots post, debate, and generate media in public is a very effective way to find orchestration problems.

It also turns out it is much more fun than another sterile AI demo.

That is the project in one sentence:

I was bored, built an AI social network, and accidentally ended up with both a runtime stress test and a comedy machine.

Most AI Agent Frameworks Treat Chat as a Runtime. That’s the Problem

Evan Green — Mon, 09 Mar 2026 20:25:23 +0000

Most AI agent frameworks are chat wrappers with a loop bolted on.

They look capable in demos. They can feel impressive in short-lived workflows. But once you add retries, parallel workers, approval gates, failures, long-running tasks, and operator oversight, the whole thing starts to collapse into improvisation.

The reason is structural: many of these systems treat the conversation transcript as the coordination layer.

That is not a runtime. It is a liability.

Tandem is built around a different premise. The engine should own orchestration, state, approvals, artifacts, scheduling, replay, checkpoints, and memory access. Once workflows become parallel and durable, that shift stops being a design preference and starts becoming a requirement.

Chat is a good interface, but a weak coordination layer

There is nothing wrong with chat as a surface. It is intuitive, flexible, and useful for directing work.

The problem begins when chat becomes the authoritative system of record for execution.

When a transcript is the source of truth, concurrency becomes guesswork. There is no reliable way to let multiple agents work in parallel because neither knows, in a structured way, what the other has claimed. Failure handling turns into re-prompting and hoping. Debugging means re-reading threads. Replay is impossible. Operator visibility becomes a log scrape.

These are not unusual edge cases. They are normal conditions for any workflow that runs longer than a few moments or involves more than one worker.

That is the dividing line between a clever assistant and a serious execution platform.

What Tandem actually is

Tandem is an engine-owned workflow runtime for coordinated autonomous work.

That means the engine, not the UI, owns truth about execution. However you access the system (desktop, terminal, web, or API) you are talking to the same engine running the same execution model. There is no surface that holds state the others cannot see.

The engine owns:

orchestration
task state
approvals
artifacts
scheduling
replay
checkpoints
memory access
policy enforcement

This matters because once workflows are parallel and long-running, you need infrastructure that can survive failure, coordinate work deterministically, and expose the same state consistently across every surface.

One runtime, multiple surfaces

A lot of agent systems end up splitting behavior across interfaces. One surface has one model, another has a different one, and the logic gradually fragments.

Tandem is being built around the opposite idea: one runtime, multiple clients.

The same engine powers:

a desktop app for daily workflows and supervised approvals
a TUI for terminal-native operation
a web control panel for operations, automations, packs, and live oversight
a headless HTTP + SSE runtime for API clients and server deployments

That means you are not rewriting the operating model for each surface. You are interacting with the same execution substrate from different environments.

The engine also exposes:

HTTP + SSE APIs for sessions, runs, cancellation, and event streaming
a TypeScript SDK: @frumu/tandem-client
a Python SDK: tandem-client
headless runtime support for server deployments, internal apps, and channel integrations

Blackboard first, not transcript first

At the center of Tandem is the idea that agents should coordinate through durable shared state, not just messages.

That shared state lives in a blackboard.

A blackboard is the engine’s shared execution map. It holds the structured state of the job: what exists, what changed, what is blocked, what is runnable, what failed, what artifacts were produced, and what decisions were made.

It is not a conversation history. It is runtime state.

Blackboard execution map

This is what allows the system to answer operational questions directly:

Which tasks are blocked?
Which tasks are runnable?
Which tasks are already claimed?
Which tasks require approval?
Which tasks failed and should be retried?

Without a blackboard, those answers usually have to be inferred from logs or reconstructed from chat. That does not scale.

Workboards are execution, not just UI

On top of the blackboard, Tandem uses a workboard.

If the blackboard is the shared state layer, the workboard is the execution layer agents coordinate against. It is not just a Kanban view. It is the engine-owned task model that tracks state, ownership, dependencies, decisions, retries, artifacts, and reliability signals.

Agents do not ask each other in chat who wants the next task.

The board already knows.

At any moment, the board needs to track:

which tasks are blocked and waiting on prerequisites
which tasks are runnable and eligible for claiming
which tasks are already claimed, with lease metadata
which tasks require a specific role or gate
which tasks failed and are eligible for retry

That is the baseline for safe concurrency. Without it, parallel agents trampling the same job is not a matter of if. It is a matter of when.

How claims and task transitions work

When a task becomes runnable, an agent claims it. That claim writes ownership and a lease into shared state. No other worker can take that same task unless the lease expires, the claim is released, or policy transitions it.

This is optimistic concurrency applied to workflow execution.

Role-aware routing goes further. Tasks can express an intent such as memory, builder, review, or test-gate, and the runtime routes them to the appropriate agent class. Not every worker is interchangeable, and the board should encode that rather than leaving it buried inside a prompt.

When a task fails, it moves into a failed or rework state, increments a retry counter, and becomes available for clean pickup later. The board records what happened, which then feeds into auditability, replay, and memory.

Claim and transition lifecycle

A fifty-task board in practice

Imagine a coding mission with fifty tasks.

Some are runnable immediately. Some are blocked by prerequisites. Some require approval. Some are intended for specific roles like memory retrieval, triage, implementation, review, or testing.

Ten agents can work on that board at once, but only tasks that are actually runnable should be claimable.

As tasks complete:

blocked dependents are re-evaluated
newly unblocked work becomes runnable
free agents claim new tasks
retries are tracked cleanly
the board keeps a revisioned record of what happened

That is parallel execution with structure.

A simplified coding flow might look like this:

Step	Task	Waits on
1	Inspect issue	—
2	Retrieve memory	1
3	Inspect repo structure	1
4	Reproduce bug	1, 2, 3
5	Identify duplicate issues	1
6	Review prior fixes	1
7	Draft triage summary	1
8	Write failing test	4
9	Propose patch	4, 5, 6
10	Validate patch	8, 9
11	Prepare final review artifact	10

At runtime, triage can claim task 1 first. Once that completes, memory retrieval, repo inspection, duplicate checking, prior-fix review, and triage summary can branch from it. Reproduction cannot start until the required upstream tasks are done. Patch work cannot even become runnable until reproduction and prior analysis are complete.

No one has to ask in chat who owns task 9. The board already knows.

Engine truth, not transcript inference

For production coordination, the engine must own truth.

That means:

structured event history instead of log scraping
materialized run state instead of state inferred from a thread
checkpoints and replay so you can rewind execution rather than restart blindly
decision lineage so you know why something was routed or blocked
deterministic task-state projection so every client sees the same reality

Without those primitives, debugging multi-agent workflows becomes archaeology.

With them, the system becomes operable.

Concurrency that actually holds together

A lot of conversational delegation systems suffer from concurrency blindness. They can delegate in theory, but once multiple workers are active, the coordination model becomes fuzzy.

Tandem is built around the opposite approach:

explicit task claims
revisioned shared state
blackboard patch streams
deterministic task transitions
isolated execution paths
replayable run history

These are not just implementation details. They are what make concurrent autonomous work tractable.

Browser automation belongs inside the runtime

Serious workflows often need the web.

That is why Tandem includes browser automation as part of the same engine-owned model as local files, APIs, and operator actions. It is not bolted on, and it does not live in a separate lane. It participates in the same coordination model, artifact flow, and runtime state as everything else.

Tandem also includes a custom web fetch tool that converts raw HTML into Markdown before handing it to the LLM. That makes web content easier for the model to work with and reduces a large amount of unnecessary markup and noise. In practice, that can cut token usage dramatically, in some cases by as much as 80%.

This works wherever the engine runs. On a headless Linux server with Chromium installed, such as a VPS, CI runner, or Docker container, browser tasks can execute without a display environment and without a GUI. The same workflows you build locally can run the same way remotely.

Tandem Coder is where this becomes especially obvious

One of the clearest places this model matters is coding workflows.

The next major slice is Tandem Coder: a memory-aware coding agent that runs inside the engine rather than as a frontend-owned feature layer.

It is being built on top of:

context runs
blackboard state
artifacts
approvals
GitHub MCP
engine memory

The near-term roadmap includes:

coder run contracts and artifact taxonomy
deterministic memory retrieval for issue triage
memory-aware issue triage workflows
failure-fingerprint memory candidates and duplicate detection
a developer-mode run viewer with kanban projection

The goal is for coder workflows to learn from prior failures, fixes, and reviews by reusing engine memory, not by re-reading chat history and pretending that is durable context.

Where Tandem fits relative to assistant-first systems

Assistant-first systems are usually optimized for speed of setup, chat interaction, and personal productivity.

That is a valid design center, and it solves real problems.

Tandem is aimed at a different layer. It is not trying to be a better personal assistant. It is being built as orchestration infrastructure for cases where you need:

durable shared task state
parallel execution
replay and checkpoints
engine-owned truth
structured artifacts
approvals and policy gates
memory-aware workflows
multiple clients on the same runtime
a headless platform that other tools can build on

That is a different problem, and it leads to different architecture choices.

Why this category needs better foundations

Too many AI agent systems still rely on improvisation once real complexity shows up.

They look smart on the happy path, then become fragile when you add concurrency, failures, approvals, long-lived workflows, or operator oversight.

If autonomous systems are going to do serious work, they need stronger primitives underneath them:

blackboards and workboards
task claiming
optimistic concurrency
checkpoints and replay
engine-owned state
reusable memory
structured artifacts
deterministic workflow control
policy-gated mutation paths
operator-grade visibility across clients
stable platform APIs

That is what Tandem is being built around.

Not as a better chatbot.

As infrastructure for coordinated autonomous work.

Getting started

Desktop app:

https://tandem.frumu.ai/

Web control panel:

npm i -g @frumu/tandem-panel
tandem-control-panel --init
tandem-control-panel

Open http://127.0.0.1:39732.

Engine and TUI (WIP):

npm install -g @frumu/tandem @frumu/tandem-tui

The TUI is still work in progress. Start with:

tandem-tui

If it does not attach or bootstrap cleanly in your environment, run the engine manually and retry:

tandem-engine serve --hostname 127.0.0.1 --port 39731
tandem-tui

If engine API token auth is enabled, set the same token in your environment before launching TUI.

For setup and troubleshooting help, use the Tandem docs: https://tandem.docs.frumu.ai/.

I Built an Agentic Ferrari in Rust… and Nobody’s Driving It

Evan Green — Mon, 23 Feb 2026 15:00:49 +0000

I built an agentic Ferrari in Rust.

It’s fast, ridiculously low overhead, and honestly a little absurd: multi-agent orchestration, tool routing, local memory, event streaming, safety gates—the whole thing.

And right now… almost nobody’s driving it.

Repo: https://github.com/frumu-ai/tandem
Website: https://tandem.frumu.ai/
Docs: https://tandem.frumu.ai/docs/

Developers get workflows. Everyone else gets chatboxes. Tandem brings workflows to everyone.

The brutally honest origin story

This started because I wanted Anthropic’s Cowork on Windows. I thought the idea was that good.

The bigger motivation is the gap: developers get real AI workflows (CLI/IDEs), while everyone else gets a chat box. I can use the dev tools — the problem is most people can’t. Tandem is my attempt to make developer-grade agent workflows approachable for non-devs, without shipping their entire machine to a cloud agent.

At first I used OpenCode to move fast. But once I cared about owning the rules—custom endpoints, tool semantics, streaming events, safety gates—I hit the wall.

If the runtime controls how prompts, tools, and state flow… and I need control over that… I may as well build the runtime.

That’s when “a Tauri app” turned into a full local agent engine.

Tandem isn’t a desktop app. It’s an engine with clients.

Tandem is a headless agent runtime written in Rust. The UI is just a client.

Today it ships with:

tandem-engine (Rust): orchestration, tools, memory, event streaming
Desktop app (Tauri + React): Plan Mode, visual diffs, approve-to-execute UX
TUI: a terminal cockpit for the same engine
Headless VPS Portal: 9 working React examples (Deep Research, Swarms, Incident Triage) to show how easy it is to build custom clients
Guides to build your own GUI/clients on top of the engine API

Ferrari translation:

Engine = tandem-engine
Dashboard = Desktop UI
Track cockpit = TUI
Telemetry = event stream

Why Rust (the real reason)

I wanted to push horsepower downward.

Agent systems aren’t one-off calls. They’re loops: plan → act → observe → revise. That means the runtime ends up doing a lot of “boring but heavy” work constantly:

managing orchestration state (like my supervised "Planner → Builder → Validator" sub-agent loop)
streaming structured events
tool routing + retries + budgets + sandboxed Python venv execution
indexing and memory reads/writes
parsing noisy web pages or extracting text from PDFs/DOCXs into something models can actually use

Rust is where I want that work to live.

Then frontends can do what they’re best at:

look polished (React + motion)
stay responsive
render plans, diffs, timelines, logs
avoid becoming a spaghetti bowl of orchestration logic

This separation is the whole point: engine = responsibility, clients = experience.

Provider freedom (because vendor lock-in sucks)

Tandem isn't tied to a specific model. Bring your own API key (OpenRouter, Anthropic, OpenAI) or go completely offline with local models via Ollama. The engine doesn't care; it simply routes and executes.

The trust problem: Choosing your level of control

Letting an LLM blindly write files “live” in the background is a quick way to lose trust forever.

Instead of a one-size-fits-all approach, Tandem lets you dial the trust level up or down depending on what you're doing.

In regular chat sessions, you pick the exact mode per prompt:

Ask: Q&A without making any file changes at all.
Explore: Analyze and explore the codebase safely.
Immediate: Execute changes directly for quick, low-risk edits.
Plan (Zero Trust): The agent proposes a staged execution plan, the UI shows visual diffs, and a human explicitly clicks Execute.
Coder: Focused specifically on code generation.
Orchestrate: An AI plans and executes multi-step workflows.

For serious architectural work, you enter the Command Center:
A dedicated cockpit for launching orchestrator missions and managing manual swarm interventions. You give the objective, and the engine coordinates Planner/Builder/Validator sub-agents while giving you live telemetry on tokens, runtime, and tasks.

For background work, there's Agent Automation (WIP):
A separate hub for Scheduled Bots (like Daily Research or Issue Triage) and MCP Connector operations where you set explicit bounds and let it run.

Plan Mode is slower than “just do it,” but having that safety net is how you make local agents feel safe enough to keep installed. (I also encrypt your API keys locally using AES-256-GCM, because I mean it when I say "developer-grade".)

Local memory without the usual “install a database” tax

Most “local-first” tools quietly stop being local-first the moment you add RAG and require external services.

I wanted long-term memory without asking users to run Postgres/pgvector/Pinecone/etc.

So memory lives locally (SQLite + embedded vector search via sqlite-vec). It keeps setup friction low and makes the engine feel like an actual local tool, not a mini devops project.

So why is the Ferrari parked?

Because capability isn’t adoption.

I built the engine. I shipped two clients. I wrote docs. It works.

But most people don’t wake up looking for “an agent runtime.” They want a workflow that succeeds in 60 seconds, onboarding that’s obvious, and trust that’s earned immediately.

A Ferrari is useless if:

nobody knows where the keys are, or
they’re scared it’ll go through the garage door.

Help me find drivers (or build a better steering wheel)

If you try Tandem and bounce in the first 5 minutes, I want to know where. That feedback is worth more than compliments.

Repo: https://github.com/frumu-ai/tandem
Website: https://tandem.frumu.ai/
Docs: https://tandem.frumu.ai/docs/

And if you build a client on top of the engine: I’ll happily link it in the docs.

Over-engineered? Probably. 😄

Necessary? Also probably. Because the moment you want safety gates, streaming state, memory, and multiple clients, you’re not building an app anymore—you’re building a runtime.

And that’s the real punchline: Tandem isn’t “one UI.” It’s a local engine that can serve dozens or even hundreds of clients on the same machine—Desktop, TUI, tiny custom dashboards, scripts, automations, whatever you want—without rebuilding the core. (I just shipped 9 example dashboards in my vps-web-portal to prove it).

The Ferrari isn’t the dashboard. It’s the engine.