Calvin Sturm

Posted on Feb 21

I Built a Local-First Agent Runtime in Rust (and Why Wrapping Existing CLIs Didn’t Work)

#ai #rust #devops #cli

I’ve been trying to make local AI workflows reliable for real day-to-day use: coding tasks, browser tasks, repeatable evals, and auditable tool execution.

I first tried adding trust/approval controls around existing agent CLIs. That approach hit a hard limit quickly: when tool execution is deeply native to the host app, external wrappers can’t reliably enforce policy boundaries.

So I built my own runtime: LocalAgent.

GitHub: https://github.com/CalvinSturm/LocalAgent

Why I built this

I kept seeing the same failure pattern with local 20–30B models:

brittle tool behavior
occasional non-answers
inconsistent step execution
hard-to-debug failures without replayable state

The answer wasn’t just “pick a better model.”

The answer was to harden the runtime process:

explicit safety gates
deterministic artifacts
policy + approvals
eval + baseline comparisons
replay + verification

What LocalAgent is

LocalAgent is a local-first agent runtime CLI focused on control and reliability.

It supports:

local providers: LM Studio, Ollama, llama.cpp server
tool calling with hard gates
trust workflows (policy, approvals, audit)
replayable run artifacts
MCP stdio tool sources (including Playwright MCP)
deterministic eval harnesses
TUI chat mode

Safety defaults (important)

Defaults are intentionally restrictive:

trust is off
shell is disabled
write tools are not exposed
file write execution is disabled

You have to explicitly enable risky capabilities.

Architecture (high level)

At a high level, each run does:

Build runtime context (provider/model/workdir/state/settings)
Prepare prompt messages (session/task memory/instructions if enabled)
Apply compaction (if configured)
Call model (streaming or non-streaming)
If tool calls are returned:
- run TrustGate decision first
- execute only if allowed
- normalize tool result envelope
- feed tool result back to model
Repeat until final output or exit condition
Write artifacts/events best-effort for replay/debug

This design keeps side effects behind explicit gates and makes failures inspectable.

Why this is better than wrapper-only trust

External wrappers are useful, but they’re limited when tool execution happens inside another runtime you don’t control.

With LocalAgent:

tool identity/args are first-class internal data
policy and approvals are evaluated before side effects
event/audit/run artifacts are generated in one execution graph
replay and verification use the same runtime semantics

In short: security and reliability controls are part of the execution model, not bolted on.

Quickstart

cargo install --path . --force
localagent init
localagent doctor --provider lmstudio
localagent --provider lmstudio --model <model> chat --tui

One-shot run:

bash localagent --provider ollama --model qwen3:8b --prompt "Summarize README.md" run

Slow hardware notes

On slower CPUs / first-token-heavy setups, retries can create a bad UX (re-sent prompts before completion). During debugging, use larger timeouts and disable retries:

bash localagent --provider llamacpp \ --base-url http://localhost:5001/v1 \ --model default \ --http-timeout-ms 300000 \ --http-stream-idle-timeout-ms 120000 \ --http-max-retries 0 \ --prompt "..." run

What I’ve learned so far

The biggest reliability gains came from process constraints, not model hype:

bounded tasks
strict output expectations
pre-exec arg validation
deterministic evals + baselines
replayable artifacts for root-cause debugging

For high-ambiguity reasoning, I still route to stronger hosted models.
For a lot of productivity helper work, local models are viable when the runtime is disciplined.

Current docs

README: project overview + workflows
CLI reference: complete command/flag map
provider setup guide (LM Studio/Ollama/llama.cpp)
templates, policy docs, and eval docs

Repo: https://github.com/CalvinSturm/LocalAgent

Feedback I’d love

What local model + runtime combos are most stable for tool-calling?
Which prompt/output constraints improved reliability most for you?
What would make local-first coding workflows feel “production-ready”?

If this is useful, I can write a follow-up with concrete eval/baseline workflows and model routing strategy.

DEV Community