I’ve been trying to make local AI workflows reliable for real day-to-day use: coding tasks, browser tasks, repeatable evals, and auditable tool execution.
I first tried adding trust/approval controls around existing agent CLIs. That approach hit a hard limit quickly: when tool execution is deeply native to the host app, external wrappers can’t reliably enforce policy boundaries.
So I built my own runtime: LocalAgent.
GitHub: https://github.com/CalvinSturm/LocalAgent
Why I built this
I kept seeing the same failure pattern with local 20–30B models:
- brittle tool behavior
- occasional non-answers
- inconsistent step execution
- hard-to-debug failures without replayable state
The answer wasn’t just “pick a better model.”
The answer was to harden the runtime process:
- explicit safety gates
- deterministic artifacts
- policy + approvals
- eval + baseline comparisons
- replay + verification
What LocalAgent is
LocalAgent is a local-first agent runtime CLI focused on control and reliability.
It supports:
- local providers: LM Studio, Ollama, llama.cpp server
- tool calling with hard gates
- trust workflows (policy, approvals, audit)
- replayable run artifacts
- MCP stdio tool sources (including Playwright MCP)
- deterministic eval harnesses
- TUI chat mode
Safety defaults (important)
Defaults are intentionally restrictive:
- trust is off
- shell is disabled
- write tools are not exposed
- file write execution is disabled
You have to explicitly enable risky capabilities.
Architecture (high level)
At a high level, each run does:
- Build runtime context (provider/model/workdir/state/settings)
- Prepare prompt messages (session/task memory/instructions if enabled)
- Apply compaction (if configured)
- Call model (streaming or non-streaming)
- If tool calls are returned:
- run TrustGate decision first
- execute only if allowed
- normalize tool result envelope
- feed tool result back to model
- Repeat until final output or exit condition
- Write artifacts/events best-effort for replay/debug
This design keeps side effects behind explicit gates and makes failures inspectable.
Why this is better than wrapper-only trust
External wrappers are useful, but they’re limited when tool execution happens inside another runtime you don’t control.
With LocalAgent:
- tool identity/args are first-class internal data
- policy and approvals are evaluated before side effects
- event/audit/run artifacts are generated in one execution graph
- replay and verification use the same runtime semantics
In short: security and reliability controls are part of the execution model, not bolted on.
Quickstart
cargo install --path . --force
localagent init
localagent doctor --provider lmstudio
localagent --provider lmstudio --model <model> chat --tui
`
One-shot run:
bash
localagent --provider ollama --model qwen3:8b --prompt "Summarize README.md" run
Slow hardware notes
On slower CPUs / first-token-heavy setups, retries can create a bad UX (re-sent prompts before completion). During debugging, use larger timeouts and disable retries:
bash
localagent --provider llamacpp \
--base-url http://localhost:5001/v1 \
--model default \
--http-timeout-ms 300000 \
--http-stream-idle-timeout-ms 120000 \
--http-max-retries 0 \
--prompt "..." run
What I’ve learned so far
The biggest reliability gains came from process constraints, not model hype:
- bounded tasks
- strict output expectations
- pre-exec arg validation
- deterministic evals + baselines
- replayable artifacts for root-cause debugging
For high-ambiguity reasoning, I still route to stronger hosted models.
For a lot of productivity helper work, local models are viable when the runtime is disciplined.
Current docs
- README: project overview + workflows
- CLI reference: complete command/flag map
- provider setup guide (LM Studio/Ollama/llama.cpp)
- templates, policy docs, and eval docs
Repo: https://github.com/CalvinSturm/LocalAgent
Feedback I’d love
- What local model + runtime combos are most stable for tool-calling?
- Which prompt/output constraints improved reliability most for you?
- What would make local-first coding workflows feel “production-ready”?
If this is useful, I can write a follow-up with concrete eval/baseline workflows and model routing strategy.
Top comments (0)