Everyone’s building AI agents nowadays. I have also been building AI agents for several years now. At some point, I got tired of the same friction: vendor playgrounds that don’t support variables or multi-provider comparisons, cloud SaaS tools that are great at one thing (logging, evals, tracing) but force you to stitch four of them together, and all of it routing your prompts and API keys through someone else’s servers.
So I (with the help of AI and other humans) built Reticle — a local desktop app for designing, running, and evaluating LLM scenarios and agents.
Yes, you could do all of this with code. Just like you can test APIs with curl. But Postman exists for a reason — the right GUI collapses the feedback loop, makes iteration faster, and catches issues you'd only find after shipping. That's what Reticle is trying to be for AI development.
Here’s how I built it, and the key decisions behind each part.
The stack: Tauri + React + Rust + SQLite
Reticle is a Tauri app — a React frontend with a Rust backend, packaged as a native desktop binary. Everything lives locally: all scenarios, agents, test cases, run history, and API keys are stored in a SQLite database on your machine. No account, no sync, no cloud.
The Local-first approach was the main idea for this app. Your keys never leave your machine. Your prompts and traces are yours alone. There’s no subscription standing between you and your own development environment — and no vendor lock-in on your iteration history.
There’s also a practical performance angle: a local SQLite database with no network round-trips is fast in a way that cloud-backed tools simply can’t match for tight iteration loops. When you’re running evals across 50 test cases, that difference is noticeable.
This wasn’t just a privacy checkbox for me — it was a first-class design constraint. Everything else in Reticle’s architecture follows from it.
Scenarios and Agents: the two first-class citizens
Reticle is built around two core primitives. Scenarios are single-shot LLM calls — a system prompt, conversation history, model config, and variables. Agents are ReAct loops where the model reasons, calls tools, gets results, and iterates until it reaches a final answer.
Scenarios are where you work out what the model should say and how it should say it. The {{variable}} syntax lets you define a prompt template once and fill in values per test run, because in production, prompts are always templates, and testing them with hardcoded strings doesn't reflect real behavior. The same scenario can be run across OpenAI, Anthropic, and Google in one click, with outputs, latency, and cost landing side by side for a direct comparison.
Agents are where things get more complex and more interesting. The hard part of building agents isn’t getting them to work; it’s debugging them when they don’t. Every agent run streams a structured event log in real time: loop iterations, LLM requests with exact messages, tool calls with arguments and results, token usage and latency per step. When the model passes a wrong argument to a tool or gets stuck in a loop, you see exactly where and why.
Runs history and built-in usage tracking
Every run also records token usage and calculates an estimated cost based on each model’s published per-token pricing. It’s an approximation, not your exact cloud invoice. But it’s accurate enough to answer the questions that actually matter during development: which model is cheapest for this use case, how much does a full eval suite cost to run, and why did that last agent run cost 10× the previous one. That last question usually has an answer — a runaway loop, an unexpectedly long context, a model being used where a smaller one would’ve been fine. Token-level visibility makes it findable instead of mysterious.
All runs are stored and fully inspectable after the fact. You can go back to any previous execution, see exactly what the model received and returned, and compare behavior across runs. This matters more than it sounds: the number of times I’ve made a change, gotten a worse result, and had nothing to compare against used to be embarrassing. Now the history is just there.
Evals: text matching, schema validation, and LLM-as-judge
The eval system is what helps an engineer sleep at night. Prompt changes, model upgrades, and tool updates are all silent regressions waiting to happen. Unless you have assertions in place.
Reticle supports five assertion types: contains/equals/not_contains for text, json_schema for structured output validation via AJV, tool_called/tool_sequence for verifying agent behavior, and llm_judge for anything subjective.
The llm_judge type is the one that is useful for non-deterministic outputs. You write a criteria statement in plain English — "the response should be empathetic and avoid technical jargon" — and delegate evaluation to a configurable model (default: gpt-4o-mini) at temperature 0. It returns PASS or FAIL with a reason. For a huge category of real-world outputs that can't be rule-checked, this makes testing practical.
Try it / tear it apart
Reticle is open source and in public beta. If you’re building agents and the current tooling landscape feels as fragmented to you as it did to me, give it a try.
Download: reticle.run
Source: github.com/fwdai/reticle
I’d genuinely love feedback — what’s missing, what’s wrong, what you’d prioritize. And I’m curious: how are you building and testing your agents today? What does your workflow look like? Drop it in the comments — I’m always looking to learn how others are approaching this.
If you made it till the end, give this project a ⭐ on GitHub, this helps a lot!


Top comments (0)