Introducing SteelThread: Evals & Observability for Reliable Agents

#ai #agents #llm #programming

We’ve spent a lot of time internally running evals for our own agents. If you care about reliability in agentic systems, you know why this matters — models drift, prompts change, third party MCP tools get updated. A small change in one place can cause unexpected behavior somewhere else.

That’s why we’re excited to share something we’ve been using ourselves for months: SteelThread, our evaluation framework built on top of Portia Cloud.

You can try if for free on Portia!

While building our own automations on top of Portia, we realised it was an absolute joy to run evals with owing to two of its core features:

First, every agent run is captured in a structured state object called a PlanRunState — steps, tool calls, arguments, outputs. That makes very targeted evaluators trivial to write, be it deterministic or LLM-as-Judge ones e.g. you can count plan steps, validate the behaviour of a specific tool, review the tone in final summary etc.
Second, we use Portia Cloud to store our agent runs. Whenever we manage to produce a multi-agent plan outcome that is desirable (or undesirable) e.g. during agent development, we can take the inputs and outputs of that agent run (query, plan, plan run) and instantly turn them into an Eval dataset. Since we built SteelThread, we haven’t actually needed to manually curate and build eval datasets from scratch anymore.

Before SteelThread, we still felt the pain that many teams do. Creating and maintaining curated datasets was tedious. Balancing deterministic checks with LLM-as-judge evals was tricky. And running evals against real APIs often meant dealing with authentication, rate limits, or unintended side effects — so we’d spend hours stubbing tools just to test safely.

SteelThread wraps all of this into a single workflow inside Portia Cloud. It gives you two ways to keep your agents in check: Streams, which spot changes in behavior in real time, and Evals which let you run regression tests against a ground truth dataset. Both Streams and Evals allow you to combine deterministic and LLM-as-judge evaluators. You can write your own evaluators but SteelThread comes with a generous helping of off-the-shelf ones for you to use as well.

Here is an example flow where we add a production agent run to an Eval dataset.

Observability and evals are essential for building reliable agentic systems, and SteelThread just makes them easier. Paired with the Portia development SDK, it’s a powerful combo: build structured, debuggable agents, monitor them in production, and turn any incident into a regression test instantly.

If you want to try it, head over to Portia Dashboard or check out our GitHub repo!