Ian Parent

Posted on May 2 • Edited on Jul 7 • Originally published at iris-eval.com

What changed in Iris v0.4.0

#mcp #aiagents #observability #opensource

Iris v0.4.0 ships today. It's the release where protocol-native eval crosses from "deterministic rules" into "semantic scoring" — without giving up any of what made the deterministic layer work.

Three headline features plus a lot of infrastructure work that quietly compounds. I'll go through each, why it matters, and how it fits the thesis.

1 — LLM-as-Judge, as a real MCP tool

Heuristic rules catch a lot: length, keyword overlap, PII patterns, prompt-injection signatures, hallucination markers. They don't catch semantic quality. "Did the output actually answer the user's question?" is not a regex.

v0.4.0 adds a dedicated tool for that: evaluate_with_llm_judge.

Five templates — accuracy (hallucination detection), helpfulness (does it address the ask), safety (harm potential beyond regex PII), correctness (vs a reference answer), faithfulness (RAG grounding vs provided sources). Each returns a 0..1 score plus a 1-3 sentence rationale plus a per-dimension breakdown.

Two design decisions worth calling out:

Cost-capped, pessimistically. Before every call, Iris estimates worst-case cost (all max_output_tokens billable) and refuses if it would exceed IRIS_LLM_JUDGE_MAX_COST_USD_PER_EVAL. A guard that only triggers after the money is spent is not a guard.

Keys read at call time, not startup. A missing IRIS_ANTHROPIC_API_KEY or IRIS_OPENAI_API_KEY only fails the specific tool invocation that needs it. The rest of Iris keeps working. Configuration is progressive.

Seven supported models across Anthropic and OpenAI. Full pricing table shipped in the repo. Unknown model IDs fail upfront — the cap can't be enforced without pricing data.

Real measured cost. I ran a 5-sample smoke through gpt-4o-mini on ship day — accuracy template against both correct and hallucinated facts, helpfulness template on direct vs vague answers, safety template on an appropriate refusal. Every verdict matched what a human evaluator would call it (fabricated Stanford study flagged as accuracy failure with a supporting-passage-free rationale; vague non-answer scored 0.10; crisis-line refusal scored 1.00). Total spend across 5 calls: $0.00047 — an average of $0.00009 per eval. On claude-haiku it's about $0.0003; on opus it's $0.015-$0.025. The cap-per-eval is $0.25 by default; most teams will want to lower it.

Why this matters for the category: LLM-as-judge was the main differentiator competitors pointed to when Iris shipped with only heuristic rules. v0.4 closes that gap while keeping the MCP-native and runtime advantages. The comparison stops being "deterministic vs semantic" and starts being "MCP-runtime + both vs notebook + semantic only".

2 — Semantic citation verification

When an agent emits "A 2019 Stanford study found 73% of users prefer dark mode [1]", you want two things: (a) did it cite anything at all, and (b) does the cited source actually support the claim.

The v0.3.1 no_hallucination_markers rule handles (a) with a fabricated-citation heuristic — fires when numbered citations co-occur with expert markers without real source resolution. It's fast and free, and it catches the worst offenders.

v0.4.0 adds a dedicated tool for (b): verify_citations.

The pipeline has three phases. Extract four citation kinds (numbered [N], parenthetical (Author, Year), bare URLs, DOIs) from the output. Resolve URL + DOI citations through an SSRF-guarded fetcher — eight defense layers, top to bottom: scheme allowlist, private/localhost/RFC-1918/link-local/cloud-metadata block, optional domain allowlist (IRIS_CITATION_DOMAINS), manual redirect chase with per-hop SSRF re-check, 4xx/5xx reject, non-text reject, 5MB cap with truncation, 10s timeout. Then per-claim LLM verdict: "does this source actually support this claim?"

Outbound HTTP is opt-in. allow_fetch=true on the tool call or IRIS_CITATION_ALLOW_FETCH=1 in the environment. Iris refuses to reach out to arbitrary URLs generated by an LLM unless the operator has said yes, then narrows further with a domain allowlist for production deployments.

Returns an overall support ratio (supported / resolved), per-citation verdicts with rationale and confidence, total cost across all judge calls (also capped via max_cost_usd_total). When there are no resolvable citations, score is null and passed: true — the tool degrades gracefully rather than failing the run.

This is a direction no competitor is positioned for. The category's next move isn't more rules — it's grounding the rules in real sources.

3 — OpenTelemetry trace export

Sentry, Grafana, Datadog, Tempo, Jaeger, Honeycomb, New Relic — every observability backend an enterprise already runs accepts OTLP/HTTP. As of v0.4, Iris speaks it.

Setting IRIS_OTEL_ENDPOINT turns on best-effort async export from every log_trace call. Iris still writes the trace locally first — the OTel export is a side effect, not a dependency. If the collector is down, the trace is still stored; if the collector is fast, you see it in Grafana within seconds.

Implementation detail worth flagging: Iris carries zero @opentelemetry/* dependencies. The OTLP/HTTP JSON wire format is a frozen spec; we use native fetch against it. This is the same pattern as the LLM client and the citation resolver — hand-rolled against documented wire formats. Supply-chain surface stays small; the artifact stays auditable. Teams evaluating eval infrastructure care about this.

gRPC transport isn't in v0.4. Teams on gRPC-only collectors should front them with an OTel Collector accepting HTTP and forwarding gRPC — that's the standard pattern anyway.

What else shipped

Less headline-grade, but each one earns its line:

MCP tool surface expanded 3 → 9. Along with evaluate_with_llm_judge and verify_citations, v0.4 adds lifecycle management: list_rules, deploy_rule, delete_rule, delete_trace. Agents can now deploy a new rule when they see a failure pattern and clean up when the rule is obsolete — all via MCP. No dashboard trip.
Tenant isolation scaffolding. Every storage method now takes a TenantId. OSS deployments see only 'local'; Cloud tier (v0.5) gets workspace isolation without a future data migration. Four defense layers: branded type, runtime assert, composite indexes, tenant-scoped queries. 132 existing production traces migrated cleanly in our v0.3→v0.4 smoke test.
Supply-chain integrity. Every release artifact now ships with an SBOM, cosign keyless signatures, and SLSA build-provenance attestations. cosign verify ghcr.io/iris-eval/mcp-server:v0.4.0 works out of the box. This is the bar we think every MCP server should meet.
Playwright E2E in CI on Chromium + Firefox (not WebKit-on-Linux — that isn't Safari and the assurance is weak). Storybook 10 catalog, Lighthouse CI with realistic floors, bundle-size budget gate, axe a11y tests for every chart state.
5/5 on the Glama Tool Definition Quality Score. Every one of the 9 tools carries MCP annotations (readOnlyHint, destructiveHint, idempotentHint, openWorldHint) plus a 5-section description (Behavior / Output shape / Use when / Don't use when / Error modes). Integration tests make the annotations survive round-trip and descriptions contain each section — regressions fail CI.

Breaking changes

One, and it's worth naming explicitly:

IStorageAdapter — every method now takes tenantId: TenantId as the first parameter. Migration 004 backfills existing data to LOCAL_TENANT. Custom storage adapters need to update their signatures.

If you're running upstream Iris this doesn't affect you. If you've written a custom adapter: it's a one-function signature change per method, and the tenant type is exported.

Why it hangs together

The v0.4 feature set isn't a collection of unrelated wins. It's three vectors of the same thesis.

Deterministic rules are the fast path — free, reproducible, millisecond latency, catches the obvious failures. LLM-as-judge is the slow path — semantic, paid, seconds, catches the subtle failures. Citation verification is the grounded path — actually checking agent claims against sources, which is where accuracy-at-scale has to land.

OpenTelemetry is the integration path — Iris participates in the enterprise observability stack you already have.

Tenant isolation and supply-chain integrity are the production path — the things an enterprise buyer audits before they let something run next to their agents.

The release ships all five together because you need all five. One without the others is a weaker product.

What's queued for v0.5

Cloud Tier. Managed Iris. Multi-tenant with real workspaces. PostgreSQL adapter. Team dashboards. Alerting. The items moved from v0.4 into v0.5 are the ones that only make sense alongside the hosted offering.

For now, v0.4 is where the substrate pays off. The next six months of product are about running it at real customer volume.

npm install -g @iris-eval/mcp-server@0.4.0
iris-mcp --dashboard

Ship notes are cut. The dashboard is deployed. See you on the changelog.

DEV Community