Mukunda Rao Katta

Posted on May 25

A six-concern production harness for Nemotron agents on Crusoe Managed Inference

#hermeschallenge #ai #llm #agents

The first time I shipped a Nemotron-backed agent on a managed inference endpoint, I had an answer for "does it work?" and no good answer for any of the questions that came after.

How much did this run cost?
Did it try to call hosts I never approved?
Did it call tools with bad arguments?
Did the output drift from a saved baseline?
What was the p95 latency?
Did it stay under the budget cap I funded?

Each of those is a small, tractable problem. The trouble is that they show up at the same time, on the same day, when you are mid-deploy and the agent is already serving traffic. You do not want to be wiring six things in parallel.

crusoe-nemotron-harness is what I built so that I never have to do that wiring again. It is a production harness for Nemotron agents on Crusoe Cloud Managed Inference that turns those six concerns into one context-managed block and one RunReport.

The problem

A Nemotron provider on Crusoe gives you fast, cheap inference. It does not give you observability. The provider returns tokens. It does not return cost-in-USD, an egress audit, a tool-arg verdict, a snapshot diff, a latency percentile, or a budget-cap decision.

Every team that ships a Nemotron agent ends up writing those six things. They write them inline, scattered across the agent code, with no consistent shape. The next person to touch the agent re-derives half of it because they cannot find the rest.

The harness is the seam where those six concerns live in one place, behind one API, with one shared report shape.

The shape of the fix

The whole flow is one context manager around your agent's work.

from crusoe_nemotron_harness import (
    FakeNemotronProvider,
    NemotronHarness,
    ToolSpec,
)

harness = NemotronHarness(
    provider=FakeNemotronProvider(seed=3),
    max_tokens_per_run=200_000,
    max_usd_per_run=1.00,
    allowed_hosts=["api.example.com"],
    tools={"search": ToolSpec("search", required=("query",), types={"query": "str"})},
)

with harness.run() as ctx:
    result = ctx.complete("Summarize Nemotron in one paragraph.", max_tokens=200)
    ctx.call_tool("search", {"query": "nemotron benchmarks"}, latency_ms=15)
    ctx.fetch_url("https://api.example.com/v1/data")
    report = ctx.report()

print(report.as_dict())

You can use the FakeNemotronProvider with no API keys to see the full shape of the output. When you want real traffic, swap in CrusoeNemotronProvider:

import os
import requests
from crusoe_nemotron_harness import CrusoeNemotronProvider, NemotronHarness


def requests_transport(url, headers, body):
    response = requests.post(url, headers=headers, data=body, timeout=60)
    response.raise_for_status()
    return response.json()


provider = CrusoeNemotronProvider(
    model="nemotron-70b-instruct",
    api_key=os.environ["CRUSOE_API_KEY"],
    url=os.environ["CRUSOE_INFERENCE_URL"],
    transport=requests_transport,
)
harness = NemotronHarness(
    provider=provider,
    max_tokens_per_run=200_000,
    max_usd_per_run=10.00,
    allowed_hosts=["api.crusoe.example.com"],
)

The agent code does not change. The harness sits around it.

What it does NOT do

It is not a router. It does not pick between Nemotron variants or fall back to a different provider on failure.
It is not a queue. It does not retry, schedule, or persist runs.
It is not a UI. The RunReport is a dict and a pretty-printed string. Plug it into whatever dashboard or log sink you already use.
It is not a generic agent framework. It does the harness, nothing else.

Inside the lib (one design choice worth showing)

Every metric on the RunReport maps to a single-purpose sibling library that already exists in the agent-stack:

Concern	Module	Sibling library
`total_cost_usd`	`cost.py`	claude-cost, bedrock-cost, bedrock-kit
`allowed_hosts`	`egress.py`	agentguard, agentguard-rs, birddog
`tool_failures`	`vet.py`	agentvet, agentvet-rs
`snapshot_events`	`snap.py`	agentsnap, agentsnap-rs
`p50/p95_latency_ms`	`trace.py`	agenttrace, agenttrace-rs
`tokens_used`	`budget.py`	token-budget-pool, token-budget-py, llm-budget-window

The harness is the seam. The internals are small and self-contained because the real work is done by the sibling library in each row, ported into the harness as a local module so it stays zero-dep.

# what every metric collapses to
class RunReport:
    total_calls: int
    llm_calls: int
    tool_calls: int
    total_input_tokens: int
    total_output_tokens: int
    p50_latency_ms: int
    p95_latency_ms: int
    tokens_used: int
    usd_used: float
    aborted: bool
    abort_reason: str | None

Treating the six concerns as one report shape is what makes the harness fit into a single context manager. If they each had their own callback, their own object, and their own lifecycle, the API would balloon and the agent code would stop being readable.

When this is useful

You are running Nemotron on Crusoe Managed Inference and want production observability without rewriting your agent.
You need a single artifact you can drop into a Slack post-mortem after a run goes sideways.
You are running a leaderboard demo for a hackathon and want a deterministic "before vs. after" output to show what the harness adds.
You want budget caps to abort cleanly with a typed reason instead of crashing on a 429 from the provider.
You want every run to come with a snapshot you can replay later in CI.

When this is NOT what you want

You are on OpenAI or Anthropic directly and want vendor-specific features. The harness is currently Crusoe + Nemotron shaped. The pattern generalizes; the wiring does not, yet.
You want a multi-agent orchestration layer (autogen, crewai). The harness wraps one agent run, not a swarm.
You want a hosted observability product with dashboards. Use Phoenix, Langfuse, or Helicone and feed them the RunReport.

Install

git clone https://github.com/MukundaKatta/crusoe-nemotron-harness.git
cd crusoe-nemotron-harness
python3 -m venv .venv && .venv/bin/pip install -e ".[dev]"
.venv/bin/pytest -q

Repo: https://github.com/MukundaKatta/crusoe-nemotron-harness

Sibling libraries

Library	Role
claude-cost	Cache-aware cost calc for Anthropic
bedrock-cost	Cross-vendor Bedrock cost calc
agentguard	Egress allowlist for tool calls
agentsnap	Snapshot tests for agent runs
agenttrace-rs	Run-level cost + latency aggregation

Each is small. Each does one thing. The harness wires six of them together for one specific deployment shape.

What's next

The next harness I want to build is the same shape for AWS Bedrock with Anthropic and Llama on it. The six concerns are identical. The provider shape, pricing data, and request body change. If you have a managed-inference endpoint you want a one-import harness for, send a repo link.

DEV Community