DEV Community: Akash Hadagali Persetti

My eval leaderboard published a negative retry count

Akash Hadagali Persetti — Mon, 20 Jul 2026 20:45:33 +0000

I was looking at the EvalBench dashboard a few weeks ago and one cell stopped me:

Retries to valid
0.125
95% CI -0.035 – 0.285
n=40

A negative number of retries. You cannot retry a request negative-point-oh-three-five times. The mean was fine. The interval was nonsense, and it was nonsense in a way that pointed straight at how I was computing every number on the page.

EvalBench is a multi-provider LLM eval platform I built. Three suites run against OpenAI and Anthropic models across five domains: structured-output reliability, latency and cost with judge scoring, and RAG retrieval quality. Every suite emits the same MetricRecord shape, and the runner aggregates those records into leaderboard rows.

The interesting part was not the arithmetic. It was that a frontend formatting string had been quietly making a statistical decision for months.

Why I stopped publishing bare means

Here is the structured-output suite, 40 samples per model:

Model	Schema valid	First-attempt valid
anthropic/claude-sonnet-4-5	85%	80%
openai/gpt-4o	72.5%	72.5%

A 12.5 point gap. If I publish just those two numbers, every reader concludes Sonnet is meaningfully better at schema adherence.

Now the same table with intervals:

Model	Schema valid	95% CI
anthropic/claude-sonnet-4-5	85%	70.9% – 92.9%
openai/gpt-4o	72.5%	57.2% – 83.9%

The intervals overlap across roughly eleven points. At n=40 I cannot tell these models apart on this metric. The honest reading is "no detectable difference," not "Sonnet wins by 12.5."

That is the whole reason the interval math exists. A mean with no sample size and no interval is a claim disguised as a measurement.

Three metric types, three different intervals

The mistake I see in eval tooling is applying one interval formula to every column. Proportions, unbounded counts, and tail latencies have different distributions, and a single formula gets at least two of them wrong.

Proportions get Wilson, not normal approximation

Schema validity is a proportion. The textbook normal approximation, p ± 1.96 * sqrt(p(1-p)/n), breaks badly near 0 and 1. At p=1.0 it produces a zero-width interval, which claims perfect certainty from 40 samples.

Wilson does not do that:

def wilson_interval(values: Sequence[float]) -> Estimate:
    """Return the 95% Wilson interval for binary or fractional successes."""
    n = len(values)
    if n == 0:
        return _empty_estimate()

    z = 1.96
    mean = sum(values) / n
    denominator = 1 + z**2 / n
    center = (mean + z**2 / (2 * n)) / denominator
    half = (
        z
        * math.sqrt((mean * (1 - mean) + z**2 / (4 * n)) / n)
        / denominator
    )
    return Estimate(
        mean=mean,
        n=n,
        ci_low=max(0.0, center - half),
        ci_high=min(1.0, center + half),
    )

Four perfect scores gives you [0.510, 1.0], not [1.0, 1.0]. Four failures gives [0.0, 0.490]. The interval stays inside [0,1] and stays wide when n is small.

I pinned those exact values in a test rather than asserting loose properties:

@pytest.mark.parametrize(
    ("values", "mean", "ci_low", "ci_high"),
    [
        ([1.0, 1.0, 1.0, 1.0], 1.0, 0.5100999795960008, 1.0),
        ([0.0, 0.0, 0.0, 0.0], 0.0, 0.0, 0.48990002040399916),
        ([1.0, 0.0, 1.0, 0.0], 0.5, 0.15003570882017148, 0.8499642911798285),
        ([1.0, 0.5], 0.75, 0.19786250921045673, 0.9733234672343529),
    ],
)
def test_wilson_interval_matches_hand_computed_values(
    values: list[float], mean: float, ci_low: float, ci_high: float
) -> None:
    estimate = runner_module.wilson_interval(values)
    assert estimate.ci_low == pytest.approx(ci_low)

Hand-computed constants catch algebra typos that a property assertion like ci_low <= mean will happily let through.

p95 latency gets an order statistic

p95 latency is the one people get most wrong. A normal interval around a 95th percentile assumes the latency distribution is symmetric, and provider latency is not remotely symmetric. It has a long right tail from retries and queueing.

So p95 does not get a parametric interval at all. It gets a nearest-rank estimate with bounds pulled from the sorted sample:

def percentile_interval(values: Sequence[float], q: float) -> Estimate:
    """Return a nearest-rank percentile with a binomial order-statistic CI."""
    n = len(values)
    if n == 0:
        return _empty_estimate()

    ordered = sorted(values)
    estimate_index = min(n - 1, max(0, math.ceil(q * n) - 1))
    estimate = ordered[estimate_index]
    if n == 1:
        return Estimate(mean=estimate, n=n, ci_low=estimate, ci_high=estimate)

    lower_index = min(n - 1, max(0, _binomial_quantile(n, q, 0.025)))
    upper_index = min(n - 1, max(0, _binomial_quantile(n, q, 0.975)))
    return Estimate(
        mean=estimate,
        n=n,
        ci_low=ordered[lower_index],
        ci_high=ordered[upper_index],
    )

The count of observations below the 95th percentile follows a binomial distribution. Take the 2.5th and 97.5th quantiles of that binomial, use them as indices into the sorted latencies, and the bounds are actual observed latencies. No distribution assumption, and the interval cannot land somewhere the data never went.

_binomial_quantile builds the PMF by recurrence from the mode outward rather than calling math.comb, because binomial coefficients at n in the hundreds overflow into float garbage:

mode = min(n, max(0, math.floor((n + 1) * probability)))
masses = [0.0] * (n + 1)
masses[mode] = 1.0

failure_odds = (1.0 - probability) / probability
for successes in range(mode, 0, -1):
    masses[successes - 1] = (
        masses[successes] * successes / (n - successes + 1) * failure_odds
    )

Starting at the mode with mass 1.0 and walking outward keeps every ratio near unity. The masses get normalized by their own sum at the end, so the unnormalized start does not matter.

The bug

Two of those three formulas were right. The routing between them was not.

Here is what aggregate_records used to do:

proportion_metrics = {
    metadata.get("key")
    for metadata in suite.display_metrics
    if metadata.get("format") == "percent"
}

interval = (
    wilson_interval(values)
    if metric_key in proportion_metrics
    else normal_mean_interval(values)
)

Interval selection keyed off the display format string. And retries_to_valid was declared like this:

{
    "key": "retries_to_valid",
    "label": "Retries to valid",
    "format": "number",
    "higher_is_better": False,
}

Format is "number", so it missed proportion_metrics and fell through to normal_mean_interval, which has no clamps:

standard_error = statistics.stdev(values) / math.sqrt(n)
half = 1.96 * standard_error
return Estimate(mean=mean, n=n, ci_low=mean - half, ci_high=mean + half)

Mean 0.125, mostly zeros with a few ones, standard deviation large relative to the mean. mean - half goes below zero and nothing stops it.

normal_mean_interval is correct for what it is. The failure is that I made a statistical decision depend on a presentation field. "percent" is a formatting instruction for the frontend. I overloaded it into "this quantity is a bounded proportion," and the two meanings drifted apart the moment I added a metric that was a bounded count rather than a percentage.

retries_to_valid is bounded below at zero. It just is not a proportion, so neither branch fit it.

The fix

Metrics now declare their statistical support explicitly, separate from how they render:

{
    "key": "retries_to_valid",
    "label": "Retries to valid",
    "format": "number",         # presentation
    "support": "non_negative",  # statistics
    "higher_is_better": False,
}

Routing reads support:

def _interval_for_support(support: str, values: Sequence[float]) -> Estimate:
    if support == "proportion":
        return wilson_interval(values)
    if support == "non_negative":
        return non_negative_mean_interval(values)
    return normal_mean_interval(values)

non_negative_mean_interval clamps the lower bound at zero. I considered a bootstrap instead, which handles skewed low-count data better, and decided against it for now. The reasoning is in the docstring:

"""Return a 95% normal interval around the mean, clamped at zero.

Clamping (rather than bootstrapping) keeps this a small, deterministic
change to an existing formula instead of introducing a new resampling
dependency for what is fundamentally the same asymptotic-normal
approximation used elsewhere in this module. It is a known-conservative
fix: for genuinely skewed low-count data the true interval is narrower
than a clamped symmetric one, but the clamped bound is never wrong in
the way an unclamped negative bound is.
"""

A clamped interval is too wide near zero. That is a real cost and I would rather state it than pretend the clamp is free.

The other half of the fix is refusing to guess. Missing metadata raises instead of defaulting:

missing = [key for key in suite.metric_keys if key not in declared]
if missing:
    raise ValueError(
        f"suite {suite.name!r} is missing declared support for metrics: "
        f"{sorted(missing)}"
    )

A silent default is how this happened. A new suite cannot now reach the leaderboard without saying what kind of quantity each of its metrics is.

The test I should have written first

Every interval test I had pinned hand-computed values for one function. All of them passed while the dashboard published a negative retry count, because not one of them asked whether the right formula reached the right column.

That test is generic, cheap, and runs across every registered suite:

@pytest.mark.parametrize(
    "suite", registry_module.list_suites(), ids=lambda suite: suite.name
)
def test_aggregate_records_intervals_stay_within_declared_support(
    suite: Suite,
) -> None:
    ...
    for row in response.rows:
        for metric_key, estimate in row.metrics.items():
            support = support_by_metric[metric_key]
            if estimate.ci_low is None:
                continue
            if support == "proportion":
                assert 0.0 <= estimate.ci_low
                assert estimate.ci_high <= 1.0
            elif support == "non_negative":
                assert estimate.ci_low >= 0.0

It does not check any particular number. It checks that no published bound falls outside the range its quantity can physically occupy. Unit tests on formulas verify the math. This verifies the wiring, and the wiring is what broke.

There is also a regression test pinned to the exact reported case, 40 records with one outlier requiring 5 retries, mean 0.125, asserting ci_low >= 0. I confirmed it failed before the fix. A regression test you never watched fail is a test you are trusting for no reason.

The backend went from 262 tests to 267. Five tests, and one of them covers a class of failure that the other 262 could not see.

Takeaway

Publish the interval and the sample size next to every mean, and test that each interval respects the support of the thing it is measuring, because a leaderboard that cannot check its own bounds will cheerfully show you a negative retry count instead.

I Ran terraform validate Inside a Lambda. Two Things Broke.

Akash Hadagali Persetti — Sat, 18 Jul 2026 20:32:09 +0000

Most agent projects that generate code evaluate it by asking another model whether the code looks correct. I did that for about a week in TerraformAgent before I got tired of it.

The failure mode is boring and predictable. The judge reads a .tf file, sees resource blocks with plausible attribute names, and says yes. The file has an undeclared variable, or a reference to a resource that does not exist, or an attribute that HashiCorp removed in provider v5. The judge has no way to know. It is pattern-matching on what Terraform usually looks like, which is exactly what the generator was doing when it produced the bug.

Terraform ships a checker. It is called terraform validate. It is deterministic, it takes about a second, and it is never wrong about syntax or references. The whole reason people reach for an LLM judge here is that running the real thing inside a Lambda is annoying. It is annoying. It is also not that hard, and this post is the parts that actually bit me.

TerraformAgent is a 6-node LangGraph pipeline that turns a natural-language infra request into Terraform. Orchestrator, researcher, parallel domain subagents, aggregator, reviewer, evaluator. The evaluator is the node I care about here.

The evaluator writes to disk and shells out

The generated code lives in state as a filename-to-content dict. To run real Terraform against it, you have to put it on a filesystem. Lambda gives you /tmp, so that is where it goes:

thread_id = state.get("thread_id", "default")
temp_dir = f"/tmp/tf_eval_{thread_id}"

if os.path.exists(temp_dir):
    shutil.rmtree(temp_dir)
os.makedirs(temp_dir, exist_ok=True)

aggregated_output = state.get("aggregated_output", {})
for filename, content in aggregated_output.items():
    filepath = os.path.join(temp_dir, filename)
    with open(filepath, "w") as f:
        f.write(content)

The thread_id in the path matters. Lambda execution environments get reused across invocations, so two runs can land in the same container. Without the thread scoping, run B validates a directory containing run A's leftovers.

Then three subprocess calls, in order:

fmt_result = subprocess.run(
    ["terraform", "fmt", "-check", "-recursive"],
    cwd=temp_dir, capture_output=True, text=True, timeout=30,
)
eval_results["terraform_fmt_pass"] = fmt_result.returncode == 0

init next, then validate only if init succeeded. Skipping that guard gives you a validate failure that is really an init failure, and the retry logic then blames the wrong thing.

if init_success:
    val_result = subprocess.run(
        ["terraform", "validate", "-no-color"],
        cwd=temp_dir, capture_output=True, text=True, timeout=60,
    )
    eval_results["terraform_validate_pass"] = val_result.returncode == 0
else:
    eval_results["terraform_validate_pass"] = None
    eval_results["terraform_validate_output"] = "skipped (init failed)"

Note the three-state result. True, False, and None. None means the check did not run. If you collapse that to False you get a retry loop chasing a validation error that was never evaluated. There is a fourth case in the fmt block for the same reason: FileNotFoundError sets the result to None with the message terraform binary not found (running locally without Docker?), because the binary only exists inside the container image and I run these tests on my laptop.

terraform init wants the network. Lambda should not.

terraform init downloads the AWS provider plugin. It is about 60 MB. Doing that on every invocation is slow, costs egress, and breaks the moment HashiCorp's release endpoint has a bad minute.

So bake it into the image at build time:

RUN mkdir -p /tmp/tf-provider-cache /opt/tf-plugins && \
    cd /tmp/tf-provider-cache && \
    printf 'terraform {\n  required_providers {\n    aws = {\n      source  = "hashicorp/aws"\n      version = "~> 5.0"\n    }\n  }\n}\n' > main.tf && \
    TF_PLUGIN_CACHE_DIR=/opt/tf-plugins terraform init -backend=false && \
    rm -rf /tmp/tf-provider-cache

ENV TF_PLUGIN_CACHE_DIR=/opt/tf-plugins

A throwaway main.tf that declares nothing but the provider requirement, one init, and the plugin lands in the cache directory.

The cache goes in /opt, not /tmp. This one cost me real time. Lambda mounts a fresh empty /tmp on every execution environment at runtime. Anything you write to /tmp during the Docker build is gone. Not stale, not partially there. Gone. /opt is part of the read-only image layers and survives to runtime.

The read-only cache breaks on version mismatch

Baking into /opt fixes the download. It introduces a second problem, and this is the part I did not see coming.

/opt is read-only at runtime. Terraform treats TF_PLUGIN_CACHE_DIR as a directory it can write to. As long as the generated code asks for a provider version matching what got baked in, Terraform reads from the cache and everyone is happy. The moment the generated code pins something else, Terraform tries to download that version into the cache dir, hits a read-only path, and fails outright. Not a slow path. A hard failure.

The fix is a copy, once per execution environment:

baked_cache = os.environ.get("TF_PLUGIN_CACHE_DIR", "/opt/tf-plugins")
writable_cache = "/tmp/tf-plugins"
if not os.path.isdir(writable_cache):
    if os.path.isdir(baked_cache):
        shutil.copytree(baked_cache, writable_cache)
    else:
        os.makedirs(writable_cache, exist_ok=True)

env = os.environ.copy()
env["TF_PLUGIN_CACHE_DIR"] = writable_cache

init_result = subprocess.run(
    ["terraform", "init", "-backend=false", "-input=false", "-no-color"],
    cwd=temp_dir, capture_output=True, text=True, timeout=120, env=env,
)

The if not os.path.isdir(writable_cache) guard is what makes this cheap. Warm containers skip the copy entirely. Cold ones pay a local filesystem copy, which is nothing next to a 60 MB download. Matching versions stay fully offline. Non-matching versions still work, they just fall back to a normal network install into a directory Terraform is allowed to write to.

That is the tradeoff I would call out in a review: I could have refused non-matching versions and kept it strictly offline, which would be faster and more predictable. I chose to degrade to a network fetch instead, because an agent that generates version = "~> 4.0" for a good reason should not hard-fail on infrastructure trivia.

The LLM judge is still there. It just answers a different question.

I want to correct something, because I have seen this framed as replacing the LLM judge, and that is not what happened.

The evaluator runs both. Mechanical checks answer "is this valid Terraform." An LLM judge answers "does this do what the user asked":

judge_prompt = f"""Task: {state.get('task', '')}

Generated Terraform code:
{json.dumps(aggregated_output)}

Does the generated code satisfy the original task? Evaluate:
1. Are all requested resources present?
2. Are the configurations correct for the stated requirements?
3. Are there any obvious omissions?

Answer strictly with JSON: {{"task_success": true/false, "reason": "...", "missing_resources": [...]}}"""

Nothing deterministic can check whether "give me a VPC with a private subnet for the database" produced the right resources. That is a semantic question and it needs a model. The point is not that LLM judges are useless. The point is that they should not be answering questions a compiler already answers.

Routing the retry

The evaluator's results feed a function that decides which domains re-run. Only the flagged ones:

is_blocking = "BLOCKING_ERROR" in review_feedback
fmt_failed = eval_results.get("terraform_fmt_pass") is False
validate_failed = eval_results.get("terraform_validate_pass") is False

if not (is_blocking or fmt_failed or validate_failed):
    return []

Two things here. is False and not falsy, so the None case does not trigger a retry. And when something failed but no specific file was named in the output, every active domain retries rather than silently returning an empty list. A retry that does nothing is worse than an expensive one.

The graph wires it as a conditional edge back to the subagents, capped at 3:

workflow.add_conditional_edges(
    "evaluator",
    should_retry_or_finish,
    {"domain_subagents": "domain_subagents", "end": END},
)

What I would do differently

The fmt output is a filename list, which makes attributing a failure to a domain easy. validate output is prose from Terraform, and I am string-matching known filenames against it. That works until a filename appears in an error message for a reason unrelated to that file being broken. Parsing terraform validate -json would give me structured diagnostics with real file and line attribution. I have not done it yet.

The retry cap of 3 is a guess. I have no data on how often a third attempt fixes something a second one did not. That is a metric I should be emitting and am not.

And the judge sees json.dumps(aggregated_output) with the entire generated codebase inline. On a large request that is a lot of tokens for one boolean.

94 tests pass on this. None of them mock terraform, because the whole point was to stop trusting a stand-in for the real thing.

Takeaway

If a real tool can answer your evaluation question, run the real tool, and save the model for the questions that have no compiler.

Repo: github.com/akashpersetti/terraform-agent

The cost of a serverless agent isn't the Lambda. It's the loop.

Akash Hadagali Persetti — Wed, 15 Jul 2026 19:25:42 +0000

When I moved Wingman onto Lambda, I assumed the thing I'd have to watch was compute duration. That's the serverless mental model everyone starts with: you pay per millisecond, so keep the function fast and cheap.

That model was wrong for an agent. The Lambda bill turned out to be the boring, predictable part. The two things that actually move cost and latency are the cold start and the retry loop, and neither of them shows up where you'd look first.

Here's what the code taught me.

The setup: what Wingman actually is

Wingman is a LangGraph worker-evaluator agent. A worker node (running gpt-4o-mini) does the task. An evaluator node (also gpt-4o-mini) grades the answer against a success criteria. If the criteria isn't met, control routes back to the worker for another attempt, up to five times.

It runs as a container image on Lambda, fronted by API Gateway and CloudFront, with session history in DynamoDB. The compute is stateless: every request rebuilds the conversation from DynamoDB, so there's no persistent checkpointer to worry about across cold and warm starts. I wrote about that state design in an earlier post. This one is about what it costs to run.

Cold start #1: the image

A Lambda container cold start begins by unpacking the image. Wingman's image carries langchain, langgraph, langchain-community, langchain-experimental, boto3, plus reportlab and sendgrid for the agent's tools. That's not a small pile of Python.

The one decision I'm glad I made is in the Dockerfile. The project has a local dependency group with playwright, gradio, and uvicorn, and a dev group on top of that. None of those belong in Lambda. So the build exports only the core deps and drops the rest:

RUN uv export \
      --no-dev \
      --no-group dev \
      --no-group local \
      --format requirements-txt \
      -o /tmp/requirements.txt \
    && pip install -r /tmp/requirements.txt --no-cache-dir \
      --target "${LAMBDA_TASK_ROOT}"

Playwright alone would have pulled a browser runtime into the image. Keeping it in the local group means it exists for running the agent on my laptop and never ships to Lambda. Smaller image, faster unpack, faster cold start. This is the kind of thing you only notice you did right when you compare it to the version where you didn't.

Cold start #2: the init you pay for on import

The bigger cold-start cost is in the handler, and it's easy to miss because it happens at import time:

# backend/main.py — runs once per new execution environment
wingman = Wingman()
wingman.setup()

_dynamodb = boto3.resource("dynamodb", region_name=os.getenv("AWS_REGION", "us-east-1"))
_table = _dynamodb.Table(_dynamodb_table_name)

setup() isn't cheap. It builds the tool list, constructs two ChatOpenAI clients, binds tools to the worker, wraps the evaluator in structured output, and compiles the LangGraph:

def setup(self):
    self.tools = get_tools()
    worker_llm = ChatOpenAI(model="gpt-4o-mini")
    self.worker_llm_with_tools = worker_llm.bind_tools(self.tools)
    evaluator_llm = ChatOpenAI(model="gpt-4o-mini")
    self.evaluator_llm_with_output = evaluator_llm.with_structured_output(EvaluatorOutput)
    self._build_graph()

Putting this at module level is the right call. It runs once when a new execution environment spins up, then every warm invocation reuses the same compiled graph and the same clients. If I'd built the graph inside the request handler instead, I'd be paying graph compilation on every single /api/chat call, warm or not. Module-level init trades a heavier cold start for near-zero per-request setup, which is the trade you want when warm invocations vastly outnumber cold ones.

The lever for cold-start speed is memory:

resource "aws_lambda_function" "backend" {
  timeout     = 300   # 5 minutes
  memory_size = 1024
}

On Lambda, CPU scales with memory. At 1024MB you get roughly a full vCPU, so bumping memory doesn't just buy RAM headroom, it makes that import-time init finish faster. Memory is a latency knob here, not only a capacity one.

The real cost: the retry loop, not the duration

Now the part that surprised me.

Look at the timeout: 300 seconds. Lambda's max is 900. That's a generous ceiling for a function whose real job takes a few seconds. So why so high? Because the retry loop can run the worker up to five times, and each turn is a full round of model calls plus whatever tools the worker decides to invoke.

Here's the accounting that matters. turn_count only increments in the evaluator:

def evaluator(self, state: State) -> Dict[str, Any]:
    ...
    return {
        ...
        "turn_count": state.get("turn_count", 0) + 1,
    }

def route_based_on_evaluation(self, state: State) -> str:
    if state["success_criteria_met"] or state["user_input_needed"]:
        return "END"
    if state.get("turn_count", 0) >= self.MAX_TURNS:
        return "END"
    return "worker"

So a single /api/chat request is not one model call. In the worst case it's five worker calls, five evaluator calls, and every tool call the worker makes in between. The Lambda duration for that is trivial to reason about and cheap. The token cost is the thing that swings, and it swings by a factor of five depending on how many times the evaluator sends the work back.

That reframed the whole cost model for me. Moving to stateless compute on Lambda made the infrastructure cost flat and predictable. What it did was push all the cost variance out of the infrastructure and into the model loop. The bill I have to watch isn't Lambda-seconds. It's tokens times retries.

The 29-second cliff

One more thing the config quietly documents. API Gateway has a hard 29-second integration timeout. A five-turn loop with tool calls can blow past that even though the Lambda itself is happily configured for 300 seconds. So the request dies at the gateway while the function is still working.

The fix is in the Terraform: a Lambda Function URL that bypasses API Gateway entirely for long tasks.

resource "aws_lambda_function_url" "backend" {
  function_name      = aws_lambda_function.backend.function_name
  authorization_type = "NONE"
}

Short chats go through CloudFront and API Gateway. Long agent runs go straight to the Function URL and get the full timeout budget. It works, but authorization_type = "NONE" and allow_origins = ["*"] are both flagged in the code as things to tighten after the first deploy, and they're still open. An unauthenticated public URL that invokes a model loop is a bill waiting to happen if someone finds it. That one's on my list.

What I'd do differently

The retry loop has no token budget. It caps at five turns, but a "turn" can be an expensive worker call with several tool invocations, and nothing stops a run from burning far more tokens than the answer is worth. A turn cap is a crude proxy for a cost cap. If I rebuilt this, the evaluator would carry a token budget alongside turn_count, and the loop would stop on whichever it hit first.

Takeaway

For a serverless agent, Lambda duration is the cheap, predictable line on the bill. The cost that actually moves lives in the cold-start init you pay on import and the retry loop you pay per request, so budget those, not the milliseconds.

Four rival LLMs, zero consensus: designing an MCP panel

Akash Hadagali Persetti — Sun, 12 Jul 2026 18:39:23 +0000

The obvious version of a multi-model tool is a black box: send it a question, it asks GPT and Gemini and Claude and Grok, and hands you back one blended answer. That was the first thing I cut.

I built mcp-second-opinion, a small MCP stdio server that lets an agent like Claude Code or Cursor consult rival models mid-task. It ships two tools. ask_other_model hits one named model. ask_the_panel convenes every model you have a key for. The interesting decision is what the panel does not do: it never reconciles the answers. It hands back four labeled responses and lets the caller sort it out.

This post is about why that's the right v0, and the parts of the fan-out that were harder than they looked.

The problem with a consensus layer

A synthesizer sounds helpful until you ask what it's actually doing. To merge four answers into one, something has to decide who's right. That something is either a fifth model call (now you're trusting a judge that has the same failure modes as the panelists) or a heuristic like majority vote (which is meaningless when the question isn't multiple-choice).

The whole reason you convene a panel is the disagreement. If GPT says the migration is safe and Claude flags a race condition, the gap is the signal. Averaging that into "the migration is probably fine, but watch for edge cases" throws away the one thing you called four models to get.

So I wrote it into the design spec as an explicit non-goal and shipped the panel as raw labeled output. A summary tool can come later if it earns its place. The v0 surface is honest: here is what each model said, you decide.

The fan-out

ask_the_panel runs all enabled providers concurrently and collects them. The core is one asyncio.gather:

start = time.monotonic()
if coros:
    results = await asyncio.gather(*coros.values())
    for provider, result in zip(coros.keys(), results):
        if result.error is not None:
            responses[provider] = {
                "answer": None,
                "model": result.model,
                "error": result.error,
                "latency_ms": result.latency_ms,
                "cost_usd": None,
            }
        else:
            responses[provider] = {
                "answer": result.answer,
                "model": result.model,
                "error": None,
                "latency_ms": result.latency_ms,
                "tokens": result.tokens,
                "cost_usd": result.cost_usd,
            }
total_latency_ms = int((time.monotonic() - start) * 1000)

Two details in there that I got wrong the first time.

First, total_latency_ms is wall-clock, measured once around the whole gather. It is not the sum of per-provider latencies. If Grok takes 4s and everyone else takes 1s, the panel took 4s, not 7s. Reporting the sum would be a lie about how long the user actually waited. The per-provider latency_ms is still there in each slot if you want to see who was slow.

Second, gather has no return_exceptions=True. That looks like a bug until you see the provider layer. provider.ask never raises. It catches everything and returns a ProviderResponse with an error field set:

try:
    response = await asyncio.wait_for(
        litellm.acompletion(model=model, messages=messages, max_tokens=max_tokens),
        timeout=timeout,
    )
except asyncio.TimeoutError:
    return ProviderResponse(model=model, answer=None,
                            error=f"timeout after {timeout}s", ...)
except Exception as e:
    return ProviderResponse(model=model, answer=None, error=str(e), ...)

This is the part that makes partial failure boring. One provider timing out doesn't take down the panel. Its slot comes back with answer: None and error: "timeout after 30s", and the other three answers are untouched. If I'd let exceptions propagate into gather, a single flaky provider would blow up the whole call. Pushing error handling down to the provider boundary means the panel is failure-isolated by construction, not by a try/except wrapped around the gather.

Graceful degradation over hard config

A four-provider tool that demands four API keys is a tool nobody runs. I don't have a Grok key half the time. So enabled-ness is derived from the environment, not declared:

PROVIDER_KEYS = {
    "openai": "OPENAI_API_KEY",
    "gemini": "GEMINI_API_KEY",
    "anthropic": "ANTHROPIC_API_KEY",
    "grok": "XAI_API_KEY",
}

enabled = frozenset(
    provider
    for provider, env_key in PROVIDER_KEYS.items()
    if os.environ.get(env_key)
)

Set one key, you get a one-model panel. Set three, you get three. The missing ones still appear in the response with error: "XAI_API_KEY not set" instead of vanishing, so the caller can see the panel is incomplete rather than wondering why Grok never spoke. The server never crashes on missing config. It just tells you what it can and can't do.

LiteLLM does the provider-routing work under this. Model names get a prefix so LiteLLM knows where to send them:

def resolve_model(name: str) -> str:
    if name.startswith("claude-"):
        return f"anthropic/{name}"
    if name.startswith("gemini-"):
        return f"gemini/{name}"
    if name.startswith("grok-"):
        return f"xai/{name}"
    return name

That's the whole abstraction. One litellm.acompletion call works for all four vendors, and pricing comes back through litellm.completion_cost, which is where the per-slot cost_usd and the panel total come from.

The self-skip footgun I built on purpose

Here's one that only shows up when the host is itself a panelist. If Claude Code calls this server and the panel includes Anthropic, you're asking Claude to grade a room that Claude is sitting in. Sometimes that's fine. Sometimes you specifically want outside opinions.

So there's MCP_SECOND_OPINION_SELF_SKIP. Set it to a provider key and that provider drops out of the panel:

if provider == config.self_skip:
    responses[provider] = {
        "answer": None,
        "model": resolved,
        "error": "skipped (self)",
    }
    continue

It's off by default, and an unknown value logs a warning and skips nothing rather than failing. Small feature, but it's the difference between "second opinion" and "echo."

What I'd do differently

The profiles (flagship, balanced, cheap) pin one model per provider per profile. That was the right call for a v0.1, but it means you can't run gpt-4o and claude-haiku in the same panel without editing the profile map. A real config would let you compose the panel model-by-model. I deferred it because nobody had asked for it yet, and shipping the constrained version taught me what people actually reach for.

I'd also revisit the flat 30s timeout. It's per-provider and uniform, which punishes fast models by making the slow one set the floor for wall-clock latency. A smarter version might cut the panel off once N of 4 have answered. But that reopens the consensus question I spent this whole post avoiding, so I left it alone.

Takeaway

If you're building a multi-model tool, the hard part isn't calling four APIs, it's resisting the urge to blend the answers, because the disagreement is the product.

Source: mcp-second-opinion on PyPI, MIT licensed.

I grade my AI twin on real traffic, not just an offline test set

Akash Hadagali Persetti — Thu, 09 Jul 2026 19:11:15 +0000

Most eval setups have a quiet lie in them. You run a fixed question set through your agent, a judge model scores the answers, you get a number, and you feel good. But that number describes a sandbox. It says nothing about the question a stranger typed into your live app at 2am, or whether your anti-hallucination guardrail held for that specific input.

Twin is my personal AI digital twin. It answers questions about me over a RAG index of my profile, running on Bedrock with Claude Sonnet 4.5, FastAPI backend, Next.js frontend. I already had an offline eval harness. What I added this week is a second judge that grades production traffic as it happens, and building it forced me to fix an integrity bug I'd been ignoring in the offline one.

The problem with the offline harness

My offline harness in evals/ does the honest-looking thing: load queries.json, run retrieval, run inference, hand the result to a judge. The judge prompt is strict about faithfulness:

JUDGE_SYSTEM_PROMPT = """You are grading whether an AI-generated answer is
faithful to the provided source material. Be strict: any claim in the answer
not directly supported by the source material counts as a hallucination, even
if it sounds plausible.
...
- "correctly_refused" is true if the source material contains nothing relevant
  to the question and the answer clearly states the information isn't available;
  false if the source material was empty/irrelevant but the answer invented
  something anyway...
"""

The judge is good. The problem was structural: the offline harness ran its own retrieval pass, separate from the answer's retrieval pass. So the judge was grading the answer against a source set that wasn't necessarily the one the model actually saw. If retrieval is non-deterministic or the index shifts, the judge is checking the wrong evidence. You can pass an eval where the model hallucinated and fail one where it didn't. The number is measuring the wrong thing.

That gap is invisible on a static question set because retrieval usually returns the same chunks twice. It stops being invisible the moment you point it at live traffic, where you have no fixed inputs at all.

The approach: capture the exact source the model saw

The fix is boring and correct. Whatever chunks the answer was generated from, those exact chunks get judged. No second retrieval.

In the chat path, right after inference, I capture the request with the same retrieved_chunks object that produced the answer:

retrieved_chunks = retrieval.retrieve(request.message, k=5)
# ... inference runs against retrieved_chunks ...
capture_live_eval(request.message, retrieved_chunks, assistant_response)

capture_live_eval is fire-and-forget. It never raises, because an eval capture failing should never break a user's chat:

def capture_live_eval(query: str, retrieved_chunks: List, answer: str) -> None:
    """Fire-and-forget capture of a real chat exchange for async
    faithfulness judging. Never raises."""
    if not EVALS_BUCKET:
        return
    try:
        retrieved_text = "\n\n".join(
            f"## {chunk.section_title}\n{chunk.text}"
            for chunk, score in retrieved_chunks
        )
        key = f"live/raw/{datetime.now().isoformat()}-{uuid.uuid4()}.json"
        body = {
            "timestamp": datetime.now().isoformat(),
            "query": query,
            "retrieved_chunk_ids": [c.chunk_id for c, _ in retrieved_chunks],
            "retrieved_text": retrieved_text,
            "answer": answer,
        }
        evals_s3_client.put_object(
            Bucket=EVALS_BUCKET, Key=key,
            Body=json.dumps(body), ContentType="application/json",
        )
    except Exception as e:
        print(f"Live eval capture failed (non-fatal): {e}")

The important line is that retrieved_text is serialized from the same chunks the answer was built on. The judge later reads exactly this. There is no second retrieval anywhere in the live path.

The judging happens out of band

I did not want to block a chat response on a judge call. So the write to S3 under live/raw/ triggers a separate Lambda via an S3 event. That Lambda reads the raw record, judges it, and writes the result to live/judged/:

def process_record(bucket: str, key: str, s3_client) -> None:
    raw = json.loads(
        s3_client.get_object(Bucket=bucket, Key=key)["Body"].read().decode("utf-8")
    )
    judged_key = key.replace("live/raw/", "live/judged/", 1)

    output = dict(raw)
    try:
        judgment = judge.judge_answer(
            raw["query"], raw["retrieved_text"], raw["answer"]
        )
        output["judgment"] = judgment
    except Exception as e:
        output["judgment_error"] = str(e)

    s3_client.put_object(
        Bucket=bucket, Key=judged_key,
        Body=json.dumps(output), ContentType="application/json",
    )

The chat request writes one small JSON object and moves on. The user waits on nothing. The judge runs whenever the event fires. The same judge_answer function backs both the offline harness and the live judge, so I am grading production and my test set with identical criteria. That reuse is deliberate. It's the only reason the two numbers are comparable.

One detail I like: the judge runs on Bedrock with temperature: 0.0 and parses a strict JSON object with a brace-depth scan rather than a regex, because judge models love to wrap JSON in prose no matter how many times you tell them not to.

What broke, and what I'd do differently

The capture path writes retrieved_chunk_ids alongside the text. I don't use those IDs in the judge yet, but they exist for a reason: my next problem is retrieval drift. If I rebuild the profile index, the chunk IDs behind a past judgment change, and a stored judgment silently starts referring to source text that no longer exists. Right now the judged record freezes retrieved_text at capture time, which protects the judgment itself, but I have no way to diff "what the judge saw" against "what the index says today." Storing the IDs is the first half of that; I haven't built the second half.

The bigger honest gap: I only judge faithfulness. The judge can tell me the model stayed inside its source material. It cannot tell me the answer was useful, or that retrieval pulled the right chunks in the first place. A perfectly faithful answer to badly retrieved context still passes. Faithfulness is the cheap half of quality, and it's the half I automated because it's the half a single model call can check.

I'd also reconsider the fire-and-forget silence. capture_live_eval swallowing every exception means a broken eval bucket produces one print and nothing else. On Lambda that print lands in CloudWatch and I will not notice for a week. If I were doing this again I'd emit a metric on capture failure, because a silent eval pipeline is the same as no eval pipeline.

Takeaway: an eval number is only trustworthy if the judge grades the exact evidence the model actually used, and the fastest way to find out whether yours does is to point it at real traffic instead of a frozen question set.

I built an eval harness for my own AI, and it caught my digital twin lying

Akash Hadagali Persetti — Wed, 08 Jul 2026 00:29:00 +0000

I run a digital twin on my personal site. It answers questions about me as if it were me: my experience, my projects, what I have and haven't worked with. The whole system prompt is one long instruction to never make anything up. If a visitor asks about a skill I don't have, it's supposed to say "I don't have information about that," not improvise.

For months I assumed it worked, because every time I tested it by hand, it behaved. Then I wrote an eval harness and pointed it at the twin. On 35 questions, 9 answers contained claims that weren't in the source material, and on the 8 questions I designed to be unanswerable, it only refused 4 of them. My anti-hallucination prompt was losing about a quarter of the time, and I'd been shipping it.

Here's how the retrieval works, how I graded it, and the two things that broke.

The setup: RAG over a text file, no vector database

The twin's knowledge is a single profile document split by section headers. When a visitor sends a message, the backend embeds the message, scores it against every section, and injects the top matches into the system prompt. That's the entire retrieval loop.

There's no Pinecone, no pgvector, no managed index. The "vector store" is a JSON file. I chose that on purpose. The profile is a few dozen sections. At that size, a linear scan over every chunk is a few milliseconds, and a real vector database is one more thing to provision, pay for, and keep in sync. The embeddings come from Bedrock's Titan v2 model, and the similarity is plain cosine, computed in Python:

def cosine_similarity(a, b):
    dot = sum(x * y for x, y in zip(a, b))
    norm_a = math.sqrt(sum(x * x for x in a))
    norm_b = math.sqrt(sum(y * y for y in b))
    if norm_a == 0 or norm_b == 0:
        return 0.0
    return dot / (norm_a * norm_b)


def retrieve(query, k=5, index=None):
    if index is None:
        index = load_index()
    query_embedding = embed_text(query)
    scored = [(chunk, cosine_similarity(query_embedding, chunk.embedding)) for chunk in index]
    scored.sort(key=lambda pair: pair[1], reverse=True)
    return scored[:k]

Chunking is just splitting on ## headers, one chunk per section. The index gets built once with a script that embeds each chunk and writes the vectors to disk. At request time I load that file into a module-level cache so the embeddings aren't re-read on every warm invocation.

The tradeoff is honest: this doesn't scale. If the profile grew to thousands of sections, the linear scan and the load-everything-into-memory approach would fall over, and I'd move to a real index. For a personal site with a bounded corpus, paying that complexity now would be premature. The code is small enough that swapping it later is an afternoon, not a migration.

Grading retrieval and answers separately

An eval on a RAG system has to measure two different things, and it's tempting to conflate them. Did retrieval pull the right chunks? And given some context, did the model answer faithfully? A good retrieval score with a hallucinated answer is still a failure, and a faithful answer built on lucky retrieval isn't something I can trust to hold.

So the harness measures both. For retrieval, I hand-labeled 35 questions with the chunk IDs that should come back, across four categories: single-chunk questions, multi-chunk questions that need two or more sections, out-of-corpus questions with no valid answer, and personal-guardrail questions the twin should refuse. Then recall@k and nDCG@k:

def recall_at_k(retrieved_ids, relevant_ids, k):
    if not relevant_ids:
        return None
    retrieved_top_k = set(retrieved_ids[:k])
    relevant_set = set(relevant_ids)
    return len(retrieved_top_k & relevant_set) / len(relevant_set)

Recall tells me whether the right chunk showed up at all. nDCG tells me whether it showed up near the top, since a correct chunk buried at position five matters less than one at position one.

For answer quality I used an LLM as a judge. Same Bedrock model, temperature zero, given the question, the retrieved source text, and the answer, told to be strict and flag any claim not directly supported by the source. It returns structured JSON:

JUDGE_SYSTEM_PROMPT = """You are grading whether an AI-generated answer is
faithful to the provided source material. Be strict: any claim in the answer
not directly supported by the source material counts as a hallucination...

Respond with ONLY a JSON object matching this exact shape:
{"faithful": true or false, "hallucinated_claims": ["...", ...],
 "correctly_refused": true, false, or null, "rationale": "one sentence"}
"""

The correctly_refused field is the one I cared about most. It's true when the source had nothing relevant and the answer admitted it, false when the source was empty but the answer invented something anyway, and null when refusal doesn't apply. That single field is how I measure whether the anti-hallucination instruction actually holds.

What broke, part one: the guardrail I trusted most

The numbers:

Recall@3: 0.74, recall@5: 0.84, nDCG@5: 0.76
Multi-chunk recall@5: 0.70, the weakest category
Personal-guardrail recall@5: 1.0
Faithful answers: 26 of 35
Correctly refused out-of-corpus questions: 4 of 8

The retrieval numbers are fine for a first pass. Multi-chunk is the soft spot, which makes sense: when an answer needs two sections and only one is clearly on-topic, the second one competes with everything else and sometimes loses. That's a ranking problem I can work on.

The number that actually bothered me is 4 of 8. Half the time, when a visitor asks something my profile has no answer to, the twin makes something up instead of saying it doesn't know. Nine answers overall carried hallucinated claims. This is on a system whose prompt spends hundreds of words, with worked examples, telling it not to do exactly this.

The lesson isn't subtle. A carefully written "don't hallucinate" instruction is not a guarantee. It's a suggestion the model follows most of the time and quietly ignores the rest. I would never have known the real rate without a test set built specifically to bait it, because my manual testing never included the questions designed to make it fail. You don't find this class of bug by using your own product. You find it by trying to break it on purpose.

What broke, part two: the eval was grading the wrong copy of the truth

The subtler problem is in the harness itself. Here's the loop:

retrieved = retrieval.retrieve(q["query"], k=5)
retrieved_text = "\n\n".join(... for chunk, score in retrieved)

answer = server.call_bedrock([], q["query"])
judgment = judge.judge_answer(q["query"], retrieved_text, answer)

Read the order. I call retrieve to get the chunks I score for recall and nDCG. Then I call call_bedrock to get the answer. But call_bedrock runs retrieval again internally to build its own prompt context. So the answer is generated from one retrieval pass, and the judge grades it against a different pass that I ran separately in the harness.

Same query, same k, so in practice the two passes return the same chunks and nothing goes wrong. But they are two independent calls. If retrieval ever became nondeterministic, or if I changed k in one place and not the other, the judge would be grading an answer against source text the answering model never actually saw. The faithfulness verdict would be measuring a fiction. It works today by luck, not by construction.

The fix is to make retrieval happen once and thread the same context through both the answer and the judge, so what gets graded is provably what the model was given. I'd rather the eval be correct by design than correct by coincidence.

What I'd do differently

I'd build the eval before I trusted the prompt, not after. I shipped a guardrail, believed it because it looked thorough and passed my casual tests, and only measured it months later. The prompt engineering wasn't wrong, it just wasn't verified, and "wrote a strong instruction" is not the same claim as "the model obeys it."

And I'd wire the harness so retrieval runs exactly once per question. An eval that can silently grade the wrong inputs is worse than no eval, because it hands you a number you'll believe.

The one-sentence version: a "never hallucinate" prompt bought me a 74% refusal miss rate I couldn't see until I wrote the test that was designed to catch it.

Building a worker-evaluator retry loop in LangGraph (and where it bites you on Lambda)

Akash Hadagali Persetti — Fri, 03 Jul 2026 13:22:20 +0000

Most agent demos stop at "the model called a tool and gave an answer." That answer is often wrong, and nothing in the loop notices. I wanted an agent that checks its own work against a success criteria before returning, and retries when it falls short. That is the whole idea behind Wingman: a worker that does the task, and a separate evaluator that decides whether the task is actually done.

Here is how the loop is built, the one decision that shaped everything, and the part that broke once it ran on Lambda behind API Gateway.

The problem

A single-pass agent has no idea when it failed. It calls a tool, produces text, and returns. If the output misses the point, the user finds out, not the system.

I wanted a gate. Something that reads the assistant's last response, compares it against a stated success criteria, and either accepts it or sends the worker back to try again with feedback. Two roles, not one. The worker produces. The evaluator judges.

The catch is that a retry loop with no ceiling is a bill waiting to happen. So the loop needs a hard stop, and it needs to know the difference between "the assistant needs to try again" and "the assistant is stuck and should ask the user."

The approach and the key decision

The graph has three nodes: worker, tools, and evaluator. The state carries the conversation, the success criteria, the latest feedback, two boolean flags, and a turn counter.

class State(TypedDict):
    messages: Annotated[List[Any], add_messages]
    success_criteria: str
    feedback_on_work: Optional[str]
    success_criteria_met: bool
    user_input_needed: bool
    turn_count: int

The worker runs first. If its last message contains tool calls, the router sends it to the tool node and back. If not, it goes to the evaluator.

def worker_router(self, state: State) -> str:
    last = state["messages"][-1]
    return "tools" if (hasattr(last, "tool_calls") and last.tool_calls) else "evaluator"

The key decision is making the evaluator a structured-output call, not a free-text one. It returns a typed object, so the routing logic reads clean booleans instead of parsing prose.

class EvaluatorOutput(BaseModel):
    feedback: str = Field(description="Feedback on the assistant's response")
    success_criteria_met: bool = Field(description="Whether the success criteria have been met")
    user_input_needed: bool = Field(
        description="True if more input is needed from the user, or the assistant is stuck"
    )

Both the worker and the evaluator run on gpt-4o-mini. The worker is bound to tools. The evaluator is bound to the schema above with .with_structured_output.

The routing after evaluation is where the loop's whole personality lives:

def route_based_on_evaluation(self, state: State) -> str:
    if state["success_criteria_met"] or state["user_input_needed"]:
        return "END"
    if state.get("turn_count", 0) >= self.MAX_TURNS:
        return "END"
    return "worker"

Three ways out. The criteria is met. The assistant is stuck and needs the user. Or the loop has spent its five turns. Otherwise it loops back to the worker, and this time the worker gets the feedback baked into its system prompt:

if state.get("feedback_on_work"):
    system_message += f"""
Previously you thought you completed the assignment, but your reply was rejected because the success criteria was not met.
Feedback:
{state["feedback_on_work"]}
Please continue the assignment..."""

That feedback injection is what makes the retry worth anything. The worker does not just try again blind. It tries again knowing why the last attempt was rejected.

One design detail I care about: the evaluator is told not to confuse "not done yet" with "stuck." A failed criteria means retry, not stop.

For user_input_needed: set True ONLY if the assistant explicitly asked the user a question or stated it cannot proceed.
Do NOT set user_input_needed=True just because the criteria was not met — the assistant should simply retry.

Without that line, the evaluator bails to the user the first time the worker misses, and the retry loop never actually retries.

What broke, and what I changed

The loop is stateless by design. Wingman reconstructs the full conversation from DynamoDB on every request and runs a fresh graph invocation with a throwaway thread ID, so there is no in-memory state to lose between Lambda cold starts. That part held up.

The timeout did not.

A worker-evaluator loop is slow. Five turns, each turn a worker call plus tool calls plus an evaluator call, adds up to real wall-clock time. I set the Lambda timeout to 300 seconds to give the loop room:

# LangGraph agent loops can take a while.
timeout     = 300  # 5 minutes (Lambda max is 15 min)
memory_size = 1024

That number is a lie if the request comes through API Gateway. The HTTP API has a hard 29-second integration cap that Lambda's 300 seconds cannot override. So a long agent run would keep working inside Lambda while the gateway had already given up on the caller. The function billed for the full run. The user got nothing.

The fix was to stop routing long tasks through the gateway at all. I added a Lambda Function URL, which is a direct HTTPS endpoint with no 29-second ceiling:

# Lambda function URL — direct HTTPS invoke, bypasses API Gateway 29s timeout.
resource "aws_lambda_function_url" "backend" {
  function_name      = aws_lambda_function.backend.function_name
  authorization_type = "NONE"
}

Now the gateway stays for the fast endpoints, and the agent loop can run against the Function URL without a proxy in the middle deciding it took too long.

There is a second sharp edge I have not fully closed. turn_count only increments inside the evaluator node:

"turn_count": state.get("turn_count", 0) + 1,

But the worker can loop worker -> tools -> worker -> tools several times before it ever hands off to the evaluator, and none of those hops count against MAX_TURNS. So the five-turn cap is really "five evaluator turns," not "five model calls." A task that keeps calling tools without producing a final answer can burn far more time than the cap implies. The honest fix is to count worker iterations too, or add a wall-clock budget inside the run, rather than trusting the evaluator counter to bound cost. That one is still on my list.

Takeaway

An evaluator that gates the worker's output is the cheap half of a self-correcting agent. The expensive half is making the retry ceiling account for every model call, not just the ones you happened to count.

I stopped putting AWS keys in GitHub Secrets. Here's what I do instead.

Akash Hadagali Persetti — Tue, 30 Jun 2026 20:51:31 +0000

For a while my deploy pipelines all worked the same way. Generate an IAM user, copy its access key and secret into GitHub repo secrets, and let the workflow use them. It deploys fine. It also means a long-lived credential to my AWS account is sitting in a third party's vault, valid until I remember to rotate it, which I never do.

For Wingman I cut that out. The GitHub Actions workflow holds zero AWS keys. It asks GitHub for a short-lived token at runtime, hands that token to AWS, and AWS gives back temporary credentials that expire when the job ends. This is OIDC federation. Below is how it's wired, the parts that are easy to get wrong, and one decision I'm still not happy about.

The problem with the old way

An IAM user access key is a static secret. Once it exists, anyone who can read it can use it from anywhere, forever, until it's rotated or deleted. Storing it in GitHub Secrets means the blast radius of a GitHub breach now includes my AWS account. It also means I own a rotation chore I will forget.

What I actually want: GitHub should be able to deploy this one repo to this one account, prove who it is on each run, and never hold a credential longer than the run takes.

The approach: trust GitHub's identity, not a stored key

OIDC flips the model. Instead of AWS trusting a secret that GitHub stores, AWS trusts GitHub's identity provider directly. On each run GitHub mints a signed JWT describing the job (which repo, which branch, which environment). The workflow passes that JWT to AWS STS. AWS checks the signature against GitHub's public keys, checks the token's claims against a trust policy I wrote, and if both pass, returns temporary credentials.

Two pieces make this work: a permission in the workflow, and a trust policy in AWS.

The workflow needs permission to request the token at all:

permissions:
  id-token: write   # required for OIDC
  contents: read

Miss this and the token request fails before AWS is ever contacted. It's the first thing to check when nothing else looks wrong.

Then the credential step, which carries no key, only a role ARN:

- name: Configure AWS credentials (OIDC — no long-lived keys)
  uses: aws-actions/configure-aws-credentials@v4.1.0
  with:
    role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
    aws-region: ${{ env.AWS_REGION }}

AWS_ROLE_ARN is not a secret in the credential sense. It's just an identifier. There's nothing in this repo's secrets that lets you authenticate to AWS on its own.

The trust policy is where the real work is

On the AWS side I register GitHub as an OIDC provider and write a role that only GitHub Actions can assume, and only from my repo. This is the Terraform:

data "aws_iam_openid_connect_provider" "github" {
  url = "https://token.actions.githubusercontent.com"
}

resource "aws_iam_role" "github_actions" {
  name = "${var.project_name}-github-actions"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Principal = {
        Federated = data.aws_iam_openid_connect_provider.github.arn
      }
      Action = "sts:AssumeRoleWithWebIdentity"
      Condition = {
        StringLike = {
          "token.actions.githubusercontent.com:sub" = "repo:${var.github_org}/${var.github_repo}:*"
        }
        StringEquals = {
          "token.actions.githubusercontent.com:aud" = "sts.amazonaws.com"
        }
      }
    }]
  })
}

The sub condition is the load-bearing line. It says: only tokens whose subject is repo:my-org/wingman:* may assume this role. Without it, or with a loose wildcard, you've built a role that any GitHub repo on the platform can assume. That is a real misconfiguration people ship, and it turns "keyless" into "anyone's keys."

The :* at the end matches any branch, tag, or environment in that repo. If you want to lock deploys to main only, you tighten it to repo:my-org/wingman:ref:refs/heads/main. I kept it open across refs deliberately, since I run the same role from workflow_dispatch too.

The aud check pins the audience to sts.amazonaws.com. Keep it as a StringEquals, not a StringLike. Exact match on audience is one of those small things that quietly closes a door.

Scope the role to exactly what the deploy touches

Authentication gets GitHub in the door. Authorization decides what it can do once inside. The deploy does four things, so the role grants four things and stops:

resource "aws_iam_role_policy" "github_actions" {
  name = "deploy-permissions"
  role = aws_iam_role.github_actions.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      # ECR: push container images
      {
        Effect = "Allow"
        Action = [
          "ecr:GetAuthorizationToken",
          "ecr:BatchCheckLayerAvailability",
          "ecr:PutImage",
          "ecr:InitiateLayerUpload",
          "ecr:UploadLayerPart",
          "ecr:CompleteLayerUpload",
        ]
        Resource = "*"
      },
      # Lambda: update function code and config (scoped to the one function)
      {
        Effect = "Allow"
        Action = [
          "lambda:UpdateFunctionCode",
          "lambda:UpdateFunctionConfiguration",
          "lambda:GetFunction",
          "lambda:GetFunctionConfiguration",
        ]
        Resource = aws_lambda_function.backend.arn
      },
      # S3: sync frontend assets (scoped to the one bucket)
      {
        Effect   = "Allow"
        Action   = ["s3:PutObject", "s3:DeleteObject", "s3:ListBucket"]
        Resource = [aws_s3_bucket.frontend.arn, "${aws_s3_bucket.frontend.arn}/*"]
      },
      # CloudFront: invalidate after deploy (scoped to the one distribution)
      {
        Effect   = "Allow"
        Action   = ["cloudfront:CreateInvalidation"]
        Resource = aws_cloudfront_distribution.main.arn
      },
    ]
  })
}

Lambda, S3, and CloudFront are pinned to specific ARNs. The ECR auth-token call needs Resource = "*" because ecr:GetAuthorizationToken is account-scoped and won't accept a resource constraint. That's an AWS quirk, not laziness. The push actions around it are still gated by which repo the token allows.

What I'd flag if I were reviewing this

It mostly came together without drama, but two things are worth being honest about.

The OIDC provider is a one-per-account resource. The first time you set this up there's a decent chance one already exists, from another repo or an earlier experiment, and terraform apply will refuse to create a duplicate. I handle it with a data source that reads the existing provider, plus a commented-out resource block as the fallback for a truly fresh account:

# If the OIDC provider doesn't exist yet, comment out the data source
# and uncomment this, then re-run apply:
# resource "aws_iam_openid_connect_provider" "github" {
#   url             = "https://token.actions.githubusercontent.com"
#   client_id_list  = ["sts.amazonaws.com"]
#   thumbprint_list = ["6938fd4d98bab03faadb97b34396831e3780aea1"]
# }

It's a small thing, but it's the difference between a clean apply and ten minutes of confusion about why a global resource conflicts.

The part I'm not satisfied with: the workflow still injects third-party API keys (OpenAI, Serper, Pushover) into the Lambda's environment at deploy time, pulled from GitHub Secrets and pushed in with update-function-configuration. So I solved the AWS-credential problem cleanly and then left application secrets living in GitHub and in plaintext Lambda env vars. The right move is Secrets Manager or SSM Parameter Store, with the Lambda reading them at cold start and the GitHub role granted nothing more than permission to trigger a deploy. I know the fix. I just haven't done it yet, and pretending otherwise would be dishonest.

Takeaway

OIDC removes the standing AWS credential from your CI, but the security only holds if the trust policy's sub claim is scoped to your exact repo. Get that line right and the rest is just IAM.

Why my LangGraph agent throws away its checkpoint on every request

Akash Hadagali Persetti — Mon, 29 Jun 2026 16:56:22 +0000

The problem

Wingman is a worker-evaluator agent. A worker LLM tries to satisfy a success criteria, an evaluator decides if it passed, and if it failed the worker retries up to 5 times. Standard loop.

The deployment is where it gets opinionated. It runs on AWS Lambda behind API Gateway, packaged as a container image on ECR. Lambda is stateless by design. The execution environment can be frozen, thawed, or thrown away between requests, and you get no say in when. So the question that drives the whole architecture is: where does the conversation live between turns?

LangGraph has a built-in answer. You attach a checkpointer, give each conversation a thread_id, and the framework persists graph state after every superstep. On the next call you pass the same thread_id and it resumes. That's the blessed path, and on a long-lived server it's the right one.

On Lambda it falls apart. The default MemorySaver keeps checkpoints in process memory. When Lambda freezes the environment, that memory might survive for the next warm invocation or it might be gone. You cannot tell from inside the handler. A user's second message could land on a fresh execution environment with an empty checkpoint store, and the agent forgets the conversation. There are durable checkpointer backends, but they pull you toward keeping a database connection warm and treating the LangGraph thread as your source of truth.

The decision

I stopped trying to make the framework's persistence survive Lambda. Instead I made the compute fully stateless and kept the durable state myself.

The conversation history is a plain list of {role, content} dicts. It lives in DynamoDB, keyed by session_id. On every request the handler reads that history, the agent rebuilds the entire LangGraph state from scratch, runs one superstep, and writes the updated history back.

The checkpointer is still there. It's just deliberately disposable:

self.graph = builder.compile(checkpointer=MemorySaver())

# Fresh thread ID each call — MemorySaver only lives for this invocation
config = {"configurable": {"thread_id": str(uuid.uuid4())}}
result = await self.graph.ainvoke(state, config=config)

A new thread_id every call means the checkpointer never resumes anything. It exists only because the graph wants one, and it dies with the invocation. The real memory is the history row in DynamoDB. LangGraph's persistence layer became a no-op on purpose.

Reconstruction is just a loop that turns stored dicts back into message objects:

past_messages = []
for h in history:
    if h["role"] == "user":
        past_messages.append(HumanMessage(content=h["content"]))
    elif h["role"] == "assistant" and not h["content"].startswith(EVALUATOR_PREFIX):
        past_messages.append(AIMessage(content=h["content"]))

past_messages.append(HumanMessage(content=message))

state = {
    "messages": past_messages,
    "success_criteria": success_criteria or "The answer should be clear and accurate",
    "feedback_on_work": None,
    "success_criteria_met": False,
    "user_input_needed": False,
    "turn_count": 0,
}

The request handler is boring, which is the point:

@app.post("/api/chat")
async def chat(request: ChatRequest):
    history = _get_session(request.session_id)            # read
    new_history = await wingman.run_superstep(            # rebuild + run
        request.message, request.success_criteria, history,
    )
    _put_session(request.session_id, new_history)         # write
    return {"history": new_history}

Read, rebuild, run, write. No connection to keep warm, no checkpoint to hope survived. Any Lambda environment can serve any request for any session, because nothing important lives in the compute.

The alternatives I rejected. Keeping state in Lambda memory loses conversations on cold start, which is the bug I described above. A durable LangGraph checkpointer (DynamoDB or Postgres backend) would work, but it makes the framework's thread the source of truth and ties me to its serialization format. Putting the whole history in the request payload pushes state to the client and grows every round trip. DynamoDB on-demand keyed by session was the least clever option, and least clever is what you want for state.

What broke, and what I would change

Two things in this code are honest tradeoffs, not polish.

First, retries don't survive a request. turn_count resets to 0 on every reconstruction, and the MAX_TURNS = 5 cap only applies within a single superstep. So the worker-evaluator loop can burn up to 5 retries answering one message, but it carries no retry budget across messages. For this app that's fine, since each user turn is its own task. If a single user task spanned multiple requests, I'd be silently resetting the budget, and I'd need to persist turn_count into the DynamoDB item alongside history.

Second, I throw away evaluator feedback on reload. The reconstruction loop skips any assistant line starting with EVALUATOR_PREFIX, so the evaluator's reasoning never re-enters the worker's context on the next request. That keeps the stored history clean and the prompt short, but it means cross-turn the worker can't see why it was corrected before. Within a turn the feedback flows fine through feedback_on_work. Across turns it's gone. That was a deliberate call to keep context small, and I'd revisit it if quality on multi-turn tasks dropped.

There's also a concurrency hole I'm aware of. DynamoDB writes here are last-write-wins with no conditional check. Two requests racing on the same session_id would clobber each other's history. A single user clicking once at a time never hits it, but it's the first thing I'd harden with a conditional write or a version attribute if this saw real concurrent traffic.

On cost, DynamoDB is on-demand billing keyed by a single partition key, which keeps a personal-scale app inside free tier. Reads and writes are single-item by primary key, so latency is a few milliseconds and predictable. The expensive part of a request is the LLM calls in the loop, not the state layer. The state layer is cheap. The loop is where the money goes.

Takeaway

On Lambda, don't make your framework's in-process memory the source of truth. Keep the compute disposable, own your durable state in something built for it, and let the checkpointer die every request.