Akash Hadagali Persetti

Posted on Jul 3 • Originally published at akashpersetti.hashnode.dev

Building a worker-evaluator retry loop in LangGraph (and where it bites you on Lambda)

#langgraph #terraform #lambda #agents

Most agent demos stop at "the model called a tool and gave an answer." That answer is often wrong, and nothing in the loop notices. I wanted an agent that checks its own work against a success criteria before returning, and retries when it falls short. That is the whole idea behind Wingman: a worker that does the task, and a separate evaluator that decides whether the task is actually done.

Here is how the loop is built, the one decision that shaped everything, and the part that broke once it ran on Lambda behind API Gateway.

The problem

A single-pass agent has no idea when it failed. It calls a tool, produces text, and returns. If the output misses the point, the user finds out, not the system.

I wanted a gate. Something that reads the assistant's last response, compares it against a stated success criteria, and either accepts it or sends the worker back to try again with feedback. Two roles, not one. The worker produces. The evaluator judges.

The catch is that a retry loop with no ceiling is a bill waiting to happen. So the loop needs a hard stop, and it needs to know the difference between "the assistant needs to try again" and "the assistant is stuck and should ask the user."

The approach and the key decision

The graph has three nodes: worker, tools, and evaluator. The state carries the conversation, the success criteria, the latest feedback, two boolean flags, and a turn counter.

class State(TypedDict):
    messages: Annotated[List[Any], add_messages]
    success_criteria: str
    feedback_on_work: Optional[str]
    success_criteria_met: bool
    user_input_needed: bool
    turn_count: int

The worker runs first. If its last message contains tool calls, the router sends it to the tool node and back. If not, it goes to the evaluator.

def worker_router(self, state: State) -> str:
    last = state["messages"][-1]
    return "tools" if (hasattr(last, "tool_calls") and last.tool_calls) else "evaluator"

The key decision is making the evaluator a structured-output call, not a free-text one. It returns a typed object, so the routing logic reads clean booleans instead of parsing prose.

class EvaluatorOutput(BaseModel):
    feedback: str = Field(description="Feedback on the assistant's response")
    success_criteria_met: bool = Field(description="Whether the success criteria have been met")
    user_input_needed: bool = Field(
        description="True if more input is needed from the user, or the assistant is stuck"
    )

Both the worker and the evaluator run on gpt-4o-mini. The worker is bound to tools. The evaluator is bound to the schema above with .with_structured_output.

The routing after evaluation is where the loop's whole personality lives:

def route_based_on_evaluation(self, state: State) -> str:
    if state["success_criteria_met"] or state["user_input_needed"]:
        return "END"
    if state.get("turn_count", 0) >= self.MAX_TURNS:
        return "END"
    return "worker"

Three ways out. The criteria is met. The assistant is stuck and needs the user. Or the loop has spent its five turns. Otherwise it loops back to the worker, and this time the worker gets the feedback baked into its system prompt:

if state.get("feedback_on_work"):
    system_message += f"""
Previously you thought you completed the assignment, but your reply was rejected because the success criteria was not met.
Feedback:
{state["feedback_on_work"]}
Please continue the assignment..."""

That feedback injection is what makes the retry worth anything. The worker does not just try again blind. It tries again knowing why the last attempt was rejected.

One design detail I care about: the evaluator is told not to confuse "not done yet" with "stuck." A failed criteria means retry, not stop.

For user_input_needed: set True ONLY if the assistant explicitly asked the user a question or stated it cannot proceed.
Do NOT set user_input_needed=True just because the criteria was not met — the assistant should simply retry.

Without that line, the evaluator bails to the user the first time the worker misses, and the retry loop never actually retries.

What broke, and what I changed

The loop is stateless by design. Wingman reconstructs the full conversation from DynamoDB on every request and runs a fresh graph invocation with a throwaway thread ID, so there is no in-memory state to lose between Lambda cold starts. That part held up.

The timeout did not.

A worker-evaluator loop is slow. Five turns, each turn a worker call plus tool calls plus an evaluator call, adds up to real wall-clock time. I set the Lambda timeout to 300 seconds to give the loop room:

# LangGraph agent loops can take a while.
timeout     = 300  # 5 minutes (Lambda max is 15 min)
memory_size = 1024

That number is a lie if the request comes through API Gateway. The HTTP API has a hard 29-second integration cap that Lambda's 300 seconds cannot override. So a long agent run would keep working inside Lambda while the gateway had already given up on the caller. The function billed for the full run. The user got nothing.

The fix was to stop routing long tasks through the gateway at all. I added a Lambda Function URL, which is a direct HTTPS endpoint with no 29-second ceiling:

# Lambda function URL — direct HTTPS invoke, bypasses API Gateway 29s timeout.
resource "aws_lambda_function_url" "backend" {
  function_name      = aws_lambda_function.backend.function_name
  authorization_type = "NONE"
}

Now the gateway stays for the fast endpoints, and the agent loop can run against the Function URL without a proxy in the middle deciding it took too long.

There is a second sharp edge I have not fully closed. turn_count only increments inside the evaluator node:

"turn_count": state.get("turn_count", 0) + 1,

But the worker can loop worker -> tools -> worker -> tools several times before it ever hands off to the evaluator, and none of those hops count against MAX_TURNS. So the five-turn cap is really "five evaluator turns," not "five model calls." A task that keeps calling tools without producing a final answer can burn far more time than the cap implies. The honest fix is to count worker iterations too, or add a wall-clock budget inside the run, rather than trusting the evaluator counter to bound cost. That one is still on my list.

Takeaway

An evaluator that gates the worker's output is the cheap half of a self-correcting agent. The expensive half is making the retry ceiling account for every model call, not just the ones you happened to count.

Top comments (2)

Raju Dandigam • Jul 3

Splitting the loop into a worker and an evaluator is the right way to turn “the model said it’s done” into an actual control point. I especially like that you kept explicit success criteria, a hard retry ceiling, and a separate user_input_needed path, because those are the pieces that stop a retry loop from becoming an expensive infinite shrug. In production the next pain point is usually visibility into why the evaluator kept bouncing the worker, which is where execution-tree tooling like agent-inspect starts paying for itself. Curious what failed first on Lambda in your setup: latency, state handoff, or just the difficulty of preserving enough context between retries.

Akash Hadagali Persetti • Jul 8

"expensive infinite shrug" is exactly the failure mode I was designing against, that's a good way to put it.

To your question: it was latency, and not where I expected. State handoff was fine. I reconstruct the full conversation from DynamoDB on every request and run a fresh graph invocation, so there's no in-memory state to lose between retries or cold starts. That part held up.

What broke was the timeout ceiling. I'd set the Lambda timeout to 300s to give the loop room, but every request went through API Gateway, which has a hard 29s integration cap Lambda can't override. So a long run kept churning inside Lambda while the gateway had already dropped the caller. Function billed for the full run, user got nothing. Fix was a Lambda Function URL for the long tasks so there's no proxy deciding it took too long.

Haven't tried agent-inspect but the visibility point is real, right now my evaluator feedback is just logged, and reading why it bounced the worker across five turns is more archaeology than I'd like. Will take a look.