Most agent demos stop at "the model called a tool and gave an answer." That answer is often wrong, and nothing in the loop notices. I wanted an agent that checks its own work against a success criteria before returning, and retries when it falls short. That is the whole idea behind Wingman: a worker that does the task, and a separate evaluator that decides whether the task is actually done.
Here is how the loop is built, the one decision that shaped everything, and the part that broke once it ran on Lambda behind API Gateway.
The problem
A single-pass agent has no idea when it failed. It calls a tool, produces text, and returns. If the output misses the point, the user finds out, not the system.
I wanted a gate. Something that reads the assistant's last response, compares it against a stated success criteria, and either accepts it or sends the worker back to try again with feedback. Two roles, not one. The worker produces. The evaluator judges.
The catch is that a retry loop with no ceiling is a bill waiting to happen. So the loop needs a hard stop, and it needs to know the difference between "the assistant needs to try again" and "the assistant is stuck and should ask the user."
The approach and the key decision
The graph has three nodes: worker, tools, and evaluator. The state carries the conversation, the success criteria, the latest feedback, two boolean flags, and a turn counter.
class State(TypedDict):
messages: Annotated[List[Any], add_messages]
success_criteria: str
feedback_on_work: Optional[str]
success_criteria_met: bool
user_input_needed: bool
turn_count: int
The worker runs first. If its last message contains tool calls, the router sends it to the tool node and back. If not, it goes to the evaluator.
def worker_router(self, state: State) -> str:
last = state["messages"][-1]
return "tools" if (hasattr(last, "tool_calls") and last.tool_calls) else "evaluator"
The key decision is making the evaluator a structured-output call, not a free-text one. It returns a typed object, so the routing logic reads clean booleans instead of parsing prose.
class EvaluatorOutput(BaseModel):
feedback: str = Field(description="Feedback on the assistant's response")
success_criteria_met: bool = Field(description="Whether the success criteria have been met")
user_input_needed: bool = Field(
description="True if more input is needed from the user, or the assistant is stuck"
)
Both the worker and the evaluator run on gpt-4o-mini. The worker is bound to tools. The evaluator is bound to the schema above with .with_structured_output.
The routing after evaluation is where the loop's whole personality lives:
def route_based_on_evaluation(self, state: State) -> str:
if state["success_criteria_met"] or state["user_input_needed"]:
return "END"
if state.get("turn_count", 0) >= self.MAX_TURNS:
return "END"
return "worker"
Three ways out. The criteria is met. The assistant is stuck and needs the user. Or the loop has spent its five turns. Otherwise it loops back to the worker, and this time the worker gets the feedback baked into its system prompt:
if state.get("feedback_on_work"):
system_message += f"""
Previously you thought you completed the assignment, but your reply was rejected because the success criteria was not met.
Feedback:
{state["feedback_on_work"]}
Please continue the assignment..."""
That feedback injection is what makes the retry worth anything. The worker does not just try again blind. It tries again knowing why the last attempt was rejected.
One design detail I care about: the evaluator is told not to confuse "not done yet" with "stuck." A failed criteria means retry, not stop.
For user_input_needed: set True ONLY if the assistant explicitly asked the user a question or stated it cannot proceed.
Do NOT set user_input_needed=True just because the criteria was not met — the assistant should simply retry.
Without that line, the evaluator bails to the user the first time the worker misses, and the retry loop never actually retries.
What broke, and what I changed
The loop is stateless by design. Wingman reconstructs the full conversation from DynamoDB on every request and runs a fresh graph invocation with a throwaway thread ID, so there is no in-memory state to lose between Lambda cold starts. That part held up.
The timeout did not.
A worker-evaluator loop is slow. Five turns, each turn a worker call plus tool calls plus an evaluator call, adds up to real wall-clock time. I set the Lambda timeout to 300 seconds to give the loop room:
# LangGraph agent loops can take a while.
timeout = 300 # 5 minutes (Lambda max is 15 min)
memory_size = 1024
That number is a lie if the request comes through API Gateway. The HTTP API has a hard 29-second integration cap that Lambda's 300 seconds cannot override. So a long agent run would keep working inside Lambda while the gateway had already given up on the caller. The function billed for the full run. The user got nothing.
The fix was to stop routing long tasks through the gateway at all. I added a Lambda Function URL, which is a direct HTTPS endpoint with no 29-second ceiling:
# Lambda function URL — direct HTTPS invoke, bypasses API Gateway 29s timeout.
resource "aws_lambda_function_url" "backend" {
function_name = aws_lambda_function.backend.function_name
authorization_type = "NONE"
}
Now the gateway stays for the fast endpoints, and the agent loop can run against the Function URL without a proxy in the middle deciding it took too long.
There is a second sharp edge I have not fully closed. turn_count only increments inside the evaluator node:
"turn_count": state.get("turn_count", 0) + 1,
But the worker can loop worker -> tools -> worker -> tools several times before it ever hands off to the evaluator, and none of those hops count against MAX_TURNS. So the five-turn cap is really "five evaluator turns," not "five model calls." A task that keeps calling tools without producing a final answer can burn far more time than the cap implies. The honest fix is to count worker iterations too, or add a wall-clock budget inside the run, rather than trusting the evaluator counter to bound cost. That one is still on my list.
Takeaway
An evaluator that gates the worker's output is the cheap half of a self-correcting agent. The expensive half is making the retry ceiling account for every model call, not just the ones you happened to count.
Top comments (1)
Splitting the loop into a worker and an evaluator is the right way to turn “the model said it’s done” into an actual control point. I especially like that you kept explicit success criteria, a hard retry ceiling, and a separate
user_input_neededpath, because those are the pieces that stop a retry loop from becoming an expensive infinite shrug. In production the next pain point is usually visibility into why the evaluator kept bouncing the worker, which is where execution-tree tooling like agent-inspect starts paying for itself. Curious what failed first on Lambda in your setup: latency, state handoff, or just the difficulty of preserving enough context between retries.