DEV Community

mgd43b for AgentEnsemble

Posted on • Originally published at agentensemble.net

Human-in-the-Loop Agent Systems in Java

Fully autonomous agents make great demos. In production, someone on your team will eventually ask: "Can a human check this before it goes out?"

The answer should be yes, and it shouldn't require bolting on a custom approval system. Human-in-the-loop isn't a limitation of agent systems -- it's a feature. The best agent architectures make it easy to insert human judgment at exactly the right points, without breaking the execution flow.

This post covers how AgentEnsemble handles human review: the review handler API, review policies, pre-flight validation, and the patterns that make this work in practice.

Why Human-in-the-Loop?

Three reasons show up repeatedly in production agent deployments:

  1. Quality assurance. LLMs produce plausible-sounding output that's sometimes wrong. A human reviewer catches factual errors, hallucinations, and tone problems that automated checks miss.

  2. Compliance. Regulated industries (finance, healthcare, legal) often require human approval before AI-generated content is used in customer-facing contexts. It's not optional.

  3. Calibration. When you deploy a new agent workflow, you want to review the first few outputs to verify the agents are behaving as expected before letting them run autonomously.

The Review Handler

The core abstraction is reviewHandler() -- a function that receives a task's output and returns a Review decision:

Ensemble.builder()
    .agents(researcher, writer)
    .tasks(researchTask, writeTask)
    .chatLanguageModel(model)
    .reviewHandler(taskOutput -> {
        System.out.println("=== REVIEW REQUIRED ===");
        System.out.println("Task: " + taskOutput.getTaskDescription());
        System.out.println("Agent: " + taskOutput.getAgentRole());
        System.out.println("Output:\n" + taskOutput.getRaw());
        System.out.println("========================");
        System.out.print("Decision (approve/reject/edit): ");

        String decision = scanner.nextLine().trim().toLowerCase();

        return switch (decision) {
            case "approve" -> Review.approve();
            case "reject" -> {
                System.out.print("Reason: ");
                yield Review.reject(scanner.nextLine());
            }
            case "edit" -> {
                System.out.print("Enter corrected output: ");
                yield Review.edit(scanner.nextLine());
            }
            default -> Review.approve();
        };
    })
    .build()
    .run();
Enter fullscreen mode Exit fullscreen mode

Three possible outcomes:

  • Review.approve() -- accept the output as-is. The ensemble continues with the next task.
  • Review.reject(reason) -- reject the output. The task is re-executed with the rejection reason fed back to the agent as additional context.
  • Review.edit(correctedOutput) -- replace the output with a human-provided correction. The ensemble continues with the edited version.

The review handler is a plain Java function. It can be a console prompt (as above), a REST call to an approval service, a Slack message that blocks until someone responds, or a database write that pauses the ensemble until a flag is flipped.

Review Policies

Not every task needs human review. Review policies control which tasks trigger the handler:

Ensemble.builder()
    .agents(researcher, writer, editor)
    .tasks(researchTask, writeTask, editTask)
    .chatLanguageModel(model)
    .reviewHandler(this::handleReview)
    .reviewPolicy(ReviewPolicy.REVIEW_ALL)
    .build()
    .run();
Enter fullscreen mode Exit fullscreen mode

Available policies:

Policy Behavior
REVIEW_ALL Every task output goes through review. Use for high-stakes workflows.
REVIEW_FAILED Only tasks that failed and were retried, or hit max iterations.
FIRST_TASK_ONLY Review the first task output to calibrate. If approved, the rest run without review.

FIRST_TASK_ONLY is particularly useful during the deployment phase. You review the first output to verify the agents are producing what you expect, then let the pipeline run autonomously for the remaining tasks.

Pre-Flight Validation

Sometimes you want an automated quality check before a human sees the output. The beforeReview() hook runs first:

Ensemble.builder()
    .agents(researcher, writer)
    .tasks(researchTask, writeTask)
    .chatLanguageModel(model)
    .beforeReview(taskOutput -> {
        String raw = taskOutput.getRaw();

        // Automated checks
        if (raw == null || raw.isBlank()) {
            return Review.reject("Empty output");
        }
        if (raw.length() < 200) {
            return Review.reject("Output too short -- minimum 200 characters");
        }
        if (raw.contains("I don't know") || raw.contains("I cannot")) {
            return Review.reject("Agent declined the task");
        }

        // Passed automated checks -- proceed to human review
        return Review.skip();
    })
    .reviewHandler(this::humanReview)
    .reviewPolicy(ReviewPolicy.REVIEW_ALL)
    .build()
    .run();
Enter fullscreen mode Exit fullscreen mode

The flow is:

  1. Task completes.
  2. beforeReview() runs automated checks.
    • If it returns Review.reject(), the task re-executes. No human is bothered.
    • If it returns Review.skip(), the output passes to the human reviewHandler().
    • If it returns Review.approve(), the output is accepted without human review.
  3. reviewHandler() presents the output to a human (if beforeReview didn't already decide).

This pattern keeps humans focused on judgment calls, not on catching obvious failures that a simple check could handle.

Rejection and Re-Execution

When a review is rejected -- whether by beforeReview or the human reviewer -- the framework re-executes the task with the rejection reason injected as additional context. The agent sees:

Previous attempt was rejected. Reason: "Output too short -- minimum 200 characters"

This gives the agent a chance to correct its approach. It's not just a retry -- the agent has feedback on what went wrong.

You can limit the number of review cycles to prevent infinite loops:

Task criticalTask = Task.builder()
    .description("Write the executive summary")
    .expectedOutput("A concise, accurate summary")
    .agent(writer)
    .maxOutputRetries(3) // max 3 re-executions after rejection
    .build();
Enter fullscreen mode Exit fullscreen mode

If the output is still rejected after 3 attempts, the task fails with a clear error.

Patterns for Production Review Workflows

Pattern 1: Console Review (Development)

Good for testing and debugging:

.reviewHandler(taskOutput -> {
    System.out.println(taskOutput.getRaw());
    System.out.print("Approve? (y/n): ");
    return scanner.nextLine().equals("y")
        ? Review.approve()
        : Review.reject("Rejected by developer");
})
Enter fullscreen mode Exit fullscreen mode

Pattern 2: REST API Review (Production)

Block the ensemble until an external approval system responds:

.reviewHandler(taskOutput -> {
    // Submit for review
    String reviewId = approvalService.submitForReview(
        taskOutput.getTaskDescription(),
        taskOutput.getRaw()
    );

    // Poll until decision is made
    ReviewDecision decision = approvalService.awaitDecision(reviewId);

    return switch (decision.status()) {
        case APPROVED -> Review.approve();
        case REJECTED -> Review.reject(decision.reason());
        case EDITED -> Review.edit(decision.correctedOutput());
    };
})
Enter fullscreen mode Exit fullscreen mode

Pattern 3: Slack/Teams Notification

Send a message and wait for a reaction:

.reviewHandler(taskOutput -> {
    String messageId = slack.postMessage(
        "#agent-reviews",
        formatForSlack(taskOutput)
    );

    // Block until thumbs-up or thumbs-down reaction
    SlackReaction reaction = slack.awaitReaction(messageId,
        Duration.ofMinutes(30));

    return reaction.isPositive()
        ? Review.approve()
        : Review.reject("Rejected via Slack");
})
Enter fullscreen mode Exit fullscreen mode

Pattern 4: Automated-Only Review

Skip the human entirely -- use beforeReview for automated quality gates:

.beforeReview(taskOutput -> {
    QualityScore score = qualityChecker.evaluate(taskOutput.getRaw());

    if (score.overall() >= 0.8) {
        return Review.approve(); // good enough, no human needed
    } else if (score.overall() >= 0.5) {
        return Review.skip(); // borderline, send to human
    } else {
        return Review.reject("Quality score too low: " + score.overall());
    }
})
.reviewHandler(this::humanReviewForBorderlineCases)
Enter fullscreen mode Exit fullscreen mode

Pattern 5: Tiered Review by Task

Use task-level configuration to vary review intensity:

// Critical task -- always reviewed
Task customerEmail = Task.builder()
    .description("Draft a response to the customer complaint")
    .expectedOutput("Professional, empathetic email response")
    .agent(writer)
    .build();

// Internal task -- skip review
Task internalSummary = Task.builder()
    .description("Summarize the complaint for internal tracking")
    .expectedOutput("Brief internal summary")
    .agent(writer)
    .build();

Ensemble.builder()
    .agents(writer)
    .tasks(customerEmail, internalSummary)
    .chatLanguageModel(model)
    .reviewHandler(taskOutput -> {
        // Only review customer-facing tasks
        if (taskOutput.getTaskDescription().contains("customer")) {
            return humanReview(taskOutput);
        }
        return Review.approve(); // skip internal tasks
    })
    .reviewPolicy(ReviewPolicy.REVIEW_ALL)
    .build()
    .run();
Enter fullscreen mode Exit fullscreen mode

Combining Review with Other Production Features

Review gates compose naturally with other AgentEnsemble features:

Review + Guardrails

Guardrails catch invalid content at the agent level. Review catches quality issues at the workflow level.

Agent writer = Agent.builder()
    .role("Content Writer")
    .goal("Write marketing copy")
    .outputGuardrail(output -> {
        if (containsPII(output)) {
            return GuardrailResult.reject("Output contains PII");
        }
        return GuardrailResult.accept();
    })
    .build();

Ensemble.builder()
    .agents(writer)
    .tasks(writeTask)
    .chatLanguageModel(model)
    .beforeReview(this::automatedQualityCheck)
    .reviewHandler(this::humanReview)
    .build()
    .run();
Enter fullscreen mode Exit fullscreen mode

The execution flow is: Agent runs -> Guardrail validates -> Pre-flight check -> Human review. Each layer catches different classes of problems.

Review + Callbacks

Track review decisions alongside other execution events:

Ensemble.builder()
    .agents(writer)
    .tasks(writeTask)
    .chatLanguageModel(model)
    .reviewHandler(taskOutput -> {
        Review decision = humanReview(taskOutput);
        auditLog.record(taskOutput.getTaskDescription(),
            decision.getType(), decision.getReason());
        return decision;
    })
    .listener(event -> {
        if (event instanceof TaskCompleteEvent e) {
            metrics.recordTaskCompletion(e);
        }
    })
    .build()
    .run();
Enter fullscreen mode Exit fullscreen mode

Review + Structured Output

Review typed output, not raw strings:

record ProposalDraft(
    String title,
    String executiveSummary,
    List<String> keyPoints,
    double estimatedBudget
) {}

Task proposalTask = Task.builder()
    .description("Draft a project proposal")
    .expectedOutput("Structured proposal")
    .agent(writer)
    .outputType(ProposalDraft.class)
    .build();

Ensemble.builder()
    .agents(writer)
    .tasks(proposalTask)
    .chatLanguageModel(model)
    .reviewHandler(taskOutput -> {
        ProposalDraft draft = taskOutput
            .getStructuredOutput(ProposalDraft.class);

        // Review specific fields
        if (draft.estimatedBudget() > 100_000) {
            return Review.reject("Budget exceeds approval threshold");
        }
        if (draft.keyPoints().size() < 3) {
            return Review.reject("Need at least 3 key points");
        }
        return Review.approve();
    })
    .build()
    .run();
Enter fullscreen mode Exit fullscreen mode

The Design Philosophy

Human-in-the-loop isn't an escape hatch for when agents fail. It's a first-class architectural decision. The best agent systems are designed with human review points from the start, not retrofitted when something goes wrong in production.

AgentEnsemble makes this easy by treating review as a builder method, not a separate system. Same API, same execution flow, same observability. A human reviewer is just another step in the pipeline.


Get started:


AgentEnsemble is MIT-licensed and available on GitHub.

Top comments (0)