TLDR: GraphScout explores candidate reasoning paths, Plan Validator grades those plans across completeness, efficiency, safety, coherence, and fallback, then feeds actionable edits back into the loop. The result is workflow execution that is auditable, cheaper, and more resilient than one shot chains. This article shows how to pair them in OrKa, why the pairing matters, and how to ship it responsibly with tests, metrics, and trace visibility.
Why pair a scout with a gate
If you rely on a single agent to decide the entire path for a complex task, you inherit that agent’s blind spots. It will often miss an edge case, overuse a costly tool, skip verification, or choose a path that works only on the happy trail. The answer may look fine, until it is not. When you ship that pattern into production, your latency and failure curves will tell the story.
GraphScout and Plan Validator attack this from different angles that click together:
- GraphScout generates candidate routes through your agent graph. It is a search process that proposes plausible sequences of nodes with a cost and capability mindset.
- Plan Validator evaluates any proposed plan with a strict rubric, returns a score, and gives concrete repair suggestions. You can loop on those suggestions until the plan clears a threshold, then execute.
The scout explores. The gate enforces. Together they close the planning loop with a feedback path that is inspectable and deterministic enough for real systems.
Quick primer on GraphScout
GraphScout’s job is to propose. You define a graph of agents and service nodes, and GraphScout returns one or more candidate execution paths that try to satisfy an intent given constraints. Typical constraints include required steps like retrieval and verification, cost ceilings, and hard requirements such as safety checks before any external side effect.
Important properties you should know:
- Search space: You can limit or expand it through heuristics. Fan out too much and you burn tokens. Constrain too tightly and you miss creative but valid routes.
- Cost hints: GraphScout tracks simple cost estimates per node. That lets it prefer cheaper paths when capability is similar.
- Output: A plan object, usually a list of steps with metadata. You can serialize it to JSON, log it, and pass it to the validator.
Example plan shape, simplified:
{
"plan_id": "run_2025_10_26_001",
"steps": [
{"id": "retrieve", "type": "service", "args": {"k": 8}},
{"id": "dedupe", "type": "service", "args": {}},
{"id": "reason", "type": "agent", "args": {"mode": "synthesis"}},
{"id": "fact_check", "type": "agent", "args": {"sources_required": true}},
{"id": "write", "type": "agent", "args": {"format": "final"}}
],
"est_cost_tokens": 1200,
"notes": ["RAG with verification and structured output"]
}
On its own, GraphScout is already useful. But discovery without validation is how brittle plans sneak in. That is where the validator enters.
Quick primer on Plan Validator
Plan Validator’s job is to judge and repair. It reads a proposed plan and returns two things: a score and specific suggestions. The score is a float in the range [0.0, 1.0]. The suggestions are structured, so your loop can apply them programmatically.
The validator grades across five dimensions that map to real failure modes:
- Completeness. Are required stages present, like retrieval, synthesis, and verification for RAG. Does the plan define clear inputs, outputs, and dependencies.
- Efficiency. Does the plan meet budget guidance. Are there redundant steps or unnecessary high cost agents.
- Safety. Are there guardrails before tools that hit the network, code execution, or data writes. Are there rate limits and timeouts.
- Coherence. Does data flow make sense. Are outputs of one step correctly consumed by the next. Are variables named and scoped predictably.
- Fallback. Are there on ramp and off ramp strategies. What happens when retrieval is weak, or a tool is unavailable.
A typical output looks like this:
{
"validation_score": 0.88,
"overall_assessment": "Good structure with minor issues around fallback and cost control.",
"dimensions": {
"completeness": {"score": 0.92, "issues": [], "suggestions": []},
"efficiency": {"score": 0.82, "issues": ["High token use in synthesis"], "suggestions": ["Lower max_tokens in reason step to 800"]},
"safety": {"score": 0.95, "issues": [], "suggestions": []},
"coherence": {"score": 0.87, "issues": ["Ambiguous variable names"], "suggestions": ["Rename 'ctx' to 'retrieved_passages'"]},
"fallback": {"score": 0.75, "issues": ["No web failover when vector search is empty"], "suggestions": ["Add a web_search branch when k_hits < 2"]}
},
"blocking": ["fallback"]
}
By default you establish a threshold, such as 0.85 to proceed. If the score is lower, you use the suggestions to patch the plan and validate again. That makes the validator a gate, not a suggestion box.
The pairing pattern that works in practice
The canonical loop looks like this:
- GraphScout proposes one or more candidate plans.
- Plan Validator scores a plan and lists fixes.
- A LoopNode applies fixes, revalidates, and repeats until the plan passes the threshold or the loop budget is exhausted.
- The executor runs the validated plan and streams results to the trace log.
You can run this loop on a single model or ask for a second model to validate for variance reduction. For production you should log every plan, score, and patch so that risk reviews and audits can replay the evolution of the plan.
YAML sketch
Below is a compact YAML that captures the flow. Adjust ids and prompts to match your repo.
orchestrator:
id: orka_orchestrator_095
strategy: sequential
queue: redis
agents:
- id: graph_scout
type: builder
prompt: |
You are GraphScout. Given an intent and a catalog of nodes, propose a minimal safe plan that satisfies the task with cost awareness.
Return JSON with fields: plan_id, steps[], est_cost_tokens, notes[].
Intent: {{ input.intent }}
Constraints: {{ input.constraints }}
Catalog: {{ input.catalog }}
- id: plan_validator
type: classification
prompt: |
You are Plan Validator. Grade the plan on completeness, efficiency, safety, coherence, and fallback. Return JSON as specified.
Plan JSON:
{{ previous_outputs.graph_scout }}
- id: loop_repair
type: builder
prompt: |
Apply Plan Validator suggestions to produce an updated plan JSON. Keep the original intent and constraints.
Original plan:
{{ previous_outputs.graph_scout }}
Validator output:
{{ previous_outputs.plan_validator }}
Return only the repaired plan JSON.
- id: executor
type: builder
prompt: |
Execute the validated plan step by step and stream tool calls to the trace. Respect cost and safety limits.
# Pseudo control plane
flow:
- graph_scout
- plan_validator
- loop_node:
check: "previous_outputs.plan_validator.validation_score < 0.85"
body: ["loop_repair", "plan_validator"]
max_rounds: 3
- gate:
check: "previous_outputs.plan_validator.validation_score >= 0.85"
on_fail: "return error: plan did not meet threshold"
- executor
This is not meant to be pasted as is. It is a blueprint for how to structure the flow so that the validator becomes a real gate and not an advisory step.
What improves when you run the pairing
1. Consistency and explainability
A passing plan is no longer a subjective decision. The score and the list of fixes have a persistent record. If someone asks why a plan ran, you can point to the score, the thresholds, and the patch history. If a plan fails in production you can inspect whether the validator missed a class of issue and update your rubric or few shot examples.
2. Better cost control
One of the fastest ways to burn budget is to let a search process expand without pressure to simplify. The validator adds that pressure. Efficiency is a first class dimension, which allows you to prevent accidental high cost defaults like long context windows and high token limits. Over time the combination of cost hints in GraphScout and efficiency scoring in the validator converges to cheaper and still safe plans.
3. Safer tool use
Anything that touches the network, runs code, or writes to a datastore is a risk surface. The validator can block plans that attempt those steps without rate limits, timeouts, or a human approval step. This is not a substitute for runtime policy. It is a structural gate that prevents obvious unsafe plans from reaching execution at all.
4. Clear fallbacks
Most production failures come from thin fallbacks. The validator forces you to declare what happens when a critical step has no result or the result has low confidence. A plan that lacks a failover path simply does not pass the gate.
Determinism and variance control
Validation is model mediated, so variance exists. Here are proven tactics to push toward stable outcomes.
- Lower temperature. Start at 0.1 for the validator. The goal is consistent grading, not creativity.
- Few shot anchors. Add two or three canonical plans with scores and justifications. That establishes a rubric for the model to imitate.
- Secondary validator. If you have the budget, run a second model as a spot checker and block plans when the validators disagree by more than a delta. Start with a delta of 0.1.
- Gold set. Create a tiny set of hand graded plans and lock a nightly job that compares model grades with human grades. Alert on drift.
- Score bands. Define pass, repair, and block bands. For example pass at 0.88 and above, repair between 0.70 and 0.87, block below 0.70. Make these explicit in config, not tribal knowledge.
Observability and traceability
You need full visibility into what the scout proposed, what the validator rejected, and why. That means capturing the following artifacts per run:
- Intent, constraints, and catalog snapshot.
- GraphScout output plan JSON.
- Plan Validator output JSON, including per dimension scores.
- Loop history with the applied patches.
- Final executed plan and its step results.
- Token, time, and cost metrics per step and per loop.
Emit these into your logging sink with a stable schema. If you have OrKa’s trace viewer, wire the plan and validator outputs into the run timeline. That gives you a replayable story: input, exploration, evaluation, repair, execution.
A concrete example: RAG with verification and web failover
Below is a compact example that most teams can run locally with a small model. The goal is to answer a question with citations. The plan must include retrieval, dedupe, synthesis, and a fact check. If the vector search returns fewer than two relevant hits, the plan must branch to a web search and retry synthesis with those results included.
Intent and constraints
{
"intent": "Explain the role of Plan Validator in OrKa and provide two citations.",
"constraints": {
"must_include": ["retrieve", "synthesis", "fact_check"],
"failover": "web_search_when_vector_hits_lt_2",
"max_tokens_total": 1800
},
"catalog": ["retrieve", "dedupe", "synthesis", "fact_check", "web_search", "write"]
}
A typical scout output
{
"steps": [
{"id": "retrieve", "args": {"k": 6}},
{"id": "dedupe"},
{"id": "synthesis", "args": {"max_tokens": 700}},
{"id": "fact_check", "args": {"require_citations": true}},
{"id": "write", "args": {"format": "citations"}}
],
"est_cost_tokens": 1400
}
Validator response and repair
Validator flags the missing web failover and returns a suggestion to add a branch. The loop applies the patch and resubmits. The repaired plan looks like this:
{
"steps": [
{"id": "retrieve", "args": {"k": 8}},
{"id": "dedupe"},
{"id": "branch",
"args": {
"if": "vector_hits < 2",
"then": [{"id": "web_search", "args": {"q_template": "{{ question }} site:reputable_domain"}}],
"else": []
}},
{"id": "synthesis", "args": {"max_tokens": 600}},
{"id": "fact_check", "args": {"require_citations": true, "min_sources": 2}},
{"id": "write", "args": {"format": "citations"}}
],
"est_cost_tokens": 1550
}
Score climbs from 0.78 to 0.91 and the plan passes the gate. The executor runs it. Your trace has the full paper trail.
Local model considerations
For r/LocalLLaMA readers, you can run both the scout and the validator on a small instruction model. Start with 3B to 8B class models, temperature near zero for the validator, and top p at 0.95 for the scout to maintain some diversity. For CPU only boxes, keep max tokens low in the validator and prefer compact JSON outputs. Use caching for static inputs like catalogs to reduce latency. Expect validator calls in the 1 to 4 second range on modern CPUs, scout calls slightly higher if you allow the search to fan out.
Anti patterns to avoid
- Validator as rubber stamp. If you always pass at 0.5, nothing meaningful is happening. Choose real thresholds and stick to them.
- One giant step. A plan that stuffs everything into a single agent is not a plan. The validator should block it. If it does not, strengthen the rubric.
- Overfitting to examples. Few shots are anchors, not the full map. If the validator becomes blind to new valid shapes, rotate the examples.
- Ignoring loop budget. Infinite repair loops are a cost sink. Three rounds is a sensible default for most tasks.
- Missing logs. If you do not log plans and scores, you will not be able to explain incidents. Treat plan artifacts as first class logs.
Minimal test plan
- Create three seed plans: one that should pass, one that should repair, one that should block.
- Run the loop ten times on each with a small model and record scores, variance, and latency.
- Assert that pass clears your threshold within one repair at most, repair converges within two repairs, block never clears within the loop budget.
- Alert if variance on the validator exceeds 0.08 over time on the same seed inputs.
- Store all artifacts for audit.
What to publish when you ship
- One reference doc that shows the validator JSON schema, the five dimensions, score bands, and two loop patterns.
- Two runnable examples: a simple designer loop and a GraphScout paired loop with a failover branch.
- A short note on variance mitigation and default temperatures.
- A micro table reporting token use and latency for small, medium, and large plans across two models.
Final thoughts
Pairing GraphScout with Plan Validator gives you a planning feedback loop that is concrete instead of performative. The scout explores a space of candidate paths. The validator moves that exploration toward a safe and efficient plan with a clear definition of done. When you add a small repair loop, the system becomes self improving within controlled budgets. None of this removes the need for runtime checks or human oversight where it matters. It does remove a large class of low quality plans from ever reaching execution. That is a trade worth making.
Next steps
- Wire the example YAML into your project and run it on a local model.
- Tune thresholds until your pass and repair sets behave as expected.
- Add a nightly validation job that uses your gold set and alerts on drift.
- Share traces with your team and ask them to try to break the rubric. When they do, make the rubric stronger.
If you want to discuss edge cases or failure stories, open an issue or send a trace. Real failures make the best tests.
Top comments (0)