Jangwook Kim

Posted on May 3 • Originally published at effloow.com

POLARIS: Typed DAG Planning for Governed AI Agents

#agenticai #governance #enterprise #dagplanning

Most multi-agent frameworks optimize for capability. POLARIS optimizes for something harder: making an AI agent's behavior provably auditable before it touches production data.

Published on arXiv in January 2026 (2601.11816), "POLARIS: Typed Planning and Governed Execution for Agentic AI in Back-Office Automation" describes a three-component architecture that applies type-checking, rubric-guided selection, and compiled policy guardrails to document-centric finance workflows. This article walks through how it works, what the paper's results actually show, and how to apply the core pattern in your own system.

Why Generic Multi-Agent Setups Fail in Enterprise

When you wire together a general-purpose agent framework — a planner, a set of tool-calling agents, maybe a critic — the result is often capable but opaque. The planner proposes a sequence of actions, the agents execute them, and if something goes wrong midway through a financial document workflow, you get a corrupted ledger entry and no clean audit trail explaining which step produced it.

This is not a hypothetical problem. Back-office finance processes have hard requirements that most agentic frameworks do not address out of the box:

Every transformation on sensitive data needs to be traceable to a specific step in a specific plan
Side effects (writes to accounting systems, external API calls) must be blocked or routed unless the plan explicitly authorizes them
Failures should be handled deterministically, not by having an LLM improvise a recovery strategy

Generic frameworks like LangGraph and AutoGen give you the building blocks, but they leave policy enforcement to you. If your team is busy, policy enforcement is the thing that gets deferred. POLARIS bakes it into the planning layer so deferral is structurally impossible.

What POLARIS Is: The Three-Component Architecture

POLARIS stands for Policy-Aware LLM Agentic Reasoning for Integrated Systems. The name is dense; the architecture is cleaner than it sounds.

The system has three components that work in sequence:

1. Planner — An LLM that generates multiple structurally diverse candidate DAGs for the task at hand. Each node in a DAG is a typed step (extract, validate, route, write). The planner's output is a set of plans, not a single one.

2. Rubric-guided reasoning module — A second reasoning pass that scores each candidate plan against a compliance rubric and selects one. The rubric encodes things like "every plan must include a validate step before a write step" and "PII-tagged fields cannot flow directly to an external write." This module outputs exactly one plan.

3. Execution guard — The selected plan runs inside a guard that performs validator-gated checks at each step boundary, runs a bounded repair loop if a step fails, and enforces compiled policy guardrails that block or reroute side effects before they occur.

The insight is that separating planning from selection from execution lets you enforce invariants at each handoff. By the time a plan reaches the execution guard, it has already passed structural type-checking and compliance scoring. The guard only needs to enforce runtime invariants — not re-evaluate the entire plan.

Typed DAG Planning Explained

In the POLARIS model, a plan is a directed acyclic graph where each node declares its input tokens and output tokens. A token is just a named data artifact ("raw_fields", "validated", "result"). A step is valid only if all its inputs were produced by a previous step or are in the initial input set.

This is type-checking applied to data flow. The structure looks like this:

[input]
   │
   ▼
[EXTRACT: raw_fields]   ← consumes: input
                              produces: raw_fields, pii-safe tag
   │
   ▼
[VALIDATE: validated]   ← consumes: raw_fields
                              produces: validated, audit-log entry
   │
   ▼
[WRITE: result]         ← consumes: validated
                              produces: result, immutable tag

Each step also carries policy_tags that annotate which compliance properties it satisfies. A step tagged ["audit-log"] tells the execution guard to record its inputs and outputs to the audit trail before proceeding to the next node. A step tagged ["immutable"] tells the guard that the write output cannot be modified by any subsequent step in this plan.

The type-check fails at plan generation time, not at runtime. If the planner proposes a step that reads validated before any upstream step has produced it, the plan is discarded before it ever reaches the rubric module. This is the core safety property: structurally invalid plans cannot reach execution.

Rubric-Guided Plan Selection

The rubric-guided reasoning module receives a set of type-valid candidate plans and selects one. In the paper, the rubric is a policy document — a set of rules describing what a compliant plan for a given task class must include.

The module scores each plan against the rubric and selects the highest-scoring compliant plan. If no candidate plan is compliant, the system surfaces this as an error rather than proceeding with the best available option. This is a deliberate design choice: in enterprise finance, a non-compliant plan executed is worse than no plan executed.

In practice, the rubric can be as simple as a weighted scoring function over step types:

Plans that include a validate step before any write step score higher
Plans that tag PII-handling steps with pii-safe score higher
Plans with shorter paths to the same output score higher (parsimony)

The module does not need to understand the document domain. It just needs to know how to score a plan against a structured rubric. This separation means you can update compliance rules by updating the rubric, without retraining or modifying the planner.

Validator-Gated Execution and the Bounded Repair Loop

Once the rubric module selects a plan, it passes to the execution guard. The guard runs each step in DAG order and performs a validator check at each node boundary before passing outputs to the next node.

If a validator check fails — the extracted field is in the wrong format, the validated output is missing a required key — the guard enters a bounded repair loop. It retries the failed step up to a configurable maximum, passing the validator's error message back as context. If the step succeeds within the bound, execution continues. If it exhausts the repair limit, the guard surfaces a structured escalation event rather than improvising.

This bounded repair behavior is the difference between predictable and unpredictable failure modes. An agent system without a repair bound can loop indefinitely or escalate in ways that trigger unintended side effects. The POLARIS guard treats repair as a finite process with a known exit condition.

Policy Guardrails: How They Work

The execution guard includes a compiled policy layer that intercepts side effects before they occur. A side effect is any action that changes external state: writing to an accounting system, calling an external API, sending a notification.

Policy guardrails are compiled from the same rubric that guided plan selection. If a step is tagged ["pii-safe"] but the execution guard detects that the step's output would flow to an external write not authorized in the current plan, the guardrail blocks the write and routes it to a review queue instead. The original plan execution is halted cleanly, and the audit trail records the block event with the full context.

This pre-emptive blocking is the key distinction from post-hoc monitoring. Most enterprise AI governance approaches watch for problems after they happen. POLARIS blocks non-compliant side effects before state changes occur. The audit trail records what was blocked and why, which is exactly what a compliance team needs.

The Python Minimal Reproduction

Effloow Lab reproduced the core POLARIS DAG-planning pattern in Python. The type-check and rubric-selection logic was implemented from the paper description. No public pip package or official GitHub repo exists — implementation is based on arXiv:2601.11816.

from dataclasses import dataclass, field
from enum import Enum
from typing import Any

class StepType(Enum):
    EXTRACT = "extract"
    VALIDATE = "validate"
    ROUTE = "route"
    WRITE = "write"

@dataclass
class PlanStep:
    step_id: str
    step_type: StepType
    inputs: list[str]   # must be outputs of prior steps or "input"
    outputs: list[str]
    policy_tags: list[str] = field(default_factory=list)

def type_check_dag(steps: list[PlanStep]) -> bool:
    """Verify no step reads an output that hasn't been produced yet."""
    produced: set[str] = {"input"}
    for step in steps:
        for inp in step.inputs:
            if inp not in produced:
                return False
        produced.update(step.outputs)
    return True

def select_plan(candidate_plans: list[list[PlanStep]], rubric: dict) -> list[PlanStep]:
    """Pick the plan with the highest rubric score (simplified)."""
    def score(plan):
        return sum(
            rubric.get(step.step_type.value, 0)
            for step in plan
            if type_check_dag(plan)
        )
    return max(candidate_plans, key=score)

# Example: 3-step finance document plan
steps = [
    PlanStep("s1", StepType.EXTRACT,  ["input"],  ["raw_fields"],    ["pii-safe"]),
    PlanStep("s2", StepType.VALIDATE, ["raw_fields"], ["validated"], ["audit-log"]),
    PlanStep("s3", StepType.WRITE,    ["validated"],  ["result"],    ["immutable"]),
]
print("DAG valid:", type_check_dag(steps))  # DAG valid: True
print("Steps:", [s.step_id for s in steps])

Running this produces:

DAG valid: True
Steps: ['s1', 's2', 's3']

Passing an invalid plan (where s2 reads from a token no step has produced) returns DAG valid: False. The rubric-based select_plan function correctly selects the plan that includes a validate step when the rubric weights validation steps more heavily. Both behaviors match the paper's described invariants.

Benchmark Results: What the Numbers Mean

The paper reports two performance figures, and it is worth being precise about what each measures.

SROIE dataset, micro-F1: 0.81. SROIE (Scanned Receipts OCR and Information Extraction, ICDAR 2019) is a public benchmark where the task is extracting four fields from scanned receipt images: company, date, address, and total amount. A micro-F1 of 0.81 on this task is competitive with other document extraction systems in the literature. It reflects the quality of the extract and validate steps in a real document pipeline — not the governance layer specifically.

Controlled synthetic suite, precision: 0.95–1.00 for anomaly routing. This figure measures the execution guard's ability to correctly identify and route anomalous steps (steps that would violate policy) in a synthetic test set. The 0.95–1.00 range with preserved audit trails is the stronger governance claim. It shows the policy guardrails are catching what they are designed to catch under controlled conditions.

Neither figure has been independently reproduced by Effloow Lab. Both are from the original paper (arXiv:2601.11816). The SROIE benchmark infrastructure would require OCR tooling and the SROIE dataset itself; reproducing it was outside the scope of this PoC.

POLARIS vs. LangGraph vs. AutoGen for Enterprise Governance

Where does POLARIS fit relative to the frameworks teams are already using?

Capability	POLARIS	LangGraph	AutoGen
Typed step definitions	Built-in (DAG nodes with I/O tokens)	Manual (user-defined state schema)	Not native
Pre-execution type check	Yes, at plan generation time	No	No
Multi-plan generation + selection	Yes, rubric-guided	No (single graph)	Partial (conversation-based)
Compiled policy guardrails	Yes, blocks side effects pre-emptively	No (hooks possible but manual)	No
Bounded repair loop	Yes, configurable limit	Manual (custom node retry logic)	Partial (conversation retry)
Audit trail per step	Yes, enforced by execution guard	Manual (logging hooks)	Manual (logging hooks)
Public package available	No (research framework)	Yes (pip install langgraph)	Yes (pip install autogen)
Production maturity	Research paper	Production-ready	Production-ready

The honest framing: LangGraph and AutoGen are production tools you can deploy today. POLARIS is a research framework that describes a design pattern you have to implement. If you need governed, auditable agentic pipelines, POLARIS gives you the architecture to build one. LangGraph gives you a graph execution engine you can wire up with manual governance logic. AutoGen gives you conversational multi-agent primitives with governance left entirely to you.

Teams building enterprise finance automation from scratch would do well to read the POLARIS paper and implement the type-check and rubric-selection pattern on top of whatever execution framework they prefer. The ideas are not tied to any specific library.

When to Apply POLARIS vs. Simpler Agentic Patterns

POLARIS adds real complexity. You are maintaining a typed schema for every data artifact, a rubric for every task class, and a compiled policy set for every execution context. That overhead is worth it in some situations and not in others.

Use the POLARIS pattern when:

Your pipeline produces regulated outputs (financial records, compliance reports, PII-handling workflows)
You need a clean audit trail at the step level, not just the pipeline level
The cost of a non-compliant side effect is high enough that pre-emptive blocking is worth the overhead
You are operating in an environment where compliance teams need to inspect plan selection reasoning, not just execution logs

Use a simpler pattern when:

The task is exploratory or the cost of a wrong action is easily reversible
You are prototyping and governance can be added later
Your team does not yet have a formal rubric for what a compliant plan looks like — forcing a rubric prematurely creates false safety guarantees

A useful heuristic: if the word "audit" appears in your product requirements, reach for the POLARIS architecture. If it does not, start simple and layer governance in when you have concrete compliance requirements to encode.

Frequently Asked Questions

Q: Is there a pip-installable POLARIS package I can use today?

No. As of the publication date of the paper (January 16, 2026) and the date of this article (May 2026), there is no public Python package for POLARIS and no official GitHub repository. The framework is described in the research paper (arXiv:2601.11816). You would need to implement it from the paper. The Python code in this article covers the core planning layer.

Q: What does "bounded repair loop" mean in practice?

When a step fails — for example, the validation step rejects a field as malformed — the execution guard retries that step with the validator's error message as additional context. "Bounded" means there is a hard limit on retries (the paper uses a fixed maximum; the exact number is implementation-defined). Once the bound is exhausted without a successful step, the guard raises a structured escalation event and stops. This prevents infinite retry loops that could leave downstream systems in inconsistent states.

Q: How does the rubric differ from a prompt instruction?

A prompt instruction tells an LLM what to do. A rubric in the POLARIS sense is a structured scoring function applied to candidate plans after they have been generated. The LLM proposes plans; the rubric evaluates them. This separation means the compliance logic lives outside the LLM's context window and cannot be accidentally overridden by a sufficiently persuasive input document.

Q: Can I use this pattern with any LLM, or is it model-specific?

The POLARIS architecture is model-agnostic. The Planner and Rubric-guided reasoning module are LLM calls; the type-check, rubric scoring, and execution guard are deterministic code. You can swap the LLM backbone without changing the governance layer. The paper uses unspecified LLM infrastructure; nothing in the architecture is OpenAI- or Anthropic-specific.

Q: What is the SROIE benchmark and why is it used here?

SROIE (Scanned Receipts OCR and Information Extraction) is a public benchmark from the ICDAR 2019 competition. It consists of scanned receipt images with ground-truth annotations for four fields. It is a reasonable proxy for real document extraction difficulty in finance automation because receipts are unstructured, visually inconsistent, and require both OCR and semantic understanding to process correctly. POLARIS uses it to demonstrate that the extraction component works on realistic document inputs, not just clean structured data.

Verdict: Architecture worth borrowing, not a drop-in library

POLARIS is not something you install and deploy this week. It is a research framework that formalizes a design pattern most enterprise AI teams are reinventing ad-hoc. The typed DAG + rubric-selection + execution guard combination is sound and addresses real gaps in production agentic systems.

The SROIE micro-F1 of 0.81 and the 0.95–1.00 anomaly routing precision on the synthetic suite are credible results for the described tasks. Neither figure warrants uncritical extrapolation to your specific use case.

If you are building agentic pipelines that touch regulated data, read arXiv:2601.11816 and implement the type-check and rubric-selection pattern. The Python reproduction in this article is a working starting point. The execution guard layer — bounded repair loop, compiled policy guardrails, pre-emptive side-effect blocking — is where the real engineering investment goes, and the paper gives you enough detail to design it.

Track: paper-poc | Evidence: Effloow Lab Python reproduction | Paper: arXiv:2601.11816

DEV Community