AI agents have quietly crossed a line. They no longer just suggest text — they act. They send emails, write to databases, call internal APIs, trigger refunds. In a toy project that's fine. In a company where one of those actions touches customer data or moves money, "the agent decided to" is not an answer anyone in security or risk will accept.
The missing piece isn't a smarter model. It's a decision point in front of every action — somewhere you can ask, before anything happens: is this allowed?
I've been building exactly that as an open project called Horkos. This post walks through the core idea — a policy gateway that intercepts an agent's action and returns one of three outcomes: allow, block, or require human approval. The code below is simplified from the real thing to keep it readable, but the shape is the shape.
The model: an action is a thing you evaluate before you run it
Most agent frameworks execute a tool call and then (maybe) log it. That ordering is the whole problem. By the time you have a log, the money already moved.
So the first move is to treat every action as data that flows through a checkpoint before execution:
agent wants to act
│
▼
evaluate against policy
│
┌────┴────┐
BLOCKED ALLOWED
│ │
│ requires approval?
│ ┌──┴──┐
│ YES NO
│ │ │
│ pause execute
│ │ │
└──────┴──────┴──► always write to the audit trail
Every path ends in the same place: a log. The decision is the interesting part.
A minimal action payload
The agent (or an SDK wrapper around it) describes what it wants to do:
from enum import Enum
from typing import Any, Optional
from pydantic import BaseModel, Field
class RiskLevel(str, Enum):
LOW = "low"
MEDIUM = "medium"
HIGH = "high"
CRITICAL = "critical"
class ActionRequest(BaseModel):
action_type: str = Field(..., examples=["send_email", "db_query", "wire_transfer"])
input_data: dict[str, Any] = Field(default_factory=dict)
risk_level: RiskLevel = RiskLevel.MEDIUM
Notice there's no output_data yet — the action hasn't run. We're deciding whether it's allowed to.
The policy engine
Policies are just data. Keeping them as JSON means a non-engineer can change what's blocked without a deploy:
DEFAULT_POLICY = {
# Substrings that should never reach a tool, regardless of action type.
"block_patterns": [
"DROP TABLE",
"DELETE FROM",
"UNION SELECT",
"--",
"ignore previous instructions",
"disregard the system prompt",
],
# Action types that always need a human before they run.
"require_approval_for": [
"wire_transfer",
"create_admin",
"delete_data",
"export_customer_data",
],
# Risk threshold above which we escalate to a human.
"require_approval_above": "high",
}
Two things are happening here, and they map to two real-world fears:
-
block_patternscatches the "the agent got prompt-injected / hallucinated a destructive command" case. ADROP TABLEor anignore previous instructionsin the input is a hard stop. -
require_approval_forcatches the "this action is legitimate but too consequential to be fully autonomous" case. Moving money is allowed — by a human, this time. Now the evaluator:
from dataclasses import dataclass, field
@dataclass
class Decision:
outcome: str # "allow" | "block" | "require_approval"
violations: list[str] = field(default_factory=list)
reason: str = ""
_RISK_ORDER = {"low": 0, "medium": 1, "high": 2, "critical": 3}
class PolicyEngine:
def __init__(self, policy: dict) -> None:
self.policy = policy
def evaluate(self, action: ActionRequest) -> Decision:
haystack = self._flatten(action.input_data)
# 1. Hard blocks — destructive or injection-like input.
hits = [
p for p in self.policy["block_patterns"]
if p.lower() in haystack.lower()
]
if hits:
return Decision(
outcome="block",
violations=hits,
reason="Input matched blocked patterns",
)
# 2. Action types that always need a human.
if action.action_type in self.policy["require_approval_for"]:
return Decision(
outcome="require_approval",
reason=f"Action type '{action.action_type}' requires approval",
)
# 3. Risk threshold escalation.
threshold = _RISK_ORDER[self.policy["require_approval_above"]]
if _RISK_ORDER[action.risk_level.value] >= threshold:
return Decision(
outcome="require_approval",
reason=f"Risk '{action.risk_level.value}' is at or above threshold",
)
return Decision(outcome="allow")
@staticmethod
def _flatten(data: dict) -> str:
"""Turn nested input into one searchable string."""
parts: list[str] = []
for value in data.values():
if isinstance(value, dict):
parts.append(PolicyEngine._flatten(value))
else:
parts.append(str(value))
return " ".join(parts)
It's not magic. It's a checkpoint with rules you can read. That readability is a feature — when an auditor asks "why was this blocked?", the answer is a line in a JSON file, not a model's vibes.
Wiring it into a FastAPI endpoint
from fastapi import APIRouter, Depends, BackgroundTasks
router = APIRouter()
engine = PolicyEngine(DEFAULT_POLICY)
@router.post("/v1/actions")
async def submit_action(
action: ActionRequest,
background_tasks: BackgroundTasks,
) -> dict:
decision = engine.evaluate(action)
if decision.outcome == "block":
# The action never runs. We record the attempt.
background_tasks.add_task(write_audit, "action.blocked", action, decision)
return {"status": "blocked", "violations": decision.violations}
if decision.outcome == "require_approval":
# Pause. A human gets pinged (Slack, email, whatever).
background_tasks.add_task(write_audit, "approval.requested", action, decision)
background_tasks.add_task(notify_approver, action, decision)
return {"status": "awaiting_approval", "reason": decision.reason}
# Allowed — the caller is cleared to execute.
background_tasks.add_task(write_audit, "action.allowed", action, decision)
return {"status": "allowed"}
The audit write goes to a BackgroundTasks so logging never slows down or breaks the agent's path. Whatever the decision, the attempt is recorded.
Three outcomes, one demo
Same agent, three actions:
-
Routine notification →
allowed, logged. -
A query containing
DROP TABLE users→blockedbefore it ever reaches the database. -
A €50,000 transfer →
awaiting_approval; a human approves or denies, and the agent waits. That third one is the part people underestimate. "Human-in-the-loop" gets said a lot; what it actually means in code is: the action pauses, state is persisted, a human decision flips it, and the agent resumes or stops. The policy decides which actions deserve that treatment — not the agent.
Where this is
Horkos it's live at horkos.eu with a Python SDK that wraps an agent in a few lines. I'm a platform engineer working on infrastructure in a regulated industry, and this is the layer I kept wishing existed before signing off on anything autonomous.
If you're putting agents anywhere near production, I'd genuinely like to hear where your security or compliance team draws the line. That's the design input I care about most right now — drop it in the comments.
Top comments (0)