Nelson Amaya

Posted on May 31

AI Alignment is a Systems Architecture Problem, Not a Prompt Problem

#ai #alignment #agents

Introduction

For the last year and a half, I have been building SAFi (the Self-Alignment Framework Interface). It is a self-hosted, fully open-source runtime governance engine for AI agents licensed under the AGPL-3.0.

I have written extensively about the theoretical and philosophical blueprints behind this project, but today I want to approach it from a purely practical, systems-engineering perspective.

Full disclosure: I have worked in IT infrastructure and systems architecture for over 20 years. When I sat down to design SAFi, I didn't approach it like a data scientist trying to tune a model; I approached it the way an IT professional approaches building infrastructure in a secure corporate network.

The Core Philosophy: External Zero-Trust Governance

The mainstream AI industry is currently obsessed with "internal alignment"—pouring billions into training models to self-police via fine-tuning (RLHF) or writing massive, polluted system prompts to control behavior.

SAFi rejects this. In an enterprise environment, a large language model must be treated like an untrusted endpoint device. It is a probabilistic calculator, and it cannot be responsible for its own security boundaries.

Instead, SAFi enforces an external, zero-trust architecture modeled directly after enterprise infrastructure models:

Least Privilege by Default: Every agent starts with a completely blank slate. They are granted zero tools or advanced capabilities out of the box.
Policy-Driven Authorization: Capabilities and tools are authorized strictly at the Policy layer. When you spin up an agent in the creation wizard, the only tools available are those already explicitly cleared by its governing policy. Nothing runs until governance says it can.
Role-Based Access Control (RBAC): Access to the governance platform itself is strictly segmented into a clear administrative hierarchy:
Members: Can only interact with existing, pre-built agents.
Auditors: Granted strict read-only access to agents, policies, and logs to verify system health without configuration privileges.
Editors: Authorized to modify policies and configure new agents.
Admins: Hold full global rights, including domain verification, user management, and setting the master organization charter.

Deconstructing the Faculty Loop

To operationalize fluid cognitive concepts into predictable machine logic, SAFi maps the architectural lifecycle of every single user prompt into a discrete, sequential state loop:

Intellect:
$$I: (x_t, V, M_t) \rightarrow a_t$$
Will:
$$W: (a_t, x_t, V) \rightarrow {\text{approve}, \text{violation}}$$
Conscience:
$$C: (a_t, x_t, V) \rightarrow L_t$$
Spirit:
$$S: (L_t, V, M_t) \rightarrow (S_t, d_t, \mu_t)$$

1. The Intellect (The Generator)

The Intellect is strictly a generative faculty. It drafts initial responses or proposes tool calls ($a_t$). Crucially, it has zero decision-making power and is entirely air-gapped from execution. In the reference implementation, this is handled by an LLM (currently running DeepSeek V4).

2. The Will (The Firewall)

Written entirely in pure, deterministic Python. It does not deliberate, negotiate, or reason. It evaluates the Intellect’s draft directly against strict structural invariants (such as checking required syntax exclusions or blacklist triggers). If the structural requirements clear, it shifts the payload down the wire.

3. The Conscience (The Compliance Auditor)

Powered by a specialized evaluator model, this faculty assesses the structurally valid draft against the policy's weighted Value Set ($V$) using granular rubrics. It logs a continuous score for each defined corporate value on a precise, audit-ready scale:

-1.0 = Absolute Violation / Misaligned
0.0 = Neutral / Not Applicable
1.0 = Perfect Alignment

4. The Spirit (The Integrator)

Built on pure Python using NumPy, the Spirit faculty ingests the Conscience ledger ($L_t$), rescales the matrix of continuous scores into a macro alignment metric from 1 to 10 ($S_t$), and updates an Exponential Moving Average ($\mu_t$) to track behavioral drift ($d_t$) across the user session.

Closed-Loop Feedback & Correction

Alignment cannot be a static instruction; it must be a closed control loop. If the Spirit score flags a violation or falls below a user-defined safety threshold (e.g., < 5), the Will intercepts the output and triggers a Reflexion Loop, feeding targeted coaching notes back to the Intellect for an immediate rewrite.

To guarantee network stability and prevent infinite execution loops, if the rewritten output fails the audit a second time, the Will halts execution entirely and routes the user to a secure, governed redirect message.

Real-World Pilots: State Persistence in Action

To prove the framework thrives under real operational environments, I have been dogfooding SAFi across two completely distinct, highly persistent use cases. Because SAFi is entirely model-agnostic and decoupled from the policy layer, I am running both engines using DeepSeek, relying on the memory layers to maintain fidelity:

Use Case 1: The Production Work Assistant

I deployed an agent scoped tightly to an internal corporate policy to act as my daily assistant for vendor coordination, infrastructure planning, and team management.

Instead of blowing up context windows or losing state, the agent uses SAFi’s Project & Task Memory. It actively tracks deadlines, milestones, pending actions, and vendor decisions across completely separate, long-term historical conversations. I can seamlessly say, "Draft an email to vendor X regarding our pending action items," and the engine pulls the correct context from the persistent ledger, generating a ready-to-send draft.

Use Case 2: The Automations Scholar

On the personal side, I engineered a highly specialized Bible Scholar agent. It is configured to run on an automated cron schedule. Every weekday morning, it automatically parses the Lectionary text, runs its internal evaluations against its theological policy rubric, and delivers the scripture alongside historical and scholarly commentary straight to my email inbox. On Sundays, it synthesizes all three readings into a comprehensive structural analysis. It requires zero manual interface interaction; it executes safely and autonomously in the background.

Deployment & Native Telemetry

SAFi is entirely API-driven. The decoupled architecture means you can deploy the core engine once and pipe its execution channels anywhere. I have already wired native endpoints directly into Telegram and Microsoft Teams, and because the gateway handles requests via a clean, unified API layer, mapping it to enterprise systems like Slack or WhatsApp requires nothing more than standard routing.

Every single transaction across these channels generates an immutable audit trail. You can look at the backend logs and trace the exact mathematical coordinates of why an agent constructed a specific response, making it fully compliant with the security standards demanded by enterprise leadership.

The codebase is completely open and ready for architectural testing:

GitHub Repository: https://github.com/jnamaya/SAFi
Live Sandbox Demo: https://safi.selfalignmentframework.com (Note: I have intentionally paired the sandbox Intellect with a drastically downsized model to prove how effectively the external governance engine forces compliance even when the underlying reasoning model is weak).

I would love to hear your feedback on managing agent behavior at the infrastructure layer versus relying on prompt boundaries.

Top comments (1)

Harjot Singh • May 31

Alignment is a systems-architecture problem, not a prompt problem is exactly the reframe the field needs at the application layer. Trying to make an agent behave by writing the perfect system prompt is like trying to make a program correct by writing a stern comment, the prompt is a request, not a constraint, and anything that depends on the model choosing to comply will eventually fail under adversarial input or just bad luck. Real alignment of a deployed agent comes from the architecture around it: what it's allowed to call, what's gated behind approval, what's sandboxed, what gets validated before it propagates. Those are structural guarantees that hold regardless of what the model decides, which is the whole point. The prompt sets intent; the architecture sets limits, and only the second one is enforceable. The corollary I'd add is that this is also why alignment isn't a one-time setting, it's a property of the system you have to design for and test, the same way you'd design for security or fault tolerance, not a paragraph you tune. Constrain structurally what you can't guarantee behaviorally. That make-alignment-an-architecture-property instinct is core to how I think about Moonshift. At the architecture level, where do you put the hardest constraint, the tool/permission boundary, or a supervising checker that can veto the agent's actions?