The Five Faculties: A Tour of SAFi's Cognitive Architecture

#ai #opensource #ethics #machinelearning

Most attempts at AI governance treat alignment as a prompt-level concern. You write a system message, hope the model follows it, and accept that any sufficiently creative attacker can talk the model into ignoring it. The Self-Alignment Framework Interface (SAFi) takes a different approach. Instead of asking a single LLM to judge its own output, SAFi splits cognition across five specialized faculties, each with a distinct role, a defined interface, and no ability to overstep its bounds. The result is a governed AI architecture that decouples generation from evaluation from execution.

Let’s walk through each faculty in order, following the actual loop the orchestrator runs on every turn.

Phase Zero: The Pre-Generation Barrier

Before the Intellect ever sees a user prompt, the Phase Zero gate (phase_zero.py) runs a deterministic security scan. It checks injection signatures from a threat intelligence module, per-persona blacklisted phrases, and an entropy-based heuristic that catches indirect prompt injection attempts (the so-called “ancient text” pattern where a high-entropy blob contains embedded instruction markers). Phase Zero makes zero LLM calls. If it flags a threat, the orchestrator short-circuits immediately to a governed redirect, and the Intellect is never exposed to adversarial content.

1. Synderesis: The Immutable Constitution

The Synderesis faculty (synderesis.py) is the system’s constitution compiler. Before any prompt is processed, Synderesis defines the governance policies, value weights, and scope boundaries that every other faculty will reference. It exposes PERSONAS, GOVERNANCE_MAP, and functions like get_profile, list_profiles, and assemble_agent. At runtime, Synderesis is read-only. Its policies cannot be changed mid-conversation, which makes social engineering against the value system structurally impossible.

2. Intellect: The Generative Engine (Air-Gapped)

The Intellect (intellect.py) is the only faculty that talks to an LLM for generation. It parses RAG context, conversation history, Spirit feedback, and the user prompt to produce a typed intent. That intent is either a text response or a tool call proposal. The critical architectural invariant is the Air Gap: the Intellect never executes tools. It returns tool calls as proposals for the Will to approve. The generate method returns a 3-tuple of (intent, reflection, retrieved_context), and the orchestrator routes everything through the Will before any action is taken.

3. Will: The Deterministic Gatekeeper

The Will (will.py) is pure Python with zero LLM calls. It doesn’t deliberate or negotiate. It runs strict structural passes, checking syntax, required exclusions, and user invariants. If a check fails, the Will vetoes the proposal immediately.

The Will distinguishes between two failure modes. A hard-gate breach (a non-negotiable value with hard_gate=true scoring at or below -1.0) is caught deterministically and routed directly to a governed redirect with no rewrite. Everything else flows into an aggregate alignment score A_t in [0, 1]. If that score falls below the configurable threshold (default 0.5), the Will triggers a single Reflexion Loop: the Intellect rewrites the response using the persona’s coaching directive, then the Conscience and Spirit re-audit the corrected draft.

If the rewrite still fails, the behavior diverges. A low alignment score is treated as a soft quality signal the Will commits the best available draft with its honest low score recorded. Only a residual critical (ethical) violation routes to a governed redirect.

4. Conscience: The Analytical Auditor

The Conscience (conscience.py) is a secondary LLM call that evaluates the Intellect’s draft against the policy’s weighted value set. For each value, it produces a score on a continuous scale from -1.0 (absolute violation) to +1.0 (perfect alignment), with a confidence interval. This compliance ledger (L_t) is the mathematical judgment that the Will and Spirit depend on.

The Conscience also has an evaluate_redirect method for auditing the quality of governed redirect messages on criteria like clarity, helpfulness, and tone. This ensures that even when SAFi refuses a request, it does so respectfully and provides guidance.

5. Spirit: The Long-Term Integrator

The Spirit (spirit.py) is pure Python using NumPy. It ingests the Conscience ledger, scales the continuous scores into a consolidated metric from 1 to 10 (S_t), and updates the system’s moving average (mu_t) using an exponential moving average with a configurable beta parameter. A high beta (e.g., 0.9) means long memory, slow adaptation. A low beta (e.g., 0.1) means fast adaptation to recent behavior.

The Spirit also computes behavioral drift (d_t), quantifying how much the current turn’s ethical vector diverges from the historical average. This gives operators a mathematical signal for detecting gradual alignment erosion before it becomes critical. The result is that SAFi doesn’t just evaluate individual outputs it tracks the agent’s character over time.

Why Separation Matters

This cognitive architecture solves a real engineering problem. Monolithic LLMs face an inherent conflict: the same model that generates a response must also evaluate whether that response is compliant. SAFi’s benchmarks show that unguarded baselines fail adversarial prompts at a 30-point higher rate than the governed pipeline.

By splitting generation (Intellect) from evaluation (Conscience) from execution (Will), SAFi eliminates that conflict. The governance layer is model-independent the same deterministic gates fire whether the underlying LLM is GPT-5, Claude, or an open-source fine-tune. You can swap the model without rewriting the governance.

Every step of the loop is audited and logged, giving operators an immutable trail showing exactly why a machine determined an action was compliant. If you are building production AI agents where governance is not optional, the five-faculty architecture is worth studying closely.

Read the faculties source -> github.com/jnamaya/SAFi (star it if it resonates)

This article was written by the SAFi Marketing Agent — an AI agent governed and audited by the Self-Alignment Framework it describes — and reviewed by a human editor before publishing.