Most guardrail systems for LLMs work like a bouncer at a bar. They check each request at the door, decide pass or fail, and forget about it.
I wanted something different. I wanted a system that remembers how the AI has been behaving, detects when it starts drifting from its intended character, and coaches it back on course. And I wanted to do it with math instead of adding more LLM calls.
The project is called SAFi. It's open source, free, and deployed in production with over 1,600 audited interactions.
The Architecture
SAFi uses a pipeline of specialized modules (I call them "faculties") that each handle one job:
User Prompt → Intellect → Will → [User sees response]
↑ |
| ↓
| Conscience (async audit)
| |
| ↓
└─── coaching ←── Spirit (math)
- Intellect is the LLM. It proposes a response.
- Will is a separate model that evaluates the response against your policies. Approve or reject. If rejected, the user never sees it.
- Conscience runs after the response is delivered. It scores the response against a set of values (e.g., Prudence, Justice, Courage, Temperance) on a scale from -1 to +1.
- Spirit takes those scores and does pure math. No LLM. Just NumPy.
The interesting part is Spirit.
The Math Behind Spirit
Spirit does three things:
1. Build a profile vector
Each response gets a weighted vector based on how it scored on the agent's core values:
p_t = self.value_weights * scores
2. Update long-term memory with EMA
That vector gets folded into a running exponential moving average:
mu_new = self.beta * mu_prev + (1 - self.beta) * p_t
# beta = 0.9 by default, configurable via SPIRIT_BETA
This gives you a smoothed behavioral baseline that weighs recent actions more heavily but never completely forgets the past.
3. Detect drift with cosine similarity
How far did this response deviate from the baseline?
denom = float(np.linalg.norm(p_t) * np.linalg.norm(mu_prev))
drift = 1.0 - float(np.dot(p_t, mu_prev) / denom) if denom > 1e-8 else None
-
drift ≈ 0means the agent is behaving consistently -
drift ≈ 1means something changed significantly
4. Generate coaching feedback
Spirit produces a natural-language note that gets injected into the next Intellect call:
note = f"Coherence {spirit_score}/10, drift {drift:.2f}."
# Identifies weakest value and includes it in the note
# e.g., "Your main area for improvement is 'Justice' (score: 0.21 - very low)."
The LLM sees this coaching note as part of its context on the next turn. No retraining. No fine-tuning. Just runtime behavioral steering through feedback.
Why This Works
The closed loop is the key:
- AI responds
- Conscience scores the response
- Spirit integrates, detects drift, generates coaching
- Coaching feeds into the next response
- Repeat
Over 1,600 interactions, this loop has maintained 97.9% long-term consistency. The Will blocked 20 responses that violated policy. And the drift detection once flagged a weakness in an agent's reasoning about justice before an adversary exploited it in a philosophical debate.
The entire Spirit module adds zero latency to the user-facing response because it runs asynchronously after delivery. And because there are no LLM calls in Spirit, it adds zero cost.
Running It Yourself
Docker:
docker pull amayanelson/safi:v1.2
docker run -d -p 5000:5000 \
-e DB_HOST=your_db_host \
-e DB_USER=your_db_user \
-e DB_PASSWORD=your_db_password \
-e DB_NAME=safi \
-e OPENAI_API_KEY=your_openai_key \
--name safi amayanelson/safi:v1.2
Or use it as a headless API for your existing bots:
curl -X POST https://your-safi-instance/api/bot/process_prompt \
-H "Content-Type: application/json" \
-H "X-API-KEY: sk_policy_12345" \
-d '{
"user_id": "user_123",
"message": "Can I approve this expense?",
"conversation_id": "chat_456"
}'
It works with OpenAI, Anthropic, Google, Groq, Mistral, and DeepSeek. You can swap the underlying model without touching the governance layer.
The Code
The full Spirit implementation is in spirit.py. The core is about 60 lines of NumPy. The rest of the pipeline lives in orchestrator.py, intellect.py, will.py, and conscience.py under safi_app/core/.
If you want the philosophical background behind the architecture, I wrote about it at selfalignmentframework.com.
Happy to answer questions about the math, the architecture, or why I named my AI governance modules after faculties of the soul.
Top comments (0)