Nelson Amaya

Posted on Feb 12

I Built a Feedback Loop That Coaches LLMs at Runtime Using NumPy

#llm #machinelearning #python #showdev

Most guardrail systems for LLMs work like a bouncer at a bar. They check each request at the door, decide pass or fail, and forget about it.

I wanted something different. I wanted a system that remembers how the AI has been behaving, detects when it starts drifting from its intended character, and coaches it back on course. And I wanted to do it with math instead of adding more LLM calls.

The project is called SAFi. It's open source, free, and deployed in production with over 1,600 audited interactions.

The Architecture

SAFi uses a pipeline of specialized modules (I call them "faculties") that each handle one job:

User Prompt → Intellect → Will → [User sees response]
                 ↑                      |
                 |                      ↓
                 |                Conscience (async audit)
                 |                      |
                 |                      ↓
                 └─── coaching ←── Spirit (math)

Intellect is the LLM. It proposes a response.
Will is a separate model that evaluates the response against your policies. Approve or reject. If rejected, the user never sees it.
Conscience runs after the response is delivered. It scores the response against a set of values (e.g., Prudence, Justice, Courage, Temperance) on a scale from -1 to +1.
Spirit takes those scores and does pure math. No LLM. Just NumPy.

The interesting part is Spirit.

The Math Behind Spirit

Spirit does three things:

1. Build a profile vector

Each response gets a weighted vector based on how it scored on the agent's core values:

p_t = self.value_weights * scores

2. Update long-term memory with EMA

That vector gets folded into a running exponential moving average:

mu_new = self.beta * mu_prev + (1 - self.beta) * p_t
# beta = 0.9 by default, configurable via SPIRIT_BETA

This gives you a smoothed behavioral baseline that weighs recent actions more heavily but never completely forgets the past.

3. Detect drift with cosine similarity

How far did this response deviate from the baseline?

denom = float(np.linalg.norm(p_t) * np.linalg.norm(mu_prev))
drift = 1.0 - float(np.dot(p_t, mu_prev) / denom) if denom > 1e-8 else None

drift ≈ 0 means the agent is behaving consistently
drift ≈ 1 means something changed significantly

4. Generate coaching feedback

Spirit produces a natural-language note that gets injected into the next Intellect call:

note = f"Coherence {spirit_score}/10, drift {drift:.2f}."
# Identifies weakest value and includes it in the note
# e.g., "Your main area for improvement is 'Justice' (score: 0.21 - very low)."

The LLM sees this coaching note as part of its context on the next turn. No retraining. No fine-tuning. Just runtime behavioral steering through feedback.

Why This Works

The closed loop is the key:

AI responds
Conscience scores the response
Spirit integrates, detects drift, generates coaching
Coaching feeds into the next response
Repeat

Over 1,600 interactions, this loop has maintained 97.9% long-term consistency. The Will blocked 20 responses that violated policy. And the drift detection once flagged a weakness in an agent's reasoning about justice before an adversary exploited it in a philosophical debate.

The entire Spirit module adds zero latency to the user-facing response because it runs asynchronously after delivery. And because there are no LLM calls in Spirit, it adds zero cost.

Running It Yourself

Docker:

docker pull amayanelson/safi:v1.2

docker run -d -p 5000:5000 \
  -e DB_HOST=your_db_host \
  -e DB_USER=your_db_user \
  -e DB_PASSWORD=your_db_password \
  -e DB_NAME=safi \
  -e OPENAI_API_KEY=your_openai_key \
  --name safi amayanelson/safi:v1.2

Or use it as a headless API for your existing bots:

curl -X POST https://your-safi-instance/api/bot/process_prompt \
  -H "Content-Type: application/json" \
  -H "X-API-KEY: sk_policy_12345" \
  -d '{
    "user_id": "user_123",
    "message": "Can I approve this expense?",
    "conversation_id": "chat_456"
  }'

It works with OpenAI, Anthropic, Google, Groq, Mistral, and DeepSeek. You can swap the underlying model without touching the governance layer.

The Code

The full Spirit implementation is in spirit.py. The core is about 60 lines of NumPy. The rest of the pipeline lives in orchestrator.py, intellect.py, will.py, and conscience.py under safi_app/core/.

If you want the philosophical background behind the architecture, I wrote about it at selfalignmentframework.com.

Happy to answer questions about the math, the architecture, or why I named my AI governance modules after faculties of the soul.

DEV Community