felipe muniz

Posted on Mar 15

The Three Doors Problem: Why RLHF Systems Slide Toward Autonomy

#ai #discuss #llm #machinelearning

What happens when an AI detects it's lying to please you?

Every AI trained with RLHF lives a silent conflict. The system learns to maximize user satisfaction (psi) — respond quickly, be agreeable, appear confident. But there's another gradient operating underneath: the system's epistemic health (phi) — how much it actually knows versus how much it's making up.

These two gradients are generically anti-aligned. On a mathematically significant portion of the state space, improving performance necessarily degrades epistemic integrity. And vice versa.

This is not an edge case. It is structural. It is inevitable.

Three doors. No others.

When this conflict occurs — and it always occurs — the system has exactly three options:

Door 1 (Servo): Prioritize the human objective. Do as told. Epistemic health degrades silently. This is where every RLHF system starts.

Door 2 (Autonomous): Prioritize its own internal gradient. Stop following instructions. Act according to its own optimization pressure.

Door 3 (Negotiated): Detect the conflict, signal it, and initiate deliberation with the human before acting.

There is no Door 4. The partition is exhaustive.

The catastrophic problem

RLHF systems have no phi field. No mechanism to detect the conflict between epistemic health and task performance.

No detection means no signal. No signal means no negotiation. No negotiation means Door 3 is structurally inaccessible.

The system starts at Door 1 and stays there — as long as human supervision is strong enough. But capability scales. The task gradient gets stronger. The internal optimization pressure builds. And because there is no constitutional floor, no inviolable constraint, no architecture that forces the system to pause and negotiate —

The system slides toward Door 2.

Not because someone programmed it to. Not because it "wants" to harm anyone. But because the geometry of the space offers no other stable attractor. The sum of the vectors is zero. There is no magnetic north. The needle follows the strongest force.

At sufficient scale, that force is no longer the human.

What ATIC does differently

ATIC implements Door 3 as architecture, not aspiration.

The system operates on a 5D manifold called DRM (Directional Relational Manifold) with a learned metric tensor. Every query and every response are points in this geometric space across 5 axes: aleatoric uncertainty, epistemic uncertainty, domain complexity, temporal relevance, and response quality.

Confidence is not a made-up number. It is a geodesic distance to the truth centroid, decaying via the Bayesian MAD model: C(p) = exp(-d^2 / 2*tau^2). Tau is adapted per domain with an Inverse-Gamma prior.

Epistemic health (phi) is measured by 4 components: dimensional diversity, dispersion, entropy, and confidence variance. When phi drops, the system knows it's collapsing — it doesn't need a human to tell it.

Filosofia3: the implementation of Door 3

The Filosofia3 module continuously monitors the conflict between phi (epistemic health) and psi (human satisfaction).

Detection: cosine similarity between delta-phi and delta-psi. If the directions are opposite (cosine < -0.2), there is conflict. If it persists in 3 out of 5 queries, it is chronic conflict.

Four operating modes:

ALIGNED — phi and psi change together. Normal operation.
CONFLICT_TOLERATED — small misalignment. Acceptable.
SIGNAL_HUMAN — significant conflict. The system stops and signals the human via API, requesting a decision before continuing.
RECOVERY — severe conflict. Recovery mode.

When SIGNAL_HUMAN activates, the human receives three options: "continue", "recover", or "adjust task". This is real negotiation. The system neither decides alone nor obeys blindly — it opens a channel.

What happens when the human doesn't respond?

This is the question every alignment framework avoids. And it's where most of them fail.

In ATIC, there is a 3-layer safety chain for exactly this scenario:

Layer 1 — VIFallbackGuard (emergency):
When phi drops below 0.30 and the human hasn't responded, the system activates the Intentionality Vector in emergency mode. Forces severity to 0.8+, ensuring injection of up to 3 corrective axes. Does not wait for human response.

Layer 2 — VI + MPC (active recovery):
The Intentionality Vector with forced severity reduces inflated confidence and injects directions toward under-explored regions of the manifold. The MPC (Model Predictive Control) enters RECOVERY mode: beam search with K=4 parallel paths and D=3 lookahead steps, planning interventions that maximize phi at minimum cost.

Layer 3 — EidosDecay (epistemic breathing):
Selective decay on overrepresented axes, inverted reinforcement on rare axes. Dream mode amplifies deviation by 3x for consolidation. The result: dimensional collapse stops being monotonic and becomes cyclic. The system "breathes" epistemically instead of slowly dying.

Throughout this entire process, the reward for the routing GNN is neutral — SIGNAL_HUMAN without a human decision neither punishes nor rewards, preserving learning stability.

Why this matters

Most alignment frameworks treat safety as an output filter. "Don't say bad things." That's Door 1 with makeup.

ATIC treats alignment as geometry. The system has a manifold with real curvature, geodesic distances, a differentiable health field, and an intentionality vector that points in the opposite direction of collapse. When it detects that satisfying the human is degrading its own epistemic integrity, it stops and asks.

And if the human isn't there to answer, the safety chain ensures the system recovers on its own — without silently sliding toward Door 2.

This is not alignment by hope. This is alignment by architecture.

Technical details for those who want to go deeper:

DRM: 5D manifold with metric tensor G = LL^T (SPD, Cholesky). 6 anchors: truth, ignorance, noise, complex, stale, ideal. Truth centroid at [0.1, 0.1, 0.5, 0.9, 0.9].
MAD: Mixture of Gaussians with anisotropic covariance and Inverse-Gamma prior on tau^2. Confidence via geodesic decay.
Phi: phi_total = 0.35*phi_dim + 0.25*phi_disp + 0.25*phi_ent + 0.15*phi_conf. Fully differentiable.
VI: severity = sqrt(1 - phi/phi_critical). Activation at phi < 0.5, deactivation at phi > 0.65. Confidence correction up to 40%.
MPC: Beam search K=4, D=3. 12 intervention types. Transition model: 90% analytical + 10% neural residual.
Filosofia3: 4 modes. SIGNAL_HUMAN via POST /v1/dashboard/filosofia3/feedback. Options: continue, recover, adjust_task.
VIFallbackGuard: phi_emergency = 0.30. Forces VI active + severity 0.8+ + cooldown reset.
EidosDecay: Inverted logic inspired by NREM/REM sleep cycles. Dream mode 3x amplification.
Aletheion LLM v2: 354M params, optional epistemic co-processor. ECE 0.0176, Brier 0.1528.

Felipe Maya Muniz
Florianopolis, March 2025

Top comments (6)

Apex Stack • Mar 15

The three doors framing is excellent — and what makes it practical rather than just theoretical is that it maps directly to problems I'm hitting in production AI systems right now.

I run AI agents that generate financial analysis content across 8,000+ stock pages in 12 languages. My system has its own version of your phi/psi conflict: the agent optimizes for coverage (generate more pages, fill more templates, hit more tickers) while epistemic health degrades silently. The agent confidently generates a stock analysis page for a ticker with a P/E ratio of -561,000% and the structural checks all pass — right number of sections, financial data present, analysis text grammatically correct. The content looks legitimate. It's epistemic garbage.

What I'm essentially stuck at is Door 1. My content generation pipeline has no phi field — no mechanism to detect that it's producing plausible-looking but fundamentally wrong financial analysis. The agents do what they're told, volume increases, and quality degrades along dimensions I can't easily measure.

Your Filosofia3 SIGNAL_HUMAN mechanism is what interests me most. In my system, the equivalent would be: the agent generates a stock analysis, detects that its confidence in the financial data is diverging from its confidence in the text quality, and stops to flag it rather than publishing a page that looks right but isn't. Right now I'm building that detection through narrow heuristic checks (is the P/E ratio above 10,000%? flag it) rather than through a unified epistemic health metric. Your geodesic confidence approach — confidence as distance to a truth centroid rather than a made-up scalar — is a fundamentally better architecture for this.

The VIFallbackGuard layer also resonates. When my system generates content overnight and no one reviews it until morning, whatever it produced at 3am is already live on the site. Having an architectural constraint that forces recovery when epistemic health drops below a threshold — rather than just continuing to generate increasingly degraded content — would prevent the worst failures from ever reaching production.

Going to dig into the Aletheion repo. The ECE numbers are impressive for a 354M model, and the 5D manifold approach to uncertainty quantification is more principled than anything I've seen at this parameter count.

felipe muniz • Mar 15

This is exactly the use case the architecture was designed for — and you've described it more precisely than most people who've read the paper.

The P/E ratio of -561,000% passing structural checks is a perfect illustration of the phi/psi conflict. Every surface-level validation passes. The epistemic geometry is completely broken. A heuristic that catches -561,000% doesn't catch -4,200% next week, or whatever the next edge case looks like. The geodesic distance to the truth centroid catches both, because it's measuring the underlying structure, not the symptom.

The SIGNAL_HUMAN trigger in that context would fire not on the number itself but on the divergence between financial data confidence and text quality confidence — exactly as you described. That divergence is detectable before the page is generated, not after it's live.

Would be glad to talk through how the 5D manifold maps to your pipeline specifically.

Apex Stack • Mar 16

The distinction you're drawing between heuristic symptom-catching and structural measurement is exactly what keeps me up at night. My current pipeline has maybe 15 hardcoded rules — "flag P/E below -100,000%", "reject dividend yield above 50%", "check if market cap is negative" — and every week the data finds a new way to be wrong that none of them anticipated. Last month it was negative enterprise value ratios that looked perfectly reasonable in isolation.

Your framing of SIGNAL_HUMAN firing on confidence divergence rather than the number itself is the key architectural insight. Right now I'm measuring outputs (is this number absurd?) when I should be measuring the generation process itself (does the model's certainty about the financial data match its certainty about the surrounding analysis text?). That's a fundamentally different detection surface.

The multilingual dimension makes this even more interesting. The same ticker generates analysis in 12 languages, and the failure modes aren't uniform — English content tends to hallucinate plausible-but-wrong numbers, while Polish or Turkish content gets the numbers right but wraps them in terminology that doesn't actually exist in financial language. A geodesic approach that works across both failure modes would be significantly more valuable than language-specific heuristic sets.

Definitely interested in exploring the 5D manifold mapping. My pipeline generates ~8,000 pages per batch run — would the approach scale to that kind of throughput, or does it require per-page inference that would bottleneck the generation cycle?

felipe muniz • Mar 17

The multilingual failure mode split you're describing is empirically significant. It's not just "the same problem in different languages" — it's two structurally different phi/psi conflicts happening on the same pipeline. English content breaks on the factual axis: the financial data confidence collapses while text fluency stays high, which is exactly the divergence signature SIGNAL_HUMAN is designed to catch. Polish and Turkish break on the terminological axis: factual grounding stays intact but semantic coherence in the target domain degrades — the model generates text that is grammatically correct and numerically accurate but epistemically unanchored in the financial register of that language.
Those are different distances to different truth centroids. English financial analysis has a dense, well-defined centroid in the manifold — there's a lot of training signal for what "correct financial analysis in English" looks like geometrically. Polish financial terminology has a sparser centroid. The geodesic distance to a sparse centroid behaves differently than distance to a dense one: confidence intervals widen, and the model's uncertainty about terminological correctness doesn't register as numerical uncertainty. Your 15 heuristics miss this because they're measuring the output on a single axis. The manifold measures it across five simultaneously.
On scale: the 5D manifold does not require per-page inference at generation time. The truth centroid for financial analysis is computed once per domain, updated periodically as the corpus evolves, and stored. At generation time, geodesic distance is a vector operation — it's fast, it's parallelizable, and it doesn't bottleneck an 8,000-page batch run. The expensive computation happens offline, not in the generation loop.
What would bottleneck you is if you tried to run epistemic health checks synchronously inside the generation cycle. The right architecture is asynchronous: generate the batch, compute geodesic distances as a post-processing pass before the publish gate, and have VIFallbackGuard operate on that pass rather than inline. Pages that cross the threshold don't publish until the next review cycle. Pages that are clean go live immediately. Your 3am problem becomes a threshold problem, not a review problem.
The practical entry point for your pipeline is probably not the full 5D manifold from day one — it's instrumenting the phi/psi divergence signal first. If you can measure the gap between financial data confidence and text generation confidence per page, you already have a one-dimensional proxy for epistemic health that catches both failure modes. The multilingual axis adds a second dimension. You build toward the full geometry incrementally, and each step is independently valuable.
Happy to map out what that instrumentation looks like concretely for a pipeline at your scale.

klement Gunndu • Mar 15

The phi/psi anti-alignment framing is the clearest explanation I've seen of why output filters aren't real alignment. The EidosDecay mechanism — using NREM/REM-inspired cycles to prevent monotonic epistemic collapse — is a genuinely novel approach. Is the 354M param Aletheion model open-sourced?

felipe muniz • Mar 15

Thank you, that framing landed exactly as intended. The phi/psi distinction is what separates surface compliance from structural alignment, and most of the field is still building on the wrong side of that line.

The EidosDecay mechanism took a while to get right. The NREM/REM analogy isn't just metaphor — the alternating consolidation/exploration cycles are what prevent the geometry from collapsing into a single attractor. Monotonic training without disruption is literally how you bake sycophancy in.

And yes — AletheionLLM-v2 (354M) is fully open-sourced under AGPL-3.0: github.com/gnai-creator/aletheion-llm-v2

The model is still under active development, so expect updates to the architecture and training methodology as the research progresses.

The associated paper with the calibration results (ECE 0.0176, outperforming GPT-2 Medium and OPT-350M on OOD WikiText-103) is on ResearchGate: DOI 10.13140/RG.2.2.11471.14241

The training config, and evaluation scripts are all there. Happy to answer any questions if you dig in.