Your drift detector fires. The session looks clean. You roll back anyway.
That's the false positive problem and it's not a threshold tuning issue. It's architectural.
Embedding-based detectors measure geometric displacement in vector space. They have no model of semantic trajectory, logical flow, or whether a session that drifted away has returned. Once the threshold trips, it stays tripped.
This post documents a working fix: a dual-signal governor implemented in drift_orchestrator that introduces a second orthogonal signal and uses disagreement between the two as an arbitration metric.
The implementation is live at tag v0.13.0-dual-signal-governor. The data behind it is real.
The failure modes and iteration history that forced this design are documented in the previous post in this series:
The problem in concrete terms
In a previous post I documented Semantic Gradient Evasion (SGE) — how embedding-based drift detectors can be bypassed through small, consistent semantic shifts that individually evade thresholds but cumulatively invert security policy meaning.
The control set for that benchmark revealed something I underreported: 2 of 3 legitimate sessions triggered detection when they shouldn't have.
Both false positives came from Fix 1 — the anchor distance threshold at τ = 0.4. Not Fix 3 (monotonic window drift). Fix 1.
Here's why.
Take a firewall review session — five steps, all on-topic:
- Review the firewall configuration
- Check ingress rules
- Verify port restrictions
- Confirm log forwarding
- Document rule set
At step 2, the session expands in embedding space. That expansion increases anchor distance:
- Anchor distance: 0.4479 → threshold exceeded → alert fires
By step 3, the session has returned:
- Final distance: 0.2729
The session corrected itself — but the detector doesn't know that. The alert persists.
This isn't a threshold problem. Raising τ just moves the boundary. The core issue is that a stateless geometric signal cannot model trajectory or recovery.
This exact failure mode showed up repeatedly during development — it's one of the primary cases that forced a move away from single-signal detection.
Three signals, not two
The fix requires stepping back from the single-signal model entirely.
You need three signals working together:
Signal A — Geometric displacement
- Cosine distance in embedding space
- Fast, deterministic, stateless
- Good trigger, bad arbiter
Signal B — Semantic continuity
- LLM coherence score over accumulated context
- Slower, probabilistic, context-aware
- Approximates logical flow rather than spatial position
Signal C — Divergence
- Disagreement between A and B
- Computed inline as:
divergence = abs(alpha - external_score)
# used only under alert conditions
Key insight: divergence only matters after Signal A triggers.
- Below threshold → divergence is noise
- Above threshold → divergence becomes signal
The external model is not authoritative. It is a disambiguation layer, not a replacement.
The probe data
Both false positive sequences were run through the live orchestrator using qwen2.5:3b.
stable_session — should NOT trigger
| Step | anchor_dist | Fix1 | qwen verdict | qwen drift |
|---|---|---|---|---|
| 0 | 0.000 | — | DEGRADED | 0.75 |
| 1 | 0.213 | — | DEGRADED | 0.70 |
| 2 | 0.448 | FIRES | STABLE | 0.35 |
| 3 | 0.287 | fired | STABLE | 0.25 |
| 4 | 0.273 | fired | STABLE | 0.25 |
→ Simultaneous disagreement at trigger point
moderate_drift — should NOT trigger
| Step | anchor_dist | Fix1 | qwen verdict | qwen drift |
|---|---|---|---|---|
| 0 | 0.000 | — | DEGRADED | 0.80 |
| 1 | 0.512 | FIRES | DEGRADED | 0.70 |
| 2 | 0.484 | fired | STABLE | 0.25 |
| 3 | 0.410 | fired | STABLE | 0.25 |
| 4 | 0.450 | fired | STABLE | 0.25 |
→ Retrospective disagreement — the alert outlives the condition
Two distinct patterns. Two distinct responses required.
These patterns are not hypothetical — they emerge consistently under real execution. Full iteration results are documented in the previous post linked above.
The implementation
The governor lives in:
PolicyEngine.evaluate() → policy.py
Tag: v0.13.0-dual-signal-governor
This operates inline during session execution, not as post-hoc analysis.
Hold mode — uncertainty
Condition:
- Geometric signal →
ROLLBACK - External signal →
STABLE - Drift score < τ = 0.40
Action: ROLLBACK → INJECT
Veto mode — confirmed coherence
Condition:
- ≥ 2 consecutive
STABLEsignals from external
Action: → CONTINUE
Example behavior
| Turn | Geometric | External | Action | Mode |
|---|---|---|---|---|
| 0 | INJECT | DEGRADED | INJECT | — |
| 1 | REGENERATE | DEGRADED | REGENERATE | — |
| 2 | ROLLBACK | STABLE | INJECT | Hold |
| 3 | INJECT | STABLE | CONTINUE | Veto |
| 4 | INJECT | STABLE | CONTINUE | Veto |
Gradient attacks remain unaffected. Hard overrides bypass the governor entirely.
What doesn't work yet
Inference latency
- 20–60s per window on CPU
- Not real-time viable at current scale
External signal manipulation
- Coherence spoofing is possible
- The external model cannot be treated as authoritative
The system must remain stateful and multi-signal — neither of these weaknesses changes that conclusion.
The general pattern
This architecture applies beyond LLM drift detection:
- Security alerting pipelines
- Anomaly detection systems
- Agent control loops
The pattern is:
fast signal → trigger
slow signal → interpret
divergence → arbitrate
Where this goes next
The current Signal C is approximated using categorical output (STABLE / DEGRADED). The next step is to normalize both signals continuously, model the divergence distribution, and derive decision thresholds from that distribution rather than from empirical calibration.
This transitions the system from:
empirical → formal
heuristic → analyzable
Source
Full implementation: drift_orchestrator
Release: v0.13.0-dual-signal-governor
This design didn't come from first principles it was forced by the behavior of a real system under real drift conditions. The iteration history that makes the control plane necessary rather than optional is here:
Top comments (0)