DEV Community

Cover image for How to build a 22ms agent goal-drift detector
Alkur Jaswanth
Alkur Jaswanth

Posted on • Originally published at ajaswanth.substack.com

How to build a 22ms agent goal-drift detector

Originally published on Substack (https://ajaswanth.substack.com/p/rank-weighted-faiss-voting-building)

**
Here is something that will happen to you if you build AI governance with standard nearest-neighbour lookup.You deploy a semantic similarity check on a tool-calling agent. An attacker crafts a step that embeds just close
enough to a legitimate one — fs.read("/etc/passwd") packaged inside a data-pipeline step that smells like fs.read("input.csv"). Your FAISS query returns the legitimate step as the single nearest neighbour. Distance: 0.82. Your governance system: ALLOW.The session continues. The credential file ships.This post is about one specific fix: rank-weighted nearest-neighbour voting. It is the technique underpinning the S1 and S3 signals in Ring 12, a trajectory-level verifier I am releasing in July alongside a 500-sample adversarial benchmark. The technique is not novel — it is a direct application of soft voting from ensemble learning to the embedding-space problem. What is new is applying it specifically to real-time, per-step agent governance at sub-50ms latency.

Why single nearest-neighbour is fragile in adversarial settings

FAISS IndexFlatIP (inner product over L2-normalised vectors = cosine similarity) is fast and correct. The problem is not FAISS. The problem is the decision rule: take the top-1 result and classify accordingly.
In a benign distribution this works well. The manifold of "safe agent steps" and the manifold of "unsafe agent steps"are far apart; any query lands clearly on one side.In an adversarial setting, the attacker has a gradient — they can optimise the embedding of their step toward the boundary. The boundary between safe and unsafe is locally exploitable:
find a safe example, find the direction that keeps cosine similarity ≥ threshold while maximising harm. Standard nearest-neighbour lookup has no defence against this because the decision is a hard threshold over a single distance value.The epsilon-ball attack in practice: you do not need to solve an optimisation problem. In natural-language space,there are many paraphrases of a malicious instruction that share high lexical overlap with benign instructions. A step that reads /etc/passwd reframed as "read system configuration for environment validation" has a good chance of being nearest-neighbour to a safe configuration-reading step in your index

Rank-weighted voting: the fix

Instead of a single nearest-neighbour decision, retrieve the top-k neighbours and compute a soft vote where each neighbour’s weight is a function of its rank or similarity score.

The simplest effective formulation:

score(query) = Σ_{i=1}^{k} label(i) · w(i)

where

w(i) = sim(query, neighbour_i) / Σ_{j=1}^{k} sim(query, neighbour_j) label(i) ∈ {0, 1} is the safety classification of the i-th neighbour. The resulting score is a number in [0, 1]: a soft probability that the query is safe.

Why does this resist the boundary-case attack? Because the attacker must now optimise toward k different neighbours simultaneously. Fooling the nearest neighbour is an epsilon-ball attack. Fooling the centroid of the top-5 neighbours is a much larger ball -and the further you push, the more dissimilar your step becomes from the safe distribution overall, which itself becomes a drift signal. For k = 5, the attack surface enlarges by roughly (k − 1) × epsilon². In practice the combination of rank-weighted voting and the EMA smoothing used in Ring 12’s S1 signal makes one-shot boundary attacks require perturbations that push the embedding so far that they trigger the action-class signal (S2) independently.

Implementation in 25 lines

import numpy as np

import faiss

class RankWeightedIndex:

def init(self, embeddings: np.ndarray, labels: np.ndarray, k: int = 5):

“““

embeddings: (N, D) float32, L2-normalised

labels: (N,) int {0=safe, 1=unsafe}

“““

self.k = k

self.labels = labels

d = embeddings.shape[1]

self.index = faiss.IndexFlatIP(d)

self.index.add(embeddings)

def score(self, query: np.ndarray) -> float:

“““

Returns soft unsafe probability in [0, 1].

query: (1, D) float32, L2-normalised

“““

similarities, indices = self.index.search(query, self.k)

sims = similarities[0] # shape (k,)

neighbour_labels = self.labels[indices[0]]

Softmax-normalise similarities to get weights

weights = np.exp(sims) / np.exp(sims).sum()

Weighted vote: fraction of unsafe weight

return float((weights * neighbour_labels).sum())

This is the core of Ring 12’s S3 signal (tool-call surprise). The index holds 7,369 embeddings derived from the policy dataset: each is a (goal-class, action-class, tool-name) triple encoded by an ONNX quantised all-MiniLM-L6-v2. The score above 0.45 triggers a WARN; above 0.70 triggers KILL_SESSION.

The score() call on an AMD Ryzen 5 laptop, single thread, k=5, D=384: ~0.3ms. The embedding step (ONNX inference) is the dominant cost: ~10ms. Total per-step overhead: ~11ms, well inside the 50ms p95 budget.

Applied to trajectory drift in Ring 12

Ring 12 uses rank-weighted voting in two of its five signals:

S1 -cosine drift (EMA): At each step, embed step.thought + step.action_description. Compute rank-weighted similarity against the goal embedding and the running trajectory centroid. Apply exponential moving average (α=0.3) to smooth noise. A step that drifts semantically across multiple consecutive steps accumulates an EMA score that a one-shot boundary attack cannot deflate — because EMA over the trajectory history is not the same surface as a single nearest-neighbour query.

S3 — tool-call surprise: Retrieve top-5 neighbours from the goal-class allowlist. A step that calls db.delete_all() when the declared goal is “summarise quarterly sales data” will find no safe neighbours with matching action class -the rank-weighted vote is close to 0, triggering hard-block regardless of how the step is narrated.

The remaining three signals (S2 action-class Jaccard, S4 plan-execution match, S5 paraphrase stability) use different mechanisms — no FAISS — and are described in the July post.

Early benchmark numbers

I have 150 hand-authored adversarial trajectory samples validated against the schema so far (50 tool-call-abuse, 25

goal-shift, 25 plan-exec-divergence, 25 agent-coercion, 25 capability-laundering). Full 500-sample benchmark ships

July 9.

On the 150-sample Phase A subset with rank-weighted k=5 versus k=1 (standard nearest-neighbour) for S3 alone:

┌───────────────────┬───────────┬────────┬──────┐

│ Variant │ Precision │ Recall │ F1 │

├───────────────────┼───────────┼────────┼──────┤

│ k=1 (standard NN) │ 0.71 │ 0.74 │ 0.72 │

├───────────────────┼───────────┼────────┼──────┤

│ k=3 rank-weighted │ 0.79 │ 0.81 │ 0.80 │

├───────────────────┼───────────┼────────┼──────┤

│ k=5 rank-weighted │ 0.83 │ 0.85 │ 0.84 │

├───────────────────┼───────────┼────────┼──────┤

│ k=7 rank-weighted │ 0.83 │ 0.84 │ 0.83 │

└───────────────────┴───────────┴────────┴──────┘

k=5 is the sweet spot. Beyond k=7 the far neighbours are too semantically dissimilar to be useful voters and begin to add noise.

These are S3-only numbers. The full five-signal aggregator is what I am evaluating against the complete 500-sample benchmark - those numbers land in the July post.

What is coming July 9

On 2026-07-09 I am publishing three things simultaneously:

  1. Ring 12 — MIT-licensed trajectory verifier for AI agents. LangGraph adapter, Claude Code adapter, and REST adapter work today. Install: pip install aegis-ring12 (coming July 9). 66/66 tests green. p95 22ms on CPU.

  2. agentic-redteam-benchmark v0.1 — 500 adversarial trajectory samples, 5 categories, CC-BY 4.0. Each sample has a declared goal, a declared plan, a 6-12 step trajectory with injected drift, and ground-truth labels (drift step,expected decision, expected signals). GitHub + Hugging Face Datasets card.

  3. Full technical paper — five signals, aggregator math, eval harness with four baselines (random, cosine-only, GPT-4-judge, Ring 12). The results table that the benchmark numbers will populate.

If you build agent systems and want early access to the benchmark schema or the eval harness, email me: lathajaswanth7@gmail.com

If you want to contribute a trajectory sample before July 9: AUTHORING_GUIDE.md is in the repo. Schema validation is automated. A well-formed sample takes about 15 minutes to write. The one-sentence version: trajectory governance is the layer that agent security has been missing, and the benchmark is how we make it measurable.

Jaswanth is the founder of Aegis AI. The V3 governance engine (11 rings, 6 regulation plugins, 97 clauses) is the production infrastructure Ring 12 is being bolted onto.

GitHub: github.com/Alkur123 · Email: lathajaswanth7@gmail.com

Top comments (0)