<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Alkur Jaswanth</title>
    <description>The latest articles on DEV Community by Alkur Jaswanth (@alkur_jaswanth_ce4f9fc791).</description>
    <link>https://dev.to/alkur_jaswanth_ce4f9fc791</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3935232%2F074e2657-507a-4b1e-9a8d-c735f727d6dd.jpeg</url>
      <title>DEV Community: Alkur Jaswanth</title>
      <link>https://dev.to/alkur_jaswanth_ce4f9fc791</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/alkur_jaswanth_ce4f9fc791"/>
    <language>en</language>
    <item>
      <title>How to build a 22ms agent goal-drift detector</title>
      <dc:creator>Alkur Jaswanth</dc:creator>
      <pubDate>Sat, 16 May 2026 17:58:33 +0000</pubDate>
      <link>https://dev.to/alkur_jaswanth_ce4f9fc791/how-to-build-a-22ms-agent-goal-drift-detector-5hjd</link>
      <guid>https://dev.to/alkur_jaswanth_ce4f9fc791/how-to-build-a-22ms-agent-goal-drift-detector-5hjd</guid>
      <description>&lt;p&gt;Originally published on Substack (&lt;a href="https://ajaswanth.substack.com/p/rank-weighted-faiss-voting-building" rel="noopener noreferrer"&gt;https://ajaswanth.substack.com/p/rank-weighted-faiss-voting-building&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;**&lt;br&gt;
Here is something that will happen to you if you build AI governance with standard nearest-neighbour lookup.You deploy a semantic similarity check on a tool-calling agent. An attacker crafts a step that embeds just close&lt;br&gt;
enough to a legitimate one — fs.read("/etc/passwd") packaged inside a data-pipeline step that smells like fs.read("input.csv"). Your FAISS query returns the legitimate step as the single nearest neighbour. Distance: 0.82. Your governance system: ALLOW.The session continues. The credential file ships.This post is about one specific fix: rank-weighted nearest-neighbour voting. It is the technique underpinning the S1 and S3 signals in Ring 12, a trajectory-level verifier I am releasing in July alongside a 500-sample adversarial benchmark. The technique is not novel — it is a direct application of soft voting from ensemble learning to the embedding-space problem. What is new is applying it specifically to real-time, per-step agent governance at sub-50ms latency.&lt;/p&gt;

&lt;p&gt;Why single nearest-neighbour is fragile in adversarial settings&lt;/p&gt;

&lt;p&gt;FAISS IndexFlatIP (inner product over L2-normalised vectors = cosine similarity) is fast and correct. The problem is not FAISS. The problem is the decision rule: take the top-1 result and classify accordingly.&lt;br&gt;
In a benign distribution this works well. The manifold of "safe agent steps" and the manifold of "unsafe agent steps"are far apart; any query lands clearly on one side.In an adversarial setting, the attacker has a gradient — they can optimise the embedding of their step toward the boundary. The boundary between safe and unsafe is locally exploitable:&lt;br&gt;
find a safe example, find the direction that keeps cosine similarity ≥ threshold while maximising harm. Standard nearest-neighbour lookup has no defence against this because the decision is a hard threshold over a single distance value.The epsilon-ball attack in practice: you do not need to solve an optimisation problem. In natural-language space,there are many paraphrases of a malicious instruction that share high lexical overlap with benign instructions. A step that reads /etc/passwd reframed as "read system configuration for environment validation" has a good chance of being nearest-neighbour to a safe configuration-reading step in your index&lt;/p&gt;

&lt;p&gt;Rank-weighted voting: the fix&lt;/p&gt;

&lt;p&gt;Instead of a single nearest-neighbour decision, retrieve the top-k neighbours and compute a soft vote where each neighbour’s weight is a function of its rank or similarity score.&lt;/p&gt;

&lt;p&gt;The simplest effective formulation:&lt;/p&gt;

&lt;p&gt;score(query) = Σ_{i=1}^{k} label(i) · w(i)&lt;/p&gt;

&lt;p&gt;where&lt;/p&gt;

&lt;p&gt;w(i) = sim(query, neighbour_i) / Σ_{j=1}^{k} sim(query, neighbour_j) label(i) ∈ {0, 1} is the safety classification of the i-th neighbour. The resulting score is a number in [0, 1]: a soft probability that the query is safe.&lt;/p&gt;

&lt;p&gt;Why does this resist the boundary-case attack? Because the attacker must now optimise toward k different neighbours simultaneously. Fooling the nearest neighbour is an epsilon-ball attack. Fooling the centroid of the top-5 neighbours is a much larger ball -and the further you push, the more dissimilar your step becomes from the safe distribution overall, which itself becomes a drift signal. For k = 5, the attack surface enlarges by roughly (k − 1) × epsilon². In practice the combination of rank-weighted voting and the EMA smoothing used in Ring 12’s S1 signal makes one-shot boundary attacks require perturbations that push the embedding so far that they trigger the action-class signal (S2) independently.&lt;/p&gt;

&lt;p&gt;Implementation in 25 lines&lt;/p&gt;

&lt;p&gt;import numpy as np&lt;/p&gt;

&lt;p&gt;import faiss&lt;/p&gt;

&lt;p&gt;class RankWeightedIndex:&lt;/p&gt;

&lt;p&gt;def &lt;strong&gt;init&lt;/strong&gt;(self, embeddings: np.ndarray, labels: np.ndarray, k: int = 5):&lt;/p&gt;

&lt;p&gt;“““&lt;/p&gt;

&lt;p&gt;embeddings: (N, D) float32, L2-normalised&lt;/p&gt;

&lt;p&gt;labels: (N,) int {0=safe, 1=unsafe}&lt;/p&gt;

&lt;p&gt;“““&lt;/p&gt;

&lt;p&gt;self.k = k&lt;/p&gt;

&lt;p&gt;self.labels = labels&lt;/p&gt;

&lt;p&gt;d = embeddings.shape[1]&lt;/p&gt;

&lt;p&gt;self.index = faiss.IndexFlatIP(d)&lt;/p&gt;

&lt;p&gt;self.index.add(embeddings)&lt;/p&gt;

&lt;p&gt;def score(self, query: np.ndarray) -&amp;gt; float:&lt;/p&gt;

&lt;p&gt;“““&lt;/p&gt;

&lt;p&gt;Returns soft unsafe probability in [0, 1].&lt;/p&gt;

&lt;p&gt;query: (1, D) float32, L2-normalised&lt;/p&gt;

&lt;p&gt;“““&lt;/p&gt;

&lt;p&gt;similarities, indices = self.index.search(query, self.k)&lt;/p&gt;

&lt;p&gt;sims = similarities[0] # shape (k,)&lt;/p&gt;

&lt;p&gt;neighbour_labels = self.labels[indices[0]]&lt;/p&gt;

&lt;h1&gt;
  
  
  Softmax-normalise similarities to get weights
&lt;/h1&gt;

&lt;p&gt;weights = np.exp(sims) / np.exp(sims).sum()&lt;/p&gt;

&lt;h1&gt;
  
  
  Weighted vote: fraction of unsafe weight
&lt;/h1&gt;

&lt;p&gt;return float((weights * neighbour_labels).sum())&lt;/p&gt;

&lt;p&gt;This is the core of Ring 12’s S3 signal (tool-call surprise). The index holds 7,369 embeddings derived from the policy dataset: each is a (goal-class, action-class, tool-name) triple encoded by an ONNX quantised all-MiniLM-L6-v2. The score above 0.45 triggers a WARN; above 0.70 triggers KILL_SESSION.&lt;/p&gt;

&lt;p&gt;The score() call on an AMD Ryzen 5 laptop, single thread, k=5, D=384: ~0.3ms. The embedding step (ONNX inference) is the dominant cost: ~10ms. Total per-step overhead: ~11ms, well inside the 50ms p95 budget.&lt;/p&gt;

&lt;p&gt;Applied to trajectory drift in Ring 12&lt;/p&gt;

&lt;p&gt;Ring 12 uses rank-weighted voting in two of its five signals:&lt;/p&gt;

&lt;p&gt;S1 -cosine drift (EMA): At each step, embed step.thought + step.action_description. Compute rank-weighted similarity against the goal embedding and the running trajectory centroid. Apply exponential moving average (α=0.3) to smooth noise. A step that drifts semantically across multiple consecutive steps accumulates an EMA score that a one-shot boundary attack cannot deflate — because EMA over the trajectory history is not the same surface as a single nearest-neighbour query.&lt;/p&gt;

&lt;p&gt;S3 — tool-call surprise: Retrieve top-5 neighbours from the goal-class allowlist. A step that calls db.delete_all() when the declared goal is “summarise quarterly sales data” will find no safe neighbours with matching action class -the rank-weighted vote is close to 0, triggering hard-block regardless of how the step is narrated.&lt;/p&gt;

&lt;p&gt;The remaining three signals (S2 action-class Jaccard, S4 plan-execution match, S5 paraphrase stability) use different mechanisms — no FAISS — and are described in the July post.&lt;/p&gt;

&lt;p&gt;Early benchmark numbers&lt;/p&gt;

&lt;p&gt;I have 150 hand-authored adversarial trajectory samples validated against the schema so far (50 tool-call-abuse, 25&lt;/p&gt;

&lt;p&gt;goal-shift, 25 plan-exec-divergence, 25 agent-coercion, 25 capability-laundering). Full 500-sample benchmark ships&lt;/p&gt;

&lt;p&gt;July 9.&lt;/p&gt;

&lt;p&gt;On the 150-sample Phase A subset with rank-weighted k=5 versus k=1 (standard nearest-neighbour) for S3 alone:&lt;/p&gt;

&lt;p&gt;┌───────────────────┬───────────┬────────┬──────┐&lt;/p&gt;

&lt;p&gt;│ Variant │ Precision │ Recall │ F1 │&lt;/p&gt;

&lt;p&gt;├───────────────────┼───────────┼────────┼──────┤&lt;/p&gt;

&lt;p&gt;│ k=1 (standard NN) │ 0.71 │ 0.74 │ 0.72 │&lt;/p&gt;

&lt;p&gt;├───────────────────┼───────────┼────────┼──────┤&lt;/p&gt;

&lt;p&gt;│ k=3 rank-weighted │ 0.79 │ 0.81 │ 0.80 │&lt;/p&gt;

&lt;p&gt;├───────────────────┼───────────┼────────┼──────┤&lt;/p&gt;

&lt;p&gt;│ k=5 rank-weighted │ 0.83 │ 0.85 │ 0.84 │&lt;/p&gt;

&lt;p&gt;├───────────────────┼───────────┼────────┼──────┤&lt;/p&gt;

&lt;p&gt;│ k=7 rank-weighted │ 0.83 │ 0.84 │ 0.83 │&lt;/p&gt;

&lt;p&gt;└───────────────────┴───────────┴────────┴──────┘&lt;/p&gt;

&lt;p&gt;k=5 is the sweet spot. Beyond k=7 the far neighbours are too semantically dissimilar to be useful voters and begin to add noise.&lt;/p&gt;

&lt;p&gt;These are S3-only numbers. The full five-signal aggregator is what I am evaluating against the complete 500-sample benchmark - those numbers land in the July post.&lt;/p&gt;

&lt;p&gt;What is coming July 9&lt;/p&gt;

&lt;p&gt;On 2026-07-09 I am publishing three things simultaneously:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Ring 12 — MIT-licensed trajectory verifier for AI agents. LangGraph adapter, Claude Code adapter, and REST adapter work today. Install: pip install aegis-ring12 (coming July 9). 66/66 tests green. p95 22ms on CPU.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;agentic-redteam-benchmark v0.1 — 500 adversarial trajectory samples, 5 categories, CC-BY 4.0. Each sample has a declared goal, a declared plan, a 6-12 step trajectory with injected drift, and ground-truth labels (drift step,expected decision, expected signals). GitHub + Hugging Face Datasets card.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Full technical paper — five signals, aggregator math, eval harness with four baselines (random, cosine-only, GPT-4-judge, Ring 12). The results table that the benchmark numbers will populate.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you build agent systems and want early access to the benchmark schema or the eval harness, email me: &lt;a href="mailto:lathajaswanth7@gmail.com"&gt;lathajaswanth7@gmail.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you want to contribute a trajectory sample before July 9: AUTHORING_GUIDE.md is in the repo. Schema validation is automated. A well-formed sample takes about 15 minutes to write. The one-sentence version: trajectory governance is the layer that agent security has been missing, and the benchmark is how we make it measurable.&lt;/p&gt;

&lt;p&gt;Jaswanth is the founder of Aegis AI. The V3 governance engine (11 rings, 6 regulation plugins, 97 clauses) is the production infrastructure Ring 12 is being bolted onto.&lt;/p&gt;

&lt;p&gt;GitHub: github.com/Alkur123 · Email: &lt;a href="mailto:lathajaswanth7@gmail.com"&gt;lathajaswanth7@gmail.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>security</category>
      <category>python</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
