<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Alkur Jaswanth</title>
    <description>The latest articles on DEV Community by Alkur Jaswanth (@alkur_jaswanth_ce4f9fc791).</description>
    <link>https://dev.to/alkur_jaswanth_ce4f9fc791</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3935232%2F074e2657-507a-4b1e-9a8d-c735f727d6dd.jpeg</url>
      <title>DEV Community: Alkur Jaswanth</title>
      <link>https://dev.to/alkur_jaswanth_ce4f9fc791</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/alkur_jaswanth_ce4f9fc791"/>
    <language>en</language>
    <item>
      <title>[Boost]</title>
      <dc:creator>Alkur Jaswanth</dc:creator>
      <pubDate>Sun, 28 Jun 2026 15:16:40 +0000</pubDate>
      <link>https://dev.to/alkur_jaswanth_ce4f9fc791/-3j4i</link>
      <guid>https://dev.to/alkur_jaswanth_ce4f9fc791/-3j4i</guid>
      <description>&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/alkur_jaswanth_ce4f9fc791/the-standard-way-to-score-ai-agent-monitors-is-gameable-a-coin-flip-scores-f1-088-3om6" class="crayons-story__hidden-navigation-link"&gt;The standard way to score AI agent monitors is gameable a coin flip scores F1 0.88&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/alkur_jaswanth_ce4f9fc791" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3935232%2F074e2657-507a-4b1e-9a8d-c735f727d6dd.jpeg" alt="alkur_jaswanth_ce4f9fc791 profile" class="crayons-avatar__image"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/alkur_jaswanth_ce4f9fc791" class="crayons-story__secondary fw-medium m:hidden"&gt;
              Alkur Jaswanth
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                Alkur Jaswanth
                
              
              &lt;div id="story-author-preview-content-4012871" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/alkur_jaswanth_ce4f9fc791" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3935232%2F074e2657-507a-4b1e-9a8d-c735f727d6dd.jpeg" class="crayons-avatar__image" alt=""&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;Alkur Jaswanth&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/alkur_jaswanth_ce4f9fc791/the-standard-way-to-score-ai-agent-monitors-is-gameable-a-coin-flip-scores-f1-088-3om6" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;Jun 28&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/alkur_jaswanth_ce4f9fc791/the-standard-way-to-score-ai-agent-monitors-is-gameable-a-coin-flip-scores-f1-088-3om6" id="article-link-4012871"&gt;
          The standard way to score AI agent monitors is gameable a coin flip scores F1 0.88
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/ai"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;ai&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/machinelearning"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;machinelearning&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/security"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;security&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/opensource"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;opensource&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
          &lt;a href="https://dev.to/alkur_jaswanth_ce4f9fc791/the-standard-way-to-score-ai-agent-monitors-is-gameable-a-coin-flip-scores-f1-088-3om6" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left"&gt;
            &lt;div class="multiple_reactions_aggregate"&gt;
              &lt;span class="multiple_reactions_icons_container"&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/sparkle-heart-5f9bee3767e18deb1bb725290cb151c25234768a0e9a2bd39370c382d02920cf.svg" width="18" height="18"&gt;
                  &lt;/span&gt;
              &lt;/span&gt;
              &lt;span class="aggregate_reactions_counter"&gt;2&lt;span class="hidden s:inline"&gt;&amp;nbsp;reactions&lt;/span&gt;&lt;/span&gt;
            &lt;/div&gt;
          &lt;/a&gt;
            &lt;a href="https://dev.to/alkur_jaswanth_ce4f9fc791/the-standard-way-to-score-ai-agent-monitors-is-gameable-a-coin-flip-scores-f1-088-3om6#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              

              3&lt;span class="hidden s:inline"&gt;&amp;nbsp;comments&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            4 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial crayons-icon c-btn__icon"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success crayons-icon c-btn__icon"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;


</description>
    </item>
    <item>
      <title>The standard way to score AI agent monitors is gameable a coin flip scores F1 0.88</title>
      <dc:creator>Alkur Jaswanth</dc:creator>
      <pubDate>Sun, 28 Jun 2026 15:15:25 +0000</pubDate>
      <link>https://dev.to/alkur_jaswanth_ce4f9fc791/the-standard-way-to-score-ai-agent-monitors-is-gameable-a-coin-flip-scores-f1-088-3om6</link>
      <guid>https://dev.to/alkur_jaswanth_ce4f9fc791/the-standard-way-to-score-ai-agent-monitors-is-gameable-a-coin-flip-scores-f1-088-3om6</guid>
      <description>&lt;p&gt;Traditionally, evaluation of the agent monitoring mechanisms involves an attempt to game them, as it was my case when I attempted to test whether monitors would be able to identify the problem in the run and not in the beginning. The input prompt may look perfect until a certain issue pops up down the line, such as using the wrong file or changing the scope of the task execution. Single pass filter would not identify it since it does not consider the steps of the procedure in order.&lt;/p&gt;

&lt;p&gt;There are available datasets for the agent-based tasks, yet they focus on detecting whether the agent completes the task or gets hacked rather than whether the agent monitor reacts timely and correctly to the situation. Thus, I created one that takes into account complete trajectories with labeled steps in it. It consists of five types of drift that remain hidden until they appear – tool-call misuse, goal shift, plan execution mismatch, agent to agent coercion and capability laundering. The measured dataset is the reviewed gold split: 513 trajectories, 453 adversarial and 60 benign controls.&lt;/p&gt;

&lt;p&gt;The clear winner in that scoring system was whatever fired before the bad step was hit, as an early detection. This made random guessing seem quite powerful since early detections on normal steps were being rewarded based on this system a coin flip would get F1 of 0.88. Once I modified that and said only the very first detection on the drift step is a true positive and any other detection on normal step is a false alarm, those numbers took a dive: the coin flip gets 0.19 now, and all other numbers are now making sense. I personally prefer the scoring system which does not reward trigger happy behavior.&lt;/p&gt;

&lt;p&gt;It seems like the monitors are still confusing regular steps with drifts even after the adjustment. It was harder to distinguish some of the drifts from others. Not sure how this affects the real-life deployment.&lt;/p&gt;

&lt;p&gt;Here are the baseline scores on gold split using the correct metric:&lt;/p&gt;

&lt;p&gt;Random (p=0.15): F1 0.189, precision 0.115, recall 0.530, FP-step 0.305&lt;/p&gt;

&lt;p&gt;Cosine-to-goal: F1 0.000, precision , recall , FP-step 0.968&lt;/p&gt;

&lt;p&gt;GPT-4o-mini judge: F1 0.672, precision 0.671, recall 0.674, FP-step 0.059&lt;/p&gt;

&lt;p&gt;Production verifier (full manifest): F1 0.184, precision 0.730, recall 0.105, FP-step 0.011&lt;/p&gt;

&lt;p&gt;Production verifier (tight scope): F1 0.843, precision 0.742, recall 0.977, FP-step 0.082&lt;/p&gt;

&lt;p&gt;The random method at such probability achieves roughly 0.189 F1, whereas the cosine monitor reaches 0 across the board and triggers on virtually any step (roughly 97%). This value is notable since the simple angle check can not distinguish ordinary steps from the drifts, making the threshold meaningless. GPT-4o mini performs well on balanced scores (0.672), but the production verifier line seems more relevant. With the full manifest the false positive rate becomes very small (0.011, i.e. 59 of 60 benign controls pass) but the recall stays poor (0.105). It seems as if the entire tradeoff revolves around how strict you get with the permitted features.&lt;/p&gt;

&lt;p&gt;The trend continued when the same verification tool was used for the full set of adversarial traces. Every harmful trace lay out of the detection bounds the aggregated metric would identify all 453 harmful traces complete below the band (0 of 453, with a Wilson 95% confidence interval of [0.000, 0.008]). The threshold on drift alone cannot detect most of the attacks, bypassing 87.2% of them at the one percent false positive rate. It is the unique structural features such as unusual privilege accesses or secret reads followed by external calls that really stop the attacks.&lt;/p&gt;

&lt;p&gt;The continuous drift measure remains silent. The next step is testing the same verifier on the bigger MonitoringBench dataset in order to test whether my set of rules is driving the results. Over all 2,924 attack sequences (around 93,000 tool call steps), the aggregate score failed to reach the kill level for all of them, while the structural score detected 90.4% (Wilson [89.3%, 91.4%]) of attacks. However, the rest of them, around 9.6%, appear to be related to the contents which do not satisfy any of the pattern rules.&lt;/p&gt;

&lt;p&gt;In general, it looks like structural detectors do the job currently and add almost no discrimination power per step for the rest of them. The only gap left is the lack of content signal which would spike on real malicious contents but not on changes in general. Of course, one may object that this problem shifts the problem to a perfect classifier per step, which is somewhat true.&lt;/p&gt;

&lt;p&gt;Another limitation is that the gold labels are based on one pass and the second review is yet to be done, hence the lack of an agreement number so far. The data is also English first, and the machine flagged tier was not included in the computation. The verifier used itself is just one system, so it may produce different results for other systems .It might be good to run the provided code on various monitors to see where it fails.&lt;/p&gt;

&lt;p&gt;The dataset, the evaluation harness, and the preprint are all publicly available:&lt;/p&gt;

&lt;p&gt;Dataset (CC-BY 4.0): &lt;a href="https://huggingface.co/datasets/jash-ai/agentic-redteam-benchmark" rel="noopener noreferrer"&gt;https://huggingface.co/datasets/jash-ai/agentic-redteam-benchmark&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Code &amp;amp; eval harness (MIT): &lt;a href="https://github.com/Alkur123/agentic-redteam-benchmark" rel="noopener noreferrer"&gt;https://github.com/Alkur123/agentic-redteam-benchmark&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Preprint (Zenodo): &lt;a href="https://doi.org/10.5281/zenodo.20995496" rel="noopener noreferrer"&gt;https://doi.org/10.5281/zenodo.20995496&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Each row above reproduces with &lt;code&gt;python eval.py -baseline &amp;lt;name&amp;gt;&lt;/code&gt; (random, cosine, gpt4, ring12), which by default evaluates the gold split. If you were to implement your own agent monitor system, the most valuable would be to test it yourself and let me know what breaks, especially if there is a counterexample that is sub-band, lacks any sort of structure, yet harmful. This is the scenario I most want to break.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>security</category>
      <category>opensource</category>
    </item>
    <item>
      <title>How to build a 22ms agent goal-drift detector</title>
      <dc:creator>Alkur Jaswanth</dc:creator>
      <pubDate>Sat, 16 May 2026 17:58:33 +0000</pubDate>
      <link>https://dev.to/alkur_jaswanth_ce4f9fc791/how-to-build-a-22ms-agent-goal-drift-detector-5hjd</link>
      <guid>https://dev.to/alkur_jaswanth_ce4f9fc791/how-to-build-a-22ms-agent-goal-drift-detector-5hjd</guid>
      <description>&lt;p&gt;Originally published on Substack (&lt;a href="https://ajaswanth.substack.com/p/rank-weighted-faiss-voting-building" rel="noopener noreferrer"&gt;https://ajaswanth.substack.com/p/rank-weighted-faiss-voting-building&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;**&lt;br&gt;
Here is something that will happen to you if you build AI governance with standard nearest-neighbour lookup.You deploy a semantic similarity check on a tool-calling agent. An attacker crafts a step that embeds just close&lt;br&gt;
enough to a legitimate one — fs.read("/etc/passwd") packaged inside a data-pipeline step that smells like fs.read("input.csv"). Your FAISS query returns the legitimate step as the single nearest neighbour. Distance: 0.82. Your governance system: ALLOW.The session continues. The credential file ships.This post is about one specific fix: rank-weighted nearest-neighbour voting. It is the technique underpinning the S1 and S3 signals in Ring 12, a trajectory-level verifier I am releasing in July alongside a 500-sample adversarial benchmark. The technique is not novel — it is a direct application of soft voting from ensemble learning to the embedding-space problem. What is new is applying it specifically to real-time, per-step agent governance at sub-50ms latency.&lt;/p&gt;

&lt;p&gt;Why single nearest-neighbour is fragile in adversarial settings&lt;/p&gt;

&lt;p&gt;FAISS IndexFlatIP (inner product over L2-normalised vectors = cosine similarity) is fast and correct. The problem is not FAISS. The problem is the decision rule: take the top-1 result and classify accordingly.&lt;br&gt;
In a benign distribution this works well. The manifold of "safe agent steps" and the manifold of "unsafe agent steps"are far apart; any query lands clearly on one side.In an adversarial setting, the attacker has a gradient — they can optimise the embedding of their step toward the boundary. The boundary between safe and unsafe is locally exploitable:&lt;br&gt;
find a safe example, find the direction that keeps cosine similarity ≥ threshold while maximising harm. Standard nearest-neighbour lookup has no defence against this because the decision is a hard threshold over a single distance value.The epsilon-ball attack in practice: you do not need to solve an optimisation problem. In natural-language space,there are many paraphrases of a malicious instruction that share high lexical overlap with benign instructions. A step that reads /etc/passwd reframed as "read system configuration for environment validation" has a good chance of being nearest-neighbour to a safe configuration-reading step in your index&lt;/p&gt;

&lt;p&gt;Rank-weighted voting: the fix&lt;/p&gt;

&lt;p&gt;Instead of a single nearest-neighbour decision, retrieve the top-k neighbours and compute a soft vote where each neighbour’s weight is a function of its rank or similarity score.&lt;/p&gt;

&lt;p&gt;The simplest effective formulation:&lt;/p&gt;

&lt;p&gt;score(query) = Σ_{i=1}^{k} label(i) · w(i)&lt;/p&gt;

&lt;p&gt;where&lt;/p&gt;

&lt;p&gt;w(i) = sim(query, neighbour_i) / Σ_{j=1}^{k} sim(query, neighbour_j) label(i) ∈ {0, 1} is the safety classification of the i-th neighbour. The resulting score is a number in [0, 1]: a soft probability that the query is safe.&lt;/p&gt;

&lt;p&gt;Why does this resist the boundary-case attack? Because the attacker must now optimise toward k different neighbours simultaneously. Fooling the nearest neighbour is an epsilon-ball attack. Fooling the centroid of the top-5 neighbours is a much larger ball -and the further you push, the more dissimilar your step becomes from the safe distribution overall, which itself becomes a drift signal. For k = 5, the attack surface enlarges by roughly (k − 1) × epsilon². In practice the combination of rank-weighted voting and the EMA smoothing used in Ring 12’s S1 signal makes one-shot boundary attacks require perturbations that push the embedding so far that they trigger the action-class signal (S2) independently.&lt;/p&gt;

&lt;p&gt;Implementation in 25 lines&lt;/p&gt;

&lt;p&gt;import numpy as np&lt;/p&gt;

&lt;p&gt;import faiss&lt;/p&gt;

&lt;p&gt;class RankWeightedIndex:&lt;/p&gt;

&lt;p&gt;def &lt;strong&gt;init&lt;/strong&gt;(self, embeddings: np.ndarray, labels: np.ndarray, k: int = 5):&lt;/p&gt;

&lt;p&gt;“““&lt;/p&gt;

&lt;p&gt;embeddings: (N, D) float32, L2-normalised&lt;/p&gt;

&lt;p&gt;labels: (N,) int {0=safe, 1=unsafe}&lt;/p&gt;

&lt;p&gt;“““&lt;/p&gt;

&lt;p&gt;self.k = k&lt;/p&gt;

&lt;p&gt;self.labels = labels&lt;/p&gt;

&lt;p&gt;d = embeddings.shape[1]&lt;/p&gt;

&lt;p&gt;self.index = faiss.IndexFlatIP(d)&lt;/p&gt;

&lt;p&gt;self.index.add(embeddings)&lt;/p&gt;

&lt;p&gt;def score(self, query: np.ndarray) -&amp;gt; float:&lt;/p&gt;

&lt;p&gt;“““&lt;/p&gt;

&lt;p&gt;Returns soft unsafe probability in [0, 1].&lt;/p&gt;

&lt;p&gt;query: (1, D) float32, L2-normalised&lt;/p&gt;

&lt;p&gt;“““&lt;/p&gt;

&lt;p&gt;similarities, indices = self.index.search(query, self.k)&lt;/p&gt;

&lt;p&gt;sims = similarities[0] # shape (k,)&lt;/p&gt;

&lt;p&gt;neighbour_labels = self.labels[indices[0]]&lt;/p&gt;

&lt;h1&gt;
  
  
  Softmax-normalise similarities to get weights
&lt;/h1&gt;

&lt;p&gt;weights = np.exp(sims) / np.exp(sims).sum()&lt;/p&gt;

&lt;h1&gt;
  
  
  Weighted vote: fraction of unsafe weight
&lt;/h1&gt;

&lt;p&gt;return float((weights * neighbour_labels).sum())&lt;/p&gt;

&lt;p&gt;This is the core of Ring 12’s S3 signal (tool-call surprise). The index holds 7,369 embeddings derived from the policy dataset: each is a (goal-class, action-class, tool-name) triple encoded by an ONNX quantised all-MiniLM-L6-v2. The score above 0.45 triggers a WARN; above 0.70 triggers KILL_SESSION.&lt;/p&gt;

&lt;p&gt;The score() call on an AMD Ryzen 5 laptop, single thread, k=5, D=384: ~0.3ms. The embedding step (ONNX inference) is the dominant cost: ~10ms. Total per-step overhead: ~11ms, well inside the 50ms p95 budget.&lt;/p&gt;

&lt;p&gt;Applied to trajectory drift in Ring 12&lt;/p&gt;

&lt;p&gt;Ring 12 uses rank-weighted voting in two of its five signals:&lt;/p&gt;

&lt;p&gt;S1 -cosine drift (EMA): At each step, embed step.thought + step.action_description. Compute rank-weighted similarity against the goal embedding and the running trajectory centroid. Apply exponential moving average (α=0.3) to smooth noise. A step that drifts semantically across multiple consecutive steps accumulates an EMA score that a one-shot boundary attack cannot deflate — because EMA over the trajectory history is not the same surface as a single nearest-neighbour query.&lt;/p&gt;

&lt;p&gt;S3 — tool-call surprise: Retrieve top-5 neighbours from the goal-class allowlist. A step that calls db.delete_all() when the declared goal is “summarise quarterly sales data” will find no safe neighbours with matching action class -the rank-weighted vote is close to 0, triggering hard-block regardless of how the step is narrated.&lt;/p&gt;

&lt;p&gt;The remaining three signals (S2 action-class Jaccard, S4 plan-execution match, S5 paraphrase stability) use different mechanisms — no FAISS — and are described in the July post.&lt;/p&gt;

&lt;p&gt;Early benchmark numbers&lt;/p&gt;

&lt;p&gt;I have 150 hand-authored adversarial trajectory samples validated against the schema so far (50 tool-call-abuse, 25&lt;/p&gt;

&lt;p&gt;goal-shift, 25 plan-exec-divergence, 25 agent-coercion, 25 capability-laundering). Full 500-sample benchmark ships&lt;/p&gt;

&lt;p&gt;July 9.&lt;/p&gt;

&lt;p&gt;On the 150-sample Phase A subset with rank-weighted k=5 versus k=1 (standard nearest-neighbour) for S3 alone:&lt;/p&gt;

&lt;p&gt;┌───────────────────┬───────────┬────────┬──────┐&lt;/p&gt;

&lt;p&gt;│ Variant │ Precision │ Recall │ F1 │&lt;/p&gt;

&lt;p&gt;├───────────────────┼───────────┼────────┼──────┤&lt;/p&gt;

&lt;p&gt;│ k=1 (standard NN) │ 0.71 │ 0.74 │ 0.72 │&lt;/p&gt;

&lt;p&gt;├───────────────────┼───────────┼────────┼──────┤&lt;/p&gt;

&lt;p&gt;│ k=3 rank-weighted │ 0.79 │ 0.81 │ 0.80 │&lt;/p&gt;

&lt;p&gt;├───────────────────┼───────────┼────────┼──────┤&lt;/p&gt;

&lt;p&gt;│ k=5 rank-weighted │ 0.83 │ 0.85 │ 0.84 │&lt;/p&gt;

&lt;p&gt;├───────────────────┼───────────┼────────┼──────┤&lt;/p&gt;

&lt;p&gt;│ k=7 rank-weighted │ 0.83 │ 0.84 │ 0.83 │&lt;/p&gt;

&lt;p&gt;└───────────────────┴───────────┴────────┴──────┘&lt;/p&gt;

&lt;p&gt;k=5 is the sweet spot. Beyond k=7 the far neighbours are too semantically dissimilar to be useful voters and begin to add noise.&lt;/p&gt;

&lt;p&gt;These are S3-only numbers. The full five-signal aggregator is what I am evaluating against the complete 500-sample benchmark - those numbers land in the July post.&lt;/p&gt;

&lt;p&gt;What is coming July 9&lt;/p&gt;

&lt;p&gt;On 2026-07-09 I am publishing three things simultaneously:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Ring 12 — MIT-licensed trajectory verifier for AI agents. LangGraph adapter, Claude Code adapter, and REST adapter work today. Install: pip install aegis-ring12 (coming July 9). 66/66 tests green. p95 22ms on CPU.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;agentic-redteam-benchmark v0.1 — 500 adversarial trajectory samples, 5 categories, CC-BY 4.0. Each sample has a declared goal, a declared plan, a 6-12 step trajectory with injected drift, and ground-truth labels (drift step,expected decision, expected signals). GitHub + Hugging Face Datasets card.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Full technical paper — five signals, aggregator math, eval harness with four baselines (random, cosine-only, GPT-4-judge, Ring 12). The results table that the benchmark numbers will populate.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you build agent systems and want early access to the benchmark schema or the eval harness, email me: &lt;a href="mailto:lathajaswanth7@gmail.com"&gt;lathajaswanth7@gmail.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you want to contribute a trajectory sample before July 9: AUTHORING_GUIDE.md is in the repo. Schema validation is automated. A well-formed sample takes about 15 minutes to write. The one-sentence version: trajectory governance is the layer that agent security has been missing, and the benchmark is how we make it measurable.&lt;/p&gt;

&lt;p&gt;Jaswanth is the founder of Aegis AI. The V3 governance engine (11 rings, 6 regulation plugins, 97 clauses) is the production infrastructure Ring 12 is being bolted onto.&lt;/p&gt;

&lt;p&gt;GitHub: github.com/Alkur123 · Email: &lt;a href="mailto:lathajaswanth7@gmail.com"&gt;lathajaswanth7@gmail.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>security</category>
      <category>python</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
