<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Vinay</title>
    <description>The latest articles on DEV Community by Vinay (@vinayiitkgp).</description>
    <link>https://dev.to/vinayiitkgp</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F716870%2Ff9eba894-3ddb-48c0-b59d-a4efbc7434e6.png</url>
      <title>DEV Community: Vinay</title>
      <link>https://dev.to/vinayiitkgp</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/vinayiitkgp"/>
    <language>en</language>
    <item>
      <title>Why Self-Hosted Claude Code Was 15x Slower Than It Should Be</title>
      <dc:creator>Vinay</dc:creator>
      <pubDate>Sun, 07 Jun 2026 01:55:40 +0000</pubDate>
      <link>https://dev.to/vinayiitkgp/why-self-hosted-claude-code-was-15-slower-than-it-should-be-3pb4</link>
      <guid>https://dev.to/vinayiitkgp/why-self-hosted-claude-code-was-15-slower-than-it-should-be-3pb4</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Update (2026-05-14).&lt;/strong&gt; The SimpleEngine prefix-cache patch described in&lt;br&gt;
Finding #2 is now upstream as&lt;br&gt;
&lt;a href="https://github.com/waybarrios/vllm-mlx/pull/523" rel="noopener noreferrer"&gt;vllm-mlx PR #523&lt;/a&gt;, merged.&lt;br&gt;
If you're on a recent vllm-mlx build, the fix is already there — no local&lt;br&gt;
patching required. The walk-through below is still useful for understanding&lt;br&gt;
what the patch does and why it was needed.&lt;br&gt;
{: .prompt-info }&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update (2026-05-18) — two more sharp edges if you're running this for real:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Don't use strict &lt;code&gt;json_schema&lt;/code&gt; response_format against sparse-MoE Coder&lt;br&gt;
models.&lt;/strong&gt; If you also run LangChain (or any OpenAI-compatible client) with&lt;br&gt;
structured outputs against the same vllm-mlx instance, prefer&lt;br&gt;
&lt;code&gt;with_structured_output(schema, method="json_mode")&lt;/code&gt; over the LangChain&lt;br&gt;
default &lt;code&gt;"json_schema"&lt;/code&gt;. The strict path triggers grammar-constrained&lt;br&gt;
decoding which has hung on Qwen3-Coder-30B-A3B for 5+ minutes per call —&lt;br&gt;
and a wedged decoder starves every queued request, including your Claude&lt;br&gt;
Code session, until the server restarts. Filed upstream as&lt;br&gt;
&lt;a href="https://github.com/waybarrios/vllm-mlx/issues/546" rel="noopener noreferrer"&gt;vllm-mlx#546&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;PR #523 fixes the single-slot system-KV cache. You probably also want a&lt;br&gt;
multi-slot variant.&lt;/strong&gt; Claude Code sub-agents (Explore, Plan,&lt;br&gt;
general-purpose) carry different tool sets, so each one's system prefix&lt;br&gt;
differs from the main agent's. With a single-slot snapshot, every&lt;br&gt;
sub-agent dispatch evicts the main agent's cache and vice versa, and you&lt;br&gt;
pay the full ~28K-token cold prefill every turn. The multi-slot LRU&lt;br&gt;
follow-up is local for now — upstream PR pending.&lt;br&gt;
{: .prompt-warning }&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;I run &lt;a href="https://docs.anthropic.com/claude/docs/claude-code" rel="noopener noreferrer"&gt;Claude Code&lt;/a&gt; against a&lt;br&gt;
self-hosted &lt;a href="https://github.com/waybarrios/vllm-mlx" rel="noopener noreferrer"&gt;vllm-mlx&lt;/a&gt; backend on a Mac&lt;br&gt;
Studio. Cold turns took ~108 seconds. Follow-ups took &lt;em&gt;almost the same&lt;/em&gt;, even&lt;br&gt;
though the system prompt was byte-stable and any LLM engine worth its salt should&lt;br&gt;
be caching the prefix.&lt;/p&gt;

&lt;p&gt;Two findings, &lt;strong&gt;both required&lt;/strong&gt; to get the speedup:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Claude Code injects a rotating &lt;code&gt;x-anthropic-billing-header&lt;/code&gt; value into the
system block on every turn.&lt;/strong&gt; Even though the user-visible system prompt
doesn't change, the bytes the engine hashes for cache lookup &lt;em&gt;do&lt;/em&gt; change every
request. The prefix cache misses 100% of the time. Strip the header at the
proxy layer and the cache becomes useful.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;vllm-mlx's &lt;code&gt;SimpleEngine&lt;/code&gt; doesn't carry KV state across requests.&lt;/strong&gt; Even with
the rotating header gone, you have to patch SimpleEngine to actually &lt;em&gt;cache&lt;/em&gt;
the system prefix between turns — a small single-slot, hash-keyed cache that
restores the snapshot on a hit and prefills only the suffix.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Together: &lt;strong&gt;108-second turns → 7-8 second follow-ups. A 13-15× speedup&lt;/strong&gt;, on the&lt;br&gt;
same hardware, with the same model.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;tr&gt;
&lt;td width="33%"&gt;
&lt;h2&gt;108s → 7-8s&lt;/h2&gt;
warm-turn wall-clock, before vs after
&lt;/td&gt;
&lt;td width="33%"&gt;
&lt;h2&gt;13-15×&lt;/h2&gt;
follow-up speedup, same hardware + model
&lt;/td&gt;
&lt;td width="33%"&gt;
&lt;h2&gt;81 bytes&lt;/h2&gt;
of rotating header text that was costing 100+s/turn
&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h2&gt;
  
  
  The setup
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;flowchart LR
    CC[Claude Code CLI] --&amp;gt;|/v1/messages&amp;lt;br/&amp;gt;system + tools + msgs&amp;lt;br/&amp;gt;+ rotating cch=...| CCR[claude-code-router]
    CCR --&amp;gt; Shim["Shim&amp;lt;br/&amp;gt;&amp;lt;b&amp;gt;(1) strips x-anthropic-billing-header&amp;lt;/b&amp;gt;&amp;lt;br/&amp;gt;(2) buffers tool-call streams"]
    Shim --&amp;gt;|byte-stable&amp;lt;br/&amp;gt;system prefix| VLLM[vllm-mlx server]
    VLLM --&amp;gt; SE["SimpleEngine&amp;lt;br/&amp;gt;&amp;lt;b&amp;gt;(3) system-prefix KV cache&amp;lt;/b&amp;gt;&amp;lt;br/&amp;gt;HIT: skip prefill&amp;lt;br/&amp;gt;MISS: prefill + snapshot"]
    SE --&amp;gt;|stream tokens| CC
    style Shim fill:#1e40af,color:#fff
    style SE fill:#7c2d12,color:#fff
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The three numbered points are where the speedup comes from. Strip (1) and (3)&lt;br&gt;
and you're back to 100+ second turns.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Backend: vllm-mlx serving &lt;code&gt;Qwen2.5-Coder-32B-Instruct-8bit&lt;/code&gt; on a Mac Studio (96 GB).&lt;/li&gt;
&lt;li&gt;Front door: a small FastAPI shim that exposes Anthropic's &lt;code&gt;/v1/messages&lt;/code&gt; API and proxies to vllm-mlx.&lt;/li&gt;
&lt;li&gt;Routing: &lt;a href="https://github.com/musistudio/claude-code-router" rel="noopener noreferrer"&gt;&lt;code&gt;claude-code-router&lt;/code&gt;&lt;/a&gt; translates Claude Code's outbound calls to the shim's URL with a bearer token.&lt;/li&gt;
&lt;li&gt;Client: Claude Code, the CLI.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;End-to-end the architecture worked. Tool calling worked. Streaming worked. Output&lt;br&gt;
quality was fine. It was just &lt;em&gt;slow&lt;/em&gt; — and slow in a way that didn't match how any&lt;br&gt;
of this is supposed to behave.&lt;/p&gt;

&lt;p&gt;For context: Claude Code's prompts are large. Measured across captured requests&lt;br&gt;
on this setup, the cacheable prefix — Claude Code's system instructions plus the&lt;br&gt;
tool-definitions block — runs around &lt;strong&gt;23,000 tokens&lt;/strong&gt; (≈5.6K system + ≈17.6K&lt;br&gt;
tools, for a 23-tool toolset). With a working prefix cache, only the new user&lt;br&gt;
message and the conversation tail need to be processed each turn — typically a&lt;br&gt;
few hundred tokens. Without one, the engine re-prefills ~23K tokens every.&lt;br&gt;
single. turn. On a 32K-context model, that leaves about 9K headroom for the&lt;br&gt;
conversation and output, which is fine — but only if you're not throwing away the&lt;br&gt;
prefix work each turn.&lt;/p&gt;
&lt;h2&gt;
  
  
  What I expected vs what I observed
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Cold turn&lt;/th&gt;
&lt;th&gt;Warm turn&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Stock vllm-mlx, no shim&lt;/td&gt;
&lt;td&gt;108 s&lt;/td&gt;
&lt;td&gt;~100 s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Shim strips billing header only&lt;/td&gt;
&lt;td&gt;105 s&lt;/td&gt;
&lt;td&gt;~70 s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Shim strips header + SimpleEngine KV-cache patch&lt;/td&gt;
&lt;td&gt;108 s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;7-8 s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The cold-turn number doesn't change — there's no cache to hit on the first&lt;br&gt;
request. The warm-turn delta is the whole story.&lt;/p&gt;
&lt;h2&gt;
  
  
  Finding #1: the rotating billing header
&lt;/h2&gt;

&lt;p&gt;The first useful diagnostic was diffing the raw bytes of two consecutive&lt;br&gt;
&lt;code&gt;/v1/messages&lt;/code&gt; requests from Claude Code. Almost everything was identical: system&lt;br&gt;
prompt, tool definitions, conversation history, sampling params. But there was&lt;br&gt;
one block in the system list that changed every turn:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{"type": "text",
 "text": "x-anthropic-billing-header: cc_version=...; cc_entrypoint=cli; cch=&amp;lt;rotating-hash&amp;gt;"}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Claude Code injects this. The &lt;code&gt;cch=&lt;/code&gt; value rotates per request — Anthropic uses&lt;br&gt;
it for billing and conversation tracking. On Anthropic's hosted API, the cache&lt;br&gt;
layer normalizes around it and there's no impact. On a self-hosted backend that&lt;br&gt;
simply hashes the prompt as-is, &lt;strong&gt;the rotating value invalidates the cache key on&lt;br&gt;
every request.&lt;/strong&gt; Every turn looks brand new to the engine, because every turn&lt;br&gt;
&lt;em&gt;is&lt;/em&gt; brand new.&lt;/p&gt;

&lt;p&gt;The fix at the shim is a one-function filter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_strip_billing_header&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Drop Claude Code&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s `x-anthropic-billing-header` system block.

    Claude Code injects a small system block of the form
    `x-anthropic-billing-header: cc_version=...; cc_entrypoint=cli; cch=&amp;lt;hash&amp;gt;`
    whose `cch` value rotates every turn. Anthropic&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s cloud uses it for billing
    tracking; local upstreams just see it as 81 bytes of system text. With our
    SimpleEngine prefix-KV cache, that rotating field changes the system-prefix
    hash each turn → every turn is a cache miss → 100s+ prefill on the
    ~23K-token system+tools prefix. Removing this block makes the system
    prefix byte-stable turn-over-turn so the cache actually hits.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;system&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;system&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;lstrip&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;x-anthropic-billing-header&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;81 bytes of rotating text was costing 100+ seconds per turn. Not a great trade.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note.&lt;/strong&gt; vllm-mlx PR #277 quietly does the same fix for the &lt;code&gt;/v1/messages&lt;/code&gt;&lt;br&gt;
endpoint. If you're on a recent build of vllm-mlx and using its native&lt;br&gt;
Anthropic adapter, you may already be covered. I run my own shim (for tool-call&lt;br&gt;
buffering on the Coder alias — vllm-mlx's Hermes parser streams tool JSON as&lt;br&gt;
content deltas, which doesn't round-trip cleanly to clients), so I had to&lt;br&gt;
strip the header myself.&lt;br&gt;
{: .prompt-info }&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;After this fix, warm turns dropped from ~100 s to ~70 s. A real win, but the&lt;br&gt;
prefix cache &lt;em&gt;should&lt;/em&gt; have been saving 95+ seconds, not 30. So either the cache&lt;br&gt;
wasn't engaging at all, or it was engaging only partially. Onward.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding #2: SimpleEngine wasn't actually caching the prefix
&lt;/h2&gt;

&lt;p&gt;vllm-mlx ships two engines — both MLX-native, neither is upstream vLLM's&lt;br&gt;
PagedAttention/CUDA core (which doesn't run on Apple Silicon at all).&lt;br&gt;
&lt;code&gt;engine/simple.py&lt;/code&gt; is &lt;em&gt;"Simple engine for maximum single-user throughput.&lt;br&gt;
Wraps mlx-lm directly with zero overhead for optimal performance when serving&lt;br&gt;
a single user at a time."&lt;/em&gt; &lt;code&gt;engine/batched.py&lt;/code&gt; is &lt;em&gt;"Batched engine for&lt;br&gt;
continuous batching with multiple concurrent users."&lt;/em&gt; For a single-user&lt;br&gt;
Claude Code session, SimpleEngine is the right pick — no scheduler, no&lt;br&gt;
batching wait, direct access to mlx-lm's prompt cache. BatchedEngine wins&lt;br&gt;
when multiple users hit the same backend concurrently.&lt;/p&gt;

&lt;p&gt;SimpleEngine was what I was using. Profiling showed prefill running across the&lt;br&gt;
full system + tool prefix on every turn, even after the billing header was&lt;br&gt;
gone. The cache hit rate was effectively zero.&lt;/p&gt;

&lt;p&gt;The reason: SimpleEngine's request handler doesn't carry KV state from the&lt;br&gt;
previous request to the next. Each request gets a fresh prompt cache via&lt;br&gt;
&lt;code&gt;make_prompt_cache(model)&lt;/code&gt; and prefills the whole prompt from scratch. There's&lt;br&gt;
no across-requests cache to hit — the prefix cache lives only inside a single&lt;br&gt;
request.&lt;/p&gt;

&lt;p&gt;The fix was a small patch: add a &lt;strong&gt;single-slot, hash-keyed system-prefix KV&lt;br&gt;
cache&lt;/strong&gt; to SimpleEngine. Detect the system prefix using the ChatML markers that&lt;br&gt;
delimit it, hash the prefix tokens, and:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;On a &lt;strong&gt;hit&lt;/strong&gt;, restore the saved KV snapshot and prefill only the suffix.&lt;/li&gt;
&lt;li&gt;On a &lt;strong&gt;miss&lt;/strong&gt;, prefill the system prefix in chunks, snapshot the resulting KV
state, and store it (overwriting the previous slot).
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Excerpt from system_kv_cache_for_simple_engine.patch — a few of the load-bearing lines.
&lt;/span&gt;
&lt;span class="n"&gt;system_hash&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;system_prefix_text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()[:&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# ...
&lt;/span&gt;&lt;span class="nf"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;system_hash&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_system_kv_hash&lt;/span&gt;
    &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_system_kv_snapshot&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;system_token_count&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_system_kv_token_count&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;cache_hit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;System KV cache HIT: reusing %d tokens, prefilling %d new (hash=%s)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;system_token_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;suffix_tokens&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;system_hash&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;System KV cache MISS: will prefill %d system + %d suffix tokens (hash=%s)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;system_token_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;suffix_tokens&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;system_hash&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# On HIT, restore the saved cache state and skip system prefill:
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cache_hit&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;bc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;make_prompt_cache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;saved_state&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_system_kv_snapshot&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;bc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;saved_state&lt;/span&gt;

&lt;span class="c1"&gt;# On MISS, prefill the system prefix in chunks, then snapshot:
&lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# ... chunked prefill of system_tokens ...
&lt;/span&gt;    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_system_kv_snapshot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;bc&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_system_kv_hash&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;system_hash&lt;/span&gt;
    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_system_kv_token_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;system_token_count&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few design choices that mattered:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Single slot, not LRU.&lt;/strong&gt; A Claude Code session has one conversation at a time,
so multi-slot is overkill. The slot just stores &lt;code&gt;(hash, snapshot, token_count)&lt;/code&gt;
and overwrites on miss.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hash the prefix only, not the full prompt.&lt;/strong&gt; That way the cache survives new
user messages on the tail end — which is the common case.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ChatML marker detection.&lt;/strong&gt; The boundary between system and user is found by
searching the rendered prompt for &lt;code&gt;&amp;lt;|im_start|&amp;gt;user\n&lt;/code&gt; or
&lt;code&gt;&amp;lt;|im_start|&amp;gt;assistant\n&lt;/code&gt;. If neither marker is found, fall back to the
uncached path and don't break.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Safe fallback on any exception.&lt;/strong&gt; If the cache-aware path fails for any
reason, log a warning and fall back to the original &lt;code&gt;stream_generate&lt;/code&gt;. Don't
let a perf optimization take down generation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The full patch is upstream as&lt;br&gt;
&lt;a href="https://github.com/waybarrios/vllm-mlx/pull/523" rel="noopener noreferrer"&gt;vllm-mlx PR #523&lt;/a&gt;. Review&lt;br&gt;
hardened the original cut: closure-local capture at the gate to close a&lt;br&gt;
TOCTOU race against the snapshot pointer, and an init-time probe that disables&lt;br&gt;
the cache for sliding-window models whose &lt;code&gt;RotatingKVCache&lt;/code&gt; aliases buffers the&lt;br&gt;
engine mutates in place. The merged code is the right reference to read if&lt;br&gt;
you're curious about the cache mechanics.&lt;/p&gt;

&lt;h2&gt;
  
  
  The numbers
&lt;/h2&gt;

&lt;p&gt;After both fixes were in place, warm-turn wall-clock dropped from ~70 s&lt;br&gt;
(billing-header fix alone) to &lt;strong&gt;7-8 s&lt;/strong&gt; (billing-header fix + SimpleEngine KV&lt;br&gt;
cache). The cold turn is unchanged — there's no prior turn to cache against on&lt;br&gt;
the first request — but the cache hit rate from turn 2 onward is essentially&lt;br&gt;
100%, and the speedup is large enough that Claude Code becomes interactive&lt;br&gt;
instead of glacial.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd do differently
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Diff the inputs before profiling the engine.&lt;/strong&gt; The billing header would have&lt;br&gt;
fallen out of a 30-second &lt;code&gt;diff&lt;/code&gt; of two consecutive request bodies. I didn't run&lt;br&gt;
that diff for far too long — I was looking at vllm-mlx internals, profiling&lt;br&gt;
prefill, reading mlx-lm cache code, anything &lt;em&gt;but&lt;/em&gt; the actual bytes going over&lt;br&gt;
the wire. Once I finally did the diff, the rotating &lt;code&gt;cch=&lt;/code&gt; value was on the&lt;br&gt;
screen in five minutes.&lt;/p&gt;

&lt;p&gt;That has become a personal rule for any latency mystery on a black-box stack:&lt;br&gt;
&lt;strong&gt;capture two consecutive requests, diff them, look at what's &lt;em&gt;not&lt;/em&gt; stable&lt;br&gt;
before assuming the engine is misbehaving.&lt;/strong&gt; It would have saved me an evening&lt;br&gt;
on this one and I suspect it'll save me more.&lt;/p&gt;

&lt;p&gt;The second thing I'd change: the SimpleEngine cache patch should have come&lt;br&gt;
&lt;em&gt;after&lt;/em&gt; I'd quantified what the billing-header strip alone bought me. I lumped&lt;br&gt;
both fixes in the same session, which made it harder to attribute the speedup&lt;br&gt;
cleanly. The numbers in this post are reconstructed from a follow-up&lt;br&gt;
measurement; if I'd been disciplined the first time, I'd have had them ready.&lt;/p&gt;

&lt;h2&gt;
  
  
  When you'd hit this
&lt;/h2&gt;

&lt;p&gt;You'll hit some version of this if you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Self-host an Anthropic-compatible LLM backend (vllm-mlx, llama.cpp's Anthropic
adapter, a custom shim, etc.) and point Claude Code or another Anthropic-protocol
client at it.&lt;/li&gt;
&lt;li&gt;Notice that warm turns aren't faster than cold turns even though your system
prompt is byte-stable.&lt;/li&gt;
&lt;li&gt;See the engine's prefill phase running across the full prompt every turn in
profiling.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're using Anthropic's hosted API, none of this applies — the platform&lt;br&gt;
handles the billing header and prefix caching transparently.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reproducing this
&lt;/h2&gt;

&lt;p&gt;The two pieces that make the speedup happen:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Billing-header strip&lt;/strong&gt; — about 15 lines of FastAPI shim code that filter the
rotating &lt;code&gt;x-anthropic-billing-header&lt;/code&gt; block out of the system list before the
payload reaches vllm-mlx. Identical logic to what
&lt;a href="https://github.com/waybarrios/vllm-mlx/pull/277" rel="noopener noreferrer"&gt;vllm-mlx PR #277&lt;/a&gt; does
natively on the &lt;code&gt;/v1/messages&lt;/code&gt; adapter; you only need a shim if you're not on
that path.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SimpleEngine prefix-cache&lt;/strong&gt; — now upstream as
&lt;a href="https://github.com/waybarrios/vllm-mlx/pull/523" rel="noopener noreferrer"&gt;vllm-mlx PR #523&lt;/a&gt;. Read the
merged code if you want the cache mechanics; the load-bearing logic is the
hash check, the snapshot capture on miss, and the safe fallback when the
detection fails.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Credits
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/waybarrios/vllm-mlx/pull/277" rel="noopener noreferrer"&gt;vllm-mlx PR #277&lt;/a&gt; found the&lt;br&gt;
billing-header issue independently for the &lt;code&gt;/v1/messages&lt;/code&gt; endpoint. If you're&lt;br&gt;
using vllm-mlx's native Anthropic adapter rather than your own shim, that's the&lt;br&gt;
right upstream fix. The SimpleEngine prefix-cache patch landed in&lt;br&gt;
&lt;a href="https://github.com/waybarrios/vllm-mlx/pull/523" rel="noopener noreferrer"&gt;vllm-mlx PR #523&lt;/a&gt; — thanks to&lt;br&gt;
the maintainers for the review, which improved the patch in two specific ways&lt;br&gt;
(closure-local capture against a TOCTOU on the snapshot pointer, and a&lt;br&gt;
sliding-window guard for &lt;code&gt;RotatingKVCache&lt;/code&gt;).&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If you've hit this too, or your self-hosted Claude Code setup is slow for a&lt;br&gt;
different reason I haven't found yet, I'd love to hear about it — reach me on&lt;br&gt;
&lt;a href="https://www.linkedin.com/in/vinay-vobbilichetty" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt; or by&lt;br&gt;
&lt;a href="mailto:vinayvobbilichetty11@gmail.com"&gt;email&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>performance</category>
      <category>mac</category>
    </item>
    <item>
      <title>SOC-in-a-Box: One LLM, Eight Hats, A Production-Bar AI SOC on a Single GPU</title>
      <dc:creator>Vinay</dc:creator>
      <pubDate>Sun, 07 Jun 2026 01:55:20 +0000</pubDate>
      <link>https://dev.to/vinayiitkgp/soc-in-a-box-one-llm-eight-hats-a-production-bar-ai-soc-on-a-single-gpu-4jl4</link>
      <guid>https://dev.to/vinayiitkgp/soc-in-a-box-one-llm-eight-hats-a-production-bar-ai-soc-on-a-single-gpu-4jl4</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;A real SOC runs 24×7 with eight or nine distinct roles — alert triage, deeper investigation, incident response, threat intel, detection tuning, hunting, shift management, and a human approver for any destructive action. We built an AI version of that whole org chart, coordinated over a Redis Streams bus, with &lt;strong&gt;one&lt;/strong&gt; local LLM (GLM-4.7-Flash on a Mac M1) wearing every hat. v1 is read-only against real systems; the only writes are XSOAR notes and Webex cards, plus a human-approval gate on every proposed containment action.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;tr&gt;
&lt;td width="33%"&gt;
&lt;h2&gt;8 roles&lt;/h2&gt;
Sentinel · Tier 2 · IR Lead · Threat Intel · SOC Manager · Detection Eng · Threat Hunter · HITL
&lt;/td&gt;
&lt;td width="33%"&gt;
&lt;h2&gt;1 LLM&lt;/h2&gt;
m1 GLM-4.7-Flash via vllm-mlx, with FailoverChatModel to a studio1 backup
&lt;/td&gt;
&lt;td width="33%"&gt;
&lt;h2&gt;0 writes&lt;/h2&gt;
to CrowdStrike, Tanium, Zscaler — agents &lt;em&gt;propose&lt;/em&gt;, humans &lt;em&gt;execute&lt;/em&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The interesting parts aren't the agents themselves — there's nothing novel about an LLM-with-tools loop. The interesting parts are: (1) the architectural choices that let one local LLM serve a whole SOC org chart without melting, (2) the human-in-the-loop gate that makes "AI does containment" a real thing a security team will actually trust, and (3) a backtest harness that lets us put hard numbers on agent quality against real historical tickets before we hand the demo to leadership.&lt;/p&gt;

&lt;h2&gt;
  
  
  The shape of the problem
&lt;/h2&gt;

&lt;p&gt;A SOC is not a chatbot. It's a 24×7 event-driven pipeline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Alerts land continuously&lt;/strong&gt;, not on demand. The system has to be running and consuming events even when nobody is asking it a question.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Roles are independent&lt;/strong&gt;. Tier 2 doesn't ask Tier 1 a question — it picks up the verdict Tier 1 already published and goes deeper. Threat Intel runs &lt;em&gt;after&lt;/em&gt; IR Lead, not as part of IR Lead's reasoning.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Some roles are reactive, some are periodic&lt;/strong&gt;. Sentinel reacts to each new ticket; SOC Manager produces a shift summary every 8 hours; Threat Hunter sweeps the audit log twice a day; Detection Engineer reviews noisy rules on weekday mornings.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Destructive actions need a human gate&lt;/strong&gt;. An AI that auto-isolates hosts at 3 AM will get unplugged in a month. The interesting question is: what does the handoff look like?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auditability is non-negotiable&lt;/strong&gt;. Every decision needs to be replayable for incident retros and tuning.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The architecture is a consequence of those constraints, not the other way around.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture at a glance
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;flowchart TB
    XSOAR[XSOAR ticket feed] --&amp;gt; Sentinel
    subgraph Bus["Redis Streams bus (lab-vm1)"]
        direction LR
        STRG[soc.triage]
        SCAS[soc.cases]
        SAUD[soc.audit]
    end
    Sentinel[Sentinel / Tier 1&amp;lt;br/&amp;gt;alert triage] --&amp;gt; STRG
    STRG --&amp;gt; Tier2[Tier 2 Analyst&amp;lt;br/&amp;gt;deeper investigation]
    Tier2 --&amp;gt; SCAS
    SCAS --&amp;gt; IR[IR Lead&amp;lt;br/&amp;gt;SEV + containment plan]
    IR --&amp;gt; SCAS
    IR -.HITL action proposed.-&amp;gt; Flask[Flask HITL pages&amp;lt;br/&amp;gt;/soc-hitl/decide&amp;lt;br/&amp;gt;/soc-hitl/audit]
    Webex[Webex card buttons] -.click.-&amp;gt; Flask
    Flask --&amp;gt; SCAS
    SCAS --&amp;gt; TI[Threat Intel&amp;lt;br/&amp;gt;actor + MITRE]
    TI --&amp;gt; SCAS

    SAUD -.replay.-&amp;gt; SOCMgr[SOC Manager&amp;lt;br/&amp;gt;shift summaries&amp;lt;br/&amp;gt;06/14/22 EST]
    SAUD -.replay.-&amp;gt; DetEng[Detection Engineer&amp;lt;br/&amp;gt;rule tuning&amp;lt;br/&amp;gt;09:00 EST M-F]
    SAUD -.replay.-&amp;gt; Hunter[Threat Hunter&amp;lt;br/&amp;gt;pattern sweeps&amp;lt;br/&amp;gt;06/18 EST]
    STRG -.mirror.-&amp;gt; SAUD
    SCAS -.mirror.-&amp;gt; SAUD

    style Bus fill:#1e3a8a,color:#fff
    style Flask fill:#fef3c7,color:#92400e
    style SAUD fill:#7c2d12,color:#fff
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three reactive roles (Tier 2 / IR Lead / Threat Intel) are long-running consumers on the bus. Three periodic roles (SOC Manager / Detection Engineer / Threat Hunter) are systemd timer units that wake up on a calendar schedule, replay the audit stream, and emit a report. The HITL surface is a Flask blueprint sitting next to the existing IR web app.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we considered
&lt;/h2&gt;

&lt;p&gt;The framework choice was load-bearing — once you pick wrong, every later abstraction fights you. We evaluated five paths:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. CrewAI
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/crewAIInc/crewAI" rel="noopener noreferrer"&gt;CrewAI&lt;/a&gt; is excellent at what it's designed for: a "crew" of role-shaped agents collaborating on one task. Declarative &lt;code&gt;Agent&lt;/code&gt; + &lt;code&gt;Task&lt;/code&gt; + &lt;code&gt;Process&lt;/code&gt; (sequential or hierarchical), with strong primitives for delegation between agents inside the same crew.&lt;/p&gt;

&lt;p&gt;The mismatch for a SOC: CrewAI assumes a single in-process orchestration run. &lt;em&gt;"Spin up a crew, run a task, get an output."&lt;/em&gt; Our roles aren't a crew — they're independent processes with their own uptime, audit, restart semantics, and HITL gates. CrewAI's human-in-the-loop is &lt;code&gt;human_input=True&lt;/code&gt; on a Task — a blocking stdin prompt during the crew run. That doesn't survive a "Webex card → Flask page → SQLite sidecar → bus event back into the cascade" flow. We'd lose audit-stream replay, backtest, and per-role systemd uptime if we forced this shape.&lt;/p&gt;

&lt;p&gt;Where CrewAI &lt;em&gt;could&lt;/em&gt; slot in: as the internal reasoning of a single role. e.g. inside Tier 2's &lt;code&gt;handle()&lt;/code&gt;, swap one LLM-with-tool-loop for a small crew (investigator + critic + decider) before emitting the &lt;code&gt;Tier2Analysis&lt;/code&gt; event. Same bus architecture, more sophisticated per-role thinking. Probably not worth it yet — Tier 2 with a critic-loop pattern works.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. AutoGen
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/microsoft/autogen" rel="noopener noreferrer"&gt;AutoGen&lt;/a&gt; is conversation-shaped: agent-to-agent chat with an explicit &lt;code&gt;GroupChat&lt;/code&gt; manager. Great for "two LLMs argue and converge on an answer" — code-writer vs code-reviewer, advocate vs critic.&lt;/p&gt;

&lt;p&gt;The mismatch: a SOC isn't a conversation. Tier 2 doesn't &lt;em&gt;talk to&lt;/em&gt; Tier 1; it consumes Tier 1's verdict. The chat-history-as-state model imposes a context-window tax on a problem that doesn't need it, and the &lt;code&gt;GroupChat&lt;/code&gt; orchestrator becomes a load-bearing thing you can't restart independently.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Plain LangChain (no graph, no bus)
&lt;/h3&gt;

&lt;p&gt;The path of least resistance: write a Python function for each role, chain them together, run synchronously. We started here actually, then noticed the smell. The synchronous chain forces every role to wait for the previous one, eliminates per-role restart, makes HITL impossible without a hack, and gives you no audit log unless you build one separately.&lt;/p&gt;

&lt;p&gt;If you only have two roles, just do this. We had eight.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. n8n / Zapier / visual workflow tools
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://n8n.io/" rel="noopener noreferrer"&gt;n8n&lt;/a&gt; and similar visual workflow tools were on the list for one specific reason: leadership likes seeing the boxes-and-arrows. But the LLM nodes aren't first-class — you'd be wrapping every model call in HTTP, and the graph is in a database, not in code that's reviewable in a PR. Auditability and reproducibility are both worse than the LangGraph + bus path. (n8n is a great fit for non-LLM SOAR-style automations, just not for this.)&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Build-from-scratch asyncio + Redis Streams
&lt;/h3&gt;

&lt;p&gt;The honest baseline. Python &lt;code&gt;asyncio&lt;/code&gt; workers, Redis Streams consumer groups, no agent framework. Saves you the framework abstraction, costs you the prompt + tool-loop + state-management plumbing that LangGraph and friends do for free. For a one-role POC, fine. For eight roles, you reinvent LangChain badly.&lt;/p&gt;

&lt;h3&gt;
  
  
  What we picked: LangGraph for per-role reasoning + Redis Streams for inter-role coordination
&lt;/h3&gt;

&lt;p&gt;LangGraph gives us a clean per-role tool-loop with explicit state, and Redis Streams gives us the inter-role coordination — durable events, consumer groups for at-least-once delivery, an audit stream that's just another consumer, and the easy retrofit of new roles without touching existing ones.&lt;/p&gt;

&lt;p&gt;The split is the point: &lt;strong&gt;LangGraph is the agent runtime, the bus is the org chart.&lt;/strong&gt; Don't conflate them.&lt;/p&gt;

&lt;h2&gt;
  
  
  One LLM, many hats
&lt;/h2&gt;

&lt;p&gt;We run &lt;strong&gt;one&lt;/strong&gt; local LLM — GLM-4.7-Flash 8-bit on a Mac M1 (64 GB) via &lt;a href="https://github.com/waybarrios/vllm-mlx" rel="noopener noreferrer"&gt;vllm-mlx&lt;/a&gt; — and every role calls it with a different system prompt and a different tool whitelist. The resilience comes from a &lt;code&gt;FailoverChatModel&lt;/code&gt; (first described in an &lt;a href="https://vinayvobbili.github.io/posts/billing-header-kv-cache/" rel="noopener noreferrer"&gt;earlier post&lt;/a&gt;) that transparently falls back to a Qwen3 backup on a studio1 box if the m1 dies, and flips back the moment the primary recovers.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;📦 New — we open-sourced it.&lt;/strong&gt; That &lt;code&gt;FailoverChatModel&lt;/code&gt; is now a standalone, dependency-light package on PyPI: &lt;a href="https://pypi.org/project/langchain-failover/" rel="noopener noreferrer"&gt;&lt;code&gt;langchain-failover&lt;/code&gt;&lt;/a&gt;. &lt;code&gt;pip install langchain-failover&lt;/code&gt;, point it at two chat models, and you get the same primary/secondary failover that keeps this SOC's brain online when a GPU box drops off — connection-aware (it walks the exception's cause chain), recovery-aware (logs the flip back), and mid-stream-safe. The non-obvious part it gets right: &lt;strong&gt;tool-calling survives the failover&lt;/strong&gt; — it binds your tools on &lt;em&gt;both&lt;/em&gt; legs, so an agent mid-investigation doesn't lose its tools the instant it fails over. That's exactly what a SOC role needs at 3 AM. Source, tests, and docs: &lt;a href="https://github.com/vinayvobbili/langchain-failover" rel="noopener noreferrer"&gt;github.com/vinayvobbili/langchain-failover&lt;/a&gt;. 🚀&lt;br&gt;
{: .prompt-tip }&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Why not multiple model providers per role?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cost&lt;/strong&gt; — one model loaded once. The Mac M1 holds 35B params at 8-bit comfortably and tool-calls reliably with the &lt;code&gt;glm47&lt;/code&gt; parser.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency&lt;/strong&gt; — no inter-provider hop, no API rate limits to coordinate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Operational simplicity&lt;/strong&gt; — one health check, one auth header, one log file.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The roles aren't actually different intelligences&lt;/strong&gt; — they're the same intelligence with different prompts, tool budgets, and JSON output schemas. Tier 2 has 30 tool calls; IR Lead has 15; Threat Intel has 12.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The thing GPT-4 or Claude would buy us isn't &lt;em&gt;better&lt;/em&gt; reasoning on any one role — it's worse cost economics for a 24×7 deployment. We may revisit for SEV-1 hardest cases, but the default is local.&lt;/p&gt;

&lt;h2&gt;
  
  
  The roles
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;th&gt;Driver&lt;/th&gt;
&lt;th&gt;Trigger&lt;/th&gt;
&lt;th&gt;Bus output&lt;/th&gt;
&lt;th&gt;Real-system side effect&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Sentinel (Tier 1)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;XSOAR poller&lt;/td&gt;
&lt;td&gt;New tickets&lt;/td&gt;
&lt;td&gt;&lt;code&gt;AlertTriaged&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;XSOAR triage note&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tier 2 Analyst&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Long-running consumer&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;alert.triaged&lt;/code&gt; where TP-malicious or pri ≥ 7&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;Tier2Analysis&lt;/code&gt;, &lt;code&gt;CaseEscalated&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Webex card on escalation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;IR Lead&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Long-running consumer&lt;/td&gt;
&lt;td&gt;&lt;code&gt;case.escalated → ir_lead&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;IRPlan&lt;/code&gt;, &lt;code&gt;ActionProposed&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Webex card with HITL buttons, XSOAR plan note&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Threat Intel&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Long-running consumer&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ir.plan&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ThreatIntelReport&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Webex card, XSOAR attribution note&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SOC Manager&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Timer 06/14/22 EST&lt;/td&gt;
&lt;td&gt;calendar&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ShiftSummary&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Webex card&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Detection Engineer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Timer 09:00 EST M-F&lt;/td&gt;
&lt;td&gt;calendar&lt;/td&gt;
&lt;td&gt;&lt;code&gt;DetectionTuningReport&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Webex card&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Threat Hunter&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Timer 06/18 EST&lt;/td&gt;
&lt;td&gt;calendar&lt;/td&gt;
&lt;td&gt;&lt;code&gt;HuntingReport&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Webex card&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;HITL Flask&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Browser button click&lt;/td&gt;
&lt;td&gt;Approval link in Webex card&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ActionDecision&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Decision logged to sidecar SQLite&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Every role's verdict is also persisted to a &lt;code&gt;verdicts.sqlite&lt;/code&gt; sidecar with &lt;code&gt;wall_time_ms&lt;/code&gt;, &lt;code&gt;tool_calls_made&lt;/code&gt;, and (in backtest mode) &lt;code&gt;ground_truth&lt;/code&gt;, so we can compute agreement rates and latency distributions without instrumenting OpenTelemetry on day one.&lt;/p&gt;

&lt;h2&gt;
  
  
  The HITL gate
&lt;/h2&gt;

&lt;p&gt;The hardest design question wasn't "should the AI containment action?" — it's "how does the AI hand off to a human in a way the human will actually engage with?"&lt;/p&gt;

&lt;p&gt;We tried a few patterns. The one that works:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sequenceDiagram
    participant IR as IR Lead&amp;lt;br/&amp;gt;(LLM agent)
    participant Bus as Redis Streams
    participant Webex as Webex card&amp;lt;br/&amp;gt;(Pokedex bot)
    participant Human as Approver
    participant Flask as Flask HITL page
    participant SQLite as hitl.sqlite

    IR-&amp;gt;&amp;gt;Bus: IRPlan + ActionProposed&amp;lt;br/&amp;gt;(approver_role="IR Lead On-Call")
    IR-&amp;gt;&amp;gt;Webex: Card with 2 buttons&amp;lt;br/&amp;gt;(Action.OpenUrl)
    Webex-&amp;gt;&amp;gt;Human: Banner: "🎯 Action required from: IR Lead On-Call"
    Human-&amp;gt;&amp;gt;Flask: Click Approve or Reject
    Flask-&amp;gt;&amp;gt;Human: Confirmation page&amp;lt;br/&amp;gt;(login_required, DEMO MODE banner)
    Human-&amp;gt;&amp;gt;Flask: Submit decision
    Flask-&amp;gt;&amp;gt;SQLite: Persist decision
    Flask-&amp;gt;&amp;gt;Bus: ActionDecision (approved|rejected, dummy=True)
    Note over Bus: v2 future: executor agent&amp;lt;br/&amp;gt;consumes approved decisions
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three details matter:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The approver is addressed.&lt;/strong&gt; The card banner says "Action required from: IR Lead On-Call" — not "click here to approve." The team knows whose mailbox each card is in.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Flask confirmation page sits between the click and the recorded decision.&lt;/strong&gt; Single-click approve from a Webex card was tempting but wrong — accidental clicks would auto-execute. The two-step (click button → see page → click submit) is the friction we want.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v1 doesn't actually execute.&lt;/strong&gt; The decision is logged, an &lt;code&gt;ActionDecision&lt;/code&gt; event is published, and the demo wraps up there. v2 — an executor agent that consumes &lt;code&gt;action.decision[approved]&lt;/code&gt; and calls CrowdStrike RTR / Zscaler / Tanium — is straightforward to add once leadership trusts the loop. Trust is earned in v1, not asserted by skipping the gate.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Putting numbers on it: the backtest harness
&lt;/h2&gt;

&lt;p&gt;The hardest sell to a SOC director isn't "we built it." It's "how do we know it works before we put it on real alerts?"&lt;/p&gt;

&lt;p&gt;We have an XSOAR timeline database with 32K+ historical CrowdStrike tickets, each with an &lt;code&gt;escalation_state&lt;/code&gt; field that tells us whether a human Tier 1 closed it or a human Tier 2/Tier 3 picked it up. That's our ground truth — analyst-curated, no extra labelling required.&lt;/p&gt;

&lt;p&gt;The backtest harness samples N closed tickets stratified 50/50 between human-escalated and human-closed, then replays each through the agent cascade with all side effects neutered: bus publishes captured in-memory, Webex sends no-op'd, XSOAR writes no-op'd, HITL store stubbed.&lt;/p&gt;

&lt;p&gt;For each ticket we record:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sentinel's verdict + priority&lt;/li&gt;
&lt;li&gt;Whether Tier 2 engaged, and what it decided (escalate / close / needs human review)&lt;/li&gt;
&lt;li&gt;Whether IR Lead engaged, and what SEV it assigned&lt;/li&gt;
&lt;li&gt;Whether Threat Intel engaged, and what actor it attributed&lt;/li&gt;
&lt;li&gt;Wall time and tool-call count for each stage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then we compute the confusion matrix of &lt;em&gt;cascade-escalated-to-IR-Lead&lt;/em&gt; vs &lt;em&gt;human-escalated-in-real-life&lt;/em&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;TP  human escalated  AND  Tier 2 escalated → IR Lead
FN  human escalated  BUT  Tier 2 closed
FP  human closed     BUT  Tier 2 escalated → IR Lead
TN  human closed     AND  Tier 2 closed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Precision and recall on TP/FP/FN give us the numbers leadership wants — &lt;em&gt;"how often does the AI escalate when humans actually would, and how often does it cry wolf?"&lt;/em&gt; The summary lands in a JSON file that the dashboard panel reads, so the question gets a number, not a vibe.&lt;/p&gt;

&lt;p&gt;The harness also has a &lt;code&gt;--dry-run&lt;/code&gt; mode that swaps the LLM for a canned-JSON stub, so we can validate the plumbing end-to-end in under 2 seconds without burning a single token — and that same harness drives the real-LLM run against a full stratified sample when we want actual agreement numbers rather than a smoke test.&lt;/p&gt;

&lt;h2&gt;
  
  
  What surprised us
&lt;/h2&gt;

&lt;p&gt;Three things, in order of how much they changed the design:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The bus is more important than the agents.&lt;/strong&gt; We spent the first week tuning prompts. The unlock was when we got the Redis Streams + audit-replay pattern right — at that point, adding a new role became a 200-line file plus a systemd unit, and the existing roles didn't have to know. That's worth more than another 5% on any single agent's quality.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timer-driven roles are underrated.&lt;/strong&gt; SOC Manager / Detection Engineer / Threat Hunter run on a calendar schedule, not on events. They get the same audit stream, so they see everything the reactive agents did, plus everything the audit stream caught that no reactive agent engaged on. Detection Engineer in particular finds tuning candidates a reactive role would never see — &lt;em&gt;"this rule fired 47 times this week and 41 were closed as benign by Tier 1."&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The right level of role granularity isn't obvious.&lt;/strong&gt; We went back and forth on whether Tier 2 + IR Lead should be one role or two. They're two. Tier 2's job is "is this real and how bad?"; IR Lead's job is "given it's real, what's the plan?" Conflating them puts SEV classification in the same prompt as evidence-gathering and the model loses focus. Same with Threat Intel — keeping attribution out of IR Lead's prompt makes both roles tighter.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Try it / What's next
&lt;/h2&gt;

&lt;p&gt;The full module lives at &lt;a href="https://github.com/vinayvobbili/security-ops-platform" rel="noopener noreferrer"&gt;&lt;code&gt;src/components/soc_in_box/&lt;/code&gt;&lt;/a&gt; — agents, schemas, bus wrapper, verdict store, HITL store, web routes, systemd units, README.&lt;/p&gt;

&lt;p&gt;What's not in v1 and what we'll work on next:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;HITL v2 executor.&lt;/strong&gt; Real write path — consume &lt;code&gt;action.decision[approved]&lt;/code&gt; events, call CrowdStrike RTR / Tanium / Zscaler via MCP, log the result back on the bus. The hard parts (audit, approval, identity) are done; only the executor itself is missing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Red Team agent.&lt;/strong&gt; Once we have AttackIQ wired into the lab, a Red Team role can post &lt;code&gt;attack.executed&lt;/code&gt; events that the rest of the cascade has to detect. Closes the loop on "did the SOC actually catch what the Red Team threw?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backtest as a CI gate.&lt;/strong&gt; Once we're confident on a baseline, promote the harness to a nightly run with regression thresholds — "if Tier 2 escalation precision drops more than 5% from last week's baseline, fail the build."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The code is BSD-licensed in the public mirror. If you're building something similar, the most useful thing to copy isn't any one agent — it's the &lt;strong&gt;bus shape&lt;/strong&gt;, the &lt;strong&gt;schema-per-event&lt;/strong&gt; discipline, the &lt;strong&gt;audit-stream-as-truth&lt;/strong&gt; pattern, and the &lt;strong&gt;HITL handoff that addresses a human by role&lt;/strong&gt;. Those four ideas are what turned eight separate LLM-with-tools experiments into one thing a SOC team would actually run.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>security</category>
      <category>llm</category>
      <category>devops</category>
    </item>
    <item>
      <title>detflow: A Detection-Engineering Copilot You Can pip install</title>
      <dc:creator>Vinay</dc:creator>
      <pubDate>Sun, 07 Jun 2026 01:55:03 +0000</pubDate>
      <link>https://dev.to/vinayiitkgp/detflow-a-detection-engineering-copilot-you-can-pip-install-aem</link>
      <guid>https://dev.to/vinayiitkgp/detflow-a-detection-engineering-copilot-you-can-pip-install-aem</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR 🚀
&lt;/h2&gt;

&lt;p&gt;I shipped &lt;a href="https://pypi.org/project/detflow/" rel="noopener noreferrer"&gt;&lt;strong&gt;detflow&lt;/strong&gt;&lt;/a&gt; to PyPI — an open-source, &lt;strong&gt;vendor-neutral detection-engineering copilot&lt;/strong&gt;. It does the four things I found myself re-implementing inside every detection-as-code workflow: &lt;strong&gt;draft&lt;/strong&gt; a detection from plain English (as &lt;strong&gt;Sigma&lt;/strong&gt; or &lt;strong&gt;Cortex XSIAM XQL&lt;/strong&gt;), &lt;strong&gt;lint&lt;/strong&gt; it offline, &lt;strong&gt;find overlaps&lt;/strong&gt; against the rules you already run, and &lt;strong&gt;review&lt;/strong&gt; it like a senior detection engineer. 🛡️&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;tr&gt;
&lt;td width="33%"&gt;
&lt;h2&gt;2 formats&lt;/h2&gt;
draft &amp;amp; review in &lt;strong&gt;Sigma&lt;/strong&gt; or &lt;strong&gt;Cortex XQL&lt;/strong&gt; — one portable, one native
&lt;/td&gt;
&lt;td width="33%"&gt;
&lt;h2&gt;1 protocol&lt;/h2&gt;
bring any model: an OpenAI-compatible endpoint or a LangChain failover chain
&lt;/td&gt;
&lt;td width="33%"&gt;
&lt;h2&gt;0 crashes&lt;/h2&gt;
lint &amp;amp; overlap need no model; review degrades to a deterministic floor
&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is the detection-side sibling of &lt;a href="https://vinayvobbili.github.io/posts/iocflow-agentic-ioc-lifecycle/" rel="noopener noreferrer"&gt;iocflow&lt;/a&gt;. iocflow handles the &lt;em&gt;indicator&lt;/em&gt; lifecycle; detflow handles the &lt;em&gt;rule&lt;/em&gt; lifecycle. Same design DNA: &lt;strong&gt;deterministic primitives first, the LLM as an enhancement that can fail without taking the tool down with it.&lt;/strong&gt; 🧱&lt;/p&gt;

&lt;h2&gt;
  
  
  The itch
&lt;/h2&gt;

&lt;p&gt;A detection-as-code pipeline — the kind that turns a rule into a reviewed, tested merge request — has a handful of stages that have nothing to do with your SIEM vendor:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is this rule even &lt;em&gt;valid&lt;/em&gt;? (lint / schema-check)&lt;/li&gt;
&lt;li&gt;An analyst can describe the behavior but doesn't write Sigma fluently — can we &lt;em&gt;draft&lt;/em&gt; the first version?&lt;/li&gt;
&lt;li&gt;Are we about to ship coverage we &lt;strong&gt;already have&lt;/strong&gt;? (dedup against the catalog)&lt;/li&gt;
&lt;li&gt;Would a senior engineer &lt;em&gt;approve&lt;/em&gt; this, and what would they flag? (quality, false-positive risk, ATT&amp;amp;CK mapping, gaps)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I'd written those four stages more than once. They're generic — the only vendor-specific parts of a real pipeline are &lt;em&gt;compiling&lt;/em&gt; to your query language and &lt;em&gt;dry-running&lt;/em&gt; against your tenant. So I carved the generic four out of a detection-as-code workbench I'd built and made them a clean, public library. 🧰&lt;/p&gt;

&lt;h2&gt;
  
  
  What it looks like
&lt;/h2&gt;

&lt;p&gt;Draft a detection from a sentence — in either language:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;detflow&lt;/span&gt;

&lt;span class="n"&gt;sigma&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;detflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;draft&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;powershell with an encoded command spawned from a Word macro&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sigma&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                       &lt;span class="c1"&gt;# a full Sigma rule, ready to lint
&lt;/span&gt;
&lt;span class="n"&gt;xql&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;detflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;draft&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;same thing, but for Cortex XSIAM&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cortex-xql&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;xql&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                         &lt;span class="c1"&gt;# dataset = ... | filter ... | limit 100
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Lint it offline — no model, no network, no keys:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;report&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;detflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sigma&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;       &lt;span class="c1"&gt;# or lint_sigma / lint_xql
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    &lt;span class="c1"&gt;# "pass" / "warn" / "fail"
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;findings&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Review it like a senior engineer, deduped against your own inventory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;catalog&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Encoded PowerShell&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;crowdstrike&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;techniques&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;T1059.001&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;WMI Process Create&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sigma&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;       &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;techniques&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;T1047&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]},&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;detflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;review&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sigma&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;catalog&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;catalog&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quality_score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;false_positive_risk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;overlaps&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;               &lt;span class="c1"&gt;# "you may already cover this"
&lt;/span&gt;    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; •&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;—&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The whole flow, end to end:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;flowchart LR
    NL([plain English]) --&amp;gt;|draft| RULE[Sigma / XQL rule]
    RULE --&amp;gt;|lint| LINT[schema + best-practice findings]
    RULE --&amp;gt;|find_overlaps| OV[catalog dedup]
    LINT --&amp;gt; REV{{review}}
    OV --&amp;gt; REV
    REV --&amp;gt; V([quality · FP risk · ATT&amp;amp;CK · verdict])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There's a CLI too, for the terminal-and-CI crowd:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;detflow draft &lt;span class="s2"&gt;"credential dumping via comsvcs MiniDump"&lt;/span&gt; &lt;span class="nt"&gt;-f&lt;/span&gt; cortex-xql
detflow lint rule.yml
detflow review rule.yml &lt;span class="nt"&gt;--catalog&lt;/span&gt; catalog.json &lt;span class="nt"&gt;--json&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Model-agnostic on purpose 🔌
&lt;/h2&gt;

&lt;p&gt;detflow doesn't import an SDK or hard-code a provider. A "model" is anything with one method:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That gives you three ways in. A built-in &lt;code&gt;OpenAIChatModel&lt;/code&gt; talks to any OpenAI-compatible endpoint — OpenAI, Azure, a local vLLM/Ollama server, a gateway. &lt;code&gt;default_model()&lt;/code&gt; builds one from &lt;code&gt;DETFLOW_LLM_*&lt;/code&gt; env vars. Or you wrap &lt;strong&gt;any&lt;/strong&gt; LangChain chat model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_failover&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FailoverChatModel&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;detflow.llm&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LangChainModel&lt;/span&gt;

&lt;span class="n"&gt;chain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FailoverChatModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;primary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;local_fallback&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LangChainModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;detflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;review&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;catalog&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;catalog&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# rides the failover chain
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That &lt;code&gt;FailoverChatModel&lt;/code&gt; is &lt;a href="https://pypi.org/project/langchain-failover/" rel="noopener noreferrer"&gt;langchain-failover&lt;/a&gt;, another package I extracted and published — so a primary-model outage transparently falls back to a secondary mid-review. Three of my OSS packages quietly eating each other's dog food. 🐕&lt;/p&gt;

&lt;h2&gt;
  
  
  Never-raises, deterministic floor
&lt;/h2&gt;

&lt;p&gt;The contract I care about most: &lt;strong&gt;detflow degrades, it doesn't break.&lt;/strong&gt; 🎯&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Lint and overlap need no model at all&lt;/strong&gt; — they're pure, stdlib-plus-PyYAML, and run in CI with zero secrets.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Drafting&lt;/strong&gt; requires a model (you're asking it to write), but a model error comes back as a result with an &lt;code&gt;error&lt;/code&gt; field, not an exception.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Review&lt;/strong&gt; uses a model when one is present and falls back to a &lt;strong&gt;deterministic floor&lt;/strong&gt; when it isn't — you still get the lint results, the catalog overlaps, and the parsed ATT&amp;amp;CK techniques. &lt;code&gt;review()&lt;/code&gt; never raises.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So detflow is safe to drop into a pipeline that sometimes has an LLM available and sometimes doesn't. The boring, testable parts stay up regardless; the AI adds judgment when it can.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why two formats
&lt;/h2&gt;

&lt;p&gt;Sigma is the portable, reviewable, vendor-neutral standard — it lints cleanly and ports across SIEMs. Cortex XSIAM &lt;strong&gt;XQL&lt;/strong&gt; is what actually runs on that platform. Supporting both means you can author once in Sigma for portability, or go straight to XQL when you want the platform's full expressiveness — and detflow lints and reviews either one. The drafting prompts are language-aware (the XQL prompt knows XQL has no &lt;code&gt;startswith&lt;/code&gt;/&lt;code&gt;endswith&lt;/code&gt; and uses &lt;code&gt;| filter&lt;/code&gt;, not SQL &lt;code&gt;where&lt;/code&gt;), so you don't get SQL-shaped hallucinations back. 🧠&lt;/p&gt;

&lt;h2&gt;
  
  
  The bigger pattern
&lt;/h2&gt;

&lt;p&gt;This is the same lesson as the IOC work: when you want to &lt;em&gt;show&lt;/em&gt; AI in your engineering, the junior move is to make everything an LLM call. The stronger, more deployable story is &lt;strong&gt;deterministic primitives plus optional AI&lt;/strong&gt; — the schema checks and dedup are boring and tested, the model writes and reviews where judgment helps, and nothing falls over when the model is slow or absent.&lt;/p&gt;

&lt;p&gt;detflow runs on Python 3.9+, keeps &lt;code&gt;import detflow&lt;/code&gt; dependency-light (the LLM client is an extra), ships &lt;code&gt;py.typed&lt;/code&gt; for downstream type-checking, and every piece is independently useful.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;📦 PyPI: &lt;a href="https://pypi.org/project/detflow/" rel="noopener noreferrer"&gt;&lt;code&gt;pip install detflow&lt;/code&gt;&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;🛠️ Source: &lt;a href="https://github.com/vinayvobbili/detflow" rel="noopener noreferrer"&gt;github.com/vinayvobbili/detflow&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;🧩 Its indicator-side sibling: &lt;a href="https://vinayvobbili.github.io/posts/iocflow-agentic-ioc-lifecycle/" rel="noopener noreferrer"&gt;iocflow&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you run a detection-as-code pipeline, I'd love to know which query language you'd want next. 👋&lt;/p&gt;

</description>
      <category>security</category>
      <category>ai</category>
      <category>opensource</category>
      <category>python</category>
    </item>
    <item>
      <title>iocflow: Turning a Production AI SOC into a Shippable OSS Library</title>
      <dc:creator>Vinay</dc:creator>
      <pubDate>Sun, 07 Jun 2026 01:54:47 +0000</pubDate>
      <link>https://dev.to/vinayiitkgp/iocflow-turning-a-production-ai-soc-into-a-shippable-oss-library-5ajd</link>
      <guid>https://dev.to/vinayiitkgp/iocflow-turning-a-production-ai-soc-into-a-shippable-oss-library-5ajd</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR 🚀
&lt;/h2&gt;

&lt;p&gt;I shipped &lt;a href="https://pypi.org/project/iocflow/" rel="noopener noreferrer"&gt;&lt;strong&gt;iocflow&lt;/strong&gt;&lt;/a&gt; to PyPI — an open-source Python library for the &lt;strong&gt;entire indicator-of-compromise lifecycle&lt;/strong&gt;, built as six independently-useful layers behind pip extras. The headline isn't "another IOC parser." It's the &lt;em&gt;shape&lt;/em&gt;: every layer is a deterministic, boring, testable primitive — and the top layer is a small &lt;strong&gt;LangGraph multi-agent team&lt;/strong&gt; that orchestrates those primitives, with a &lt;strong&gt;human-in-the-loop gate&lt;/strong&gt; standing between the AI and anything destructive.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;tr&gt;
&lt;td width="33%"&gt;
&lt;h2&gt;6 layers&lt;/h2&gt;
extract · enrich · comment · hunt · block · agent — each its own pip extra
&lt;/td&gt;
&lt;td width="33%"&gt;
&lt;h2&gt;1 import&lt;/h2&gt;
&lt;code&gt;investigate(text)&lt;/code&gt; runs the whole chain as a multi-agent team
&lt;/td&gt;
&lt;td width="33%"&gt;
&lt;h2&gt;0 rogue blocks&lt;/h2&gt;
LLM &lt;em&gt;proposes&lt;/em&gt; · human &lt;em&gt;authorizes&lt;/em&gt; · a guard &lt;em&gt;vetoes&lt;/em&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is the OSS sibling of &lt;a href="https://vinayvobbili.github.io/posts/building-soc-in-a-box/" rel="noopener noreferrer"&gt;SOC-in-a-Box&lt;/a&gt;, the AI SOC I wrote about last week. SOC-in-a-Box proved the &lt;em&gt;pattern&lt;/em&gt; against real systems; iocflow packages the &lt;em&gt;lesson&lt;/em&gt; so anyone can pip-install it. 🧰&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvlkg4dghjeaxhfvo6h9p.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvlkg4dghjeaxhfvo6h9p.gif" alt="iocflow investigate() running the full IOC lifecycle with a human-in-the-loop approval gate" width="800" height="693"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;One call: extract IOCs from a report → enrich → suggest hunts → propose blocks → wait for a human → block at the firewall. The benign &lt;code&gt;8.8.8.8&lt;/code&gt; never even gets proposed.&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  The lesson I was carrying over
&lt;/h2&gt;

&lt;p&gt;SOC-in-a-Box was eight analyst roles played by &lt;strong&gt;one&lt;/strong&gt; local LLM over a message bus, read-only against production, with a human approving every containment action. The thing that actually made it trustworthy wasn't the agents — an LLM-with-tools loop is not novel. It was two architectural commitments:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The model orchestrates; it doesn't &lt;em&gt;do&lt;/em&gt;.&lt;/strong&gt; The irreversible work — query a SIEM, write a denylist, isolate a host — is done by plain, deterministic code the model merely &lt;em&gt;calls&lt;/em&gt;. The LLM picks &lt;em&gt;what&lt;/em&gt; and &lt;em&gt;when&lt;/em&gt;; the tool decides &lt;em&gt;how&lt;/em&gt;, the same way every time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No single authority for a destructive action.&lt;/strong&gt; The AI can propose containment all day long. A human clicks the button, and a dumb safety check sits underneath both of them refusing to touch anything on an allowlist.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Those two ideas aren't SOC-specific. They're how you make &lt;em&gt;any&lt;/em&gt; AI system that touches production safe enough to actually deploy. So I pulled them out of the SOC and built a clean, public library around them. 🧱&lt;/p&gt;
&lt;h2&gt;
  
  
  Deterministic primitives first, agents last
&lt;/h2&gt;

&lt;p&gt;iocflow grows in layers, each behind its own extra so &lt;code&gt;import iocflow&lt;/code&gt; stays a one-dependency install and pulls in nothing you didn't ask for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;L1 — extract&lt;/strong&gt; (&lt;code&gt;iocflow&lt;/code&gt;): pull IPs, domains, URLs, hashes, CVEs, MITRE technique IDs, threat actors, and malware families out of unstructured text, with the false-positive defenses you'd otherwise hand-write (Public Suffix List validation, benign allowlists, re-fanging of &lt;code&gt;evil-domain[.]ru&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;L2 — enrich&lt;/strong&gt; (&lt;code&gt;iocflow[enrich]&lt;/code&gt;): look each indicator up against VirusTotal / AbuseIPDB / abuse.ch and return a worst-wins verdict.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;L3 — comment&lt;/strong&gt; (&lt;code&gt;iocflow[ai]&lt;/code&gt;): an LLM turns the enrichment report into a structured assessment — and falls back to a deterministic, report-derived summary when no model is configured. It &lt;em&gt;never&lt;/em&gt; raises.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;L4 — hunt&lt;/strong&gt; (&lt;code&gt;iocflow[hunt]&lt;/code&gt;): render ready-to-run hunt queries — &lt;strong&gt;CrowdStrike CQL&lt;/strong&gt;, &lt;strong&gt;Cortex XQL&lt;/strong&gt;, and &lt;strong&gt;Sigma&lt;/strong&gt; — straight from the indicators, offline and stdlib-only. An LLM can &lt;em&gt;add&lt;/em&gt; behavioral hunts, but the deterministic queries are always there.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;L5 — block&lt;/strong&gt; (&lt;code&gt;iocflow[block]&lt;/code&gt;): push malicious indicators to the control points you operate — Palo Alto (EDL feed + live User-ID API), Zscaler, CrowdStrike, Abnormal — with &lt;code&gt;dry_run=True&lt;/code&gt; as the default &lt;em&gt;everywhere&lt;/em&gt; and an authoritative allowlist guard.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;L6 — agent&lt;/strong&gt; (&lt;code&gt;iocflow[agent]&lt;/code&gt;): the capstone. 🤖&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Notice that L1–L5 have no idea an agent exists. They're just functions with stable input/output types: &lt;code&gt;ExtractedEntities → enrich() → EnrichmentReport → comment() → Commentary → suggest() → HuntPlan → block() → BlockReport&lt;/code&gt;. You can use any one of them on its own. That's deliberate — &lt;strong&gt;the agent is a consumer of the primitives, not a replacement for them.&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  The capstone: a small multi-agent team
&lt;/h2&gt;

&lt;p&gt;Layer 6 hands a report to a supervisor that routes to specialist agents — extractor, enricher, hunter, responder — each using L1–L5 as tools, then loops back until the case is done.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;flowchart TB
    START([report text]) --&amp;gt; SUP{supervisor&amp;lt;br/&amp;gt;routes next step}
    SUP --&amp;gt;|extract| EX[extractor&amp;lt;br/&amp;gt;L1 entities]
    SUP --&amp;gt;|enrich| EN[enricher&amp;lt;br/&amp;gt;L2 + L3 assessment]
    SUP --&amp;gt;|hunt| HU[hunter&amp;lt;br/&amp;gt;L4 queries]
    SUP --&amp;gt;|respond| RE[responder&amp;lt;br/&amp;gt;L5 dry-run → propose]
    EX --&amp;gt; SUP
    EN --&amp;gt; SUP
    HU --&amp;gt; SUP
    RE -.proposal.-&amp;gt; GATE{{ApprovalGate&amp;lt;br/&amp;gt;human authorizes}}
    GATE -.approved.-&amp;gt; RE
    RE --&amp;gt;|live block| SUP
    SUP --&amp;gt;|all done| END([Case])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;iocflow.agent&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;investigate&lt;/span&gt;

&lt;span class="n"&gt;case&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;investigate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;report_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;        &lt;span class="c1"&gt;# safe: nothing is blocked by default
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;commentary&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;severity&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;—&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;commentary&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;                &lt;span class="c1"&gt;# the agents' reasoning, replayable
&lt;/span&gt;    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; •&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model is &lt;strong&gt;any&lt;/strong&gt; LangChain chat model. The bundled &lt;code&gt;default_agent_model()&lt;/code&gt; builds a &lt;a href="https://pypi.org/project/langchain-failover/" rel="noopener noreferrer"&gt;&lt;code&gt;FailoverChatModel&lt;/code&gt;&lt;/a&gt; — primary with an automatic secondary — which is the &lt;em&gt;same&lt;/em&gt; failover model I extracted from the SOC and published earlier. iocflow eating its own dog food. 🐕 And here's the part that makes it robust: &lt;strong&gt;with no model configured at all, the graph runs the layers in a fixed deterministic order and still produces a complete &lt;code&gt;Case&lt;/code&gt;.&lt;/strong&gt; The LLM is an enhancement, not a dependency.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three-layer authority (the part that matters) 🔒
&lt;/h2&gt;

&lt;p&gt;Blocking is the only step that can hurt you, so it gets the full treatment from SOC-in-a-Box:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The agent proposes.&lt;/strong&gt; The responder does a &lt;em&gt;dry run&lt;/em&gt; of L5 — full audit, zero changes — and turns it into a proposal.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A human authorizes.&lt;/strong&gt; An &lt;code&gt;ApprovalGate&lt;/code&gt; reviews the proposal and returns the approved subset. The default is &lt;code&gt;DenyAllGate&lt;/code&gt; — &lt;strong&gt;an unattended run blocks nothing.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A guard vetoes.&lt;/strong&gt; Underneath both of them, the Layer 5 allowlist guard refuses to touch public resolvers, private ranges, and well-known domains — &lt;em&gt;even if the report mislabeled them malicious.&lt;/em&gt; You cannot block &lt;code&gt;8.8.8.8&lt;/code&gt; through this library. The LLM is never the sole authority for a destructive action.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For the gate, I wired a real one to &lt;strong&gt;Slack&lt;/strong&gt; — no inbound webhook server, just post-and-poll:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;iocflow.agent&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;investigate&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;iocflow.agent.chat_gate&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SlackApprovalGate&lt;/span&gt;

&lt;span class="n"&gt;gate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SlackApprovalGate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;approvers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;U_ANALYST&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;600&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;case&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;investigate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;report_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;gate&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Bot posts the proposed blocks to your channel.
# ✅ from an allowlisted analyst authorizes the plan; ❌ or no reply = denied.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It posts the proposed blocks to a channel and polls for a reaction from an &lt;strong&gt;allowlisted&lt;/strong&gt; approver — ✅ approves the plan, ❌ or silence denies it, and a timeout defaults to &lt;em&gt;deny&lt;/em&gt;. The whole thing is a &lt;code&gt;ChatApprovalGate&lt;/code&gt; over a two-method &lt;code&gt;ChatTransport&lt;/code&gt; (&lt;code&gt;post&lt;/code&gt;, &lt;code&gt;reactions&lt;/code&gt;), so the same flow drops onto Webex, Teams, or a web UI by writing two functions. The transport is a thin seam, which means the gate logic is unit-tested without a single network call.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why build it this way
&lt;/h2&gt;

&lt;p&gt;The temptation, when you want to "show AI in your work," is to make &lt;em&gt;everything&lt;/em&gt; an LLM call. That reads as junior. The stronger story — the one a security team will actually run — is &lt;strong&gt;deterministic primitives plus agentic orchestration&lt;/strong&gt;: the boring parts are boring and tested, the AI adds judgment where judgment helps, and a human holds the keys to anything irreversible. 🎯&lt;/p&gt;

&lt;p&gt;Everything but the agent layer runs on Python 3.9+; &lt;code&gt;import iocflow&lt;/code&gt; stays dependency-light; every layer is independently useful; and the whole agent runs offline in tests because the enrichers, blockers, and model are all injectable.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;📦 PyPI: &lt;a href="https://pypi.org/project/iocflow/" rel="noopener noreferrer"&gt;&lt;code&gt;pip install iocflow&lt;/code&gt;&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;🛠️ Source: &lt;a href="https://github.com/vinayvobbili/iocflow" rel="noopener noreferrer"&gt;github.com/vinayvobbili/iocflow&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;🧠 The SOC it grew out of: &lt;a href="https://vinayvobbili.github.io/posts/building-soc-in-a-box/" rel="noopener noreferrer"&gt;SOC-in-a-Box&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you try it, I'd love to hear what control points you'd plug in. 👋&lt;/p&gt;

</description>
      <category>ai</category>
      <category>security</category>
      <category>opensource</category>
      <category>python</category>
    </item>
    <item>
      <title>Three Chat Template Patterns That Silently Kill Your Prompt Cache</title>
      <dc:creator>Vinay</dc:creator>
      <pubDate>Sun, 07 Jun 2026 01:54:28 +0000</pubDate>
      <link>https://dev.to/vinayiitkgp/three-chat-template-patterns-that-silently-kill-your-prompt-cache-47nj</link>
      <guid>https://dev.to/vinayiitkgp/three-chat-template-patterns-that-silently-kill-your-prompt-cache-47nj</guid>
      <description>&lt;p&gt;Liquid syntax error: Unknown tag 'endraw'&lt;/p&gt;
</description>
      <category>ai</category>
      <category>llm</category>
      <category>performance</category>
      <category>python</category>
    </item>
    <item>
      <title>Teaching a Reranker the Language of Security Tickets (+41% MRR@10)</title>
      <dc:creator>Vinay</dc:creator>
      <pubDate>Sun, 07 Jun 2026 01:53:46 +0000</pubDate>
      <link>https://dev.to/vinayiitkgp/teaching-a-reranker-the-language-of-security-tickets-41-mrr10-4mgk</link>
      <guid>https://dev.to/vinayiitkgp/teaching-a-reranker-the-language-of-security-tickets-41-mrr10-4mgk</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Our SOC's RAG pipeline retrieves over 142,000 closed XSOAR security tickets to ground&lt;br&gt;
investigation answers. After exhausting the easy wins — chunking, top-k, reranker&lt;br&gt;
choice — we still saw the right historical ticket land at rank 5-10 too often, and&lt;br&gt;
the LLM grounding its answer in a near-miss neighbor.&lt;/p&gt;

&lt;p&gt;We fine-tuned the reranker on our own data. Held-out test set, time-based split:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;MRR@10&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;BAAI/bge-reranker-v2-m3&lt;/code&gt; (off-the-shelf)&lt;/td&gt;
&lt;td&gt;0.598&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fine-tuned on 24K XSOAR pairs&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.846&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;+41% uplift.&lt;/strong&gt; No model architecture change, no embedding model swap. Just&lt;br&gt;
domain-specific fine-tuning of the same base reranker.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;tr&gt;
&lt;td width="33%"&gt;
&lt;h2&gt;+41%&lt;/h2&gt;
MRR@10 uplift on held-out time-split test set
&lt;/td&gt;
&lt;td width="33%"&gt;
&lt;h2&gt;24,213 + 10,848&lt;/h2&gt;
positive pairs + clean hard negatives, mined from close-notes
&lt;/td&gt;
&lt;td width="33%"&gt;
&lt;h2&gt;0&lt;/h2&gt;
explicit relevance labels collected — all signal mined from existing analyst text
&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The interesting part isn't the result — it's where the training data came from. We&lt;br&gt;
never logged a single explicit relevance judgement. The 24K positive pairs were&lt;br&gt;
hiding in plain sight inside analyst close-notes that nobody asked anyone to write.&lt;/p&gt;
&lt;h2&gt;
  
  
  The setup: embedder + reranker, the standard two-stage RAG
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;flowchart LR
    Q[User query] --&amp;gt; E[Embedder&amp;lt;br/&amp;gt;Qwen3-Embedding-8B&amp;lt;br/&amp;gt;4-bit DWQ]
    E --&amp;gt; Top50[Top-50 by&amp;lt;br/&amp;gt;cosine similarity]
    Top50 --&amp;gt; R[Reranker&amp;lt;br/&amp;gt;bge-reranker-v2-m3&amp;lt;br/&amp;gt;&amp;lt;b&amp;gt;fine-tuned&amp;lt;/b&amp;gt;]
    R --&amp;gt; Top5[Top-5 ranked&amp;lt;br/&amp;gt;by joint scoring]
    Top5 --&amp;gt; LLM[LLM grounds&amp;lt;br/&amp;gt;answer]
    style R fill:#1e40af,color:#fff
    style E fill:#0e7490,color:#fff
    style LLM fill:#065f46,color:#fff
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Our retrieval pipeline is the standard cascade:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Stage 1 — Embedder (bi-encoder).&lt;/strong&gt; &lt;code&gt;Qwen3-Embedding-8B-4bit-DWQ&lt;/code&gt; served via
vllm-mlx. Encodes the query independently, pulls top-50 candidates from ChromaDB
by cosine similarity. Fast, but it scores query and document in isolation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stage 2 — Reranker (cross-encoder).&lt;/strong&gt; &lt;code&gt;BAAI/bge-reranker-v2-m3&lt;/code&gt; running on
Apple Silicon (MPS). Jointly attends over &lt;code&gt;(query, document)&lt;/code&gt; and re-scores the
top-50 down to top-5 to feed the LLM. Slower per item, but dramatically more
accurate than embedder-only ranking.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Mental model: the embedder is a fast librarian who pulls 50 books off the shelf&lt;br&gt;
based on title similarity. The reranker is a careful reader who actually opens each&lt;br&gt;
one and re-orders by relevance to your specific question.&lt;/p&gt;

&lt;p&gt;Off-the-shelf rerankers like &lt;code&gt;bge-reranker-v2-m3&lt;/code&gt; are trained on general English&lt;br&gt;
passage retrieval (MS MARCO and friends). They've never seen an XSOAR ticket. They&lt;br&gt;
don't know that &lt;em&gt;"INBLRPRDDKNF01: ML via Cloud-based ML"&lt;/em&gt; matters in a way that&lt;br&gt;
generic English semantic similarity cannot capture. Fine-tuning is how you teach&lt;br&gt;
them.&lt;/p&gt;
&lt;h2&gt;
  
  
  Where the training data came from
&lt;/h2&gt;

&lt;p&gt;Cross-encoder training needs &lt;code&gt;(query, positive, negative)&lt;/code&gt; triples. We had no&lt;br&gt;
explicit relevance labels — no clicks, no thumbs-up/down, nothing. So we mined&lt;br&gt;
implicit ones from analyst close-notes.&lt;/p&gt;

&lt;p&gt;Buried in 142,000 closed tickets are sentences analysts type all the time:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;"With reference to XSOAR #289008, regional team confirmed..."&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;"Refer master ticket #158126."&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;"Per XSOAR #463428, user confirmed..."&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each one is a human-curated link between two tickets. Free relevance label. We just&lt;br&gt;
had to extract them.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Generalizable lesson.&lt;/strong&gt; Before paying for labels, look at what your users are&lt;br&gt;
already typing. Free-form text in close-notes, comments, JIRA descriptions —&lt;br&gt;
they're full of implicit relevance judgements that nobody asked anyone to record.&lt;br&gt;
{: .prompt-tip }&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;
  
  
  Filtering the noise: not all &lt;code&gt;#N&lt;/code&gt; references are equal
&lt;/h3&gt;

&lt;p&gt;A regex over close-notes pulled 61,500 &lt;code&gt;#N&lt;/code&gt; references. Most were useless:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pool&lt;/th&gt;
&lt;th&gt;Lead-in phrase&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;th&gt;Signal quality&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;A&lt;/td&gt;
&lt;td&gt;&lt;em&gt;"Duplicate to #N"&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;52,782&lt;/td&gt;
&lt;td&gt;Strong but trivial — same alert, different host. Embedder already gets these.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;B&lt;/td&gt;
&lt;td&gt;&lt;em&gt;"XSOAR #N · Per XSOAR…"&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;~3,000&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Gold&lt;/strong&gt; — analyst-curated cross-references between distinct tickets.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;em&gt;"QRadar offense #N"&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;~1,400&lt;/td&gt;
&lt;td&gt;Useless — references other systems, not XSOAR.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Pool A is mostly the embedder's home turf already; the reranker doesn't need help&lt;br&gt;
with near-duplicates. Pool B is the interesting signal: &lt;em&gt;"these two tickets are&lt;br&gt;
related but not identical"&lt;/em&gt; — exactly the case where a reranker earns its keep.&lt;br&gt;
After regex-filtering and verifying both endpoints existed in our DB, we had &lt;strong&gt;4,260&lt;br&gt;
unique direct &lt;code&gt;(src → tgt)&lt;/code&gt; pairs.&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Free positives via transitive siblings (and the polynomial-blow-up trap)
&lt;/h2&gt;

&lt;p&gt;When five tickets all cite the same master ticket, those five are also related to&lt;br&gt;
each other. That's a free &lt;code&gt;O(n²)&lt;/code&gt; inflation of training pairs — &lt;em&gt;if&lt;/em&gt; you cap the&lt;br&gt;
explosion.&lt;/p&gt;

&lt;p&gt;We capped each master at 20 children before generating siblings. One particularly&lt;br&gt;
prolific master had 553 children; ungapped, it would have generated &lt;strong&gt;~150,000&lt;br&gt;
trivial sibling pairs&lt;/strong&gt; and dominated the training distribution. Stratified sampling&lt;br&gt;
across distinct rules pushed cross-rule pairs to the front so the model learned&lt;br&gt;
&lt;em&gt;generalizable&lt;/em&gt; relations, not within-rule sameness.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Direct &lt;code&gt;#N&lt;/code&gt; references&lt;/td&gt;
&lt;td&gt;4,260&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Transitive siblings (capped, stratified)&lt;/td&gt;
&lt;td&gt;19,953&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total positives (training-ready)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;24,213&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;72% of the transitive pairs were cross-rule — a strong signal that our cap +&lt;br&gt;
sampling worked.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Generalizable lesson.&lt;/strong&gt; Any time you derive new training examples by&lt;br&gt;
transitivity (or any structural inference), watch for polynomial blow-up in dense&lt;br&gt;
clusters. Stratified sampling is usually the right counter-move.&lt;br&gt;
{: .prompt-tip }&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;
  
  
  The part most beginners get wrong: hard negative mining
&lt;/h2&gt;

&lt;p&gt;Negatives matter as much as positives. The model learns from contrast, and &lt;em&gt;random&lt;/em&gt;&lt;br&gt;
negatives teach almost nothing — they're already obviously different. The interesting&lt;br&gt;
negatives are the ones that &lt;em&gt;look&lt;/em&gt; similar to the embedder but aren't actually&lt;br&gt;
related. Those are the cases the embedder gets wrong, and they're exactly what the&lt;br&gt;
reranker needs to learn to push apart.&lt;/p&gt;

&lt;p&gt;The recipe: for each source ticket, query the existing embedding index for the&lt;br&gt;
top-50 nearest neighbors. Drop anything that's a known positive (direct, transitive,&lt;br&gt;
or shares a master). What's left is what the embedder thinks matches but the analyst&lt;br&gt;
never linked — &lt;em&gt;hard negatives&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;We caught a subtle trap on the first run: &lt;strong&gt;same-rule near-duplicates are not hard&lt;br&gt;
negatives.&lt;/strong&gt; Two tickets both fired by &lt;code&gt;INBLRPRDDKNF01: ML via Cloud-based ML&lt;/code&gt; with&lt;br&gt;
0.997 cosine similarity are sibling alerts of the same automated detection rule —&lt;br&gt;
they're related, just not via an analyst's &lt;code&gt;#N&lt;/code&gt; reference. Training on them as&lt;br&gt;
negatives would teach the model to push apart things that are actually related.&lt;br&gt;
Filtering by rule before adding to the negatives pool dropped 33% of candidates.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Raw top-50 candidates from embedder&lt;/td&gt;
&lt;td&gt;16,137&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Same-rule contamination (filtered out)&lt;/td&gt;
&lt;td&gt;5,289 (33%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Clean cross-rule hard negatives&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;10,848&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Median cosine similarity of the kept negatives: 0.955 — i.e. the embedder&lt;br&gt;
&lt;em&gt;strongly&lt;/em&gt; believed these were relevant. They weren't. That's exactly the gap a&lt;br&gt;
reranker should close.&lt;/p&gt;
&lt;h2&gt;
  
  
  Data discipline: split by time, never by random
&lt;/h2&gt;

&lt;p&gt;Random train/val/test splits leak future signal into training and lie to you about&lt;br&gt;
held-out quality. Any time your data has a time dimension — fraud, security, sales&lt;br&gt;
forecasting, almost everything in production ML — split by time. In production the&lt;br&gt;
model can never look at the future, so neither should your evaluation.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Split&lt;/th&gt;
&lt;th&gt;Date range&lt;/th&gt;
&lt;th&gt;Rows&lt;/th&gt;
&lt;th&gt;Pos / Neg&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Train&lt;/td&gt;
&lt;td&gt;before 2025-09-01&lt;/td&gt;
&lt;td&gt;27,604&lt;/td&gt;
&lt;td&gt;18,745 / 8,859&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Val&lt;/td&gt;
&lt;td&gt;2025-09 to 2025-11&lt;/td&gt;
&lt;td&gt;3,122&lt;/td&gt;
&lt;td&gt;2,378 / 744&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Test&lt;/td&gt;
&lt;td&gt;2025-12 onward&lt;/td&gt;
&lt;td&gt;4,335&lt;/td&gt;
&lt;td&gt;3,090 / 1,245&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h2&gt;
  
  
  The part that's almost a one-liner: the training loop
&lt;/h2&gt;

&lt;p&gt;After all the data work, the actual fit is short:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sentence_transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CrossEncoder&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;InputExample&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;torch.utils.data&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DataLoader&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;CrossEncoder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BAAI/bge-reranker-v2-m3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;num_labels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_length&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mps&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;examples&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="nc"&gt;InputExample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;texts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;passage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;label&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;load_jsonl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;train.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;loader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DataLoader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;examples&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;shuffle&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;train_dataloader&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;loader&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;evaluator&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;evaluator&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;epochs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;warmup_steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loader&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;optimizer_params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lr&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;2e-5&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;output_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;checkpoint&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few details that mattered:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;BCE-with-logits loss&lt;/strong&gt; on &lt;code&gt;(query, passage, label ∈ {0, 1})&lt;/code&gt;. Single-score output, binary cross-entropy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AdamW at &lt;code&gt;lr=2e-5&lt;/code&gt;&lt;/strong&gt; — the standard learning rate for BERT-family fine-tunes. Don't overthink it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Linear warmup for the first 10%&lt;/strong&gt; of steps (LR ramps 0 → 2e-5), then linear decay back to 0. Prevents unstable updates early when the model is still learning the new label distribution.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Periodic val evaluation&lt;/strong&gt; every ~862 steps. We tracked Average Precision to know when to stop.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The payoff
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Baseline MRR@10&lt;/th&gt;
&lt;th&gt;Fine-tuned MRR@10&lt;/th&gt;
&lt;th&gt;Δ&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Validation&lt;/td&gt;
&lt;td&gt;0.626&lt;/td&gt;
&lt;td&gt;0.811&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+30%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Test (held-out time)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.598&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.846&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+41%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;MRR@10 is the standard ranking metric: for each query, find the rank of the first&lt;br&gt;
relevant result; if it's at rank &lt;em&gt;k&lt;/em&gt;, score is &lt;em&gt;1/k&lt;/em&gt;; average across queries. Our&lt;br&gt;
baseline 0.598 means the first relevant ticket lands at rank ~1.7 on average. Our&lt;br&gt;
fine-tuned 0.846 means it lands at rank ~1.18 — almost always at the top.&lt;/p&gt;

&lt;p&gt;Translation: the LLM grounds its answer on the right historical ticket &lt;em&gt;almost every&lt;br&gt;
time&lt;/em&gt; now. It's not a marginal improvement — it changes whether the agent's&lt;br&gt;
suggestion is &lt;em&gt;useful&lt;/em&gt; or &lt;em&gt;plausible-but-wrong&lt;/em&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Battle scars (the gotchas nobody documents)
&lt;/h2&gt;

&lt;p&gt;A few things I had to fix while getting this to actually run:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Corp SSL.&lt;/strong&gt; The Mac running training had the corporate CA trusted at the system&lt;br&gt;
level (so &lt;code&gt;curl&lt;/code&gt; and the OS Keychain were happy), but Python's &lt;code&gt;requests&lt;/code&gt; /&lt;br&gt;
&lt;code&gt;urllib3&lt;/code&gt; use &lt;code&gt;certifi&lt;/code&gt;'s CA bundle, &lt;em&gt;not&lt;/em&gt; the system store. So &lt;code&gt;pip install&lt;/code&gt; and&lt;br&gt;
HuggingFace model downloads failed with &lt;code&gt;CERTIFICATE_VERIFY_FAILED&lt;/code&gt;. The fix is to&lt;br&gt;
build a combined CA bundle and point both env vars at it (different libraries read&lt;br&gt;
different ones):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;REQUESTS_CA_BUNDLE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;~/corp-ca-bundle.pem
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;SSL_CERT_FILE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;~/corp-ca-bundle.pem
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Embedding model name enforcement.&lt;/strong&gt; vllm-mlx serves on a fixed model ID and&lt;br&gt;
422s any request with the wrong name. The default &lt;code&gt;text-embedding-ada-002&lt;/code&gt; fallback&lt;br&gt;
in some libraries doesn't match. Set &lt;code&gt;EMBEDDING_MODEL&lt;/code&gt; explicitly &lt;em&gt;before&lt;/em&gt; the&lt;br&gt;
embedding function is imported — production systemd loads it via &lt;code&gt;EnvironmentFile&lt;/code&gt;,&lt;br&gt;
ad-hoc scripts have to source &lt;code&gt;.env&lt;/code&gt; themselves.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MPS memory accounting.&lt;/strong&gt; PyTorch's MPS allocator counts macOS file cache and&lt;br&gt;
inactive pages as &lt;em&gt;"other allocations"&lt;/em&gt; — even though those pages are reclaimable.&lt;br&gt;
With another 32B model already loaded, training OOMed at 19GB MPS allocation&lt;br&gt;
despite 88GB physically free. The fix is unsafe-by-default but usually correct:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;PYTORCH_MPS_HIGH_WATERMARK_RATIO&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This disables the watermark check. Safe &lt;em&gt;if&lt;/em&gt; you've actually verified there's free&lt;br&gt;
memory (&lt;code&gt;vm_stat&lt;/code&gt; first). On a system where physical RAM is genuinely exhausted,&lt;br&gt;
this will crash macOS.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;launchctl&lt;/code&gt; quirks.&lt;/strong&gt; macOS service management is a footgun farm: &lt;code&gt;launchctl&lt;br&gt;
unload&lt;/code&gt; is deprecated; &lt;code&gt;bootout&lt;/code&gt; sometimes returns I/O error from &lt;code&gt;gui/UID&lt;/code&gt; but&lt;br&gt;
works from &lt;code&gt;user/UID&lt;/code&gt;; &lt;code&gt;KeepAlive=true&lt;/code&gt; respawns killed processes — you must&lt;br&gt;
remove the service from launchd, not just kill it. Lost an evening to this once.&lt;/p&gt;

&lt;h2&gt;
  
  
  When you'd consider doing this
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;You have a domain corpus where "relevant" means something specific (legal,
medical, security tickets, internal company docs) — generic English passage
retrieval doesn't capture your relevance signal.&lt;/li&gt;
&lt;li&gt;You have an implicit relevance signal somewhere — clicks, links, analyst
references, ticket relationships, support-case "see also" — that you can mine.&lt;/li&gt;
&lt;li&gt;A stock reranker is already in your pipeline and you've tuned chunking + top-k
and you're out of obvious wins.&lt;/li&gt;
&lt;li&gt;You have a few thousand to a few tens-of-thousands of pairs — you don't need
millions.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What surprised me
&lt;/h2&gt;

&lt;p&gt;A few things, in order of how much they surprised me:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The hard-negative filter mattered more than the positive-pair mining.&lt;/strong&gt; The&lt;br&gt;
+41% lift would have collapsed to "modestly better than baseline" if I'd kept&lt;br&gt;
those 33% same-rule near-duplicates in the negatives pool. The model would have&lt;br&gt;
spent its capacity learning to push apart things that are actually related and&lt;br&gt;
gotten worse at the real job. The data-quality work was disproportionately&lt;br&gt;
high-leverage; the training loop itself was almost incidental.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The held-out test MRR (0.846) was &lt;em&gt;higher&lt;/em&gt; than the validation MRR (0.811).&lt;/strong&gt;&lt;br&gt;
That's backwards from the usual story where test is the hardest split. My read:&lt;br&gt;
detection rules in late 2025 / early 2026 are slightly clearer-cut than the&lt;br&gt;
mid-2025 rules in the val window, so the test queries were genuinely easier.&lt;br&gt;
Worth a deeper look, but it's also a useful sanity check — the model is&lt;br&gt;
generalizing forward in time, not memorizing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;bge-reranker-v2-m3 at 0.598 baseline is surprisingly OK&lt;/strong&gt; for a model that has&lt;br&gt;
never seen a security ticket. Off-the-shelf rerankers are stronger out-of-domain&lt;br&gt;
than I expected. That's both reassuring (you can ship a reasonable RAG without&lt;br&gt;
fine-tuning) and a trap (you can ship a &lt;em&gt;reasonable&lt;/em&gt; RAG without fine-tuning,&lt;br&gt;
and it'll feel "good enough" until you measure properly).&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd do differently
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Build the eval harness on day 1.&lt;/strong&gt; I spent too long tuning chunking and top-k&lt;br&gt;
by vibes before I had a number to optimize against. Once the MRR@10 harness&lt;br&gt;
existed, every change was a one-command before/after — and most of the&lt;br&gt;
"improvements" I'd been making earlier turned out to be wash trades. The harness&lt;br&gt;
took an afternoon to build. I would have saved a couple of weeks by starting&lt;br&gt;
there.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Reproducing this is doable in a couple of days if you have a domain corpus with&lt;br&gt;
implicit relevance signal. If you've tried this on your own data, or hit a snag I&lt;br&gt;
didn't, I'd love to hear how it went — reach me on&lt;br&gt;
&lt;a href="https://www.linkedin.com/in/vinay-vobbilichetty" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt; or by&lt;br&gt;
&lt;a href="mailto:vinayvobbilichetty11@gmail.com"&gt;email&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>machinelearning</category>
      <category>python</category>
    </item>
  </channel>
</rss>
