DEV Community: Vinay

Why Self-Hosted Claude Code Was 15x Slower Than It Should Be

Vinay — Sun, 07 Jun 2026 01:55:40 +0000

Update (2026-05-14). The SimpleEngine prefix-cache patch described in
Finding #2 is now upstream as
vllm-mlx PR #523, merged.
If you're on a recent vllm-mlx build, the fix is already there — no local
patching required. The walk-through below is still useful for understanding
what the patch does and why it was needed.
{: .prompt-info }

Update (2026-05-18) — two more sharp edges if you're running this for real:

Don't use strict json_schema response_format against sparse-MoE Coder
models. If you also run LangChain (or any OpenAI-compatible client) with
structured outputs against the same vllm-mlx instance, prefer
with_structured_output(schema, method="json_mode") over the LangChain
default "json_schema". The strict path triggers grammar-constrained
decoding which has hung on Qwen3-Coder-30B-A3B for 5+ minutes per call —
and a wedged decoder starves every queued request, including your Claude
Code session, until the server restarts. Filed upstream as
vllm-mlx#546.

PR #523 fixes the single-slot system-KV cache. You probably also want a
multi-slot variant. Claude Code sub-agents (Explore, Plan,
general-purpose) carry different tool sets, so each one's system prefix
differs from the main agent's. With a single-slot snapshot, every
sub-agent dispatch evicts the main agent's cache and vice versa, and you
pay the full ~28K-token cold prefill every turn. The multi-slot LRU
follow-up is local for now — upstream PR pending.
{: .prompt-warning }

TL;DR

I run Claude Code against a
self-hosted vllm-mlx backend on a Mac
Studio. Cold turns took ~108 seconds. Follow-ups took almost the same, even
though the system prompt was byte-stable and any LLM engine worth its salt should
be caching the prefix.

Two findings, both required to get the speedup:

Claude Code injects a rotating x-anthropic-billing-header value into the system block on every turn. Even though the user-visible system prompt doesn't change, the bytes the engine hashes for cache lookup do change every request. The prefix cache misses 100% of the time. Strip the header at the proxy layer and the cache becomes useful.
vllm-mlx's SimpleEngine doesn't carry KV state across requests. Even with the rotating header gone, you have to patch SimpleEngine to actually cache the system prefix between turns — a small single-slot, hash-keyed cache that restores the snapshot on a hit and prefills only the suffix.

Together: 108-second turns → 7-8 second follow-ups. A 13-15× speedup, on the
same hardware, with the same model.

108s → 7-8s

warm-turn wall-clock, before vs after

13-15×

follow-up speedup, same hardware + model

81 bytes

of rotating header text that was costing 100+s/turn

The setup

flowchart LR
    CC[Claude Code CLI] -->|/v1/messages<br/>system + tools + msgs<br/>+ rotating cch=...| CCR[claude-code-router]
    CCR --> Shim["Shim<br/><b>(1) strips x-anthropic-billing-header</b><br/>(2) buffers tool-call streams"]
    Shim -->|byte-stable<br/>system prefix| VLLM[vllm-mlx server]
    VLLM --> SE["SimpleEngine<br/><b>(3) system-prefix KV cache</b><br/>HIT: skip prefill<br/>MISS: prefill + snapshot"]
    SE -->|stream tokens| CC
    style Shim fill:#1e40af,color:#fff
    style SE fill:#7c2d12,color:#fff

The three numbered points are where the speedup comes from. Strip (1) and (3)
and you're back to 100+ second turns.

Backend: vllm-mlx serving Qwen2.5-Coder-32B-Instruct-8bit on a Mac Studio (96 GB).
Front door: a small FastAPI shim that exposes Anthropic's /v1/messages API and proxies to vllm-mlx.
Routing: claude-code-router translates Claude Code's outbound calls to the shim's URL with a bearer token.
Client: Claude Code, the CLI.

End-to-end the architecture worked. Tool calling worked. Streaming worked. Output
quality was fine. It was just slow — and slow in a way that didn't match how any
of this is supposed to behave.

For context: Claude Code's prompts are large. Measured across captured requests
on this setup, the cacheable prefix — Claude Code's system instructions plus the
tool-definitions block — runs around 23,000 tokens (≈5.6K system + ≈17.6K
tools, for a 23-tool toolset). With a working prefix cache, only the new user
message and the conversation tail need to be processed each turn — typically a
few hundred tokens. Without one, the engine re-prefills ~23K tokens every.
single. turn. On a 32K-context model, that leaves about 9K headroom for the
conversation and output, which is fine — but only if you're not throwing away the
prefix work each turn.

What I expected vs what I observed

	Cold turn	Warm turn
Stock vllm-mlx, no shim	108 s	~100 s
Shim strips billing header only	105 s	~70 s
Shim strips header + SimpleEngine KV-cache patch	108 s	7-8 s

The cold-turn number doesn't change — there's no cache to hit on the first
request. The warm-turn delta is the whole story.

Finding #1: the rotating billing header

The first useful diagnostic was diffing the raw bytes of two consecutive
/v1/messages requests from Claude Code. Almost everything was identical: system
prompt, tool definitions, conversation history, sampling params. But there was
one block in the system list that changed every turn:

{"type": "text",
 "text": "x-anthropic-billing-header: cc_version=...; cc_entrypoint=cli; cch=<rotating-hash>"}

Claude Code injects this. The cch= value rotates per request — Anthropic uses
it for billing and conversation tracking. On Anthropic's hosted API, the cache
layer normalizes around it and there's no impact. On a self-hosted backend that
simply hashes the prompt as-is, the rotating value invalidates the cache key on
every request. Every turn looks brand new to the engine, because every turn
is brand new.

The fix at the shim is a one-function filter:

def _strip_billing_header(payload: dict) -> None:
    """Drop Claude Code's `x-anthropic-billing-header` system block.

    Claude Code injects a small system block of the form
    `x-anthropic-billing-header: cc_version=...; cc_entrypoint=cli; cch=<hash>`
    whose `cch` value rotates every turn. Anthropic's cloud uses it for billing
    tracking; local upstreams just see it as 81 bytes of system text. With our
    SimpleEngine prefix-KV cache, that rotating field changes the system-prefix
    hash each turn → every turn is a cache miss → 100s+ prefill on the
    ~23K-token system+tools prefix. Removing this block makes the system
    prefix byte-stable turn-over-turn so the cache actually hits.
    """
    system = payload.get("system")
    if not isinstance(system, list):
        return
    payload["system"] = [
        b for b in system
        if not (
            isinstance(b, dict)
            and isinstance(b.get("text"), str)
            and b["text"].lstrip().lower().startswith("x-anthropic-billing-header")
        )
    ]

81 bytes of rotating text was costing 100+ seconds per turn. Not a great trade.

Note. vllm-mlx PR #277 quietly does the same fix for the /v1/messages
endpoint. If you're on a recent build of vllm-mlx and using its native
Anthropic adapter, you may already be covered. I run my own shim (for tool-call
buffering on the Coder alias — vllm-mlx's Hermes parser streams tool JSON as
content deltas, which doesn't round-trip cleanly to clients), so I had to
strip the header myself.
{: .prompt-info }

After this fix, warm turns dropped from ~100 s to ~70 s. A real win, but the
prefix cache should have been saving 95+ seconds, not 30. So either the cache
wasn't engaging at all, or it was engaging only partially. Onward.

Finding #2: SimpleEngine wasn't actually caching the prefix

vllm-mlx ships two engines — both MLX-native, neither is upstream vLLM's
PagedAttention/CUDA core (which doesn't run on Apple Silicon at all).
engine/simple.py is "Simple engine for maximum single-user throughput.
Wraps mlx-lm directly with zero overhead for optimal performance when serving
a single user at a time." engine/batched.py is "Batched engine for
continuous batching with multiple concurrent users." For a single-user
Claude Code session, SimpleEngine is the right pick — no scheduler, no
batching wait, direct access to mlx-lm's prompt cache. BatchedEngine wins
when multiple users hit the same backend concurrently.

SimpleEngine was what I was using. Profiling showed prefill running across the
full system + tool prefix on every turn, even after the billing header was
gone. The cache hit rate was effectively zero.

The reason: SimpleEngine's request handler doesn't carry KV state from the
previous request to the next. Each request gets a fresh prompt cache via
make_prompt_cache(model) and prefills the whole prompt from scratch. There's
no across-requests cache to hit — the prefix cache lives only inside a single
request.

The fix was a small patch: add a single-slot, hash-keyed system-prefix KV
cache to SimpleEngine. Detect the system prefix using the ChatML markers that
delimit it, hash the prefix tokens, and:

On a hit, restore the saved KV snapshot and prefill only the suffix.
On a miss, prefill the system prefix in chunks, snapshot the resulting KV state, and store it (overwriting the previous slot).

# Excerpt from system_kv_cache_for_simple_engine.patch — a few of the load-bearing lines.

system_hash = hashlib.sha256(system_prefix_text.encode()).hexdigest()[:16]

# ...
if (
    system_hash == self._system_kv_hash
    and self._system_kv_snapshot is not None
    and system_token_count == self._system_kv_token_count
):
    cache_hit = True
    logger.info("System KV cache HIT: reusing %d tokens, prefilling %d new (hash=%s)",
                system_token_count, len(suffix_tokens), system_hash)
else:
    logger.info("System KV cache MISS: will prefill %d system + %d suffix tokens (hash=%s)",
                system_token_count, len(suffix_tokens), system_hash)

# On HIT, restore the saved cache state and skip system prefill:
if cache_hit:
    bc = make_prompt_cache(model)
    for i, saved_state in enumerate(self._system_kv_snapshot):
        bc[i].state = saved_state

# On MISS, prefill the system prefix in chunks, then snapshot:
else:
    # ... chunked prefill of system_tokens ...
    self._system_kv_snapshot = [c.state for c in bc]
    self._system_kv_hash = system_hash
    self._system_kv_token_count = system_token_count

A few design choices that mattered:

Single slot, not LRU. A Claude Code session has one conversation at a time, so multi-slot is overkill. The slot just stores (hash, snapshot, token_count) and overwrites on miss.
Hash the prefix only, not the full prompt. That way the cache survives new user messages on the tail end — which is the common case.
ChatML marker detection. The boundary between system and user is found by searching the rendered prompt for <|im_start|>user\n or <|im_start|>assistant\n. If neither marker is found, fall back to the uncached path and don't break.
Safe fallback on any exception. If the cache-aware path fails for any reason, log a warning and fall back to the original stream_generate. Don't let a perf optimization take down generation.

The full patch is upstream as
vllm-mlx PR #523. Review
hardened the original cut: closure-local capture at the gate to close a
TOCTOU race against the snapshot pointer, and an init-time probe that disables
the cache for sliding-window models whose RotatingKVCache aliases buffers the
engine mutates in place. The merged code is the right reference to read if
you're curious about the cache mechanics.

The numbers

After both fixes were in place, warm-turn wall-clock dropped from ~70 s
(billing-header fix alone) to 7-8 s (billing-header fix + SimpleEngine KV
cache). The cold turn is unchanged — there's no prior turn to cache against on
the first request — but the cache hit rate from turn 2 onward is essentially
100%, and the speedup is large enough that Claude Code becomes interactive
instead of glacial.

What I'd do differently

Diff the inputs before profiling the engine. The billing header would have
fallen out of a 30-second diff of two consecutive request bodies. I didn't run
that diff for far too long — I was looking at vllm-mlx internals, profiling
prefill, reading mlx-lm cache code, anything but the actual bytes going over
the wire. Once I finally did the diff, the rotating cch= value was on the
screen in five minutes.

That has become a personal rule for any latency mystery on a black-box stack:
capture two consecutive requests, diff them, look at what's not stable
before assuming the engine is misbehaving. It would have saved me an evening
on this one and I suspect it'll save me more.

The second thing I'd change: the SimpleEngine cache patch should have come
after I'd quantified what the billing-header strip alone bought me. I lumped
both fixes in the same session, which made it harder to attribute the speedup
cleanly. The numbers in this post are reconstructed from a follow-up
measurement; if I'd been disciplined the first time, I'd have had them ready.

When you'd hit this

You'll hit some version of this if you:

Self-host an Anthropic-compatible LLM backend (vllm-mlx, llama.cpp's Anthropic adapter, a custom shim, etc.) and point Claude Code or another Anthropic-protocol client at it.
Notice that warm turns aren't faster than cold turns even though your system prompt is byte-stable.
See the engine's prefill phase running across the full prompt every turn in profiling.

If you're using Anthropic's hosted API, none of this applies — the platform
handles the billing header and prefix caching transparently.

Reproducing this

The two pieces that make the speedup happen:

Billing-header strip — about 15 lines of FastAPI shim code that filter the rotating x-anthropic-billing-header block out of the system list before the payload reaches vllm-mlx. Identical logic to what vllm-mlx PR #277 does natively on the /v1/messages adapter; you only need a shim if you're not on that path.
SimpleEngine prefix-cache — now upstream as vllm-mlx PR #523. Read the merged code if you want the cache mechanics; the load-bearing logic is the hash check, the snapshot capture on miss, and the safe fallback when the detection fails.

Credits

vllm-mlx PR #277 found the
billing-header issue independently for the /v1/messages endpoint. If you're
using vllm-mlx's native Anthropic adapter rather than your own shim, that's the
right upstream fix. The SimpleEngine prefix-cache patch landed in
vllm-mlx PR #523 — thanks to
the maintainers for the review, which improved the patch in two specific ways
(closure-local capture against a TOCTOU on the snapshot pointer, and a
sliding-window guard for RotatingKVCache).

If you've hit this too, or your self-hosted Claude Code setup is slow for a
different reason I haven't found yet, I'd love to hear about it — reach me on
LinkedIn or by
email.

SOC-in-a-Box: One LLM, Eight Hats, A Production-Bar AI SOC on a Single GPU

Vinay — Sun, 07 Jun 2026 01:55:20 +0000

TL;DR

A real SOC runs 24×7 with eight or nine distinct roles — alert triage, deeper investigation, incident response, threat intel, detection tuning, hunting, shift management, and a human approver for any destructive action. We built an AI version of that whole org chart, coordinated over a Redis Streams bus, with one local LLM (GLM-4.7-Flash on a Mac M1) wearing every hat. v1 is read-only against real systems; the only writes are XSOAR notes and Webex cards, plus a human-approval gate on every proposed containment action.

8 roles

Sentinel · Tier 2 · IR Lead · Threat Intel · SOC Manager · Detection Eng · Threat Hunter · HITL

1 LLM

m1 GLM-4.7-Flash via vllm-mlx, with FailoverChatModel to a studio1 backup

0 writes

to CrowdStrike, Tanium, Zscaler — agents propose, humans execute

The interesting parts aren't the agents themselves — there's nothing novel about an LLM-with-tools loop. The interesting parts are: (1) the architectural choices that let one local LLM serve a whole SOC org chart without melting, (2) the human-in-the-loop gate that makes "AI does containment" a real thing a security team will actually trust, and (3) a backtest harness that lets us put hard numbers on agent quality against real historical tickets before we hand the demo to leadership.

The shape of the problem

A SOC is not a chatbot. It's a 24×7 event-driven pipeline:

Alerts land continuously, not on demand. The system has to be running and consuming events even when nobody is asking it a question.
Roles are independent. Tier 2 doesn't ask Tier 1 a question — it picks up the verdict Tier 1 already published and goes deeper. Threat Intel runs after IR Lead, not as part of IR Lead's reasoning.
Some roles are reactive, some are periodic. Sentinel reacts to each new ticket; SOC Manager produces a shift summary every 8 hours; Threat Hunter sweeps the audit log twice a day; Detection Engineer reviews noisy rules on weekday mornings.
Destructive actions need a human gate. An AI that auto-isolates hosts at 3 AM will get unplugged in a month. The interesting question is: what does the handoff look like?
Auditability is non-negotiable. Every decision needs to be replayable for incident retros and tuning.

The architecture is a consequence of those constraints, not the other way around.

Architecture at a glance

flowchart TB
    XSOAR[XSOAR ticket feed] --> Sentinel
    subgraph Bus["Redis Streams bus (lab-vm1)"]
        direction LR
        STRG[soc.triage]
        SCAS[soc.cases]
        SAUD[soc.audit]
    end
    Sentinel[Sentinel / Tier 1<br/>alert triage] --> STRG
    STRG --> Tier2[Tier 2 Analyst<br/>deeper investigation]
    Tier2 --> SCAS
    SCAS --> IR[IR Lead<br/>SEV + containment plan]
    IR --> SCAS
    IR -.HITL action proposed.-> Flask[Flask HITL pages<br/>/soc-hitl/decide<br/>/soc-hitl/audit]
    Webex[Webex card buttons] -.click.-> Flask
    Flask --> SCAS
    SCAS --> TI[Threat Intel<br/>actor + MITRE]
    TI --> SCAS

    SAUD -.replay.-> SOCMgr[SOC Manager<br/>shift summaries<br/>06/14/22 EST]
    SAUD -.replay.-> DetEng[Detection Engineer<br/>rule tuning<br/>09:00 EST M-F]
    SAUD -.replay.-> Hunter[Threat Hunter<br/>pattern sweeps<br/>06/18 EST]
    STRG -.mirror.-> SAUD
    SCAS -.mirror.-> SAUD

    style Bus fill:#1e3a8a,color:#fff
    style Flask fill:#fef3c7,color:#92400e
    style SAUD fill:#7c2d12,color:#fff

Three reactive roles (Tier 2 / IR Lead / Threat Intel) are long-running consumers on the bus. Three periodic roles (SOC Manager / Detection Engineer / Threat Hunter) are systemd timer units that wake up on a calendar schedule, replay the audit stream, and emit a report. The HITL surface is a Flask blueprint sitting next to the existing IR web app.

What we considered

The framework choice was load-bearing — once you pick wrong, every later abstraction fights you. We evaluated five paths:

1. CrewAI

CrewAI is excellent at what it's designed for: a "crew" of role-shaped agents collaborating on one task. Declarative Agent + Task + Process (sequential or hierarchical), with strong primitives for delegation between agents inside the same crew.

The mismatch for a SOC: CrewAI assumes a single in-process orchestration run. "Spin up a crew, run a task, get an output." Our roles aren't a crew — they're independent processes with their own uptime, audit, restart semantics, and HITL gates. CrewAI's human-in-the-loop is human_input=True on a Task — a blocking stdin prompt during the crew run. That doesn't survive a "Webex card → Flask page → SQLite sidecar → bus event back into the cascade" flow. We'd lose audit-stream replay, backtest, and per-role systemd uptime if we forced this shape.

Where CrewAI could slot in: as the internal reasoning of a single role. e.g. inside Tier 2's handle(), swap one LLM-with-tool-loop for a small crew (investigator + critic + decider) before emitting the Tier2Analysis event. Same bus architecture, more sophisticated per-role thinking. Probably not worth it yet — Tier 2 with a critic-loop pattern works.

2. AutoGen

AutoGen is conversation-shaped: agent-to-agent chat with an explicit GroupChat manager. Great for "two LLMs argue and converge on an answer" — code-writer vs code-reviewer, advocate vs critic.

The mismatch: a SOC isn't a conversation. Tier 2 doesn't talk to Tier 1; it consumes Tier 1's verdict. The chat-history-as-state model imposes a context-window tax on a problem that doesn't need it, and the GroupChat orchestrator becomes a load-bearing thing you can't restart independently.

3. Plain LangChain (no graph, no bus)

The path of least resistance: write a Python function for each role, chain them together, run synchronously. We started here actually, then noticed the smell. The synchronous chain forces every role to wait for the previous one, eliminates per-role restart, makes HITL impossible without a hack, and gives you no audit log unless you build one separately.

If you only have two roles, just do this. We had eight.

4. n8n / Zapier / visual workflow tools

n8n and similar visual workflow tools were on the list for one specific reason: leadership likes seeing the boxes-and-arrows. But the LLM nodes aren't first-class — you'd be wrapping every model call in HTTP, and the graph is in a database, not in code that's reviewable in a PR. Auditability and reproducibility are both worse than the LangGraph + bus path. (n8n is a great fit for non-LLM SOAR-style automations, just not for this.)

5. Build-from-scratch asyncio + Redis Streams

The honest baseline. Python asyncio workers, Redis Streams consumer groups, no agent framework. Saves you the framework abstraction, costs you the prompt + tool-loop + state-management plumbing that LangGraph and friends do for free. For a one-role POC, fine. For eight roles, you reinvent LangChain badly.

What we picked: LangGraph for per-role reasoning + Redis Streams for inter-role coordination

LangGraph gives us a clean per-role tool-loop with explicit state, and Redis Streams gives us the inter-role coordination — durable events, consumer groups for at-least-once delivery, an audit stream that's just another consumer, and the easy retrofit of new roles without touching existing ones.

The split is the point: LangGraph is the agent runtime, the bus is the org chart. Don't conflate them.

One LLM, many hats

We run one local LLM — GLM-4.7-Flash 8-bit on a Mac M1 (64 GB) via vllm-mlx — and every role calls it with a different system prompt and a different tool whitelist. The resilience comes from a FailoverChatModel (first described in an earlier post) that transparently falls back to a Qwen3 backup on a studio1 box if the m1 dies, and flips back the moment the primary recovers.

📦 New — we open-sourced it. That FailoverChatModel is now a standalone, dependency-light package on PyPI: langchain-failover. pip install langchain-failover, point it at two chat models, and you get the same primary/secondary failover that keeps this SOC's brain online when a GPU box drops off — connection-aware (it walks the exception's cause chain), recovery-aware (logs the flip back), and mid-stream-safe. The non-obvious part it gets right: tool-calling survives the failover — it binds your tools on both legs, so an agent mid-investigation doesn't lose its tools the instant it fails over. That's exactly what a SOC role needs at 3 AM. Source, tests, and docs: github.com/vinayvobbili/langchain-failover. 🚀
{: .prompt-tip }

Why not multiple model providers per role?

Cost — one model loaded once. The Mac M1 holds 35B params at 8-bit comfortably and tool-calls reliably with the glm47 parser.
Latency — no inter-provider hop, no API rate limits to coordinate.
Operational simplicity — one health check, one auth header, one log file.
The roles aren't actually different intelligences — they're the same intelligence with different prompts, tool budgets, and JSON output schemas. Tier 2 has 30 tool calls; IR Lead has 15; Threat Intel has 12.

The thing GPT-4 or Claude would buy us isn't better reasoning on any one role — it's worse cost economics for a 24×7 deployment. We may revisit for SEV-1 hardest cases, but the default is local.

The roles

Role	Driver	Trigger	Bus output	Real-system side effect
Sentinel (Tier 1)	XSOAR poller	New tickets	`AlertTriaged`	XSOAR triage note
Tier 2 Analyst	Long-running consumer	`alert.triaged` where TP-malicious or pri ≥ 7	`Tier2Analysis`, `CaseEscalated`	Webex card on escalation
IR Lead	Long-running consumer	`case.escalated → ir_lead`	`IRPlan`, `ActionProposed`	Webex card with HITL buttons, XSOAR plan note
Threat Intel	Long-running consumer	`ir.plan`	`ThreatIntelReport`	Webex card, XSOAR attribution note
SOC Manager	Timer 06/14/22 EST	calendar	`ShiftSummary`	Webex card
Detection Engineer	Timer 09:00 EST M-F	calendar	`DetectionTuningReport`	Webex card
Threat Hunter	Timer 06/18 EST	calendar	`HuntingReport`	Webex card
HITL Flask	Browser button click	Approval link in Webex card	`ActionDecision`	Decision logged to sidecar SQLite

Every role's verdict is also persisted to a verdicts.sqlite sidecar with wall_time_ms, tool_calls_made, and (in backtest mode) ground_truth, so we can compute agreement rates and latency distributions without instrumenting OpenTelemetry on day one.

The HITL gate

The hardest design question wasn't "should the AI containment action?" — it's "how does the AI hand off to a human in a way the human will actually engage with?"

We tried a few patterns. The one that works:

sequenceDiagram
    participant IR as IR Lead<br/>(LLM agent)
    participant Bus as Redis Streams
    participant Webex as Webex card<br/>(Pokedex bot)
    participant Human as Approver
    participant Flask as Flask HITL page
    participant SQLite as hitl.sqlite

    IR->>Bus: IRPlan + ActionProposed<br/>(approver_role="IR Lead On-Call")
    IR->>Webex: Card with 2 buttons<br/>(Action.OpenUrl)
    Webex->>Human: Banner: "🎯 Action required from: IR Lead On-Call"
    Human->>Flask: Click Approve or Reject
    Flask->>Human: Confirmation page<br/>(login_required, DEMO MODE banner)
    Human->>Flask: Submit decision
    Flask->>SQLite: Persist decision
    Flask->>Bus: ActionDecision (approved|rejected, dummy=True)
    Note over Bus: v2 future: executor agent<br/>consumes approved decisions

Three details matter:

The approver is addressed. The card banner says "Action required from: IR Lead On-Call" — not "click here to approve." The team knows whose mailbox each card is in.
The Flask confirmation page sits between the click and the recorded decision. Single-click approve from a Webex card was tempting but wrong — accidental clicks would auto-execute. The two-step (click button → see page → click submit) is the friction we want.
v1 doesn't actually execute. The decision is logged, an ActionDecision event is published, and the demo wraps up there. v2 — an executor agent that consumes action.decision[approved] and calls CrowdStrike RTR / Zscaler / Tanium — is straightforward to add once leadership trusts the loop. Trust is earned in v1, not asserted by skipping the gate.

Putting numbers on it: the backtest harness

The hardest sell to a SOC director isn't "we built it." It's "how do we know it works before we put it on real alerts?"

We have an XSOAR timeline database with 32K+ historical CrowdStrike tickets, each with an escalation_state field that tells us whether a human Tier 1 closed it or a human Tier 2/Tier 3 picked it up. That's our ground truth — analyst-curated, no extra labelling required.

The backtest harness samples N closed tickets stratified 50/50 between human-escalated and human-closed, then replays each through the agent cascade with all side effects neutered: bus publishes captured in-memory, Webex sends no-op'd, XSOAR writes no-op'd, HITL store stubbed.

For each ticket we record:

Sentinel's verdict + priority
Whether Tier 2 engaged, and what it decided (escalate / close / needs human review)
Whether IR Lead engaged, and what SEV it assigned
Whether Threat Intel engaged, and what actor it attributed
Wall time and tool-call count for each stage

Then we compute the confusion matrix of cascade-escalated-to-IR-Lead vs human-escalated-in-real-life:

TP  human escalated  AND  Tier 2 escalated → IR Lead
FN  human escalated  BUT  Tier 2 closed
FP  human closed     BUT  Tier 2 escalated → IR Lead
TN  human closed     AND  Tier 2 closed

Precision and recall on TP/FP/FN give us the numbers leadership wants — "how often does the AI escalate when humans actually would, and how often does it cry wolf?" The summary lands in a JSON file that the dashboard panel reads, so the question gets a number, not a vibe.

The harness also has a --dry-run mode that swaps the LLM for a canned-JSON stub, so we can validate the plumbing end-to-end in under 2 seconds without burning a single token — and that same harness drives the real-LLM run against a full stratified sample when we want actual agreement numbers rather than a smoke test.

What surprised us

Three things, in order of how much they changed the design:

The bus is more important than the agents. We spent the first week tuning prompts. The unlock was when we got the Redis Streams + audit-replay pattern right — at that point, adding a new role became a 200-line file plus a systemd unit, and the existing roles didn't have to know. That's worth more than another 5% on any single agent's quality.
Timer-driven roles are underrated. SOC Manager / Detection Engineer / Threat Hunter run on a calendar schedule, not on events. They get the same audit stream, so they see everything the reactive agents did, plus everything the audit stream caught that no reactive agent engaged on. Detection Engineer in particular finds tuning candidates a reactive role would never see — "this rule fired 47 times this week and 41 were closed as benign by Tier 1."
The right level of role granularity isn't obvious. We went back and forth on whether Tier 2 + IR Lead should be one role or two. They're two. Tier 2's job is "is this real and how bad?"; IR Lead's job is "given it's real, what's the plan?" Conflating them puts SEV classification in the same prompt as evidence-gathering and the model loses focus. Same with Threat Intel — keeping attribution out of IR Lead's prompt makes both roles tighter.

Try it / What's next

The full module lives at src/components/soc_in_box/ — agents, schemas, bus wrapper, verdict store, HITL store, web routes, systemd units, README.

What's not in v1 and what we'll work on next:

HITL v2 executor. Real write path — consume action.decision[approved] events, call CrowdStrike RTR / Tanium / Zscaler via MCP, log the result back on the bus. The hard parts (audit, approval, identity) are done; only the executor itself is missing.
Red Team agent. Once we have AttackIQ wired into the lab, a Red Team role can post attack.executed events that the rest of the cascade has to detect. Closes the loop on "did the SOC actually catch what the Red Team threw?"
Backtest as a CI gate. Once we're confident on a baseline, promote the harness to a nightly run with regression thresholds — "if Tier 2 escalation precision drops more than 5% from last week's baseline, fail the build."

The code is BSD-licensed in the public mirror. If you're building something similar, the most useful thing to copy isn't any one agent — it's the bus shape, the schema-per-event discipline, the audit-stream-as-truth pattern, and the HITL handoff that addresses a human by role. Those four ideas are what turned eight separate LLM-with-tools experiments into one thing a SOC team would actually run.

detflow: A Detection-Engineering Copilot You Can pip install

Vinay — Sun, 07 Jun 2026 01:55:03 +0000

TL;DR 🚀

I shipped detflow to PyPI — an open-source, vendor-neutral detection-engineering copilot. It does the four things I found myself re-implementing inside every detection-as-code workflow: draft a detection from plain English (as Sigma or Cortex XSIAM XQL), lint it offline, find overlaps against the rules you already run, and review it like a senior detection engineer. 🛡️

2 formats

draft & review in Sigma or Cortex XQL — one portable, one native

1 protocol

bring any model: an OpenAI-compatible endpoint or a LangChain failover chain

0 crashes

lint & overlap need no model; review degrades to a deterministic floor

This is the detection-side sibling of iocflow. iocflow handles the indicator lifecycle; detflow handles the rule lifecycle. Same design DNA: deterministic primitives first, the LLM as an enhancement that can fail without taking the tool down with it. 🧱

The itch

A detection-as-code pipeline — the kind that turns a rule into a reviewed, tested merge request — has a handful of stages that have nothing to do with your SIEM vendor:

Is this rule even valid? (lint / schema-check)
An analyst can describe the behavior but doesn't write Sigma fluently — can we draft the first version?
Are we about to ship coverage we already have? (dedup against the catalog)
Would a senior engineer approve this, and what would they flag? (quality, false-positive risk, ATT&CK mapping, gaps)

I'd written those four stages more than once. They're generic — the only vendor-specific parts of a real pipeline are compiling to your query language and dry-running against your tenant. So I carved the generic four out of a detection-as-code workbench I'd built and made them a clean, public library. 🧰

What it looks like

Draft a detection from a sentence — in either language:

import detflow

sigma = detflow.draft("powershell with an encoded command spawned from a Word macro")
print(sigma.rule)                       # a full Sigma rule, ready to lint

xql = detflow.draft("same thing, but for Cortex XSIAM", fmt="cortex-xql")
print(xql.rule)                         # dataset = ... | filter ... | limit 100

Lint it offline — no model, no network, no keys:

report = detflow.lint(sigma.rule)       # or lint_sigma / lint_xql
print(report.status, report.summary)    # "pass" / "warn" / "fail"
for f in report.findings:
    print(f.level, f.message)

Review it like a senior engineer, deduped against your own inventory:

catalog = [
    {"name": "Encoded PowerShell", "source": "crowdstrike", "techniques": ["T1059.001"]},
    {"name": "WMI Process Create",  "source": "sigma",       "techniques": ["T1047"]},
]
result = detflow.review(sigma.rule, catalog=catalog)
print(result.quality_score, result.false_positive_risk, result.verdict)
for o in result.overlaps:               # "you may already cover this"
    print(" •", o.source, o.name, "—", o.reason)

The whole flow, end to end:

flowchart LR
    NL([plain English]) -->|draft| RULE[Sigma / XQL rule]
    RULE -->|lint| LINT[schema + best-practice findings]
    RULE -->|find_overlaps| OV[catalog dedup]
    LINT --> REV{{review}}
    OV --> REV
    REV --> V([quality · FP risk · ATT&CK · verdict])

There's a CLI too, for the terminal-and-CI crowd:

detflow draft "credential dumping via comsvcs MiniDump" -f cortex-xql
detflow lint rule.yml
detflow review rule.yml --catalog catalog.json --json

Model-agnostic on purpose 🔌

detflow doesn't import an SDK or hard-code a provider. A "model" is anything with one method:

def complete(self, system: str, user: str, *, json: bool = False) -> str: ...

That gives you three ways in. A built-in OpenAIChatModel talks to any OpenAI-compatible endpoint — OpenAI, Azure, a local vLLM/Ollama server, a gateway. default_model() builds one from DETFLOW_LLM_* env vars. Or you wrap any LangChain chat model:

from langchain_failover import FailoverChatModel
from detflow.llm import LangChainModel

chain = FailoverChatModel(models=[primary, local_fallback])
model = LangChainModel(chain)
detflow.review(rule, catalog=catalog, model=model)   # rides the failover chain

That FailoverChatModel is langchain-failover, another package I extracted and published — so a primary-model outage transparently falls back to a secondary mid-review. Three of my OSS packages quietly eating each other's dog food. 🐕

Never-raises, deterministic floor

The contract I care about most: detflow degrades, it doesn't break. 🎯

Lint and overlap need no model at all — they're pure, stdlib-plus-PyYAML, and run in CI with zero secrets.
Drafting requires a model (you're asking it to write), but a model error comes back as a result with an error field, not an exception.
Review uses a model when one is present and falls back to a deterministic floor when it isn't — you still get the lint results, the catalog overlaps, and the parsed ATT&CK techniques. review() never raises.

So detflow is safe to drop into a pipeline that sometimes has an LLM available and sometimes doesn't. The boring, testable parts stay up regardless; the AI adds judgment when it can.

Why two formats

Sigma is the portable, reviewable, vendor-neutral standard — it lints cleanly and ports across SIEMs. Cortex XSIAM XQL is what actually runs on that platform. Supporting both means you can author once in Sigma for portability, or go straight to XQL when you want the platform's full expressiveness — and detflow lints and reviews either one. The drafting prompts are language-aware (the XQL prompt knows XQL has no startswith/endswith and uses | filter, not SQL where), so you don't get SQL-shaped hallucinations back. 🧠

The bigger pattern

This is the same lesson as the IOC work: when you want to show AI in your engineering, the junior move is to make everything an LLM call. The stronger, more deployable story is deterministic primitives plus optional AI — the schema checks and dedup are boring and tested, the model writes and reviews where judgment helps, and nothing falls over when the model is slow or absent.

detflow runs on Python 3.9+, keeps import detflow dependency-light (the LLM client is an extra), ships py.typed for downstream type-checking, and every piece is independently useful.

📦 PyPI: pip install detflow
🛠️ Source: github.com/vinayvobbili/detflow
🧩 Its indicator-side sibling: iocflow

If you run a detection-as-code pipeline, I'd love to know which query language you'd want next. 👋

iocflow: Turning a Production AI SOC into a Shippable OSS Library

Vinay — Sun, 07 Jun 2026 01:54:47 +0000

TL;DR 🚀

I shipped iocflow to PyPI — an open-source Python library for the entire indicator-of-compromise lifecycle, built as six independently-useful layers behind pip extras. The headline isn't "another IOC parser." It's the shape: every layer is a deterministic, boring, testable primitive — and the top layer is a small LangGraph multi-agent team that orchestrates those primitives, with a human-in-the-loop gate standing between the AI and anything destructive.

6 layers

extract · enrich · comment · hunt · block · agent — each its own pip extra

1 import

investigate(text) runs the whole chain as a multi-agent team

0 rogue blocks

LLM proposes · human authorizes · a guard vetoes

This is the OSS sibling of SOC-in-a-Box, the AI SOC I wrote about last week. SOC-in-a-Box proved the pattern against real systems; iocflow packages the lesson so anyone can pip-install it. 🧰

One call: extract IOCs from a report → enrich → suggest hunts → propose blocks → wait for a human → block at the firewall. The benign 8.8.8.8 never even gets proposed.

The lesson I was carrying over

SOC-in-a-Box was eight analyst roles played by one local LLM over a message bus, read-only against production, with a human approving every containment action. The thing that actually made it trustworthy wasn't the agents — an LLM-with-tools loop is not novel. It was two architectural commitments:

The model orchestrates; it doesn't do. The irreversible work — query a SIEM, write a denylist, isolate a host — is done by plain, deterministic code the model merely calls. The LLM picks what and when; the tool decides how, the same way every time.
No single authority for a destructive action. The AI can propose containment all day long. A human clicks the button, and a dumb safety check sits underneath both of them refusing to touch anything on an allowlist.

Those two ideas aren't SOC-specific. They're how you make any AI system that touches production safe enough to actually deploy. So I pulled them out of the SOC and built a clean, public library around them. 🧱

Deterministic primitives first, agents last

iocflow grows in layers, each behind its own extra so import iocflow stays a one-dependency install and pulls in nothing you didn't ask for:

L1 — extract (iocflow): pull IPs, domains, URLs, hashes, CVEs, MITRE technique IDs, threat actors, and malware families out of unstructured text, with the false-positive defenses you'd otherwise hand-write (Public Suffix List validation, benign allowlists, re-fanging of evil-domain[.]ru).
L2 — enrich (iocflow[enrich]): look each indicator up against VirusTotal / AbuseIPDB / abuse.ch and return a worst-wins verdict.
L3 — comment (iocflow[ai]): an LLM turns the enrichment report into a structured assessment — and falls back to a deterministic, report-derived summary when no model is configured. It never raises.
L4 — hunt (iocflow[hunt]): render ready-to-run hunt queries — CrowdStrike CQL, Cortex XQL, and Sigma — straight from the indicators, offline and stdlib-only. An LLM can add behavioral hunts, but the deterministic queries are always there.
L5 — block (iocflow[block]): push malicious indicators to the control points you operate — Palo Alto (EDL feed + live User-ID API), Zscaler, CrowdStrike, Abnormal — with dry_run=True as the default everywhere and an authoritative allowlist guard.
L6 — agent (iocflow[agent]): the capstone. 🤖

Notice that L1–L5 have no idea an agent exists. They're just functions with stable input/output types: ExtractedEntities → enrich() → EnrichmentReport → comment() → Commentary → suggest() → HuntPlan → block() → BlockReport. You can use any one of them on its own. That's deliberate — the agent is a consumer of the primitives, not a replacement for them.

The capstone: a small multi-agent team

Layer 6 hands a report to a supervisor that routes to specialist agents — extractor, enricher, hunter, responder — each using L1–L5 as tools, then loops back until the case is done.

flowchart TB
    START([report text]) --> SUP{supervisor<br/>routes next step}
    SUP -->|extract| EX[extractor<br/>L1 entities]
    SUP -->|enrich| EN[enricher<br/>L2 + L3 assessment]
    SUP -->|hunt| HU[hunter<br/>L4 queries]
    SUP -->|respond| RE[responder<br/>L5 dry-run → propose]
    EX --> SUP
    EN --> SUP
    HU --> SUP
    RE -.proposal.-> GATE{{ApprovalGate<br/>human authorizes}}
    GATE -.approved.-> RE
    RE -->|live block| SUP
    SUP -->|all done| END([Case])

from iocflow.agent import investigate

case = investigate(report_text)        # safe: nothing is blocked by default
print(case.commentary.severity.value, "—", case.commentary.summary)
for line in case.trace:                # the agents' reasoning, replayable
    print(" •", line)

The model is any LangChain chat model. The bundled default_agent_model() builds a FailoverChatModel — primary with an automatic secondary — which is the same failover model I extracted from the SOC and published earlier. iocflow eating its own dog food. 🐕 And here's the part that makes it robust: with no model configured at all, the graph runs the layers in a fixed deterministic order and still produces a complete Case. The LLM is an enhancement, not a dependency.

Three-layer authority (the part that matters) 🔒

Blocking is the only step that can hurt you, so it gets the full treatment from SOC-in-a-Box:

The agent proposes. The responder does a dry run of L5 — full audit, zero changes — and turns it into a proposal.
A human authorizes. An ApprovalGate reviews the proposal and returns the approved subset. The default is DenyAllGate — an unattended run blocks nothing.
A guard vetoes. Underneath both of them, the Layer 5 allowlist guard refuses to touch public resolvers, private ranges, and well-known domains — even if the report mislabeled them malicious. You cannot block 8.8.8.8 through this library. The LLM is never the sole authority for a destructive action.

For the gate, I wired a real one to Slack — no inbound webhook server, just post-and-poll:

from iocflow.agent import investigate
from iocflow.agent.chat_gate import SlackApprovalGate

gate = SlackApprovalGate(approvers=["U_ANALYST"], timeout=600)
case = investigate(report_text, gate=gate)
# Bot posts the proposed blocks to your channel.
# ✅ from an allowlisted analyst authorizes the plan; ❌ or no reply = denied.

It posts the proposed blocks to a channel and polls for a reaction from an allowlisted approver — ✅ approves the plan, ❌ or silence denies it, and a timeout defaults to deny. The whole thing is a ChatApprovalGate over a two-method ChatTransport (post, reactions), so the same flow drops onto Webex, Teams, or a web UI by writing two functions. The transport is a thin seam, which means the gate logic is unit-tested without a single network call.

Why build it this way

The temptation, when you want to "show AI in your work," is to make everything an LLM call. That reads as junior. The stronger story — the one a security team will actually run — is deterministic primitives plus agentic orchestration: the boring parts are boring and tested, the AI adds judgment where judgment helps, and a human holds the keys to anything irreversible. 🎯

Everything but the agent layer runs on Python 3.9+; import iocflow stays dependency-light; every layer is independently useful; and the whole agent runs offline in tests because the enrichers, blockers, and model are all injectable.

📦 PyPI: pip install iocflow
🛠️ Source: github.com/vinayvobbili/iocflow
🧠 The SOC it grew out of: SOC-in-a-Box

If you try it, I'd love to hear what control points you'd plug in. 👋

Three Chat Template Patterns That Silently Kill Your Prompt Cache

Vinay — Sun, 07 Jun 2026 01:54:28 +0000

Liquid syntax error: Unknown tag 'endraw'

Teaching a Reranker the Language of Security Tickets (+41% MRR@10)

Vinay — Sun, 07 Jun 2026 01:53:46 +0000

TL;DR

Our SOC's RAG pipeline retrieves over 142,000 closed XSOAR security tickets to ground
investigation answers. After exhausting the easy wins — chunking, top-k, reranker
choice — we still saw the right historical ticket land at rank 5-10 too often, and
the LLM grounding its answer in a near-miss neighbor.

We fine-tuned the reranker on our own data. Held-out test set, time-based split:

	MRR@10
`BAAI/bge-reranker-v2-m3` (off-the-shelf)	0.598
Fine-tuned on 24K XSOAR pairs	0.846

+41% uplift. No model architecture change, no embedding model swap. Just
domain-specific fine-tuning of the same base reranker.

+41%

MRR@10 uplift on held-out time-split test set

24,213 + 10,848

positive pairs + clean hard negatives, mined from close-notes

0

explicit relevance labels collected — all signal mined from existing analyst text

The interesting part isn't the result — it's where the training data came from. We
never logged a single explicit relevance judgement. The 24K positive pairs were
hiding in plain sight inside analyst close-notes that nobody asked anyone to write.

The setup: embedder + reranker, the standard two-stage RAG

flowchart LR
    Q[User query] --> E[Embedder<br/>Qwen3-Embedding-8B<br/>4-bit DWQ]
    E --> Top50[Top-50 by<br/>cosine similarity]
    Top50 --> R[Reranker<br/>bge-reranker-v2-m3<br/><b>fine-tuned</b>]
    R --> Top5[Top-5 ranked<br/>by joint scoring]
    Top5 --> LLM[LLM grounds<br/>answer]
    style R fill:#1e40af,color:#fff
    style E fill:#0e7490,color:#fff
    style LLM fill:#065f46,color:#fff

Our retrieval pipeline is the standard cascade:

Stage 1 — Embedder (bi-encoder). Qwen3-Embedding-8B-4bit-DWQ served via vllm-mlx. Encodes the query independently, pulls top-50 candidates from ChromaDB by cosine similarity. Fast, but it scores query and document in isolation.
Stage 2 — Reranker (cross-encoder). BAAI/bge-reranker-v2-m3 running on Apple Silicon (MPS). Jointly attends over (query, document) and re-scores the top-50 down to top-5 to feed the LLM. Slower per item, but dramatically more accurate than embedder-only ranking.

Mental model: the embedder is a fast librarian who pulls 50 books off the shelf
based on title similarity. The reranker is a careful reader who actually opens each
one and re-orders by relevance to your specific question.

Off-the-shelf rerankers like bge-reranker-v2-m3 are trained on general English
passage retrieval (MS MARCO and friends). They've never seen an XSOAR ticket. They
don't know that "INBLRPRDDKNF01: ML via Cloud-based ML" matters in a way that
generic English semantic similarity cannot capture. Fine-tuning is how you teach
them.

Where the training data came from

Cross-encoder training needs (query, positive, negative) triples. We had no
explicit relevance labels — no clicks, no thumbs-up/down, nothing. So we mined
implicit ones from analyst close-notes.

Buried in 142,000 closed tickets are sentences analysts type all the time:

"With reference to XSOAR #289008, regional team confirmed..."
"Refer master ticket #158126."
"Per XSOAR #463428, user confirmed..."

Each one is a human-curated link between two tickets. Free relevance label. We just
had to extract them.

Generalizable lesson. Before paying for labels, look at what your users are
already typing. Free-form text in close-notes, comments, JIRA descriptions —
they're full of implicit relevance judgements that nobody asked anyone to record.
{: .prompt-tip }

Filtering the noise: not all `#N` references are equal

A regex over close-notes pulled 61,500 #N references. Most were useless:

Pool	Lead-in phrase	Count	Signal quality
A	"Duplicate to #N"	52,782	Strong but trivial — same alert, different host. Embedder already gets these.
B	"XSOAR #N · Per XSOAR…"	~3,000	Gold — analyst-curated cross-references between distinct tickets.
—	"QRadar offense #N"	~1,400	Useless — references other systems, not XSOAR.

Pool A is mostly the embedder's home turf already; the reranker doesn't need help
with near-duplicates. Pool B is the interesting signal: "these two tickets are
related but not identical" — exactly the case where a reranker earns its keep.
After regex-filtering and verifying both endpoints existed in our DB, we had 4,260
unique direct (src → tgt) pairs.

Free positives via transitive siblings (and the polynomial-blow-up trap)

When five tickets all cite the same master ticket, those five are also related to
each other. That's a free O(n²) inflation of training pairs — if you cap the
explosion.

We capped each master at 20 children before generating siblings. One particularly
prolific master had 553 children; ungapped, it would have generated ~150,000
trivial sibling pairs and dominated the training distribution. Stratified sampling
across distinct rules pushed cross-rule pairs to the front so the model learned
generalizable relations, not within-rule sameness.

Source	Count
Direct `#N` references	4,260
Transitive siblings (capped, stratified)	19,953
Total positives (training-ready)	24,213

72% of the transitive pairs were cross-rule — a strong signal that our cap +
sampling worked.

Generalizable lesson. Any time you derive new training examples by
transitivity (or any structural inference), watch for polynomial blow-up in dense
clusters. Stratified sampling is usually the right counter-move.
{: .prompt-tip }

The part most beginners get wrong: hard negative mining

Negatives matter as much as positives. The model learns from contrast, and random
negatives teach almost nothing — they're already obviously different. The interesting
negatives are the ones that look similar to the embedder but aren't actually
related. Those are the cases the embedder gets wrong, and they're exactly what the
reranker needs to learn to push apart.

The recipe: for each source ticket, query the existing embedding index for the
top-50 nearest neighbors. Drop anything that's a known positive (direct, transitive,
or shares a master). What's left is what the embedder thinks matches but the analyst
never linked — hard negatives.

We caught a subtle trap on the first run: same-rule near-duplicates are not hard
negatives. Two tickets both fired by INBLRPRDDKNF01: ML via Cloud-based ML with
0.997 cosine similarity are sibling alerts of the same automated detection rule —
they're related, just not via an analyst's #N reference. Training on them as
negatives would teach the model to push apart things that are actually related.
Filtering by rule before adding to the negatives pool dropped 33% of candidates.

Stage	Count
Raw top-50 candidates from embedder	16,137
Same-rule contamination (filtered out)	5,289 (33%)
Clean cross-rule hard negatives	10,848

Median cosine similarity of the kept negatives: 0.955 — i.e. the embedder
strongly believed these were relevant. They weren't. That's exactly the gap a
reranker should close.

Data discipline: split by time, never by random

Random train/val/test splits leak future signal into training and lie to you about
held-out quality. Any time your data has a time dimension — fraud, security, sales
forecasting, almost everything in production ML — split by time. In production the
model can never look at the future, so neither should your evaluation.

Split	Date range	Rows	Pos / Neg
Train	before 2025-09-01	27,604	18,745 / 8,859
Val	2025-09 to 2025-11	3,122	2,378 / 744
Test	2025-12 onward	4,335	3,090 / 1,245

The part that's almost a one-liner: the training loop

After all the data work, the actual fit is short:

from sentence_transformers import CrossEncoder, InputExample
from torch.utils.data import DataLoader

model = CrossEncoder(
    "BAAI/bge-reranker-v2-m3",
    num_labels=1,
    max_length=512,
    device="mps",
)

examples = [
    InputExample(texts=[r["query"], r["passage"]], label=float(r["label"]))
    for r in load_jsonl("train.jsonl")
]
loader = DataLoader(examples, shuffle=True, batch_size=8)

model.fit(
    train_dataloader=loader,
    evaluator=evaluator,
    epochs=2,
    warmup_steps=int(len(loader) * 2 * 0.1),
    optimizer_params={"lr": 2e-5},
    output_path="checkpoint",
)

A few details that mattered:

BCE-with-logits loss on (query, passage, label ∈ {0, 1}). Single-score output, binary cross-entropy.
AdamW at lr=2e-5 — the standard learning rate for BERT-family fine-tunes. Don't overthink it.
Linear warmup for the first 10% of steps (LR ramps 0 → 2e-5), then linear decay back to 0. Prevents unstable updates early when the model is still learning the new label distribution.
Periodic val evaluation every ~862 steps. We tracked Average Precision to know when to stop.

The payoff

	Baseline MRR@10	Fine-tuned MRR@10	Δ
Validation	0.626	0.811	+30%
Test (held-out time)	0.598	0.846	+41%

MRR@10 is the standard ranking metric: for each query, find the rank of the first
relevant result; if it's at rank k, score is 1/k; average across queries. Our
baseline 0.598 means the first relevant ticket lands at rank ~1.7 on average. Our
fine-tuned 0.846 means it lands at rank ~1.18 — almost always at the top.

Translation: the LLM grounds its answer on the right historical ticket almost every
time now. It's not a marginal improvement — it changes whether the agent's
suggestion is useful or plausible-but-wrong.

Battle scars (the gotchas nobody documents)

A few things I had to fix while getting this to actually run:

Corp SSL. The Mac running training had the corporate CA trusted at the system
level (so curl and the OS Keychain were happy), but Python's requests /
urllib3 use certifi's CA bundle, not the system store. So pip install and
HuggingFace model downloads failed with CERTIFICATE_VERIFY_FAILED. The fix is to
build a combined CA bundle and point both env vars at it (different libraries read
different ones):

export REQUESTS_CA_BUNDLE=~/corp-ca-bundle.pem
export SSL_CERT_FILE=~/corp-ca-bundle.pem

Embedding model name enforcement. vllm-mlx serves on a fixed model ID and
422s any request with the wrong name. The default text-embedding-ada-002 fallback
in some libraries doesn't match. Set EMBEDDING_MODEL explicitly before the
embedding function is imported — production systemd loads it via EnvironmentFile,
ad-hoc scripts have to source .env themselves.

MPS memory accounting. PyTorch's MPS allocator counts macOS file cache and
inactive pages as "other allocations" — even though those pages are reclaimable.
With another 32B model already loaded, training OOMed at 19GB MPS allocation
despite 88GB physically free. The fix is unsafe-by-default but usually correct:

export PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0

This disables the watermark check. Safe if you've actually verified there's free
memory (vm_stat first). On a system where physical RAM is genuinely exhausted,
this will crash macOS.

launchctl quirks. macOS service management is a footgun farm: launchctl unload is deprecated; bootout sometimes returns I/O error from gui/UID but
works from user/UID; KeepAlive=true respawns killed processes — you must
remove the service from launchd, not just kill it. Lost an evening to this once.

When you'd consider doing this

You have a domain corpus where "relevant" means something specific (legal, medical, security tickets, internal company docs) — generic English passage retrieval doesn't capture your relevance signal.
You have an implicit relevance signal somewhere — clicks, links, analyst references, ticket relationships, support-case "see also" — that you can mine.
A stock reranker is already in your pipeline and you've tuned chunking + top-k and you're out of obvious wins.
You have a few thousand to a few tens-of-thousands of pairs — you don't need millions.

What surprised me

A few things, in order of how much they surprised me:

The hard-negative filter mattered more than the positive-pair mining. The
+41% lift would have collapsed to "modestly better than baseline" if I'd kept
those 33% same-rule near-duplicates in the negatives pool. The model would have
spent its capacity learning to push apart things that are actually related and
gotten worse at the real job. The data-quality work was disproportionately
high-leverage; the training loop itself was almost incidental.

The held-out test MRR (0.846) was higher than the validation MRR (0.811).
That's backwards from the usual story where test is the hardest split. My read:
detection rules in late 2025 / early 2026 are slightly clearer-cut than the
mid-2025 rules in the val window, so the test queries were genuinely easier.
Worth a deeper look, but it's also a useful sanity check — the model is
generalizing forward in time, not memorizing.

bge-reranker-v2-m3 at 0.598 baseline is surprisingly OK for a model that has
never seen a security ticket. Off-the-shelf rerankers are stronger out-of-domain
than I expected. That's both reassuring (you can ship a reasonable RAG without
fine-tuning) and a trap (you can ship a reasonable RAG without fine-tuning,
and it'll feel "good enough" until you measure properly).

What I'd do differently

Build the eval harness on day 1. I spent too long tuning chunking and top-k
by vibes before I had a number to optimize against. Once the MRR@10 harness
existed, every change was a one-command before/after — and most of the
"improvements" I'd been making earlier turned out to be wash trades. The harness
took an afternoon to build. I would have saved a couple of weeks by starting
there.

Reproducing this is doable in a couple of days if you have a domain corpus with
implicit relevance signal. If you've tried this on your own data, or hit a snag I
didn't, I'd love to hear how it went — reach me on
LinkedIn or by
email.

DEV Community: Vinay

Why Self-Hosted Claude Code Was 15x Slower Than It Should Be

TL;DR

108s → 7-8s

13-15×

81 bytes

The setup

What I expected vs what I observed

Finding #1: the rotating billing header

Finding #2: SimpleEngine wasn't actually caching the prefix

The numbers

What I'd do differently

When you'd hit this

Reproducing this

Credits

SOC-in-a-Box: One LLM, Eight Hats, A Production-Bar AI SOC on a Single GPU

TL;DR

8 roles

1 LLM

0 writes

The shape of the problem

Architecture at a glance

What we considered

1. CrewAI

2. AutoGen

3. Plain LangChain (no graph, no bus)

4. n8n / Zapier / visual workflow tools

5. Build-from-scratch asyncio + Redis Streams

What we picked: LangGraph for per-role reasoning + Redis Streams for inter-role coordination

One LLM, many hats

The roles

The HITL gate

Putting numbers on it: the backtest harness

What surprised us

Try it / What's next

detflow: A Detection-Engineering Copilot You Can pip install

TL;DR 🚀

2 formats

1 protocol

0 crashes

The itch

What it looks like

Model-agnostic on purpose 🔌

Never-raises, deterministic floor

Why two formats

The bigger pattern

iocflow: Turning a Production AI SOC into a Shippable OSS Library

TL;DR 🚀

6 layers

1 import

0 rogue blocks

The lesson I was carrying over

Deterministic primitives first, agents last

The capstone: a small multi-agent team

Three-layer authority (the part that matters) 🔒

Why build it this way

Three Chat Template Patterns That Silently Kill Your Prompt Cache

Teaching a Reranker the Language of Security Tickets (+41% MRR@10)

TL;DR

+41%

24,213 + 10,848

0

The setup: embedder + reranker, the standard two-stage RAG

Where the training data came from

Filtering the noise: not all #N references are equal

Free positives via transitive siblings (and the polynomial-blow-up trap)

The part most beginners get wrong: hard negative mining

Data discipline: split by time, never by random

The part that's almost a one-liner: the training loop

The payoff

Battle scars (the gotchas nobody documents)

When you'd consider doing this

What surprised me

What I'd do differently

Filtering the noise: not all `#N` references are equal