Himansh Tekumudi

Posted on Jun 28

How CascadeFlow "flows" through SENTRI

#ai #gemini #security #showdev

Security Event Notification & Threat Response Intelligence

SENTRI is an AI-powered L1/L2 SOC analyst that runs as a keyboard-first
desktop app. It ingests raw unstructured security data — code snippets,
firewall logs, process lists, email bodies, PowerShell scripts — and
determines in real time:

Is this a threat?

How severe is it?

What's the attack chain?

How do we stop it?

How It Works

Layer	What it does
Intake	Paste raw data into a `>` terminal. Any format, any source.
Drafter	Gemini Flash classifies the threat in ~1–2s.
Verifier	Gemini Pro deep-dives on escalated cases with attack chain + CVSS scoring.
Budget	8K token cap prevents context overflow on oversized payloads.
Audit	Every response includes a structured CascadeFlow audit block.

The frontend is a Tauri + React desktop app with JetBrains Mono, dark
terminal aesthetic, and full keyboard navigation. The backend is Express
with ChromaDB for vector memory and @cascadeflow/core for model routing.

How we used CascadeFlow's Drafter/Verifier pattern to make a security AI agent that's fast when it can be, and thorough when it has to be.

The Problem We Were Actually Trying to Solve

Every security payload that lands in SENTRI is different. A snippet of PowerShell from a suspicious endpoint. A raw firewall log with 12,000 lines. An email body that might be a phishing attempt. A burst of server traffic metrics that could be a DDoS or could be a deployment.

The naive approach — which we started with — is to send all of it to the most capable model you have. Gemini Pro sees everything. Everything gets deep forensic analysis. Everything takes the same time and costs the same amount.

The problem is that most of what a SOC agent sees is not complex. A large proportion of incoming incidents are variants of things that have been seen before — known malware signatures, standard brute-force patterns, textbook phishing templates. Throwing a frontier reasoning model at a script-kiddie payload is like calling a senior forensic investigator to identify a banana peel. Correct, but absurd.

What we actually needed was a system that could tell the difference — fast — and only escalate when the escalation was warranted. That's the problem CascadeFlow's Drafter/Verifier pattern is designed to solve.

What CascadeFlow Is

CascadeFlow is a runtime model routing framework. The core idea: instead of committing a single model to your entire workload, you define a routing architecture where different models handle different classes of input — and the routing decision happens at inference time, based on the content of the request itself.

For SENTRI, we used the @cascadeflow/core agent package, wired up through @ai-sdk/google to orchestrate two Gemini models. The architecture lives in backend/src/cascade/cascadeRouter.ts and sits between the REST endpoint (/api/analyze) and any model API call.

The two models in our stack:

The Drafter — gemini-2.5-flash
Fast. Cheap. Acts as the intake classifier. Every single payload hits the Drafter first. Its job is to assess the threat, determine complexity, and either return a complete assessment or escalate.

The Verifier — gemini-2.5-pro
Highly capable. Expensive. Only runs when the Drafter explicitly escalates. Its job is deep forensic analysis — full attack chain reconstruction, lateral movement mapping, zero-day pattern recognition, detailed mitigation playbooks.

The Verifier never sees a payload the Drafter could handle. The Drafter never tries to handle a payload that needs the Verifier.

How the Drafter/Verifier Pattern Actually Works

The CascadeAgent doesn't make routing decisions based on metadata or pre-computed scores. It makes them based on what the Drafter actually produces when it reads the payload.

Here's the flow:

// backend/src/cascade/cascadeRouter.ts

import { CascadeAgent } from '@cascadeflow/core';
import { google } from '@ai-sdk/google';

const agent = new CascadeAgent({
  drafter: {
    model: google('gemini-2.5-flash'),
    systemPrompt: DRAFTER_SYSTEM_PROMPT,
  },
  verifier: {
    model: google('gemini-2.5-pro'),
    systemPrompt: VERIFIER_SYSTEM_PROMPT,
  },
  escalationCondition: (drafterOutput) => {
    return drafterOutput.complexity === 'high' ||
           drafterOutput.threatClass === 'zero-day' ||
           drafterOutput.indicators.includes('lateral-movement');
  }
});

export async function analyzePayload(payload: string) {
  return await agent.run({ input: payload });
}

The Drafter runs first, always. It produces a structured output that includes a complexity assessment, a threat class, and a list of detected indicators. The escalationCondition function reads that output and makes a binary decision: is this something the Drafter can fully resolve, or does it need the Verifier?

If the answer is no escalation needed — the Drafter's assessment is the final output. The Verifier never runs. The request completes in the Drafter's latency window.

If escalation is triggered — the full payload, plus the Drafter's initial assessment, gets passed to the Verifier. The Verifier now has both the raw evidence and the Drafter's preliminary read to work from. It doesn't start from scratch; it starts from a structured first-pass that it can validate, expand, or contradict.

This is the key design insight: the Verifier isn't just a smarter version of the Drafter doing the same job better. It's a second analyst reviewing the Drafter's work with more time, more capability, and more context.

What the Drafter Actually Does

The Drafter's job is intake classification, not shallow analysis. When it reads a payload, it's asking:

Is this a known threat pattern or something novel?
Is the attack chain simple (single-stage) or complex (multi-stage, lateral movement, persistence mechanisms)?
Are there indicators that require deep forensic reconstruction — obfuscated code, zero-day signatures, command-and-control patterns?
Can I produce a complete, actionable mitigation with confidence? Or am I uncertain enough that a second opinion is warranted?

For a straightforward brute-force attempt against a login endpoint, the Drafter can do all of this. It knows what a brute-force looks like. It knows the mitigation playbook. It produces a structured assessment with high confidence and the request is done.

For a PowerShell payload with obfuscated base64 encoding, chained execution, and registry modifications that could indicate a persistence mechanism — the Drafter flags complexity, identifies the indicators it can see, and escalates. It doesn't try to reason through something it isn't well-suited for.

The practical effect: the Drafter handles the volume. The Verifier handles the depth. You get fast responses on the 70-80% of payloads that are routine, and thorough responses on the 20-30% that genuinely need it.

What the Verifier Does Differently

When the Verifier runs, it's not just running the same analysis with a bigger model. It's doing a qualitatively different class of work.

The Drafter produces a linear assessment: here is what this payload is, here is the threat level, here is the mitigation. The Verifier reconstructs the full attack chain — the sequence of events an attacker would have executed, what they were trying to achieve at each step, where defenders have opportunities to intervene.

For a complex payload like obfuscated PowerShell, the Verifier output looks substantially different:

FORENSIC ANALYSIS — SENTRI/HIRA
================================
Payload Type: PowerShell — Obfuscated Multi-Stage Loader
Escalation: Drafter flagged high complexity, lateral movement indicators
Analyst: gemini-2.5-pro (Verifier)

ATTACK CHAIN RECONSTRUCTION:
Stage 1 — Initial Execution
  Obfuscated launcher using [System.Text.Encoding]::UTF8 decode chain
  Decodes to: Invoke-WebRequest to external C2 at 185.220.101.x
  Indicator: Known Tor exit node range

Stage 2 — Payload Retrieval
  Downloads secondary PowerShell script from C2
  Script modifies HKLM\Software\Microsoft\Windows\CurrentVersion\Run
  Establishes persistence via registry autorun key

Stage 3 — Credential Harvesting
  Loads Mimikatz-derivative module (detected: sekurlsa::logonpasswords pattern)
  Targets LSASS memory for credential extraction

THREAT CLASS: APT-pattern intrusion attempt (high confidence)
ZERO-DAY INDICATORS: None confirmed, but C2 domain has 0 prior detections

MITIGATION PLAYBOOK (in order):
1. Isolate affected host immediately — cut network access before Stage 3 completes
2. Kill PowerShell process tree (PID identification steps attached)
3. Remove registry persistence key at [path specified above]
4. Block outbound traffic to 185.220.101.0/24 at perimeter
5. Run full credential rotation for any accounts with sessions on affected host
6. Submit C2 domain to threat intel platform — potential new IOC
7. Preserve memory dump of LSASS before remediation for forensic record

CONFIDENCE: High on Stages 1-2, Medium on Stage 3 (process not confirmed complete)

This is not something the Drafter would produce for this payload. The attack chain reconstruction, the per-stage breakdown, the IOC submission recommendation — these require the reasoning depth of the Verifier. The routing decision that put this payload in front of the Verifier, rather than letting the Drafter handle it, is what made this output possible.

The Cost and Latency Reality

This is the part that surprised us most in practice.

Gemini Flash and Gemini Pro are not just different in capability — they're dramatically different in cost and latency. Flash is roughly 10-15x cheaper per token than Pro, and 3-5x faster on typical payloads. Running everything through Pro, which was our starting point, meant paying Pro prices for every Googlebot crawl, every standard brute-force, every textbook phishing email.

With the Drafter/Verifier split:

Drafter-only resolution: ~1-3 seconds, Flash pricing
Escalated to Verifier: ~8-15 seconds, Pro pricing for the Verifier call only (Flash still runs first)
Escalation rate in practice: roughly 20-25% of payloads

The majority of payloads never touch the Pro model. The cost reduction on high-volume deployments is significant — not a small optimisation, a fundamental change in the economics of running a continuously-monitoring security agent.

The latency story matters for a different reason. A SOC analyst watching a live dashboard during an active incident doesn't want to wait 12 seconds for confirmation that the thing they just saw was a Googlebot crawl. Sub-3-second responses on routine events keep the operator's attention on what actually needs it. The Drafter/Verifier pattern aligns latency with stakes, not just cost.

Why Not Just Use a Single Mid-Tier Model?

The obvious question: why not pick a model that's somewhere between Flash and Pro and send everything through that?

We considered this. The problem is that security analysis is bimodal in complexity, not normally distributed. Payloads are either routine — well within the capability of a fast cheap model — or they're genuinely complex in ways that require serious reasoning depth. There isn't a large middle ground where a mid-tier model would outperform both alternatives.

A mid-tier model would be slower and more expensive than Flash on routine payloads, while being less capable than Pro on complex ones. You'd be paying the latency and cost premium without getting the full capability uplift on the cases that need it.

The Drafter/Verifier architecture exists precisely because the right answer for bimodal workloads is two specialised endpoints, not one compromised one. CascadeFlow makes it straightforward to implement this without manually building the routing logic yourself.

What CascadeFlow Made Easy (and What Wasn't)

The integration itself was clean. The @cascadeflow/core package handles the orchestration — running the Drafter, evaluating the escalation condition, passing context to the Verifier if needed, and returning a unified response object. We didn't have to build the two-call coordination logic, the context-passing between models, or the response merging. That's all in the framework.

What took more thought was defining the escalation condition correctly.

Our first version escalated on severity score alone — if the Drafter assessed severity above a threshold, escalate. This was too aggressive. High-severity events that are also well-understood (a large-scale but textbook DDoS, for instance) don't need the Verifier. They need a fast, confident response, not a deep forensic reconstruction.

We switched to escalating on complexity indicators rather than severity: obfuscation, lateral movement signals, zero-day pattern hints, novel C2 infrastructure. High severity + known pattern = Drafter handles it. Any complexity indicator = Verifier.

The distinction matters because it aligns the escalation condition with what the Verifier actually provides: depth of reasoning for genuinely novel or complex threats, not just a second opinion on things the Drafter already understands.

Lessons Learned

Escalation conditions should match model capability gaps, not severity scores. Severity and complexity are different axes. A severe but well-understood attack doesn't need the Verifier. A low-severity but genuinely novel payload does. Design your escalation condition around what the smarter model actually does better, not just what feels more important.

The Drafter's output is infrastructure for the Verifier. When escalation happens, the Verifier receives the Drafter's structured first-pass alongside the raw payload. This matters — the Verifier isn't starting from scratch, it's reviewing and extending an existing assessment. Design the Drafter's output format with the Verifier's needs in mind.

Measure escalation rate as a system health metric. If your escalation rate climbs unexpectedly, something changed — either the incoming payload distribution shifted, or your escalation condition is miscalibrated. We track escalation rate over time and treat spikes as signals worth investigating.

The cost savings are real, but they're not the point. The Drafter/Verifier pattern isn't primarily a cost optimisation — it's a quality optimisation. Routine payloads get faster, more confident responses. Complex payloads get deeper, more thorough analysis. The cost reduction is a consequence of doing the right thing for each class of input, not the goal itself.

Build the routing layer before you need it. We bolted routing onto a system that already existed with a single-model pipeline, and migrating was non-trivial. If we'd designed the cascadeRouter.ts as the primary entry point from the beginning, the migration wouldn't have been a migration at all. Runtime routing is architecture, not an afterthought.

The result is a security agent that behaves the way a good SOC team behaves: fast and confident on the familiar, thorough and methodical on the unknown. The Drafter handles the volume so the Verifier can focus on what actually deserves its attention.

That's what CascadeFlow made possible — not just cheaper inference, but the right kind of intelligence applied to the right kind of problem.

DEV Community