Incident response agent

#ai #agents #python #webdev

Our Agent Burned $40 in 3 Minutes. cascadeflow Got It to $1.

The first time we ran our incident response agent under load, it cost us $40 in three minutes. Forty dollars. For a tool meant to save engineers time, it was burning money faster than the incidents it was diagnosing.

The problem wasn't the agent logic. The problem was we were routing every single alert — a disk usage warning, a minor latency spike, a full database outage — through the same expensive model at
the same cost per call. Nobody had thought about the runtime layer.

That changed when we added cascadeflow.

What cascadeflow Actually Does

Most teams think about AI agents in terms of prompts and models. What they miss is the runtime layer sitting between your code and the LLM API — the layer that decides which model, when, at what cost, with what guardrails.

cascadeflow is that layer. It is an open-source runtime intelligence library that handles model routing, budget enforcement, latency control, and audit logging — all inside your agent loop, with no external service required.

Install it in one line:

bashpip install cascadeflow

No API key. No dashboard to set up. It runs in-process, which means zero added latency from a network hop.

The Problem: Every Alert Is Not Created Equal

Our incident response agent handles everything from P0 database outages to INFO-level disk warnings. Before cascadeflow, every single one of those went through the same model — our most capable, most expensive option.

Here is what that looked like in practice:

A disk usage warning at 60% → $0.12 per call, overkill
A P0 database outage → $0.12 per call, justified
40 INFO alerts per day → $4.80 per day on alerts nobody reads

Multiply that across a week and you are spending real money on alerts that a much cheaper model could handle just as well.

The fix is not to use a worse model everywhere. The fix is to use the right model for each situation.

How We Built the Routing Layer

The core of cascadeflow in our agent is a routing function that maps alert severity to model choice:

python# router.py
from cascadeflow import CascadeFlow

cf = CascadeFlow()

ROUTING_MAP = {
"P0": "groq/llama3-70b-8192", # Most capable, for critical incidents
"P1": "groq/llama3-70b-8192", # Still serious
"P2": "groq/llama3-8b-8192", # Faster, cheaper, good enough
"P3": "groq/llama3-8b-8192", # Routine issues
"INFO": "groq/gemma2-9b-it" # Cheapest, handles log summaries
}

def route_incident(alert: dict, memory_context: str) -> str:
model = ROUTING_MAP.get(alert["severity"], "groq/llama3-8b-8192")

response = cf.complete(
    model=model,
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": build_prompt(alert, memory_context)}
    ],
    budget_limit=0.05  # Hard cap per call in USD
)
return response.content

The budget_limit parameter is the one I wish I had known about from day one. It puts a hard ceiling on what any single call can spend. When an alert storm fires 80 parallel calls at 3am, that ceiling is the difference between a manageable bill and a very bad morning.

Budget Enforcement: The Part Nobody Talks About

Most articles about AI agents focus on the quality of responses. Almost none talk about what happens when your agent runs 500 times in a day because an alert keeps firing.

cascadeflow handles this with a session-level budget that tracks cumulative spend:

python# budget_session.py
from cascadeflow import CascadeFlow, BudgetSession

cf = CascadeFlow()

def run_with_daily_budget(alerts: list, daily_limit: float = 5.00):
with BudgetSession(cf, limit=daily_limit) as session:
for alert in alerts:
try:
response = session.complete(
model=ROUTING_MAP.get(alert["severity"]),
messages=build_messages(alert)
)
print(f"✅ {alert['id']}: {response.content[:100]}")
except BudgetExceeded:
print(f"⚠️ Daily budget hit. Queuing {alert['id']} for next cycle.")
queue_for_later(alert)

When the daily budget is hit, the agent does not crash. It queues remaining alerts gracefully and moves on. Engineers get notified, not surprised.

The Audit Trail: Why Every Decision Gets Logged

One thing we did not expect to care about was the audit log. We thought it was a nice-to-have. It turned out to be essential.

When an incident is resolved and someone asks "why did the agent recommend a rollback instead of a restart?", the answer needs to exist somewhere. cascadeflow logs every decision automatically:

python# Every cf.complete() call logs:
{
"timestamp": "2026-06-15T03:42:11Z",
"alert_id": "INC-042",
"model_selected": "groq/llama3-70b-8192",
"routing_reason": "severity=P0",
"input_tokens": 847,
"output_tokens": 312,
"cost_usd": 0.0034,
"latency_ms": 1240,
"budget_remaining": 3.42
}

That log entry is what you show a manager, a compliance team, or yourself at 4am when you are trying to understand what happened. No extra instrumentation required — cascadeflow writes it automatically.

The Numbers After One Week

After running with cascadeflow routing for a week against our synthetic incident load:

INFO and P3 alerts moved to the cheaper model — cost per call dropped from $0.12 to $0.018
P0 and P1 alerts stayed on the capable model — quality unchanged where it matters
No call exceeded the $0.05 per-call budget cap
Total daily spend stabilized at under $1.20 for our test load

The response quality on P0 incidents was identical. The response quality on INFO alerts was slightly less verbose — which is actually better, since nobody needs a 400-word analysis of a disk at 61% capacity.

What I Would Do Differently

Set budget caps before you run load tests, not after. We learned this the expensive way. The first thing you should do after installing cascadeflow is set a budget_limit on every call and a session limit on every run.

Log severity with every alert. The routing logic is only as good as the severity signal coming in. If everything is labeled P1 because someone was lazy with the alerting config, you lose all the routing benefit. Fix your alerting taxonomy first.

Use the audit log from day one. Even in development, the cost and latency data cascadeflow captures will tell you things about your agent's behavior that you would never discover otherwise.

Getting Started

cascadeflow is open source and free. The cascadeflow docs cover model routing, budget sessions, provider setup, and the full audit log schema. It works natively with Groq, Ollama, OpenRouter, HuggingFace, and all the major providers.

The install is one line. The routing config is twenty lines of Python. The budget cap is one parameter.

There is no good reason to let your agent decide its own runtime costs. Give it a budget and a routing map, and let cascadeflow handle the rest.