Your AI SRE Doesn't Need One Model — It Needs the Right Model for Each Job
We built our first AI SRE integration with a single model. Opus for everything — incident triage, Kubernetes debugging, IAM policy review, cost anomaly detection. Figured we'd use the best available and not overthink it.
Three months in, the cost was real. And honestly, most of the tasks didn't need Opus-grade reasoning. Checking if a pod is in CrashLoopBackOff doesn't require the same cognitive load as parsing a complex cross-account IAM policy trust relationship.
Rootly published benchmark results this week that put actual numbers on a hunch most of us have been carrying. If you're building AI SRE tooling — or about to — the findings are worth sitting with.
What the Benchmarks Found
Rootly ran Claude Sonnet 4.6 and Opus across four infrastructure task types: Kubernetes, IAM/S3 policy, compute, and general infra work.
The finding: Sonnet 4.6 performs comparably to Opus on Kubernetes and compute tasks. The gap opens up on complex IAM and policy reasoning — that's where Opus pulls ahead noticeably.
This isn't surprising once you think about it.
K8s debugging is largely pattern matching plus log interpretation. Pod OOMKilled, check memory limits. CrashLoopBackOff, check startup command and liveness probe. The model needs to recognize a known pattern and apply a known fix. That's well-represented training data territory — a smaller, faster model handles it well.
IAM is different. Cross-account trust policies with condition keys, SCPs interacting with permission boundaries, AssumeRole chains where the principal is a service role with a managed policy attached — you're asking the model to reason through dependency chains where one wrong inference changes the security posture of your entire account. That's where reasoning capacity actually matters.
What Model Routing Looks Like in Practice
You don't need a fancy framework to start. The simplest version is a routing function that maps task type to model at the entry point:
TASK_MODEL_MAP = {
"k8s_debug": "claude-sonnet-4-6",
"compute_anomaly": "claude-sonnet-4-6",
"cost_analysis": "claude-sonnet-4-6",
"iam_policy_review": "claude-opus-4-6",
"security_audit": "claude-opus-4-6",
"incident_triage": "claude-sonnet-4-6", # fast first pass
"incident_rca": "claude-opus-4-6", # deep analysis on escalation
}
def route_task(task_type: str, payload: dict) -> str:
model = TASK_MODEL_MAP.get(task_type, "claude-sonnet-4-6")
return call_llm(model, payload)
You classify the task type at the entry point — from alert metadata, the PagerDuty service name, or a lightweight pre-routing call — and route accordingly.
For incident workflows specifically, two-stage routing works well: Sonnet for fast first-pass triage (is this P1? what's the likely cause?), Opus for deep RCA if the incident escalates past 15 minutes or the initial assessment comes back inconclusive. Most incidents don't need the second stage. The ones that do, you want the better model for it.
The Cost Math
At Anthropic's current pricing, Opus is roughly 3–5x the cost of Sonnet per output token.
If your AI SRE system processes 200 alerts per day with Opus on everything, routing 70% of those to Sonnet — the predictable K8s and compute tasks — saves 60–70% of your monthly LLM spend on that system without touching quality where it matters.
At scale — a platform team handling 500+ alerts per day across 20 services — that's a meaningful number. And you're not sacrificing accuracy on the tasks where Sonnet performs comparably anyway.
This is the FinOps instinct applied to LLM infrastructure. The same reasoning that makes you right-size EC2 instances and use Reserved capacity for predictable workloads applies here: match the resource to the task, don't overprovision across the board.
What Teams Get Wrong When They Start Doing This
Routing confidence. If your task classifier isn't sure whether something is "K8s debug" or "IAM-related K8s debug" (a role binding issue that surfaces as an auth error), you need a default. Default to Opus, not Sonnet. The cost of a wrong call on a security-adjacent task is much higher than the cost of the model.
Skipping caching. Sonnet's lower cost doesn't mean you should skip prompt caching. K8s debugging prompts carry a lot of repeated context — cluster state, runbook references, service topology. Caching the system prompt and the stable portion of the user message can cut token costs another 40–60%, independently of model choice. These aren't mutually exclusive optimizations.
Missing observability. When you're routing across models, your cost dashboard needs to break down by model and task type, not just total spend. Otherwise you can't tell whether cost increases came from higher alert volume or a routing misconfiguration that quietly started sending K8s tasks to Opus.
A simple Prometheus counter does it:
llm_requests_total{model="claude-sonnet-4-6", task_type="k8s_debug"}
llm_requests_total{model="claude-opus-4-6", task_type="iam_policy_review"}
Ten minutes to add. Saves hours of confusion when the bill comes in.
What I'd Do Differently
Start logging task type and model from day one — even before you build the routing logic.
Run on a single model for a month, but tag every request with what kind of task it was. By the end of the month you have real data on your task distribution. You know what percentage of your volume is K8s debugging versus IAM work. You know where the long-tail reasoning tasks actually show up.
Then build the router with data, not intuition.
I still don't fully know whether our routing is optimal — we're iterating on the task classification logic as new alert types emerge. But having the observability in place means we can actually see what's happening and improve it incrementally.
Where This Goes
This is the beginning of LLMOps as a real discipline. Most teams right now are at "pick a model and use it everywhere" — which is fine for experimentation, and honestly fine for small scale. But as AI SRE moves from pilot to production, the operational concerns show up: cost, reliability, latency, quality by task type.
The teams that treat LLM infrastructure the way they treat compute infrastructure — with cost visibility, right-sizing, and observability — will have a meaningful advantage over the ones still paying Opus rates to classify pod restarts.
Rootly's benchmarks are one data point. Your production data is a better one. Start collecting it.
If you're building AI SRE tooling and hitting interesting edge cases in model routing or task classification, reach out — I'm genuinely curious what patterns other teams are finding.
Daily DevOps & AI signals on Telegram → t.me/stackpulse1
Top comments (0)