What wakes you at 2am when an enterprise operator deploys your agent

#mlops #observability #sre #llmproduction

Month one of our enterprise rollout: three 3am pages. None of them were code bugs. None of them were model quality issues. All of them were operational surprises we had not accounted for before the handoff.

We were production-ready. We were not operator-ready. Those are different things.

Here's what the three incidents were, what they cost, and the alert set I now require before any enterprise agent deployment.

Incident one: rate limits hit at 6pm EST

The operator's usage pattern had a daily spike between 5pm and 7pm EST. Their team was closing out the day, reviewing agent outputs, running follow-up queries. Peak usage was 4x average.

Our OpenAI rate limit was sized for average load. At 6:04pm on day eight, we hit it. Every request after that returned 429 errors for eleven minutes. The agent returned a generic error message. Eleven minutes of silent failure during peak business hours.

Nobody had tested for burst traffic. We had tested for average load. The cost of finding this in production instead of before handoff: one executive complaint, two support tickets, and an emergency weekend call.

The fix was multi-provider routing with automatic failover: when provider A returns 429, route to provider B. We had this wired up within 48 hours. We should have had it before day one.

What this incident taught us about operator-readiness: size for the operator's peak, not your average. The operator's usage pattern will not match your test environment.

Incident two: $11,800 overage in 18 days

The operator had one very active team. Four engineers running batch analysis jobs against our agent at hours that made no operational sense (11pm local, Saturdays). Within 18 days, their usage represented 68% of total API cost.

We had a total cost ceiling. We did not have a per-tenant ceiling. When we hit the total ceiling, the entire deployment slowed down. The heavy team's cost was invisible to us until the month-end bill.

The question from the customer's VP of operations on day 20: "Can you break down the cost by team?"

We could not.

This is a FinOps problem that looks like a monitoring problem. Per-tenant cost tagging needs to be in the request headers before the first request goes out. Not added after month-end when someone asks the question.

Cost of finding this in production instead of before handoff: $11,800 overage, one executive conversation, two weeks of retroactive tagging work to approximate the breakdown.

Minimum viable cost attribution setup: tag every request with operator ID, team ID, and use-case ID at the gateway layer. Aggregate daily by tag. Alert when any single tag hits 70% of its monthly budget by day 15.

Incident three: no audit trail for a compliance request

Six weeks in, the operator's compliance officer needed to reconstruct a decision the agent made on a specific document at a specific time. Customer data question. The agent had made a routing decision that the compliance team needed to audit.

We had trace spans in our observability system. We did not have an immutable per-request audit log that showed: which document, which agent version, which model version, which prompt version, what the output was, what confidence score.

Trace spans are not audit logs. They're operational data. An audit log needs to be write-once, timestamped, and correlated with the business object (the document, the customer record, whatever the operator's domain object is).

We spent four days building a retroactive audit log approximation. It was not what the compliance officer needed, but it was the best we could do. The real audit log went live in week eight.

Cost of finding this in production instead of before handoff: one compliance near-miss, four days of engineering time, one uncomfortable meeting.

What operator-ready means vs. what production-ready means

Production-ready is about your system. Uptime, latency, error rates. Metrics you control.

Operator-ready is about what happens when your system runs inside someone else's organization, with their usage patterns, their cost constraints, their compliance requirements, their data.

The three incidents above were all foreseeable. The operator's usage pattern is discoverable before handoff (just ask them). Per-tenant cost attribution is a gateway configuration decision that takes a day to implement. Audit log requirements for regulated industries are documented in their compliance frameworks.

We didn't discover any of these things before the handoff because we didn't ask the right questions before the handoff.

The pre-operator readiness checklist I run now

Before any enterprise agent deployment, I go through five operational questions. These are not code reviews. They're operational configuration checks.

Rate limit sizing: What is the operator's expected peak usage, and what is our per-provider rate limit at that peak? If peak usage exceeds 60% of the rate limit, configure burst handling or multi-provider routing before go-live.

Cost attribution: Is every request tagged with the minimum attribution set (operator, team, use-case) at the gateway layer? If not, implement it before the first request. Do not add it retroactively.

Audit log schema: Does the operator operate in a regulated industry? If yes, map their compliance requirements to a specific log schema before deployment. Generic trace spans do not satisfy financial services, healthcare, or legal audit requirements.

Failover configuration: Is there a secondary provider configured for automatic failover? If not, is there a documented manual procedure and a stated SLA for the outage window?

Cost ceiling configuration: Is there a per-tenant monthly budget ceiling with an alert at 70% of budget? Not a per-account ceiling. Per tenant. An over-active team should not consume another team's budget.

These five checks take about two hours to complete and review. Three incidents in month one took about three weeks to fully resolve, plus the relationship cost.

What I'd page on

If I were setting up the alert set for a new enterprise agent deployment, these are the four alerts I'd configure first:

Alert one: per-tenant request rate above 80% of the tenant's configured rate limit. Not 100%. Leave 20% headroom to investigate before hitting the ceiling.

Alert two: per-request cost moving average above threshold for any single tenant (set the threshold based on the expected per-request cost, alert at 3x). Catches batch jobs and runaway loops before month-end.

Alert three: agent response with no trace ID in the audit log. Means the audit trail has a gap. You want to know about this in real time, not when a compliance officer asks.

Alert four: first-request p99 latency above 2x the steady-state p99. Catches cold-start regressions before the operator's peak usage hits them.

None of these alerts require custom infrastructure. They require that your gateway and your agent emit the right metadata on every request.

Get the metadata right before handoff. Fix the alerts before go-live. The 3am page is not a production incident. It's a pre-deployment checklist item you deferred.