Three weeks before the enterprise contract, the voice agent wasnt operator-ready.

#voiceagents #llmproduction #mlops #ai

Three weeks before the enterprise contract, the voice agent wasn't operator-ready

Look. We had 99.2% uptime in staging. We had eval coverage on 1,400 test turns. We had latency under 280ms first-token.

We were not operator-ready.

I know this because the enterprise pilot started on a Monday and we had our first critical incident by Tuesday afternoon.

This is what happened, what broke, and what the gateway layer decision actually looks like when you're under pressure to fix it fast.

The incident

The customer was a wealth management firm. Their advisors use a voice agent to pull client portfolio data, answer allocation questions, and schedule follow-ups. We'd been testing with synthetic personas for six weeks. The simulation results were clean.

Day one: a senior advisor ran a session that included three back-to-back allocation queries with large portfolio values. Our OpenAI rate limit hit at 6pm EST, right during peak advisor usage. Every request after the limit returned a 429. The agent logged nothing useful. The advisor's client was on hold for 4 minutes.

Day two: a compliance officer tried to pull the audit log for the day-one incident. There wasn't one. We had trace spans. We did not have a per-request log that showed which advisor, which client context, which tool calls, what the agent responded. That's a compliance gap, not a monitoring gap.

Week two: the VP of operations asked for the cost breakdown by team. We gave them a single number. They wanted per-advisor attribution. We had no per-tenant tagging.

Week three: the operations team pushed a new prompt version to fix a tone issue. Three hours later, the voice agent started refusing certain allocation questions it had previously handled fine. We had no prompt version pinned at inference time in the trace. We couldn't tell when the failure started or which requests were affected.

Four incidents. None of them were model quality issues. All of them were the gateway layer we hadn't built.

What the gateway layer is supposed to do

Before this pilot I thought of the gateway as routing. Send the request to OpenAI, or Anthropic, or whichever provider. Handle retries. Done.

That was wrong.

The gateway for an enterprise operator deployment does at least five things:

Rate limiting per tenant. Not per account. Per tenant. An advisor with heavy usage should not blow the rate limit for the entire deployment.

Cost attribution. Every request tagged with the operator, the team, the user. Without this, you cannot answer the cost-attribution questions that come in month two.

Guardrail enforcement. For financial services: no advice that sounds like a specific investment recommendation. The guardrail needs to run on every response, not just when you remember to add it.

Audit logging. Immutable, per-request, with enough context to replay the interaction. This is a compliance requirement for most regulated industries, not a nice-to-have.

Multi-provider failover. When OpenAI hits 429, route to Anthropic. Not as a manual intervention. Automatically. The 4-minute incident on day one was preventable.

What I evaluated

After week one, I spent most of a weekend evaluating gateway options. Here's the honest breakdown:

LiteLLM (open-source, self-hosted). Most complete feature set if you want full control. Per-tenant rate limiting, cost tagging, provider fallback, proxy mode. The setup complexity is real: you need to maintain the deployment, configure Redis for rate-limiting persistence, and write your own audit log schema. For a team with Kubernetes infrastructure already in place, this is probably the right call. We were mid-pilot and needed faster setup.

Portkey (managed). Zero-config guardrails, built-in prompt versioning with a rollback UI, solid multi-provider routing. Pricing gets expensive at scale but the managed model means fast setup and less ops overhead. Their guardrail policies are more configurable than LiteLLM's out of the box. We ended up here for the pilot because we were under time pressure and needed zero-setup guardrail enforcement.

Future AGI's gateway (open-source, part of the future-agi platform). This is the gateway component of their end-to-end eval + observability + guardrail stack. It handles multi-provider routing with guardrail policies, rate limiting, and OTel-native tracing that connects to the same OTel-based observability stack as the rest of the platform. I evaluated this specifically because we were already running FAGI's simulation tooling for our voice eval harness, and the unified stack had real appeal: guardrails, tracing, and eval running through the same FAGI platform.

For a team already on the FAGI platform for eval and simulation, the gateway is the right next layer. For a team coming in cold with no FAGI tooling, the setup cost is higher than Portkey or Helicone for the first-time operator deployment.

As of June 2026, the FAGI gateway ships the OpenAI-compatible proxy, multi-provider routing, guardrail policies, and OTel tracing in one stack.

Helicone (managed). Strongest on cost attribution and per-user analytics. The tagging system is granular and the dashboard is readable. Weaker on guardrails (less configurable than Portkey). Right call if your primary need is FinOps visibility and you're handling guardrails separately.

OpenRouter (managed). Pure routing. Multi-provider fallback, good for latency optimization across providers. Does not have per-tenant rate limiting or guardrail enforcement built in. Not the right call for an enterprise deployment that needs compliance features.

Bifrost (open-source). Fast proxy with interesting performance numbers. Newer, smaller community. I evaluated it and the latency story is real. But it was too new to commit to for a regulated industry deployment.

Week three: what we fixed

We were already deployed on Portkey for rate limiting and guardrail enforcement by week three. We added per-advisor tagging to every request. We pinned prompt versions at inference time and logged the version ID in each trace span.

The prompt-version incident would have been caught immediately with version pinning. The cost-attribution ask would have been answered in two SQL queries.

The audit log took longer. Financial services audit logging has specific retention and immutability requirements that generic trace systems don't satisfy out of the box. We built a thin write-once layer on top of Portkey's logging that met the compliance spec. That was two days of work we should have done before the pilot.

What shipped

Portkey for rate limiting and guardrail enforcement. Per-tenant tagging on every request. Prompt version pinning at inference time. Custom audit log layer for compliance.

The rate-limit incident did not recur. The cost-attribution question now takes two minutes to answer. The audit log is compliance-satisfying.

What I'd tell past me

Architect the gateway before you talk to the enterprise customer. Not as an afterthought when the pilot starts hitting limits.

The questions you'll be asked in month one: "Who spent what, when, doing what, with what outcome." If your gateway doesn't answer those four questions, you are not operator-ready. The model quality is probably fine. The infrastructure around it is what will bite you.

And if you're already running FAGI's eval and simulation stack: evaluate their gateway component in parallel. The unified data model between guardrails, traces, and eval signals is genuinely useful for regulated deployments where you need the audit trail to connect back to eval coverage.

What I'm building next: a pre-operator readiness checklist that runs as a CI gate before any enterprise handoff. It checks per-tenant rate limit configuration, audit log schema coverage, and prompt version tracking. None of these should be manual.