How to reduce on-call friction using AI Voice Agent

#ai #devops #automation #sre

*See how we at ilert use our AI Voice Agent to make on-call calls way smoother. It grabs incident context up front and plugs right into your call flows.
*

Even with great automation and observability, on-call still has one very human pain point: the phone rings, you wake up, and you have basically no context. In the first minutes of a critical call, you’re not fixing anything yet; you’re just trying to understand what’s going on.

At ilert, we built the AI Voice Agent to change that. Instead of connecting callers straight to a sleepy engineer, the agent speaks to the caller first, collects the essential details, and then routes the call intelligently using up-to-date incident context. That way, when an engineer does get pulled in, they’re starting with real information — not guesswork.

The full version of this post was first published on the ilert Engineering Blog by my colleague, ilert engineer Jan.

What problem are we solving?

On-call engineers often receive urgent calls with minimal or messy context. The result is predictable: they have to ask the same qualifying questions over and over before they can even begin to help. In a high-pressure situation, those minutes matter.

The AI Voice Agent takes that initial burden off the engineer. It gathers the key facts before escalation, so engineers can jump directly into troubleshooting. It can also reduce unnecessary wake-ups by checking for open incidents and letting callers know when an issue is already being handled. And because the agent lives inside ilert’s Call Flow Builder, it fits into your existing routing logic instead of forcing you to bolt on a separate system. You decide which information it should collect: names, contact details, incident descriptions, affected services, or custom fields that align with your workflow.

How it fits into our Call Flow Builder

If you’ve used ilert’s Call Flow Builder, think of the AI Voice Agent as one more node you can place wherever it makes sense. It will look like something like that:

The builder is a visual tool where each node represents a step in call handling. The AI node can greet callers, ask structured questions, enrich context, and then route or escalate based on what it learns.

Architecture overview

Under the hood, the agent is designed for fast, modular conversations with low latency. Twilio handles real-time audio streaming to and from callers, while a WebSocket channel connects ilert to OpenAI for conversational turns. The Call Flow Builder provides the configuration layer, letting you tune behavior without touching code.

Inside Call Flow Builder, the AI Voice Agent is just one of the nodes we at ilert provide. The builder is visual: you connect nodes to shape what should happen during a call, step by step. Since the AI is a node too, you can drop it exactly where it makes sense in your flow. Maybe right at the start to collect context, or later to handle a specific part of the conversation.

What the agent does in that spot is pretty simple: it talks to the caller naturally, figures out what they’re calling about, and collects the key details you want before anyone gets escalated. If you enable context enrichment, it can also look at live ilert data like open incidents, service states, or maintenance windows. That way, it doesn’t just follow a script, but it reacts to what’s actually going on right now and routes the call accordingly.
Hard parts we had to solve

Making a voice agent feel natural and reliable in production comes with some real technical headaches.

One of the first was speaker tracking. Both Twilio and OpenAI emit events about who is speaking, but interpreting those signals consistently in real time is tricky. We needed to know precisely whether the bot or the caller was talking at any given moment; otherwise, the AI might interrupt the user or miss what they said.

Conversation flow was another big challenge. A voice interface that sounds robotic is a fast way to frustrate callers, so we invested heavily in prompt engineering and tuning cadence, tone, and responsiveness. We wanted it to feel like a helpful conversation, not a phone menu.

Finally, we had to keep multiple live connections synchronized. Twilio streams, OpenAI responses, and ilert backend state all need to stay aligned. If any part drifts, context gets messy and the agent starts acting confused. Tight orchestration and careful state management were essential.

Context-aware conversations in practice

What makes the Voice Agent different from traditional IVR systems is that it combines intent recognition with optional context enrichment. At call start, it receives possible intents, gathers follow-up paths, and captures the caller’s number. If enrichment is enabled, it also learns what’s happening in ilert right now: whether there are open incidents, degraded services, or maintenance windows. That lets it respond based on reality instead of reading a static script, and route callers to the right path much faster.

Lessons learned so far

Beta testing taught us that interruption isn’t an edge case – it’s how people naturally talk on the phone. Letting callers interrupt the AI makes the experience smoother, but it also makes accurate speaker tracking even more important. The same tracking helps detect long silences so calls don’t run forever when nobody is speaking. We also reaffirmed that prompt engineering is essentially part of product design here: the voice needs to sound human while staying inside clear operational boundaries. And, unsurprisingly, multi-stream synchronization remains a core reliability requirement in any real-time voice system.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.