Sales reps lose deals in the moments they can't see themselves — talking over the prospect, missing a buying signal, going silent when objections hit. A coach whispering in their ear would fix most of that. Most companies can't afford one per rep. An AI agent listening to the live call can be that coach.
Here's how to build one using the OpenAI Realtime API for low-latency audio understanding and LangGraph to orchestrate the coaching logic.
Why This Stack
- OpenAI Realtime API — streams audio in and text/audio out over a WebSocket, with latency low enough to feel live instead of batched.
- LangGraph — gives you a stateful graph instead of a single prompt loop, so "detect objection → check playbook → generate coaching tip → suppress duplicate tips" becomes explicit, debuggable nodes instead of one giant system prompt hoping for the best.
A single LLM call can transcribe and react. It can't reliably track call state (are we in discovery? pricing? closing?), avoid repeating the same tip twice, or escalate only when it matters. That's an orchestration problem — which is exactly what LangGraph is for.
Architecture
Mic/Call Audio
↓
OpenAI Realtime API (streaming transcription + voice activity detection)
↓
LangGraph State Machine
├── transcript_node → accumulates rolling transcript
├── stage_classifier_node → discovery / pitch / objection / pricing / closing
├── signal_detector_node → buying signals, objections, talk-time ratio
├── coach_node → generates a short, actionable tip (or stays silent)
└── dedupe_node → suppresses repeat tips within a time window
↓
Coaching tip → rep's screen (text overlay or quiet TTS in their earpiece)
The key design decision: the coach node should be allowed to say nothing. A coaching agent that fires a tip every 10 seconds is noise the rep will mute in week one.
Step 1: Stream Audio into the Realtime API
import asyncio
import websockets
import json
REALTIME_URL = "wss://api.openai.com/v1/realtime?model=gpt-realtime"
async def stream_call_audio(audio_chunks):
async with websockets.connect(
REALTIME_URL,
extra_headers={"Authorization": f"Bearer {OPENAI_API_KEY}"}
) as ws:
await ws.send(json.dumps({
"type": "session.update",
"session": {
"modalities": ["text"],
"input_audio_transcription": {"model": "whisper-1"}
}
}))
async for chunk in audio_chunks:
await ws.send(json.dumps({
"type": "input_audio_buffer.append",
"audio": chunk # base64-encoded PCM16
}))
async for message in ws:
event = json.loads(message)
if event["type"] == "conversation.item.input_audio_transcription.completed":
yield event["transcript"]
This gives you a live stream of transcribed utterances, speaker-segmented if you're feeding in separate rep/prospect audio tracks (recommended — talk-time ratio is one of the highest-signal coaching metrics).
Step 2: Define the LangGraph State
from typing import TypedDict, Literal
from langgraph.graph import StateGraph, END
class CallState(TypedDict):
transcript: list[str]
stage: Literal["discovery", "pitch", "objection", "pricing", "closing"]
last_tip_at: float
detected_signals: list[str]
coaching_tip: str | None
def transcript_node(state: CallState) -> CallState:
# append latest utterance, trim to last ~2 min for context window discipline
return state
def stage_classifier_node(state: CallState) -> CallState:
# cheap heuristic + LLM fallback for ambiguous turns
state["stage"] = classify_stage(state["transcript"])
return state
def signal_detector_node(state: CallState) -> CallState:
state["detected_signals"] = detect_signals(state["transcript"], state["stage"])
return state
def coach_node(state: CallState) -> CallState:
if not state["detected_signals"]:
state["coaching_tip"] = None
return state
state["coaching_tip"] = generate_tip(state["detected_signals"], state["stage"])
return state
def dedupe_node(state: CallState) -> CallState:
import time
if state["coaching_tip"] and time.time() - state["last_tip_at"] < 45:
state["coaching_tip"] = None # too soon since last tip
else:
state["last_tip_at"] = time.time()
return state
Step 3: Wire the Graph
graph = StateGraph(CallState)
graph.add_node("transcript", transcript_node)
graph.add_node("stage", stage_classifier_node)
graph.add_node("signals", signal_detector_node)
graph.add_node("coach", coach_node)
graph.add_node("dedupe", dedupe_node)
graph.set_entry_point("transcript")
graph.add_edge("transcript", "stage")
graph.add_edge("stage", "signals")
graph.add_edge("signals", "coach")
graph.add_edge("coach", "dedupe")
graph.add_edge("dedupe", END)
app = graph.compile()
Run this graph on every new transcribed utterance:
async for transcript_chunk in stream_call_audio(audio_chunks):
state["transcript"].append(transcript_chunk)
state = app.invoke(state)
if state["coaching_tip"]:
push_to_rep_screen(state["coaching_tip"])
What Makes the Coaching Actually Useful
A few non-obvious lessons from building agents like this:
- Silence is a feature. Gate every tip through a dedupe/cooldown node. Reps tune out agents that talk too much.
- Stage-aware tips beat generic tips. "Ask a follow-up question" means nothing without knowing you're in discovery vs. pricing.
- Talk-time ratio is the cheapest high-value signal. You don't need an LLM to compute it — a running word-count ratio between speaker tracks catches "rep is monologuing" instantly.
- Keep the coach node's prompt narrow. One job: turn a detected signal into one short, actionable sentence. Don't let it also try to summarize the call — split that into its own node.
- Log every tip + outcome. You'll want to evaluate which tips actually correlate with better close rates, and that requires structured logging from day one, not an afterthought.
Where to Go Next
- Add a post-call summary node that runs once the call ends, rolling up every signal and tip into a CRM-ready note.
- Add a playbook retrieval node (RAG over your team's actual sales playbook) so tips are grounded in your specific methodology, not generic SaaS sales advice.
- Run an eval suite against recorded calls before shipping to live reps — silent failures (no tip when one was clearly needed) are worse than noisy ones.
This pattern — realtime audio in, LangGraph for stateful decision logic, narrow single-purpose nodes — generalizes well beyond sales coaching. Swap the stage classifier and playbook for your domain and you've got the same architecture for support-call QA, interview coaching, or live compliance monitoring.
Have you shipped a realtime voice agent? What's tripped you up most — latency, state management, or getting the agent to know when to shut up?
Top comments (0)