Build a Real-Time Sales Coaching AI Agent with OpenAI Realtime API + LangGraph

#ai #langgraph #agentaichallenge #openai

Sales reps lose deals in the moments they can't see themselves — talking over the prospect, missing a buying signal, going silent when objections hit. A coach whispering in their ear would fix most of that. Most companies can't afford one per rep. An AI agent listening to the live call can be that coach.

Here's how to build one using the OpenAI Realtime API for low-latency audio understanding and LangGraph to orchestrate the coaching logic.

Why This Stack

OpenAI Realtime API — streams audio in and text/audio out over a WebSocket, with latency low enough to feel live instead of batched.
LangGraph — gives you a stateful graph instead of a single prompt loop, so "detect objection → check playbook → generate coaching tip → suppress duplicate tips" becomes explicit, debuggable nodes instead of one giant system prompt hoping for the best.

A single LLM call can transcribe and react. It can't reliably track call state (are we in discovery? pricing? closing?), avoid repeating the same tip twice, or escalate only when it matters. That's an orchestration problem — which is exactly what LangGraph is for.

Architecture

Mic/Call Audio
     ↓
OpenAI Realtime API (streaming transcription + voice activity detection)
     ↓
LangGraph State Machine
  ├── transcript_node      → accumulates rolling transcript
  ├── stage_classifier_node → discovery / pitch / objection / pricing / closing
  ├── signal_detector_node  → buying signals, objections, talk-time ratio
  ├── coach_node            → generates a short, actionable tip (or stays silent)
  └── dedupe_node           → suppresses repeat tips within a time window
     ↓
Coaching tip → rep's screen (text overlay or quiet TTS in their earpiece)

The key design decision: the coach node should be allowed to say nothing. A coaching agent that fires a tip every 10 seconds is noise the rep will mute in week one.

Step 1: Stream Audio into the Realtime API

import asyncio
import websockets
import json

REALTIME_URL = "wss://api.openai.com/v1/realtime?model=gpt-realtime"

async def stream_call_audio(audio_chunks):
    async with websockets.connect(
        REALTIME_URL,
        extra_headers={"Authorization": f"Bearer {OPENAI_API_KEY}"}
    ) as ws:
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["text"],
                "input_audio_transcription": {"model": "whisper-1"}
            }
        }))

        async for chunk in audio_chunks:
            await ws.send(json.dumps({
                "type": "input_audio_buffer.append",
                "audio": chunk  # base64-encoded PCM16
            }))

        async for message in ws:
            event = json.loads(message)
            if event["type"] == "conversation.item.input_audio_transcription.completed":
                yield event["transcript"]

This gives you a live stream of transcribed utterances, speaker-segmented if you're feeding in separate rep/prospect audio tracks (recommended — talk-time ratio is one of the highest-signal coaching metrics).

Step 2: Define the LangGraph State

from typing import TypedDict, Literal
from langgraph.graph import StateGraph, END

class CallState(TypedDict):
    transcript: list[str]
    stage: Literal["discovery", "pitch", "objection", "pricing", "closing"]
    last_tip_at: float
    detected_signals: list[str]
    coaching_tip: str | None

def transcript_node(state: CallState) -> CallState:
    # append latest utterance, trim to last ~2 min for context window discipline
    return state

def stage_classifier_node(state: CallState) -> CallState:
    # cheap heuristic + LLM fallback for ambiguous turns
    state["stage"] = classify_stage(state["transcript"])
    return state

def signal_detector_node(state: CallState) -> CallState:
    state["detected_signals"] = detect_signals(state["transcript"], state["stage"])
    return state

def coach_node(state: CallState) -> CallState:
    if not state["detected_signals"]:
        state["coaching_tip"] = None
        return state
    state["coaching_tip"] = generate_tip(state["detected_signals"], state["stage"])
    return state

def dedupe_node(state: CallState) -> CallState:
    import time
    if state["coaching_tip"] and time.time() - state["last_tip_at"] < 45:
        state["coaching_tip"] = None  # too soon since last tip
    else:
        state["last_tip_at"] = time.time()
    return state

Step 3: Wire the Graph

graph = StateGraph(CallState)

graph.add_node("transcript", transcript_node)
graph.add_node("stage", stage_classifier_node)
graph.add_node("signals", signal_detector_node)
graph.add_node("coach", coach_node)
graph.add_node("dedupe", dedupe_node)

graph.set_entry_point("transcript")
graph.add_edge("transcript", "stage")
graph.add_edge("stage", "signals")
graph.add_edge("signals", "coach")
graph.add_edge("coach", "dedupe")
graph.add_edge("dedupe", END)

app = graph.compile()

Run this graph on every new transcribed utterance:

async for transcript_chunk in stream_call_audio(audio_chunks):
    state["transcript"].append(transcript_chunk)
    state = app.invoke(state)
    if state["coaching_tip"]:
        push_to_rep_screen(state["coaching_tip"])

What Makes the Coaching Actually Useful

A few non-obvious lessons from building agents like this:

Silence is a feature. Gate every tip through a dedupe/cooldown node. Reps tune out agents that talk too much.
Stage-aware tips beat generic tips. "Ask a follow-up question" means nothing without knowing you're in discovery vs. pricing.
Talk-time ratio is the cheapest high-value signal. You don't need an LLM to compute it — a running word-count ratio between speaker tracks catches "rep is monologuing" instantly.
Keep the coach node's prompt narrow. One job: turn a detected signal into one short, actionable sentence. Don't let it also try to summarize the call — split that into its own node.
Log every tip + outcome. You'll want to evaluate which tips actually correlate with better close rates, and that requires structured logging from day one, not an afterthought.

Where to Go Next

Add a post-call summary node that runs once the call ends, rolling up every signal and tip into a CRM-ready note.
Add a playbook retrieval node (RAG over your team's actual sales playbook) so tips are grounded in your specific methodology, not generic SaaS sales advice.
Run an eval suite against recorded calls before shipping to live reps — silent failures (no tip when one was clearly needed) are worse than noisy ones.

This pattern — realtime audio in, LangGraph for stateful decision logic, narrow single-purpose nodes — generalizes well beyond sales coaching. Swap the stage classifier and playbook for your domain and you've got the same architecture for support-call QA, interview coaching, or live compliance monitoring.

Have you shipped a realtime voice agent? What's tripped you up most — latency, state management, or getting the agent to know when to shut up?