PADMANABHA DAS

Posted on Dec 1

Building Real-Time Voice Learning with VAPI: WebRTC + AI Tutors

#nextjs #ai #webrtc #voiceai

Text-based learning apps are passive. I wanted students to have real conversations with AI tutors—natural voice interactions where they could ask questions, get explanations, and learn through dialogue. So I built IntelliCourse with VAPI's voice AI SDK, creating a platform where students talk to personalized AI companions about Math, Science, Coding, and more. Here's how I implemented real-time voice sessions, including the WebRTC connection flow and transcript management that makes it feel like talking to a real tutor.

The Problem: Why Voice Learning?

Traditional learning platforms rely on text chat or pre-recorded videos. But learning is conversational:

Students need to ask follow-up questions in real-time
Complex concepts are easier to explain through dialogue
Voice is faster than typing (150 words/min vs 40 words/min)
Natural conversation keeps attention better than reading

I needed a solution that could:

Handle real-time bidirectional audio
Transcribe speech to text (Speech-to-Text)
Process with GPT-4 for intelligent responses
Track session duration and save transcripts

The Tech Stack: VAPI + WebRTC

VAPI (Voice AI Platform Interface) handles the entire voice pipeline:

User speaks â†' Deepgram (STT) â†' GPT-4 (AI) â†' 11Labs (TTS) â†' User hears

Key components:

Transcriber: Deepgram Nova-3 (latest high-accuracy model)
AI Model: OpenAI GPT-4 (conversational intelligence)
Voice: 11Labs (4 voices: male/female Ã— formal/casual)
Connection: WebRTC (low-latency audio streaming)

Implementation: Voice Assistant Configuration

Step 1: Define Voice Personalities

I created 4 voice options based on user preferences:

// constants/index.ts
const voices = {
  male: {
    casual: '2BJW5coyhAzSr8STdHbE',  // 11Labs voice ID
    formal: 'c6SfcYrb2t09NHXiT80T'
  },
  female: {
    casual: 'ZIlrSGI4jZqobxRKprJz',
    formal: 'sarah'                  // 11Labs default
  }
};

Step 2: Configure VAPI Assistant

// lib/utils.ts
const configureAssistant = (companionName: string, subject: string, topic: string, voice: 'male' | 'female', style: 'formal' | 'casual') => {
  const voiceId = voices[voice][style];

  return {
    name: companionName,
    firstMessage: `Hello, let's start the session. Today we'll be talking about ${topic}.`,

    // Speech-to-Text Configuration
    transcriber: {
      provider: 'deepgram',
      model: 'nova-3',           // Latest Deepgram model (95%+ accuracy)
      language: 'en',
    },

    // Text-to-Speech Configuration
    voice: {
      provider: '11labs',
      voiceId: voiceId,
      stability: 0.4,            // Voice consistency (0-1)
      similarityBoost: 0.8,      // Match original voice (0-1)
      speed: 1,                  // Normal playback speed
      style: 0.5,                // Emotional range (0-1)
      useSpeakerBoost: true,     // Enhance clarity
    },

    // Conversational AI Configuration
    model: {
      provider: 'openai',
      model: 'gpt-4',
      messages: [{
        role: 'system',
        content: `You are a highly knowledgeable tutor teaching ${subject}.

        Your goal: Teach the student about ${topic}.

        Guidelines:
        - Stick to the topic and subject
        - Check student understanding regularly
        - Break down complex concepts step-by-step
        - Keep responses short (voice conversation)
        - Use ${style} style (${style === 'formal' ? 'professional' : 'friendly'})
        - No special characters (voice only)`
      }]
    }
  };
};

The Session Lifecycle

States

enum CallStatus {
  INACTIVE,    // Not started
  CONNECTING,  // Establishing WebRTC
  ACTIVE,      // Live conversation
  FINISHED,    // Session ended
}

Component State Management

// components/CompanionComponent.tsx
const [callStatus, setCallStatus] = useState<CallStatus>(CallStatus.INACTIVE);
const [isSpeaking, setIsSpeaking] = useState(false);      // AI speaking?
const [isMuted, setIsMuted] = useState(false);            // Mic muted?
const [messages, setMessages] = useState<Message[]>([]);  // Transcript
const [sessionDuration, setSessionDuration] = useState(0); // Seconds

Starting a Session: WebRTC Connection Flow

const handleCall = useCallback(() => {
  // 1. Set connecting state
  setCallStatus(CallStatus.CONNECTING);
  showLoading('Connecting to your AI companion...');

  // 2. Configure assistant
  const assistantConfig = configureAssistant(
    companion.name,
    companion.subject,
    companion.topic,
    companion.voice,
    companion.style,
  );

  // 3. Start VAPI session
  vapi.start(assistantConfig, {
    variableValues: {
      subject: companion.subject,
      topic: companion.topic,
      style: companion.style,
    },
  });
}, [companion]);

Event Handling: Real-Time Updates

Connection Events

useEffect(() => {
  // Session started
  vapi.on('call-start', () => {
    setCallStatus(CallStatus.ACTIVE);
    startTimer();
    closeModal();
  });

  // Session ended
  vapi.on('call-end', async () => {
    setCallStatus(CallStatus.FINISHED);
    await stopTimer();
    await saveSession();
  });

  // Error handling
  vapi.on('error', (error) => {
    console.error('VAPI Error:', error);
    setCallStatus(CallStatus.INACTIVE);
    alert('Connection failed. Please try again.');
  });

  return () => {
    vapi.removeAllListeners();
  };
}, []);

Transcription Events

// Real-time transcript updates
vapi.on('message', (message: Message) => {
  // Only process final transcripts (not interim)
  if (message.type === 'transcript' && message.transcriptType === 'final') {
    setMessages(prev => [{
      role: message.role,        // 'user' or 'assistant'
      content: message.transcript,
      timestamp: new Date()
    }, ...prev]);
  }
});

Speech Detection (Visual Feedback)

// AI starts speaking
vapi.on('speech-start', () => {
  setIsSpeaking(true);
  lottieRef.current?.play();  // Start soundwave animation
});

// AI stops speaking
vapi.on('speech-end', () => {
  setIsSpeaking(false);
  lottieRef.current?.stop();   // Stop soundwave animation
});

Session Timer Implementation

const startTimer = useCallback(() => {
  const startTime = new Date();
  setSessionStartTime(startTime);

  // Update every second
  const interval = setInterval(() => {
    setSessionDuration(prev => prev + 1);
  }, 1000);

  setTimerInterval(interval);
}, []);

const stopTimer = useCallback(async () => {
  if (timerInterval) {
    clearInterval(timerInterval);
  }

  const durationMinutes = Math.max(1, Math.round(sessionDuration / 60)
  );

  // Save to database
  await saveSessionData(companion.id, durationMinutes);
}, [sessionDuration, timerInterval]);

Microphone Control

const toggleMicrophone = useCallback(() => {
  if (callStatus !== CallStatus.ACTIVE) return;

  const newMutedState = !isMuted;
  vapi.setMuted(newMutedState);
  setIsMuted(newMutedState);
}, [callStatus, isMuted]);

Ending a Session: Cleanup & Data Persistence

const handleDisconnect = useCallback(async () => {
  // 1. Stop VAPI
  vapi.stop();

  // 2. Stop timer
  await stopTimer();

  // 3. Format transcript
  const transcriptText = messages
    .reverse()  // Chronological order
    .map(msg => {
      const speaker = msg.role === 'assistant' 
        ? companion.name.split(' ')[0]  // "Neura" from "Neura the Explorer"
        : userName;
      return `${speaker}: ${msg.content}`;
    })
    .join('\n');

  // 4. Save to Supabase
  const sessionId = await addToSessionHistory(companion.id);
  await updateSessionDuration(companion.id, durationMinutes);
  await saveSessionTranscript(sessionId, transcriptText);

  // 5. Show success
  showSuccess('Lesson Completed! Great job!');
}, [messages, companion, userName]);

Database Schema

CREATE TABLE session_history (
  id UUID PRIMARY KEY,
  companion_id UUID REFERENCES companions(id),
  user_id TEXT NOT NULL,
  duration_minutes INTEGER DEFAULT 0,
  transcript TEXT,
  created_at TIMESTAMPTZ DEFAULT NOW(),
  completed_at TIMESTAMPTZ,
);

Transcript format:

John: What are neural networks?
Neura: Neural networks are computational models inspired by the brain...
John: How do they learn?
Neura: They learn through backpropagation...

UI Implementation: Live Session View

return (
  <div className="session-container">
    {/* AI Avatar with Soundwave */}
    <div className="ai-avatar">
      {callStatus === CallStatus.ACTIVE && isSpeaking ? (
        <Lottie animationData={soundwaves} autoplay loop />
      ) : (
        <Image src={subjectIcon} alt={companion.subject} />
      )}
    </div>

    {/* Session Timer */}
    <div className="timer">
      <p>Session Duration</p>
      <p className="text-2xl">{formatTime(sessionDuration)}</p>
      <div className={callStatus === CallStatus.ACTIVE ? 'pulse-indicator' : ''} />
    </div>

    {/* Microphone Toggle */}
    <button 
      onClick={toggleMicrophone}
      disabled={callStatus !== CallStatus.ACTIVE}
    >
      <Image src={isMuted ? micOff : micOn} />
    </button>

    {/* Start/End Button */}
    <button onClick={callStatus === CallStatus.ACTIVE ? handleDisconnect : handleCall}>
      {callStatus === CallStatus.ACTIVE ? 'End Session' : 'Start Session'}
    </button>

    {/* Live Transcript */}
    <div className="transcript">
      {messages.map((msg, i) => (
        <p key={i} className={msg.role === 'user' ? 'text-primary' : ''}>
          {msg.role === 'assistant' ? companion.name : userName}: {msg.content}
        </p>
      ))}
    </div>
  </div>
);

Results After 2 Months

User metrics:

Average session: 18 minutes
Completion rate: 82% (users finish sessions they start)

Technical performance:

WebRTC connection: < 2 seconds
Transcription latency: ~200ms
AI response time: 800ms average
Voice synthesis: ~400ms

User feedback:

"Feels like talking to a real tutor"
"Better than reading textbooks"
"I can ask clarifying questions immediately"

What I'd Do Differently

1. Add session pause/resume

Current: Session runs continuously
Better: Pause for breaks, resume later

2. Implement conversation branching

Track topic coverage
Suggest related concepts
Create learning paths

3. Add real-time sentiment detection

Detect confusion in voice tone
Adjust explanation complexity
Offer alternative explanations

Tech Stack

Frontend: Next.js 15 + React 19 + TypeScript
Voice AI: VAPI SDK v2.3.0
Database: Supabase (PostgreSQL)
Auth: Clerk
STT: Deepgram Nova-3
TTS: 11Labs
AI: OpenAI GPT-4

GitHub: https://github.com/chayan-1906/IntelliCourse-Next.js

Voice AI isn't just for chatbots. With VAPI handling the WebRTC complexity, you can build conversational learning experiences in a weekend. The key is good system prompts and proper event handling.

Questions about VAPI or voice AI implementation? Drop them below.

DEV Community