Text-based learning apps are passive. I wanted students to have real conversations with AI tutors—natural voice interactions where they could ask questions, get explanations, and learn through dialogue. So I built IntelliCourse with VAPI's voice AI SDK, creating a platform where students talk to personalized AI companions about Math, Science, Coding, and more. Here's how I implemented real-time voice sessions, including the WebRTC connection flow and transcript management that makes it feel like talking to a real tutor.
The Problem: Why Voice Learning?
Traditional learning platforms rely on text chat or pre-recorded videos. But learning is conversational:
- Students need to ask follow-up questions in real-time
- Complex concepts are easier to explain through dialogue
- Voice is faster than typing (150 words/min vs 40 words/min)
- Natural conversation keeps attention better than reading
I needed a solution that could:
- Handle real-time bidirectional audio
- Transcribe speech to text (Speech-to-Text)
- Process with GPT-4 for intelligent responses
- Track session duration and save transcripts
The Tech Stack: VAPI + WebRTC
VAPI (Voice AI Platform Interface) handles the entire voice pipeline:
User speaks â†' Deepgram (STT) â†' GPT-4 (AI) â†' 11Labs (TTS) â†' User hears
Key components:
- Transcriber: Deepgram Nova-3 (latest high-accuracy model)
- AI Model: OpenAI GPT-4 (conversational intelligence)
- Voice: 11Labs (4 voices: male/female × formal/casual)
- Connection: WebRTC (low-latency audio streaming)
Implementation: Voice Assistant Configuration
Step 1: Define Voice Personalities
I created 4 voice options based on user preferences:
// constants/index.ts
const voices = {
male: {
casual: '2BJW5coyhAzSr8STdHbE', // 11Labs voice ID
formal: 'c6SfcYrb2t09NHXiT80T'
},
female: {
casual: 'ZIlrSGI4jZqobxRKprJz',
formal: 'sarah' // 11Labs default
}
};
Step 2: Configure VAPI Assistant
// lib/utils.ts
const configureAssistant = (companionName: string, subject: string, topic: string, voice: 'male' | 'female', style: 'formal' | 'casual') => {
const voiceId = voices[voice][style];
return {
name: companionName,
firstMessage: `Hello, let's start the session. Today we'll be talking about ${topic}.`,
// Speech-to-Text Configuration
transcriber: {
provider: 'deepgram',
model: 'nova-3', // Latest Deepgram model (95%+ accuracy)
language: 'en',
},
// Text-to-Speech Configuration
voice: {
provider: '11labs',
voiceId: voiceId,
stability: 0.4, // Voice consistency (0-1)
similarityBoost: 0.8, // Match original voice (0-1)
speed: 1, // Normal playback speed
style: 0.5, // Emotional range (0-1)
useSpeakerBoost: true, // Enhance clarity
},
// Conversational AI Configuration
model: {
provider: 'openai',
model: 'gpt-4',
messages: [{
role: 'system',
content: `You are a highly knowledgeable tutor teaching ${subject}.
Your goal: Teach the student about ${topic}.
Guidelines:
- Stick to the topic and subject
- Check student understanding regularly
- Break down complex concepts step-by-step
- Keep responses short (voice conversation)
- Use ${style} style (${style === 'formal' ? 'professional' : 'friendly'})
- No special characters (voice only)`
}]
}
};
};
The Session Lifecycle
States
enum CallStatus {
INACTIVE, // Not started
CONNECTING, // Establishing WebRTC
ACTIVE, // Live conversation
FINISHED, // Session ended
}
Component State Management
// components/CompanionComponent.tsx
const [callStatus, setCallStatus] = useState<CallStatus>(CallStatus.INACTIVE);
const [isSpeaking, setIsSpeaking] = useState(false); // AI speaking?
const [isMuted, setIsMuted] = useState(false); // Mic muted?
const [messages, setMessages] = useState<Message[]>([]); // Transcript
const [sessionDuration, setSessionDuration] = useState(0); // Seconds
Starting a Session: WebRTC Connection Flow
const handleCall = useCallback(() => {
// 1. Set connecting state
setCallStatus(CallStatus.CONNECTING);
showLoading('Connecting to your AI companion...');
// 2. Configure assistant
const assistantConfig = configureAssistant(
companion.name,
companion.subject,
companion.topic,
companion.voice,
companion.style,
);
// 3. Start VAPI session
vapi.start(assistantConfig, {
variableValues: {
subject: companion.subject,
topic: companion.topic,
style: companion.style,
},
});
}, [companion]);
Event Handling: Real-Time Updates
Connection Events
useEffect(() => {
// Session started
vapi.on('call-start', () => {
setCallStatus(CallStatus.ACTIVE);
startTimer();
closeModal();
});
// Session ended
vapi.on('call-end', async () => {
setCallStatus(CallStatus.FINISHED);
await stopTimer();
await saveSession();
});
// Error handling
vapi.on('error', (error) => {
console.error('VAPI Error:', error);
setCallStatus(CallStatus.INACTIVE);
alert('Connection failed. Please try again.');
});
return () => {
vapi.removeAllListeners();
};
}, []);
Transcription Events
// Real-time transcript updates
vapi.on('message', (message: Message) => {
// Only process final transcripts (not interim)
if (message.type === 'transcript' && message.transcriptType === 'final') {
setMessages(prev => [{
role: message.role, // 'user' or 'assistant'
content: message.transcript,
timestamp: new Date()
}, ...prev]);
}
});
Speech Detection (Visual Feedback)
// AI starts speaking
vapi.on('speech-start', () => {
setIsSpeaking(true);
lottieRef.current?.play(); // Start soundwave animation
});
// AI stops speaking
vapi.on('speech-end', () => {
setIsSpeaking(false);
lottieRef.current?.stop(); // Stop soundwave animation
});
Session Timer Implementation
const startTimer = useCallback(() => {
const startTime = new Date();
setSessionStartTime(startTime);
// Update every second
const interval = setInterval(() => {
setSessionDuration(prev => prev + 1);
}, 1000);
setTimerInterval(interval);
}, []);
const stopTimer = useCallback(async () => {
if (timerInterval) {
clearInterval(timerInterval);
}
const durationMinutes = Math.max(1, Math.round(sessionDuration / 60)
);
// Save to database
await saveSessionData(companion.id, durationMinutes);
}, [sessionDuration, timerInterval]);
Microphone Control
const toggleMicrophone = useCallback(() => {
if (callStatus !== CallStatus.ACTIVE) return;
const newMutedState = !isMuted;
vapi.setMuted(newMutedState);
setIsMuted(newMutedState);
}, [callStatus, isMuted]);
Ending a Session: Cleanup & Data Persistence
const handleDisconnect = useCallback(async () => {
// 1. Stop VAPI
vapi.stop();
// 2. Stop timer
await stopTimer();
// 3. Format transcript
const transcriptText = messages
.reverse() // Chronological order
.map(msg => {
const speaker = msg.role === 'assistant'
? companion.name.split(' ')[0] // "Neura" from "Neura the Explorer"
: userName;
return `${speaker}: ${msg.content}`;
})
.join('\n');
// 4. Save to Supabase
const sessionId = await addToSessionHistory(companion.id);
await updateSessionDuration(companion.id, durationMinutes);
await saveSessionTranscript(sessionId, transcriptText);
// 5. Show success
showSuccess('Lesson Completed! Great job!');
}, [messages, companion, userName]);
Database Schema
CREATE TABLE session_history (
id UUID PRIMARY KEY,
companion_id UUID REFERENCES companions(id),
user_id TEXT NOT NULL,
duration_minutes INTEGER DEFAULT 0,
transcript TEXT,
created_at TIMESTAMPTZ DEFAULT NOW(),
completed_at TIMESTAMPTZ,
);
Transcript format:
John: What are neural networks?
Neura: Neural networks are computational models inspired by the brain...
John: How do they learn?
Neura: They learn through backpropagation...
UI Implementation: Live Session View
return (
<div className="session-container">
{/* AI Avatar with Soundwave */}
<div className="ai-avatar">
{callStatus === CallStatus.ACTIVE && isSpeaking ? (
<Lottie animationData={soundwaves} autoplay loop />
) : (
<Image src={subjectIcon} alt={companion.subject} />
)}
</div>
{/* Session Timer */}
<div className="timer">
<p>Session Duration</p>
<p className="text-2xl">{formatTime(sessionDuration)}</p>
<div className={callStatus === CallStatus.ACTIVE ? 'pulse-indicator' : ''} />
</div>
{/* Microphone Toggle */}
<button
onClick={toggleMicrophone}
disabled={callStatus !== CallStatus.ACTIVE}
>
<Image src={isMuted ? micOff : micOn} />
</button>
{/* Start/End Button */}
<button onClick={callStatus === CallStatus.ACTIVE ? handleDisconnect : handleCall}>
{callStatus === CallStatus.ACTIVE ? 'End Session' : 'Start Session'}
</button>
{/* Live Transcript */}
<div className="transcript">
{messages.map((msg, i) => (
<p key={i} className={msg.role === 'user' ? 'text-primary' : ''}>
{msg.role === 'assistant' ? companion.name : userName}: {msg.content}
</p>
))}
</div>
</div>
);
Results After 2 Months
User metrics:
- Average session: 18 minutes
- Completion rate: 82% (users finish sessions they start)
Technical performance:
- WebRTC connection: < 2 seconds
- Transcription latency: ~200ms
- AI response time: 800ms average
- Voice synthesis: ~400ms
User feedback:
- "Feels like talking to a real tutor"
- "Better than reading textbooks"
- "I can ask clarifying questions immediately"
What I'd Do Differently
1. Add session pause/resume
- Current: Session runs continuously
- Better: Pause for breaks, resume later
2. Implement conversation branching
- Track topic coverage
- Suggest related concepts
- Create learning paths
3. Add real-time sentiment detection
- Detect confusion in voice tone
- Adjust explanation complexity
- Offer alternative explanations
Tech Stack
- Frontend: Next.js 15 + React 19 + TypeScript
- Voice AI: VAPI SDK v2.3.0
- Database: Supabase (PostgreSQL)
- Auth: Clerk
- STT: Deepgram Nova-3
- TTS: 11Labs
- AI: OpenAI GPT-4
GitHub: https://github.com/chayan-1906/IntelliCourse-Next.js
Voice AI isn't just for chatbots. With VAPI handling the WebRTC complexity, you can build conversational learning experiences in a weekend. The key is good system prompts and proper event handling.
Questions about VAPI or voice AI implementation? Drop them below.
Top comments (0)