DEV Community

ProRecruit
ProRecruit

Posted on • Originally published at aissence.ai

Building a Real-Time Speech-to-Text Pipeline with Deepgram + Next.js

Building a Real-Time Speech-to-Text Pipeline with Deepgram + Next.js

Real-time speech-to-text (STT) converts spoken audio into text as it is being spoken, with latency under 300 milliseconds. Deepgram Nova-2 model offers 98.7% accuracy for English at .0043 per minute - 3x cheaper than AWS Transcribe.

Prerequisites

  • Node.js 20+, Next.js 15, Deepgram API key (free tier: 45K minutes)

Step 1: Project Setup

ash
npx create-next-app@latest stt-demo --typescript --tailwind --app
cd stt-demo
npm install @deepgram/sdk

Step 2: Backend WebSocket Route

` ypescript
import { createClient, LiveTranscriptionEvents } from "@deepgram/sdk";

const deepgram = createClient(process.env.DEEPGRAM_API_KEY);
const connection = deepgram.listen.live({
model: "nova-2",
language: "en",
smart_format: true,
interim_results: true,
vad_events: true,
endpointing: 300,
});

connection.on(LiveTranscriptionEvents.Transcript, (data) => {
const transcript = data.channel.alternatives[0]?.transcript;
if (transcript) console.log("Transcript:", transcript);
});
`

Step 3: Browser Audio Capture

ypescript
const stream = await navigator.mediaDevices.getUserMedia({
audio: { sampleRate: 16000, channelCount: 1, echoCancellation: true }
});
const mediaRecorder = new MediaRecorder(stream, {
mimeType: "audio/webm;codecs=opus"
});
mediaRecorder.ondataavailable = (event) => {
if (event.data.size > 0) sendToWebSocket(event.data);
};
mediaRecorder.start(100);

Step 4: Production Optimizations

  • Connection recovery with exponential backoff
  • Audio buffering during reconnection
  • Multi-language support (language: "auto" for 36 languages)

At AissenceAI, we use this pipeline to power real-time interview transcription in 42 languages.

Our live coaching feature uses Voice Activity Detection to detect when the interviewer stops speaking.


See this in action at aissence.ai.

Top comments (0)