This is a submission for the AssemblyAI Voice Agents Challenge
What I Built
i built a platform that lets businesses create AI sales reps (voice agents) that can hold near-real-time conversations, pitch products/services, collect customer information, and answer relevant questions
the voice agent can be embedded into a website as a widget and users can trigger a conversation via a floating action button
each agent is provided with domain-specific knowledge tailored to the business, ensuring that responses stay accurate. these agents aren’t just chatbots, instead, they act as an extension of the business, using preconfigured behavior, and product/service expertise defined by the business
the app shows usage analytics as well as summarized conversations with extracted information like lead quality, customer information and intent, as well as next steps the business can take based on the conversation between user and ai sales agent
this project falls primarily under the business automation and domain expert prompt categories
Demo
app link: click here
GitHub Repository
Technical Implementation & AssemblyAI Integration
the frontend is built with Next.js with Tailwind CSS, and the backend is in Node.js, Express, PostgreSQL and the AssemblyAI SDK
AssemblyAI’s Universal-Streaming API powers real-time voice transcription via WebSockets. The low-latency transcription it provides is essential for delivering almost instantaneous transcriptions
after a successful web socket connection is made, the app connects to Assembly's transcriber and pipes audio from the frontend to a transcriber stream
here is a sample nodejs code for connecting to Assembly AI via a websocket:
import { AssemblyAI } from "assemblyai";
import ffmpeg from "fluent-ffmpeg";
import { PassThrough, Readable } from "stream";
async function handleWebSocketConnection(ws) {
const client = new AssemblyAI({ apiKey: process.env.ASSEMBLY_AI_API_KEY });
const transcriber = client.streaming.transcriber({
sampleRate: 16_000,
});
transcriber.on("turn", (turn) => {
if (!turn.transcript) return;
ws.send(JSON.stringify({ transcript: turn.transcript }));
});
await transcriber.connect();
const inputStream = new PassThrough();
const ffmpegProcess = ffmpeg()
.input(inputStream)
.inputFormat("webm")
.audioFrequency(16000)
.audioChannels(1)
.audioCodec("pcm_s16le")
.format("s16le")
.on("error", (err) => console.error("ffmpeg error:", err))
.pipe();
Readable.toWeb(ffmpegProcess).pipeTo(transcriber.stream());
ws.on("message", (data) => {
inputStream.write(data);
});
ws.on("close", async () => {
inputStream.end();
await transcriber.close();
});
}
once transcription is received, it's sent to an llm that generates appropriate responses based on predefined context. any response from the llm is read out loud in human voice using amazon polly's predefined voices
that's all for now. cheers!
Top comments (0)